rotten apples: an investigation of the prevalence and predictors of ...

35
ROTTEN APPLES: AN INVESTIGATION OF THE PREVALENCE AND PREDICTORS OF TEACHER CHEATING* BRIAN A. JACOB AND STEVEN D. LEVITT We develop an algorithm for detecting teacher cheating that combines infor- mation on unexpected test score uctuations and suspicious patterns of answers for students in a classroom. Using data from the Chicago public schools, we estimate that serious cases of teacher or administrator cheating on standardized tests occur in a minimum of 4–5 percent of elementary school classrooms annu- ally. The observed frequency of cheating appears to respond strongly to relatively minor changes in incentives. Our results highlight the fact that high-powered incentive systems, especially those with bright line rules, may induce unexpected behavioral distortions such as cheating. Statistical analysis, however, may pro- vide a means of detecting illicit acts, despite the best attempts of perpetrators to keep them clandestine. I. INTRODUCTION High-powered incentive schemes are designed to align the behavior of agents with the interests of the principal implement- ing the system. A shortcoming of such schemes, however, is that they are likely to induce behavior distortions along other dimen- sions as agents seek to game the rules (see, for instance, Holm- strom and Milgrom [1991] and Baker [1992]). The distortions may be particularly pronounced in systems with bright line rules [Glaeser and Shleifer 2001]. It may be impossible to anticipate the many ways in which a particular incentive scheme may be gamed. Test-based accountability systems in education provide an excellent example of the costs and bene ts of high-powered in- centive schemes. In an effort to improve student achievement, a number of states and districts have recently implemented pro- grams that use student test scores to punish or reward schools. * We would like to thank Suzanne Cooper, Mark Duggan, Susan Dynarski, Arne Duncan, Michael Greenstone, James Heckman, Lars Lefgren, two anony- mous referees, the editor, Edward Glaeser, and seminar participants too numer- ous to mention for helpful comments and discussions. We also thank Arne Dun- can, Philip Hansen, Carol Perlman, and Jessie Qualles of the Chicago public schools for their help and cooperation on the project. Financial support was provided by the National Science Foundation and the Sloan Foundation. All remaining errors are our own. Addresses: Brian Jacob, Kennedy School of Gov- ernment, Harvard University, 79 JFK Street, Cambridge, MA 02138; Steven Levitt, Department of Economics, University of Chicago, 1126 E. 59th Street, Chicago, IL 60637. © 2003 by the President and Fellows of Harvard College and the Massachusetts Institute of Technology. The Quarterly Journal of Economics, August 2003 843

Transcript of rotten apples: an investigation of the prevalence and predictors of ...

Page 1: rotten apples: an investigation of the prevalence and predictors of ...

ROTTEN APPLES AN INVESTIGATION OF THEPREVALENCE AND PREDICTORS OF TEACHER

CHEATING

BRIAN A JACOB AND STEVEN D LEVITT

We develop an algorithm for detecting teacher cheating that combines infor-mation on unexpected test score uctuations and suspicious patterns of answersfor students in a classroom Using data from the Chicago public schools weestimate that serious cases of teacher or administrator cheating on standardizedtests occur in a minimum of 4ndash5 percent of elementary school classrooms annu-ally The observed frequency of cheating appears to respond strongly to relativelyminor changes in incentives Our results highlight the fact that high-poweredincentive systems especially those with bright line rules may induce unexpectedbehavioral distortions such as cheating Statistical analysis however may pro-vide a means of detecting illicit acts despite the best attempts of perpetrators tokeep them clandestine

I INTRODUCTION

High-powered incentive schemes are designed to align thebehavior of agents with the interests of the principal implement-ing the system A shortcoming of such schemes however is thatthey are likely to induce behavior distortions along other dimen-sions as agents seek to game the rules (see for instance Holm-strom and Milgrom [1991] and Baker [1992]) The distortionsmay be particularly pronounced in systems with bright line rules[Glaeser and Shleifer 2001] It may be impossible to anticipatethe many ways in which a particular incentive scheme may begamed

Test-based accountability systems in education provide anexcellent example of the costs and benets of high-powered in-centive schemes In an effort to improve student achievement anumber of states and districts have recently implemented pro-grams that use student test scores to punish or reward schools

We would like to thank Suzanne Cooper Mark Duggan Susan DynarskiArne Duncan Michael Greenstone James Heckman Lars Lefgren two anony-mous referees the editor Edward Glaeser and seminar participants too numer-ous to mention for helpful comments and discussions We also thank Arne Dun-can Philip Hansen Carol Perlman and Jessie Qualles of the Chicago publicschools for their help and cooperation on the project Financial support wasprovided by the National Science Foundation and the Sloan Foundation Allremaining errors are our own Addresses Brian Jacob Kennedy School of Gov-ernment Harvard University 79 JFK Street Cambridge MA 02138 StevenLevitt Department of Economics University of Chicago 1126 E 59th StreetChicago IL 60637

copy 2003 by the President and Fellows of Harvard College and the Massachusetts Institute ofTechnologyThe Quarterly Journal of Economics August 2003

843

Recent federal legislation institutionalizes this practice requir-ing states to test elementary students each year rate schools onthe basis of student performance and intervene in schools that donot make sufcient improvement1 Several prior studies suggestthat such accountability policies may be effective at raising stu-dent achievement [Richards and Sheu 1992 Grissmer Flanaganet al 2000 Deere and Strayer 2001 Jacob 2002 Carnoy and Loeb2002 Hanushek and Raymond 2002] At the same time howeverresearchers have documented instances of manipulation includ-ing documented shifts away from nontested areas or ldquoteaching tothe testrdquo [Klein et al 2002 Jacob 2002] and increasing placementin special education [Jacob 2002 Figlio and Getzler 2002 Cullenand Reback 2002]

In this paper we explore a very different mechanism forinating test scores outright cheating on the part of teachers andadministrators2 As incentives for high test scores increase un-scrupulous teachers may be more likely to engage in a range ofillicit activities including changing student responses on answersheets providing correct answers to students or obtaining copiesof an exam illegitimately prior to the test date and teachingstudents using knowledge of the precise exam questions3 Whilesuch allegations may seem far-fetched documented cases ofcheating have recently been uncovered in California [May 2000]Massachusetts [Marcus 2000] New York [Loughran and Comis-key 1999] Texas [Kolker 1999] and Great Britain [Hofkins 1995Tysome 1994]

There has been very little previous empirical analysis ofteacher cheating4 The few studies that do exist involve investi-

1 The federal legislation No Child Left Behind was passed in 2001 Prior tothis legislation virtually every state had linked test-score outcomes to schoolfunding or required students to pass an exit examination to graduate high schoolIn the state of California a policy providing for merit pay bonuses of as much as$25000 per teacher in schools with large test score gains was recently put intoplace

2 Hereinafter we uses the phrase ldquoteacher cheatingrdquo to encompass cheatingdone by either teachers or administrators

3 We have no way of knowing whether the patterns we observe arise becausea teacher explicitly alters studentsrsquo answer sheets directly provides answers tostudents during a test or perhaps makes test materials available to students inadvance of the exam (for instance by teaching a reading passage that is on thetest) If we had access to the actual exams it might be possible to distinguishbetween these scenarios through an analysis of erasure patterns

4 In contrast there is a well-developed statistics literature for identifyingwhether one student has copied answers from another student [Wollack 1997Holland 1996 Frary 1993 Bellezza and Bellezza 1989 Frary Tideman andWatts 1977 Angoff 1974] These methods involve the identication of unusual

844 QUARTERLY JOURNAL OF ECONOMICS

gations of specic instances of cheating and generally rely on theanalysis of erasure patterns and the controlled retesting of stu-dents5 While this earlier research provides convincing evidenceof isolated cheating incidents our paper represents the rst sys-tematic attempt to (1) identify the overall prevalence of teachercheating empirically and (2) analyze the factors that predictcheating To address these questions we use detailed adminis-trative data from the Chicago public schools (CPS) that includesthe question-by-question answers given by every student ingrades 3 to 8 who took the Iowa Test of Basic Skills (ITBS) from1993 to 20006 In addition to the test responses we also haveaccess to each studentrsquos full academic record including past testscores the school and room to which a student was assigned andextensive demographic and socioeconomic characteristics

Our approach to detecting classroom cheating combinestwo types of indicators unexpected test score uctuations andunusual patterns of answers for students within a classroomTeacher cheating increases the likelihood that students in aclassroom will experience large unexpected increases in testscores one year followed by very small test score gains (or evendeclines) the following year Teacher cheating especially ifdone in an unsophisticated manner is also likely to leavetell-tale signs in the form of blocks of identical answers un-usual patterns of correlations across student answers withinthe classroom or unusual response patterns within a studentrsquosexam (eg a student who answers a number of very difcult

patterns of agreement in student responses and for the most part are onlyeffective in identifying the most egregious cases of copying

5 In the mid-eighties Perlman [1985] investigated suspected cheating in anumber of Chicago public schools (CPS) The study included 23 suspect schoolsmdashidentied on the basis of a high percentage of erasures unusual patterns of scoreincreases unnecessarily large orders of blank answer sheets for the ITBS and tipsto the CPS Ofce of Researchmdashalong with 17 comparison schools When a secondform of the test was administered to the 40 schools under more controlled condi-tions the suspect schools did much worse than the comparison schools Ananalysis of several dozen Los Angeles schools where the percentage of erasuresand changed answers was unusually high revealed evidence of teacher cheating[Aiken 1991] One of the most highly publicized cheating scandals involved Strat-eld elementary an award-winning school in Connecticut In 1996 the rm thatdeveloped and scored the exam found that the rate of erasures at Strateld was upto ve times greater than other schools in the same district and that 89 percent oferasures at Strateld were from an incorrect to a correct response Subsequentretesting resulted in signicantly lower scores [Lindsay 1996]

6 We do not however have access to the actual test forms that studentslled out so we are unable to analyze these tests for evidence of suspiciouspatterns of erasures

845ROTTEN APPLES

questions correctly while missing many simple questions) Ouridentication strategy exploits the fact that these two types ofindicators are very weakly correlated in classrooms unlikely tohave cheated but very highly correlated in situations wherecheating likely occurred That allows us to credibly estimatethe prevalence of cheating without having to invoke arbitrarycutoffs as to what constitutes cheating

Empirically we detect cheating in approximately 4 to 5 per-cent of the classes in our sample This estimate is likely tounderstate the true incidence of cheating for two reasons Firstwe focus only on the most egregious type of cheating whereteachers systematically alter student test forms There are othermore subtle ways in which teachers can cheat such as providingextra time to students that our algorithm is unlikely to detectSecond even when test forms are altered our approach is onlypartially successful in detecting illicit behavior As discussedlater when we ourselves simulate cheating by altering studentanswer strings and then testing for cheating in the articiallymanipulated classrooms many instances of moderate cheating goundetected by our methods

A number of patterns in the results reinforce our con-dence that what we measure is indeed cheating First simu-lation results demonstrate that there is nothing mechanicalabout our identication approach that automatically generatespatterns like those observed in the data When we randomlyassign students to classrooms and search for cheating in thesesimulated classes our methods nd little evidence of cheatingSecond cheating on one part of the test (eg math) is a strongpredictor of cheating on other sections of the test (eg read-ing) Third cheating is also correlated within classrooms overtime and across classrooms in a particular school Finally andperhaps most convincingly with the cooperation of the Chicagopublic schools we were able to conduct a prospective test of ourmethods in which we retested a subset of classrooms undercontrolled conditions that precluded teacher cheating Class-rooms identied as likely cheaters experienced large declinesin test scores on the retest whereas classrooms not suspectedof cheating maintained their test score gains

The prevalence of cheating is also shown to respond torelatively minor changes in teacher incentives The importanceof standardized tests in the Chicago public schools increased

846 QUARTERLY JOURNAL OF ECONOMICS

substantially with a change in leadership in 1996 particularlyfor low-achieving students Following the introduction of thesepolicies the prevalence of cheating rose sharply in low-achiev-ing classrooms whereas classes with average or higher-achiev-ing students showed no increase in cheating Cheating preva-lence also appears to be systematically lower in cases wherethe costs of cheating are higher (eg in mixed-grade class-rooms in which two different exams are administered simulta-neously) or the benets of cheating are lower (eg in class-rooms with more special education or bilingual students whotake the standardized tests but whose scores are excludedfrom ofcial calculations)

The remainder of the paper is structured as follows SectionII discusses the set of indicators we use to capture cheatingbehavior Section III describes our identication strategy SectionIV provides a brief overview of the institutional details of theChicago public schools and the data set that we use Section Vreports the basic empirical results on the prevalence of cheatingand presents a wide range of evidence supporting the interpreta-tion of these results as cheating Section VI analyzes how teachercheating responds to incentives Section VII discusses the resultsand the implications for increasing reliance on high-stakestesting

II INDICATORS OF TEACHER CHEATING

Teacher cheating especially in extreme cases is likely toleave tell-tale signs In motivating our discussion of the indicatorswe employ for detecting cheating it is instructive to compare twoactual classrooms taking the same test (see Figure I) Each row inFigure I represents one studentrsquos answers to each item on thetest Columns correspond to the different questions asked Theletter ldquoArdquo ldquoBrdquo ldquoCrdquo or ldquoDrdquo means a student provided the correctanswer If a number is entered the student answered the ques-tion incorrectly with ldquo1rdquo corresponding to a wrong answer of ldquoArdquoldquo2rdquo corresponding to a wrong answer of ldquoBrdquo etc On the right-hand side of the table we also present student test scores for thepreceding current and following year Test scores are in units ofldquograde equivalentsrdquo The grading scale is normed so that a stu-dent at the national average for sixth graders taking the test inthe eighth month of the school year would score 68 A typical

847ROTTEN APPLES

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 2: rotten apples: an investigation of the prevalence and predictors of ...

Recent federal legislation institutionalizes this practice requir-ing states to test elementary students each year rate schools onthe basis of student performance and intervene in schools that donot make sufcient improvement1 Several prior studies suggestthat such accountability policies may be effective at raising stu-dent achievement [Richards and Sheu 1992 Grissmer Flanaganet al 2000 Deere and Strayer 2001 Jacob 2002 Carnoy and Loeb2002 Hanushek and Raymond 2002] At the same time howeverresearchers have documented instances of manipulation includ-ing documented shifts away from nontested areas or ldquoteaching tothe testrdquo [Klein et al 2002 Jacob 2002] and increasing placementin special education [Jacob 2002 Figlio and Getzler 2002 Cullenand Reback 2002]

In this paper we explore a very different mechanism forinating test scores outright cheating on the part of teachers andadministrators2 As incentives for high test scores increase un-scrupulous teachers may be more likely to engage in a range ofillicit activities including changing student responses on answersheets providing correct answers to students or obtaining copiesof an exam illegitimately prior to the test date and teachingstudents using knowledge of the precise exam questions3 Whilesuch allegations may seem far-fetched documented cases ofcheating have recently been uncovered in California [May 2000]Massachusetts [Marcus 2000] New York [Loughran and Comis-key 1999] Texas [Kolker 1999] and Great Britain [Hofkins 1995Tysome 1994]

There has been very little previous empirical analysis ofteacher cheating4 The few studies that do exist involve investi-

1 The federal legislation No Child Left Behind was passed in 2001 Prior tothis legislation virtually every state had linked test-score outcomes to schoolfunding or required students to pass an exit examination to graduate high schoolIn the state of California a policy providing for merit pay bonuses of as much as$25000 per teacher in schools with large test score gains was recently put intoplace

2 Hereinafter we uses the phrase ldquoteacher cheatingrdquo to encompass cheatingdone by either teachers or administrators

3 We have no way of knowing whether the patterns we observe arise becausea teacher explicitly alters studentsrsquo answer sheets directly provides answers tostudents during a test or perhaps makes test materials available to students inadvance of the exam (for instance by teaching a reading passage that is on thetest) If we had access to the actual exams it might be possible to distinguishbetween these scenarios through an analysis of erasure patterns

4 In contrast there is a well-developed statistics literature for identifyingwhether one student has copied answers from another student [Wollack 1997Holland 1996 Frary 1993 Bellezza and Bellezza 1989 Frary Tideman andWatts 1977 Angoff 1974] These methods involve the identication of unusual

844 QUARTERLY JOURNAL OF ECONOMICS

gations of specic instances of cheating and generally rely on theanalysis of erasure patterns and the controlled retesting of stu-dents5 While this earlier research provides convincing evidenceof isolated cheating incidents our paper represents the rst sys-tematic attempt to (1) identify the overall prevalence of teachercheating empirically and (2) analyze the factors that predictcheating To address these questions we use detailed adminis-trative data from the Chicago public schools (CPS) that includesthe question-by-question answers given by every student ingrades 3 to 8 who took the Iowa Test of Basic Skills (ITBS) from1993 to 20006 In addition to the test responses we also haveaccess to each studentrsquos full academic record including past testscores the school and room to which a student was assigned andextensive demographic and socioeconomic characteristics

Our approach to detecting classroom cheating combinestwo types of indicators unexpected test score uctuations andunusual patterns of answers for students within a classroomTeacher cheating increases the likelihood that students in aclassroom will experience large unexpected increases in testscores one year followed by very small test score gains (or evendeclines) the following year Teacher cheating especially ifdone in an unsophisticated manner is also likely to leavetell-tale signs in the form of blocks of identical answers un-usual patterns of correlations across student answers withinthe classroom or unusual response patterns within a studentrsquosexam (eg a student who answers a number of very difcult

patterns of agreement in student responses and for the most part are onlyeffective in identifying the most egregious cases of copying

5 In the mid-eighties Perlman [1985] investigated suspected cheating in anumber of Chicago public schools (CPS) The study included 23 suspect schoolsmdashidentied on the basis of a high percentage of erasures unusual patterns of scoreincreases unnecessarily large orders of blank answer sheets for the ITBS and tipsto the CPS Ofce of Researchmdashalong with 17 comparison schools When a secondform of the test was administered to the 40 schools under more controlled condi-tions the suspect schools did much worse than the comparison schools Ananalysis of several dozen Los Angeles schools where the percentage of erasuresand changed answers was unusually high revealed evidence of teacher cheating[Aiken 1991] One of the most highly publicized cheating scandals involved Strat-eld elementary an award-winning school in Connecticut In 1996 the rm thatdeveloped and scored the exam found that the rate of erasures at Strateld was upto ve times greater than other schools in the same district and that 89 percent oferasures at Strateld were from an incorrect to a correct response Subsequentretesting resulted in signicantly lower scores [Lindsay 1996]

6 We do not however have access to the actual test forms that studentslled out so we are unable to analyze these tests for evidence of suspiciouspatterns of erasures

845ROTTEN APPLES

questions correctly while missing many simple questions) Ouridentication strategy exploits the fact that these two types ofindicators are very weakly correlated in classrooms unlikely tohave cheated but very highly correlated in situations wherecheating likely occurred That allows us to credibly estimatethe prevalence of cheating without having to invoke arbitrarycutoffs as to what constitutes cheating

Empirically we detect cheating in approximately 4 to 5 per-cent of the classes in our sample This estimate is likely tounderstate the true incidence of cheating for two reasons Firstwe focus only on the most egregious type of cheating whereteachers systematically alter student test forms There are othermore subtle ways in which teachers can cheat such as providingextra time to students that our algorithm is unlikely to detectSecond even when test forms are altered our approach is onlypartially successful in detecting illicit behavior As discussedlater when we ourselves simulate cheating by altering studentanswer strings and then testing for cheating in the articiallymanipulated classrooms many instances of moderate cheating goundetected by our methods

A number of patterns in the results reinforce our con-dence that what we measure is indeed cheating First simu-lation results demonstrate that there is nothing mechanicalabout our identication approach that automatically generatespatterns like those observed in the data When we randomlyassign students to classrooms and search for cheating in thesesimulated classes our methods nd little evidence of cheatingSecond cheating on one part of the test (eg math) is a strongpredictor of cheating on other sections of the test (eg read-ing) Third cheating is also correlated within classrooms overtime and across classrooms in a particular school Finally andperhaps most convincingly with the cooperation of the Chicagopublic schools we were able to conduct a prospective test of ourmethods in which we retested a subset of classrooms undercontrolled conditions that precluded teacher cheating Class-rooms identied as likely cheaters experienced large declinesin test scores on the retest whereas classrooms not suspectedof cheating maintained their test score gains

The prevalence of cheating is also shown to respond torelatively minor changes in teacher incentives The importanceof standardized tests in the Chicago public schools increased

846 QUARTERLY JOURNAL OF ECONOMICS

substantially with a change in leadership in 1996 particularlyfor low-achieving students Following the introduction of thesepolicies the prevalence of cheating rose sharply in low-achiev-ing classrooms whereas classes with average or higher-achiev-ing students showed no increase in cheating Cheating preva-lence also appears to be systematically lower in cases wherethe costs of cheating are higher (eg in mixed-grade class-rooms in which two different exams are administered simulta-neously) or the benets of cheating are lower (eg in class-rooms with more special education or bilingual students whotake the standardized tests but whose scores are excludedfrom ofcial calculations)

The remainder of the paper is structured as follows SectionII discusses the set of indicators we use to capture cheatingbehavior Section III describes our identication strategy SectionIV provides a brief overview of the institutional details of theChicago public schools and the data set that we use Section Vreports the basic empirical results on the prevalence of cheatingand presents a wide range of evidence supporting the interpreta-tion of these results as cheating Section VI analyzes how teachercheating responds to incentives Section VII discusses the resultsand the implications for increasing reliance on high-stakestesting

II INDICATORS OF TEACHER CHEATING

Teacher cheating especially in extreme cases is likely toleave tell-tale signs In motivating our discussion of the indicatorswe employ for detecting cheating it is instructive to compare twoactual classrooms taking the same test (see Figure I) Each row inFigure I represents one studentrsquos answers to each item on thetest Columns correspond to the different questions asked Theletter ldquoArdquo ldquoBrdquo ldquoCrdquo or ldquoDrdquo means a student provided the correctanswer If a number is entered the student answered the ques-tion incorrectly with ldquo1rdquo corresponding to a wrong answer of ldquoArdquoldquo2rdquo corresponding to a wrong answer of ldquoBrdquo etc On the right-hand side of the table we also present student test scores for thepreceding current and following year Test scores are in units ofldquograde equivalentsrdquo The grading scale is normed so that a stu-dent at the national average for sixth graders taking the test inthe eighth month of the school year would score 68 A typical

847ROTTEN APPLES

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 3: rotten apples: an investigation of the prevalence and predictors of ...

gations of specic instances of cheating and generally rely on theanalysis of erasure patterns and the controlled retesting of stu-dents5 While this earlier research provides convincing evidenceof isolated cheating incidents our paper represents the rst sys-tematic attempt to (1) identify the overall prevalence of teachercheating empirically and (2) analyze the factors that predictcheating To address these questions we use detailed adminis-trative data from the Chicago public schools (CPS) that includesthe question-by-question answers given by every student ingrades 3 to 8 who took the Iowa Test of Basic Skills (ITBS) from1993 to 20006 In addition to the test responses we also haveaccess to each studentrsquos full academic record including past testscores the school and room to which a student was assigned andextensive demographic and socioeconomic characteristics

Our approach to detecting classroom cheating combinestwo types of indicators unexpected test score uctuations andunusual patterns of answers for students within a classroomTeacher cheating increases the likelihood that students in aclassroom will experience large unexpected increases in testscores one year followed by very small test score gains (or evendeclines) the following year Teacher cheating especially ifdone in an unsophisticated manner is also likely to leavetell-tale signs in the form of blocks of identical answers un-usual patterns of correlations across student answers withinthe classroom or unusual response patterns within a studentrsquosexam (eg a student who answers a number of very difcult

patterns of agreement in student responses and for the most part are onlyeffective in identifying the most egregious cases of copying

5 In the mid-eighties Perlman [1985] investigated suspected cheating in anumber of Chicago public schools (CPS) The study included 23 suspect schoolsmdashidentied on the basis of a high percentage of erasures unusual patterns of scoreincreases unnecessarily large orders of blank answer sheets for the ITBS and tipsto the CPS Ofce of Researchmdashalong with 17 comparison schools When a secondform of the test was administered to the 40 schools under more controlled condi-tions the suspect schools did much worse than the comparison schools Ananalysis of several dozen Los Angeles schools where the percentage of erasuresand changed answers was unusually high revealed evidence of teacher cheating[Aiken 1991] One of the most highly publicized cheating scandals involved Strat-eld elementary an award-winning school in Connecticut In 1996 the rm thatdeveloped and scored the exam found that the rate of erasures at Strateld was upto ve times greater than other schools in the same district and that 89 percent oferasures at Strateld were from an incorrect to a correct response Subsequentretesting resulted in signicantly lower scores [Lindsay 1996]

6 We do not however have access to the actual test forms that studentslled out so we are unable to analyze these tests for evidence of suspiciouspatterns of erasures

845ROTTEN APPLES

questions correctly while missing many simple questions) Ouridentication strategy exploits the fact that these two types ofindicators are very weakly correlated in classrooms unlikely tohave cheated but very highly correlated in situations wherecheating likely occurred That allows us to credibly estimatethe prevalence of cheating without having to invoke arbitrarycutoffs as to what constitutes cheating

Empirically we detect cheating in approximately 4 to 5 per-cent of the classes in our sample This estimate is likely tounderstate the true incidence of cheating for two reasons Firstwe focus only on the most egregious type of cheating whereteachers systematically alter student test forms There are othermore subtle ways in which teachers can cheat such as providingextra time to students that our algorithm is unlikely to detectSecond even when test forms are altered our approach is onlypartially successful in detecting illicit behavior As discussedlater when we ourselves simulate cheating by altering studentanswer strings and then testing for cheating in the articiallymanipulated classrooms many instances of moderate cheating goundetected by our methods

A number of patterns in the results reinforce our con-dence that what we measure is indeed cheating First simu-lation results demonstrate that there is nothing mechanicalabout our identication approach that automatically generatespatterns like those observed in the data When we randomlyassign students to classrooms and search for cheating in thesesimulated classes our methods nd little evidence of cheatingSecond cheating on one part of the test (eg math) is a strongpredictor of cheating on other sections of the test (eg read-ing) Third cheating is also correlated within classrooms overtime and across classrooms in a particular school Finally andperhaps most convincingly with the cooperation of the Chicagopublic schools we were able to conduct a prospective test of ourmethods in which we retested a subset of classrooms undercontrolled conditions that precluded teacher cheating Class-rooms identied as likely cheaters experienced large declinesin test scores on the retest whereas classrooms not suspectedof cheating maintained their test score gains

The prevalence of cheating is also shown to respond torelatively minor changes in teacher incentives The importanceof standardized tests in the Chicago public schools increased

846 QUARTERLY JOURNAL OF ECONOMICS

substantially with a change in leadership in 1996 particularlyfor low-achieving students Following the introduction of thesepolicies the prevalence of cheating rose sharply in low-achiev-ing classrooms whereas classes with average or higher-achiev-ing students showed no increase in cheating Cheating preva-lence also appears to be systematically lower in cases wherethe costs of cheating are higher (eg in mixed-grade class-rooms in which two different exams are administered simulta-neously) or the benets of cheating are lower (eg in class-rooms with more special education or bilingual students whotake the standardized tests but whose scores are excludedfrom ofcial calculations)

The remainder of the paper is structured as follows SectionII discusses the set of indicators we use to capture cheatingbehavior Section III describes our identication strategy SectionIV provides a brief overview of the institutional details of theChicago public schools and the data set that we use Section Vreports the basic empirical results on the prevalence of cheatingand presents a wide range of evidence supporting the interpreta-tion of these results as cheating Section VI analyzes how teachercheating responds to incentives Section VII discusses the resultsand the implications for increasing reliance on high-stakestesting

II INDICATORS OF TEACHER CHEATING

Teacher cheating especially in extreme cases is likely toleave tell-tale signs In motivating our discussion of the indicatorswe employ for detecting cheating it is instructive to compare twoactual classrooms taking the same test (see Figure I) Each row inFigure I represents one studentrsquos answers to each item on thetest Columns correspond to the different questions asked Theletter ldquoArdquo ldquoBrdquo ldquoCrdquo or ldquoDrdquo means a student provided the correctanswer If a number is entered the student answered the ques-tion incorrectly with ldquo1rdquo corresponding to a wrong answer of ldquoArdquoldquo2rdquo corresponding to a wrong answer of ldquoBrdquo etc On the right-hand side of the table we also present student test scores for thepreceding current and following year Test scores are in units ofldquograde equivalentsrdquo The grading scale is normed so that a stu-dent at the national average for sixth graders taking the test inthe eighth month of the school year would score 68 A typical

847ROTTEN APPLES

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 4: rotten apples: an investigation of the prevalence and predictors of ...

questions correctly while missing many simple questions) Ouridentication strategy exploits the fact that these two types ofindicators are very weakly correlated in classrooms unlikely tohave cheated but very highly correlated in situations wherecheating likely occurred That allows us to credibly estimatethe prevalence of cheating without having to invoke arbitrarycutoffs as to what constitutes cheating

Empirically we detect cheating in approximately 4 to 5 per-cent of the classes in our sample This estimate is likely tounderstate the true incidence of cheating for two reasons Firstwe focus only on the most egregious type of cheating whereteachers systematically alter student test forms There are othermore subtle ways in which teachers can cheat such as providingextra time to students that our algorithm is unlikely to detectSecond even when test forms are altered our approach is onlypartially successful in detecting illicit behavior As discussedlater when we ourselves simulate cheating by altering studentanswer strings and then testing for cheating in the articiallymanipulated classrooms many instances of moderate cheating goundetected by our methods

A number of patterns in the results reinforce our con-dence that what we measure is indeed cheating First simu-lation results demonstrate that there is nothing mechanicalabout our identication approach that automatically generatespatterns like those observed in the data When we randomlyassign students to classrooms and search for cheating in thesesimulated classes our methods nd little evidence of cheatingSecond cheating on one part of the test (eg math) is a strongpredictor of cheating on other sections of the test (eg read-ing) Third cheating is also correlated within classrooms overtime and across classrooms in a particular school Finally andperhaps most convincingly with the cooperation of the Chicagopublic schools we were able to conduct a prospective test of ourmethods in which we retested a subset of classrooms undercontrolled conditions that precluded teacher cheating Class-rooms identied as likely cheaters experienced large declinesin test scores on the retest whereas classrooms not suspectedof cheating maintained their test score gains

The prevalence of cheating is also shown to respond torelatively minor changes in teacher incentives The importanceof standardized tests in the Chicago public schools increased

846 QUARTERLY JOURNAL OF ECONOMICS

substantially with a change in leadership in 1996 particularlyfor low-achieving students Following the introduction of thesepolicies the prevalence of cheating rose sharply in low-achiev-ing classrooms whereas classes with average or higher-achiev-ing students showed no increase in cheating Cheating preva-lence also appears to be systematically lower in cases wherethe costs of cheating are higher (eg in mixed-grade class-rooms in which two different exams are administered simulta-neously) or the benets of cheating are lower (eg in class-rooms with more special education or bilingual students whotake the standardized tests but whose scores are excludedfrom ofcial calculations)

The remainder of the paper is structured as follows SectionII discusses the set of indicators we use to capture cheatingbehavior Section III describes our identication strategy SectionIV provides a brief overview of the institutional details of theChicago public schools and the data set that we use Section Vreports the basic empirical results on the prevalence of cheatingand presents a wide range of evidence supporting the interpreta-tion of these results as cheating Section VI analyzes how teachercheating responds to incentives Section VII discusses the resultsand the implications for increasing reliance on high-stakestesting

II INDICATORS OF TEACHER CHEATING

Teacher cheating especially in extreme cases is likely toleave tell-tale signs In motivating our discussion of the indicatorswe employ for detecting cheating it is instructive to compare twoactual classrooms taking the same test (see Figure I) Each row inFigure I represents one studentrsquos answers to each item on thetest Columns correspond to the different questions asked Theletter ldquoArdquo ldquoBrdquo ldquoCrdquo or ldquoDrdquo means a student provided the correctanswer If a number is entered the student answered the ques-tion incorrectly with ldquo1rdquo corresponding to a wrong answer of ldquoArdquoldquo2rdquo corresponding to a wrong answer of ldquoBrdquo etc On the right-hand side of the table we also present student test scores for thepreceding current and following year Test scores are in units ofldquograde equivalentsrdquo The grading scale is normed so that a stu-dent at the national average for sixth graders taking the test inthe eighth month of the school year would score 68 A typical

847ROTTEN APPLES

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 5: rotten apples: an investigation of the prevalence and predictors of ...

substantially with a change in leadership in 1996 particularlyfor low-achieving students Following the introduction of thesepolicies the prevalence of cheating rose sharply in low-achiev-ing classrooms whereas classes with average or higher-achiev-ing students showed no increase in cheating Cheating preva-lence also appears to be systematically lower in cases wherethe costs of cheating are higher (eg in mixed-grade class-rooms in which two different exams are administered simulta-neously) or the benets of cheating are lower (eg in class-rooms with more special education or bilingual students whotake the standardized tests but whose scores are excludedfrom ofcial calculations)

The remainder of the paper is structured as follows SectionII discusses the set of indicators we use to capture cheatingbehavior Section III describes our identication strategy SectionIV provides a brief overview of the institutional details of theChicago public schools and the data set that we use Section Vreports the basic empirical results on the prevalence of cheatingand presents a wide range of evidence supporting the interpreta-tion of these results as cheating Section VI analyzes how teachercheating responds to incentives Section VII discusses the resultsand the implications for increasing reliance on high-stakestesting

II INDICATORS OF TEACHER CHEATING

Teacher cheating especially in extreme cases is likely toleave tell-tale signs In motivating our discussion of the indicatorswe employ for detecting cheating it is instructive to compare twoactual classrooms taking the same test (see Figure I) Each row inFigure I represents one studentrsquos answers to each item on thetest Columns correspond to the different questions asked Theletter ldquoArdquo ldquoBrdquo ldquoCrdquo or ldquoDrdquo means a student provided the correctanswer If a number is entered the student answered the ques-tion incorrectly with ldquo1rdquo corresponding to a wrong answer of ldquoArdquoldquo2rdquo corresponding to a wrong answer of ldquoBrdquo etc On the right-hand side of the table we also present student test scores for thepreceding current and following year Test scores are in units ofldquograde equivalentsrdquo The grading scale is normed so that a stu-dent at the national average for sixth graders taking the test inthe eighth month of the school year would score 68 A typical

847ROTTEN APPLES

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 6: rotten apples: an investigation of the prevalence and predictors of ...

FIGURE ISample Answer Strings and Test Scores from Two Classrooms

The data in the gure represent actual answer strings and test scores from twoCPS classrooms taking the same exam The top classroom is suspected of cheat-ing the bottom classroom is not Each row corresponds to an individual studentEach column represents a particular question on the exam A letter indicates thatthe student gave that answer and the answer was correct A number means thatthe student gave the corresponding letter answer (eg 1 5 ldquoArdquo) but the answerwas incorrect A value of ldquo0rdquo means the question was left blank Student testscores in grade equivalents are shown in the last three columns of the gure Thetest year for which the answer strings are presented is denoted year t The scoresfrom years t 2 1 and t 1 1 correspond to the preceding and following yearsrsquoexaminations

848 QUARTERLY JOURNAL OF ECONOMICS

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 7: rotten apples: an investigation of the prevalence and predictors of ...

student would be expected to gain one grade equivalent for eachyear of school

The top panel of data shows a class in which we suspectteacher cheating took place the bottom panel corresponds to atypical classroom Two striking differences between the class-rooms are readily apparent First in the cheating classroom alarge block of students provided identical answers on consecutivequestions in the middle of the test as indicated by the boxed areain the gure For the other classroom no such pattern existsSecond looking at the pattern of test scores students in thecheating classroom experienced large increases in test scoresfrom the previous to the current year (17 grade equivalents onaverage) and actually experienced declines on average the fol-lowing year In contrast the students in the typical classroomgained roughly one grade equivalent each year as would beexpected

The indicators we use as evidence of cheating formalize andextend the basic picture that emerges from Figure I We dividethese indicators into two distinct groups that respectively cap-ture unusual test score uctuations and suspicious patterns ofanswer strings In this section we describe informally the mea-sures that we use A more rigorous treatment of their construc-tion is provided in the Appendix

IIIA Indicator One Unexpected Test Score Fluctuations

Given that the aim of cheating is to raise test scores onesignal of teacher cheating is an unusually large gain in testscores relative to how those same students tested in the pre-vious year Since test score gains that result from cheating donot represent real gains in knowledge there is no reason toexpect the gains to be sustained on future exams taken bythese students (unless of course next yearrsquos teachers alsocheat on behalf of the students) Thus large gains due tocheating should be followed by unusually small test score gainsfor these students in the following year In contrast if largetest score gains are due to a talented teacher the student gainsare likely to have a greater permanent component even if someregression to the mean occurs We construct a summary mea-sure of how unusual the test score uctuations are by rankingeach classroomrsquos average test score gains relative to all other

849ROTTEN APPLES

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 8: rotten apples: an investigation of the prevalence and predictors of ...

classrooms in that same subject grade and year7 and com-puting the following statistic

(3) SCOREcbt 5 ~rank_ gaincb t2 1 ~1 2 rank_ gaincbt11

2

where rank_ gain cb t is the percentile rank for class c in subject bin year t Classes with relatively big gains on this yearrsquos test andrelatively small gains on next yearrsquos test will have high values ofSCORE Squaring the individual terms gives relatively moreweight to big test score gains this year and big test score declinesthe following year8 In the empirical analysis we consider threepossible cutoffs for what it means to have a ldquohighrdquo value onSCORE corresponding to the eightieth ninetieth and ninety-fth percentiles among all classrooms in the sample

IIIB Indicator Two Suspicious Answer Strings

The quickest and easiest way for a teacher to cheat is to alterthe same block of consecutive questions for a substantial portionof students in the class as was apparently done in the classroomin the top panel of Figure I More sophisticated interventionsmight involve skipping some questions so as to avoid a large blockof identical answers or altering different blocks of questions fordifferent students

We combine four different measures of how suspicious aclassroomrsquos answer strings are in determining whether a class-room may be cheating The rst measure focuses on the mostunlikely block of identical answers given by students on consecu-tive questions Using past test scores future test scores andbackground characteristics we predict the likelihood that eachstudent will give each possible answer (A B C or D) on everyquestion using a multinomial logit Each studentrsquos predictedprobability of choosing a particular response is identied by thelikelihood that other students (in the same year grade andsubject) with similar background characteristics will choose that

7 We also experimented with more complicated mechanisms for deninglarge or small test score gains (eg predicting each studentrsquos expected test scoregain as a function of past test scores and background characteristics and comput-ing a deviation measure for each student which was then aggregated to theclassroom level) but because the results were similar we elected to use thesimpler method We have also dened gains and losses using an absolute metric(eg where gains in excess of 15 or 2 grade equivalents are considered unusuallylarge) and obtain comparable results

8 In the following year the students who were in a particular classroom aretypically scattered across multiple classrooms We base all calculations off of thecomposition of this yearrsquos class

850 QUARTERLY JOURNAL OF ECONOMICS

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 9: rotten apples: an investigation of the prevalence and predictors of ...

response We then search over all combinations of students andconsecutive questions to nd the block of identical answers givenby students in a classroom least likely to have arisen by chance9

The more unusual is the most unusual block of test responses(adjusting for class size and the number of questions on the examboth of which increase the possible combinations over which wesearch) the more likely it is that cheating occurred Thus if tenvery bright students in a class of thirty give the correct answersto the rst ve questions on the exam (typically the easier ques-tions) the block of identical answers will not appear unusual Incontrast if all fteen students in a low-achieving classroom givethe same correct answers to the last ve questions on the exam(typically the harder questions) this would appear quite suspect

The second measure of suspicious answer strings involvesthe overall degree of correlation in student answers across thetest When a teacher changes answers on test forms it presum-ably increases the uniformity of student test forms across stu-dents in the class This measure is meant to capture more generalpatterns of similarity in student responses beyond just identicalblocks of answers Based on the results of the multinomial logitdescribed above for each question and each student we create ameasure of how unexpected the studentrsquos response was We thencombine the information for each student in the classroom tocreate something akin to the within-classroom correlation in stu-dent responses This measure will be high if students in a class-room tend to give the same answers on many questions espe-cially if the answers given are unexpected (ie correct answers onhard questions or systematic mistakes on easy questions)

Of course within-classroom correlation may arise for manyreasons other than cheating (eg the teacher may emphasizecertain topics during the school year) Therefore a third indicatorof potential cheating is a high variance in the degree of correla-tion across questions If the teacher changes answers for multiplestudents on selected questions the within-class correlation on

9 Note that we do not require the answers to be correct Indeed in manyclassrooms the most unusual strings include some incorrect answers Note alsothat these calculations are done under the assumption that a given studentrsquosanswers are uncorrelated (conditional on observables) across questions on theexam and that answers are uncorrelated across students Of course this assump-tion is unlikely to be true Since all of our comparisons rely on the relativeunusualness of the answers given in different classrooms this simplifying as-sumption is not problematic unless the correlation within and across studentsvaries by classroom

851ROTTEN APPLES

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 10: rotten apples: an investigation of the prevalence and predictors of ...

those particular questions will be extremely high while the de-gree of within-class correlation on other questions is likely to betypical This leads the cross-question variance in correlations tobe larger than normal in cheating classrooms

Our nal indicator compares the answers that students inone classroom give compared with the answers of other studentsin the system who take the identical test and get the exact samescore This measure relies on the fact that questions vary signi-cantly in difculty The typical student will answer most of theeasy questions correctly but get many of the hard questionswrong (where ldquoeasyrdquo and ldquohardrdquo are based on how well students ofsimilar ability do on the question) If students in a class system-atically miss the easy questions while correctly answering thehard questions this may be an indication of cheating

Our overall measure of suspicious answer strings is con-structed in a manner parallel to our measure of unusual testscore uctuations Within a given subject grade and year werank classrooms on each of these four indicators and then takethe sum of squared ranks across the four measures10

(4) ANSWERScbt 5 ~rank_m1cbt2 1 ~rank_m2cbt

2

1 ~rank_m3cbt2 1 ~rank_m4cbt

2

In the empirical work we again use three possible cutoffs forpotential cheating eightieth ninetieth and ninety-fthpercentiles

III A STRATEGY FOR IDENTIFYING THE PREVALENCE OF CHEATING

The previous section described indicators that are likely to becorrelated with cheating Because sometimes such patterns ariseby chance however not every classroom with large test scoreuctuations and suspicious answer strings is cheating Further-more the particular choice of what qualies as a ldquolargerdquo uctu-ation or a ldquosuspiciousrdquo set of answer strings will necessarily bearbitrary In this section we present an identication strategythat under a set of defensible assumptions nonetheless providesestimates of the prevalence of cheating

To identify the number of cheating classrooms in a given

10 Because different subjects and grades have differing numbers of ques-tions it is difcult to make meaningful comparisons across tests on the rawindicators

852 QUARTERLY JOURNAL OF ECONOMICS

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 11: rotten apples: an investigation of the prevalence and predictors of ...

year we would like to compare the observed joint distribution oftest score uctuations and suspicious answer strings with a coun-terfactual distribution in which no cheating occurs Differencesbetween the two distributions would provide an estimate of howmuch cheating occurred If teachers in a particular school or yearare cheating there will be more classrooms exhibiting both un-usual test score uctuations and suspicious answer strings thanotherwise expected In practice we do not have the luxury ofobserving such a counterfactual Instead we must make assump-tions about what the patterns would look like absent cheating

Our identication strategy hinges on three key assumptions(1) cheating increases the likelihood a class will have both largetest score uctuations and suspicious answer strings (2) if cheat-ing classrooms had not cheated their distribution of test scoreuctuations and answer strings patterns would be identical tononcheating classrooms and (3) in noncheating classrooms thecorrelation between test score uctuations and suspicious an-swers is constant throughout the distribution11

If assumption (1) holds then cheating classrooms will beconcentrated in the upper tail of the joint distribution of unusualtest score uctuations and suspicious answer strings Other partsof the distribution (eg classrooms ranked in the ftieth to sev-enty-fth percentile of suspicious answer strings) will conse-quently include few cheaters As long as cheating classroomswould look similar to noncheating classrooms on our measures ifthey had not cheated (assumption (2)) classes in the part of thedistribution with few cheaters provide the noncheating counter-factual that we are seeking

The difculty however is that we only observe thisnoncheating counterfactual in the bottom and middle parts of thedistribution when what we really care about is the upper tail ofthe distribution where the cheaters are concentrated Assump-tion (3) which requires that in noncheating classrooms the cor-relation between test score uctuations and suspicious answers isconstant throughout the distribution provides a potential solu-tion If this assumption holds we can use the part of the distri-bution that is relatively free of cheaters to project what the righttail of the distribution would look like absent cheating The gapbetween the predicted and observed frequency of classrooms that

11 We formally derive the mathematical model described in this section inJacob and Levitt [2003]

853ROTTEN APPLES

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 12: rotten apples: an investigation of the prevalence and predictors of ...

are extreme on both the test score uctuation and suspiciousanswer string measures provides our estimate of cheating

Figure II demonstrates how our identication strategy worksempirically The horizontal axis in the gure ranks classroomsaccording to how suspicious their answer strings are The verticalaxis is the fraction of the classrooms that are above the ninety-fth percentile on the unusual test score uctuations measure12

The graph combines all classrooms and all subjects in our data13

Over most of the range (roughly from zero to the seventy-fthpercentile on the horizontal axis) there is a slight positive corre-lation between unusual test scores and suspicious answer strings

12 The choice of the ninety-fth percentile is somewhat arbitrary In theempirical work that follows we also consider the eightieth and ninetieth percen-tiles The choice of the cutoff does not affect the basic patterns observed in thedata

13 To construct the gure classes were rank ordered according to theiranswer strings and divided into 200 equally sized segments The circles in thegure represent these 200 local means The line displayed in the graph is thetted value of a regression with a seventh-order polynomial in a classroomrsquos rankon the suspicious strings measure

FIGURE IIThe Relationship between Unusual Test Scores and Suspicious Answer Strings

The horizontal axis reects a classroomrsquos percentile rank in the distribution ofsuspicious answer strings within a given grade subject and year with zerorepresenting the least suspicious classroom and one representing the most sus-picious classroom The vertical axis is the probability that a classroom will beabove the ninety-fth percentile on our measure of unusual test score uctuationsThe circles in the gure represent averages from 200 equally spaced cells alongthe x-axis The predicted line is based on a probit model estimated with seventh-order polynomials in the suspicious string measure

854 QUARTERLY JOURNAL OF ECONOMICS

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 13: rotten apples: an investigation of the prevalence and predictors of ...

This is the part of the distribution that is unlikely to containmany cheating classrooms Under the assumption that the corre-lation between our two measures is constant in noncheatingclassrooms over the whole distribution we would predict that thestraight line observed for all but the right-hand portion of thegraph would continue all the way to the right edge of the graph

Instead as one approaches the extreme right tail of thedistribution of suspicious answer strings the probability of largetest score uctuations rises dramatically That sharp rise weargue is a consequence of cheating To estimate the prevalence ofcheating we simply compare the actual area under the curve inthe far right tail of Figure I to the predicted area under the curvein the right tail projecting the linear relationship observed fromthe ftieth to seventy-fth percentile out to the ninety-ninthpercentile Because this identication strategy is necessarily in-direct we devote a great deal of attention to presenting a widevariety of tests of the validity of our approach the sensitivity ofthe results to alternative assumptions and the plausibility of ourndings

IV DATA AND INSTITUTIONAL BACKGROUND

Elementary students in Chicago public schools take a stan-dardized multiple-choice achievement exam known as the IowaTest of Basic Skills (ITBS) The ITBS is a national norm-refer-enced exam with a reading comprehension section and threeseparate math sections14 Third through eighth grade students inChicago are required to take the exams each year Most schoolsadminister the exams to rst and second grade students as wellalthough this is not a district mandate

Our base sample includes all students in third to seventhgrade for the years 1993ndash200015 For each student we have thequestion-by-question answer string on each yearrsquos tests schooland classroom identiers the full history of prior and future testscores and demographic variables including age sex race andfree lunch eligibility We also have information about a wide

14 There are also other parts of the test which are either not included inofcial school reporting (spelling punctuation grammar) or are given only inselect grades (science and social studies) for which we do not have information

15 We also have test scores for eighth graders but we exclude them becauseour algorithm requires test score data for the following year and the ITBS test isnot administered to ninth graders

855ROTTEN APPLES

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 14: rotten apples: an investigation of the prevalence and predictors of ...

range of school-level characteristics We do not however haveindividual teacher identiers so we are unable to directly linkteachers to classrooms or to track a particular teacher over time

Because our cheating proxies rely on comparisons to past andfuture test scores we drop observations that are missing readingor math scores in either the preceding year or the following year(in addition to those with missing test scores in the baselineyear)16 Less than one-half of 1 percent of students with missingdemographic data are also excluded from the analysis Finallybecause our algorithms for identifying cheating rely on identify-ing suspicious patterns within a classroom our methods havelittle power in classrooms with small numbers of students Con-sequently we drop all classrooms for which we have fewer thanten valid students in a particular grade after our other exclusions(roughly 3 percent of students) A handful of classrooms recordedas having more than 40 studentsmdashpresumably multiple class-rooms not separately identied in the datamdashare also droppedOur nal data set contains roughly 20000 students per grade peryear distributed across approximately 1000 classrooms for atotal of over 40000 classroom-years of data (with four subjecttests per classroom-year) and over 700000 student-yearobservations

Summary statistics for the full sample of classrooms areshown in Table I where the unit of observation is at the level ofclasssubjectyear As in many urban school districts students inChicago are disproportionately poor and minority Roughly 60percent of students are Black and 26 percent are Hispanic withnearly 85 percent receiving free or reduced-price lunch Less than29 percent of students scored at or above national norms inreading

The ITBS exams are administered over a weeklong period inearly May Third grade teachers are permitted to administer theexam to their own students while other teachers switch classes toadminister the exams The exams are generally delivered to theschools one to two weeks before testing and are supposed to be

16 The exact number of students with missing test data varies by year andgrade Overall we drop roughly 12 percent of students because of missing testscore information in the baseline year These are students who (i) were not testedbecause of placement in bilingual or special education programs or (ii) simplymissed the exam that particular year In addition to these students we also dropapproximately 13 percent of students who were missing test scores in either thepreceding or following years Test data may be missing either because a studentdid not attend school on the days of the test or because the student transferredinto the CPS system in the current year or left the system prior to the next yearof testing

856 QUARTERLY JOURNAL OF ECONOMICS

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 15: rotten apples: an investigation of the prevalence and predictors of ...

kept in a secure location by the principal or the schoolrsquos testcoordinator an individual in the school designated to coordinatethe testing process (often a counselor or administrator) Each

TABLE ISUMMARY STATISTICS

Mean(sd)

Classroom characteristicsMixed grade classroom 0073Teacher administers exams to her own students (3rd grade) 0206Percent of students who were tested and included in ofcial

reporting 0883Average prior achievement 20004(as deviation from yeargradesubject mean) (0661) Black 0595 Hispanic 0263 Male 0495 Old for grade 0086 Living in foster care 0044 Living with nonparental relative 0104Cheatermdash95th percentile cutoff 0013School characteristicsAverage quality of teachersrsquo undergraduate institution

in the school22550(0877)

Percent of teachers who live in Chicago 0712Percent of teachers who have an MA or a PhD 0475Percent of teachers who majored in education 0712Percent of teachers under 30 years of age 0114Percent of teachers at the school less than 3 years 0547 students at national norms in reading last year 288 students receiving free lunch in school 847Predominantly Black school 0522Predominantly Hispanic school 0205Mobility rate in school 286Attendance rate in school 926School size 722

(317)Accountability policySocial promotion policy 0215School probation policy 0127Test form offered for the rst time 0371Number of observations 163474

The data summarized above include one observation per classroomyearsubjectThe sample includes allstudents in third to seventh grade for the years 1993ndash2000 We drop observations for the following reasons(a) missing reading or math scores in either the baseline year the preceding year or the following year (b)missing demographic data (c) classrooms for which we have fewer than ten valid students in a particulargrade or more than 40 students See the discussion in the text for a more detailed explanation of the sampleincluding the percentages excluded for various reasons

857ROTTEN APPLES

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 16: rotten apples: an investigation of the prevalence and predictors of ...

section of the exam consists of 30 to 60 multiple choice questionswhich students are given between 30 and 75 minutes to com-plete17 Students mark their responses on answer sheets whichare scanned to determine a studentrsquos score There is no penaltyfor guessing A studentrsquos raw score is simply calculated as thesum of correct responses on the exam The raw score is thentranslated into a grade equivalent

After collecting student exams teachers or administratorsthen ldquocleanrdquo the answer keys erasing stray pencil marks remov-ing dirt or debris from the form and darkening item responsesthat were only faintly marked by the student At the end of thetesting week the test coordinators at each school deliver thecompleted answer keys and exams to the CPS central ofceSchool personnel are not supposed to keep copies of the actualexams although school ofcials acknowledge that a number ofteachers each year do so The CPS has administered three differ-ent versions of the ITBS between 1993 and 2000 The CPS alter-nates forms each year with new forms being offered for the rsttime in 1993 1994 and 199718

V THE PREVALENCE OF TEACHER CHEATING

As noted earlier our estimates of the prevalence of cheatingare derived from a comparison of the actual number of classroomsthat are above some threshold on both of our cheating indicatorsrelative to the number we would expect to observe based on thecorrelation between these two measures in the 50thndash75th percen-tiles of the suspicious string measure The top panel of Table IIpresents our estimates of the percentage of classrooms that arecheating on average on a given subject test (ie reading compre-hension or one of the three math tests) in a given year19 Because

17 The mathematics and reading tests measure basic skills The readingcomprehension exam consists of three to eight short passages followed by up tonine questions relating to the passage The math exam consists of three sectionsthat assess number concepts problem-solving and computation

18 These three forms are used for retesting summer school testing andmidyear testing as well so that it is likely that over the years teachers have seenthe same exam on a number of occasions

19 It is important to note that we exclude a particular form of cheating thatappears to be quite prevalent in the data teachers randomly lling in answers leftblank by students at the end of the exam In many classrooms almost everystudent will end the test with a long string of the identical answers (typically allthe same response like ldquoBrdquo or ldquoDrdquo) The fact that almost all students in the classcoordinate on the same pattern strongly suggests that the students themselvesdid not ll in the blanks or were under explicit instructions by the teacher to do

858 QUARTERLY JOURNAL OF ECONOMICS

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 17: rotten apples: an investigation of the prevalence and predictors of ...

the decision as to what cutoff signies a ldquohighrdquo value on ourcheating indicators is arbitrary we present a 3 3 3 matrix ofestimates using three different thresholds (eightieth ninetiethand ninety-fth percentiles) for each of our cheating measuresThe estimated prevalence of cheaters ranges from 11 percent to21 percent depending on the particular set of cutoffs used Aswould be expected the number of cheaters is generally decliningas higher thresholds are employed Nonetheless it is encouragingthat over such a wide set of cutoffs the range of estimates isrelatively tight

The bottom panel of Table II presents estimates of the per-centage of classrooms that are cheating on any of the four subjecttests in a particular year If every classroom that cheated did soonly on one subject test then the results in the bottom panel

so Since there is no penalty for guessing on the test lling in the blanks can onlyincrease student test scores While this type of teacher behavior is likely to beviewed by many as unethical we do not make it the focus of our analysis because(1) it is difcult to provide denitive evidence of such behavior (a teacher couldargue that he or she instructed students well in advance of the test to ll in allblanks with the letter ldquoCrdquo as part of good test-taking strategy) and (2) in ourminds it is categorically different than a teacher who systematically changesstudent responses to the correct answer

TABLE IIESTIMATED PREVALENCE OF TEACHER CHEATING

Cutoff for suspicious answerstrings (ANSWERS)

Cutoff for test score uctuations (SCORE)

80th percentile 90th percentile 95th percentile

Percent cheating on a particular test80th percentile 21 21 1890th percentile 18 18 1595th percentile 13 13 11

Percent cheating on at least one of the four tests given80th percentile 45 56 5390th percentile 42 49 4495th percentile 35 38 34

The top panel of the table presents estimates of the percentage of classrooms cheating on a particularsubject test in a given year based on three alternative cutoffs for ANSWERS and SCORE In all cases theprevalence of cheating is based on the excess number of classrooms with unexpected test score uctuationamong classes with suspicious answer strings relative to classes that do not have suspicious answer stringsThe bottom panel of the table presents estimates of the percentage of classrooms cheating on at least one ofthe four subject tests that comprise the overall test In the bottom panel classrooms that cheat on more thanone subject test are counted only once Our sample includes over 35000 3rdndash7th grade classrooms in theChicago public schools for the years 1993ndash1999

859ROTTEN APPLES

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 18: rotten apples: an investigation of the prevalence and predictors of ...

would simply be four times the results in the top panel In manyinstances however classrooms appear to cheat on multiple sub-jects Thus the prevalence rates range from 34ndash56 percent of allclassrooms20 Because of the necessarily indirect nature of ouridentication strategy we explore a range of supplementary testsdesigned to assess the validity of the estimates21

VA Simulation Results

Our prevalence estimates may be biased downward or up-ward depending on the extent to which our algorithm fails tosuccessfully identify all instances of cheating (ldquofalse negativesrdquo)versus the frequency with which we wrongly classify a teacher ascheating (ldquofalse positiverdquo)

One simple simulation exercise we undertook with respect topossible false positives involves randomly assigning students tohypothetical classrooms These synthetic classrooms thus consistof students who in actuality had no connection to one another Wethen analyze these hypothetical classrooms using the same algo-rithm applied to the actual data As one would hope no evidenceof cheating was found in the simulated classes Indeed the esti-mated prevalence of cheating was slightly negative in this simu-lation (ie classrooms with large test score increases in thecurrent year followed by big declines the next year were slightlyless likely to have unusual patterns of answer strings) Thusthere does not appear to be anything mechanical in our algorithmthat generates evidence of cheating

A second simulation involves articially altering a class-

20 Computation of the overall prevalence is somewhat complicated becauseit involves calculating not only how many classrooms are actually above thethresholds on multiple subject tests but also how frequently this would occur inthe absence of cheating Details on these calculations are available from theauthors

21 In an earlier version of this paper [Jacob and Levitt 2003] we present anumber of additional sensitivity analyses that conrm the basic results First wetest whether our results are sensitive to the way in which we predict the preva-lence of high values of ANSWERS and SCORE in the upper part of the otherdistribution (eg tting a linear or quadratic model to the data in the lowerportion of the distribution versus simply using the average value from the 50thndash75th percentiles) We nd our results are extremely robust Second we examinewhether our results might simply be due to mean reversion by testing whetherclassrooms with suspicious answer strings are less likely to maintain large testscore gains Among a sample of classrooms with large test score gains we nd thatmean reversion is substantially greater among the set of classes with highlysuspicious answer strings Third we investigate whether we might be detectingstudent rather than teacher cheating by examining whether students who havesuspicious answer strings in one year are more likely to have suspicious answerstrings in other years We nd that they do not

860 QUARTERLY JOURNAL OF ECONOMICS

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 19: rotten apples: an investigation of the prevalence and predictors of ...

roomrsquos answer strings in ways that we believe mimic teachercheating or outstanding teaching22 We are then able to see howfrequently we label these altered classes as ldquocheatingrdquo and howthis varies with the extent and nature of the changes we make toa classroomrsquos answers We simulate two different types of teachercheating The rst is a very naive version in which a teacherstarts cheating at the same question for a number of students andchanges consecutive questions to the right answers for thesestudents creating a block of identical and correct responses Thesecond type of cheating is much more sophisticated we randomlychange answers from incorrect to correct for selected students ina class The outcome of this simulation reveals that our methodsare surprisingly weak in detecting moderate cheating even in thenave case For instance when three consecutive answers arechanged to correct for 25 percent of a class we catch such behav-ior only 4 percent of the time Even when six questions arechanged for half of the class we catch the cheating in less than 60percent of the cases Only when the cheating is very extreme (egsix answers changed for the entire class) are we very likely tocatch the cheating The explanation for this poor performance isthat classrooms with low test-score gains even with the boostprovided by this cheating do not have large enough test scoreuctuations to make them appear unusual because there is somuch inherent volatility in test scores When the cheating takesthe more sophisticated form described above we are even lesssuccessful at catching low and moderate cheaters (2 percent and34 percent respectively in the rst two scenarios describedabove) but we almost always detect very extreme cheating

From a political or equity perspective however we may beeven more concerned with false positives specically the risk ofaccusing highly effective teachers of cheating Hence we alsosimulate the impact of a good teacher changing certain answersfor certain students from incorrect to correct responses Our ma-nipulation in this case differs from the cheating simulationsabove in two ways (1) we allow some of the gains to be preservedin the following year and (2) we alter questions such that stu-dents will not get random answers correct but rather will tend toshow the greatest improvement on the easiest questions that thestudents were getting wrong These simulation results suggest

22 See Jacob and Levitt [2003] for the precise details of this simulationexercise

861ROTTEN APPLES

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 20: rotten apples: an investigation of the prevalence and predictors of ...

that only rarely are good teachers mistakenly labeled as cheatersFor instance a teacher whose prowess allows him or her to raisetest scores by six questions for half the students in the class (morethan a 05 grade equivalent increase on average for the class) islabeled a cheater only 2 percent of the time In contrast a navecheater with the same test score gains is correctly identied as acheater in more than 50 percent of cases

In another test for false positives we explored whether wemight be mistaking emphasis on certain subject material forcheating For example if a math teacher spends several monthson fractions with a particular class one would expect the class todo particularly well on all of the math questions relating tofractions and perhaps worse than average on other math ques-tions One might imagine a similar scenario in reading if forexample a teacher creates a lengthy unit on the UndergroundRailroad which later happens to appear in a passage on thereading comprehension exam To examine this we analyze thenature of the across-question correlations in cheating versusnoncheating classrooms We nd that the most highly correlatedquestions in cheating classrooms are no more likely to measurethe same skill (eg fractions geometry) than in noncheatingclassrooms The implication of these results is that the prevalenceestimates presented above are likely to substantially understatethe true extent of cheatingmdashthere is little evidence of false posi-tivesmdashbut we frequently miss moderate cases of cheating

VB The Correlation of Cheating Indicators across SubjectsClassrooms and Years

If what we are detecting is truly cheating then one wouldexpect that a teacher who cheats on one part of the test would bemore likely to cheat on other parts of the test Also a teacher whocheats one year would be more likely to cheat the following yearFinally to the extent that cheating is either condoned by theprincipal or carried out by the test coordinator one would expectto nd multiple classes in a school cheating in any given year andperhaps even that cheating in a school one year predicts cheatingin future years If what we are detecting is not cheating then onewould not necessarily expect to nd strong correlation in ourcheating indicators across exams for a specic classroom acrossclassrooms or across years

862 QUARTERLY JOURNAL OF ECONOMICS

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 21: rotten apples: an investigation of the prevalence and predictors of ...

Table III reports regression results testing these predictionsThe dependent variable is an indicator for whether we believe aclassroom is likely to be cheating on a particular subject testusing our most stringent denition (above the ninety-fth per-centile on both cheating indicators) The baseline probability of

TABLE IIIPATTERNS OF CHEATING WITHIN CLASSROOMS AND SCHOOLS

Independent variables

Dependent variable 5 Class suspectedof cheating (mean of the dependent

variable 5 0011)

Full sample

Sample ofclasses andschool that

existed in theprior year

Classroom cheated on exactly oneother subject this year

0105(0008)

0103(0008)

0101(0009)

0101(0009)

Classroom cheated on exactly twoother subjects this year

0289(0027)

0285(0027)

0243(0031)

0243(0031)

Classroom cheated on all three othersubjects this year

0627(0051)

0622(0051)

0595(0054)

0595(0054)

Cheating rate among all other classesin the school this year on thissubject

0166(0030)

0134(0027)

0129(0027)

Cheating rate among all other classesin the school this year on othersubjects

0023(0024)

0059(0026)

0045(0029)

Cheating in this classroom in thissubject last year

0096(0012)

0091(0012)

Number of other subjects thisclassroom cheated on last year

0023(0004)

0018(0004)

Cheating in this classroom ever in thepast

0006(0002)

Cheating rate among other classroomsin this school in past years

0090(0040)

Fixed effects for gradesubjectyear Yes Yes Yes YesR2 0090 0093 0109 0109Number of observations 165578 165578 94182 94170

The dependent variable is an indicator for whether a classroom is above the 95th percentile on both oursuspicious strings and unusual test score measures of cheating on a particular subject test Estimation isdone using a linear probability model The unit of observation is classroomgradeyearsubjectFor columnsthat include measures of cheating in prior years observations where that classroom or school does not appearin the data in the prior year are excluded Standard errors are clustered at the school level to take intoaccount correlations across classroom as well as serial correlation

863ROTTEN APPLES

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 22: rotten apples: an investigation of the prevalence and predictors of ...

qualifying as a cheater for this cutoff is 11 percent To fullyappreciate the enormity of the effects implied by the table it isimportant to keep this very low baseline in mind We reportestimates from linear probability models (probits yield similarmarginal effects) with standard errors clustered at the schoollevel

Column 1 of Table III shows that cheating on other tests inthe same year is an extremely powerful predictor of cheating in adifferent subject If a classroom cheats on exactly one other sub-ject test the predicted probability of cheating on this test in-creases by over ten percentage points Since the baseline cheatingrates are only 11 percent classrooms cheating on exactly oneother test are ten times more likely to have cheated on thissubject than are classrooms that did not cheat on any of the othersubjects (which is the omitted category) Classrooms that cheaton two other subjects are almost 30 times more likely to cheat onthis test relative to those not cheating on other tests If a classcheats on all three other subjects it is 50 times more likely to alsocheat on this test

There also is evidence of correlation in cheating withinschools A ten-percentage-point increase in cheating classroomsin a school (excluding the classroom in question) on the samesubject test raises the likelihood this class cheats by roughly 016percentage points This potentially suggests some role for cen-tralized cheating by a school counselor test coordinator or theprincipal rather than by teachers operating independentlyThere is little evidence that cheating rates within the school onother subject tests affect cheating on this test

When making comparisons across years (columns 3 and 4) itis important to note that we do not actually have teacher identi-ers We do however know what classroom a student is assignedto Thus we can only compare the correlation between past andcurrent cheating in a given classroom To the extent that teacherturnover occurs or teachers switch classrooms this proxy will becontaminated Even given this important limitation cheating inthe classroom last year predicts cheating this year In column 3for example we see that classrooms that cheated in the samesubject last year are 96 percentage points more likely to cheatthis year even after we control for cheating on other subjects inthe same year and cheating in other classes in the school Column4 shows that prior cheating in the school strongly predicts thelikelihood that a classroom will cheat this year

864 QUARTERLY JOURNAL OF ECONOMICS

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 23: rotten apples: an investigation of the prevalence and predictors of ...

VC Evidence from Retesting under Controlled Circumstances

Perhaps the most compelling evidence for our methods comesfrom the results of a unique policy intervention In spring 2002the Chicago public schools provided us the opportunity to retestover 100 classrooms under controlled circumstances that pre-cluded cheating a few weeks after the initial ITBS exam wasadministered23 Based on suspicious answer strings and unusualtest score gains on the initial spring 2002 exam we identied aset of classrooms most likely to have cheated In addition wechose a second group of classrooms that had suspicious answerstrings but did not have large test score gains a pattern that isconsistent with a bad teacher who attempts to mask a poorteaching performance by cheating We also retested two sets ofclassrooms as control groups The rst control group consisted ofclasses with large test score gains but no evidence of cheating intheir answer strings a potential indication of effective teachingThese classes would be expected to maintain their gains whenretested subject to possible mean reversion The second controlgroup consisted of randomly selected classrooms

The results of the retests are reported in Table IV Columns1ndash3 correspond to reading columns 4ndash6 are for math For eachsubject we report the test score gain (in standard score units) forthree periods (i) from spring 2001 to the initial spring 2002 test(ii) from the initial spring 2002 test to the retest a few weekslater and (iii) from spring 2001 to the spring 2002 retest Forpurposes of comparison we also report the average gain for allclassrooms in the system for the rst of those measures (the othertwo measures are unavailable unless a classroom was retested)

Classrooms prospectively identied as ldquomost likely to havecheatedrdquo experienced gains on the initial spring 2002 test thatwere nearly twice as large as the typical CPS classroom (seecolumns 1 and 4) On the retest however those excess gainscompletely disappearedmdashthe gains between spring 2001 and thespring 2002 retest were close to the systemwide average Thoseclassrooms prospectively identied as ldquobad teachers likely to havecheatedrdquo also experienced large score declines on the retest (288and 2105 standard score points or roughly two-thirds of a gradeequivalent) Indeed comparing the spring 2001 results with the

23 Full details of this policy experiment are reported in Jacob and Levitt[2003] The results of the retests were used by CPS to initiate disciplinary actionagainst a number of teachers principals and staff

865ROTTEN APPLES

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 24: rotten apples: an investigation of the prevalence and predictors of ...

TABLE IVRESULTS OF RETESTS COMPARISON OF RESULTS FOR SPRING 2002 ITBS

AND AUDIT TEST

Category ofclassroom

Reading gains between Math gains between

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

Spring2001and

spring2002

Spring2001and2002retest

Spring2002and2002retest

All classrooms inthe Chicagopublic schools 143 mdash mdash 169 mdash mdash

Classroomsprospectivelyidentied asmost likelycheaters(N 5 36 onmath N 5 39on reading) 288 126 2162 300 193 2107

Classroomsprospectivelyidentied as badteacherssuspected ofcheating(N 5 16 onmath N 5 20on reading) 166 78 288 173 68 2105

Classroomsprospectivelyidentied asgood teacherswho did notcheat (N 5 17on math N 517 on reading) 206 211 105 288 255 233

Randomly selectedclassrooms (N 524 overall butonly one test perclassroom) 145 122 223 145 122 223

Results in this table compare test scores at three points in time (spring 2001 spring 2002 and a retestadministered under controlled conditions a few weeks after the initial spring 2002 test) The unit ofobservation is a classroom-subject test pair Only students taking all three tests are included in thecalculations Because of limited data math and reading results for the randomly selected classrooms arecombined Only the rst two columns are available for all CPS classrooms since audits were performed onlyon a subset of classrooms All entries in the table are in standard score units

866 QUARTERLY JOURNAL OF ECONOMICS

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 25: rotten apples: an investigation of the prevalence and predictors of ...

spring 2002 retest children in these classrooms had gains onlyhalf as large as the typical yearly CPS gain In stark contrastclassrooms prospectively identied as having ldquogood teachersrdquo(ie classrooms with large gains but not suspicious answerstrings) actually scored even higher on the reading retest thanthey did on the initial test The math scores fell slightly onretesting but these classrooms continued to experience extremelylarge gains The randomly selected classrooms also maintainedalmost all of their gains when retested as would be expected Insummary our methods proved to be extremely effective both inidentifying likely cheaters and in correctly classifying teacherswho achieved large test score gains in a legitimate manner

VI DOES TEACHER CHEATING RESPOND TO INCENTIVES

From the perspective of economics perhaps the most inter-esting question related to teacher cheating is the degree to whichit responds to incentives As noted in the Introduction there weretwo major changes in the incentives faced by teachers and stu-dents over our sample period Prior to 1996 ITBS scores wereprimarily used to provide teachers and parents with a sense ofhow a child was progressing academically Beginning in 1996with the appointment of Paul Vallas as CEO of Schools the CPSlaunched an initiative designed to hold students and teachersaccountable for student learning

The reform had two main elements The rst involved put-ting schools ldquoon probationrdquo if less than 15 percent of studentsscored at or above national norms on the ITBS reading examsThe CPS did not use math performance to determine probationstatus Probation schools that do not exhibit sufcient improve-ment may be reconstituted a procedure that involves closing theschool and dismissing or reassigning all of the school staff24 It isclear from our discussions with teachers and administrators thatbeing on probation is viewed as an extremely undesirable circum-stance The second piece of the accountability reform was an endto social promotionmdashthe practice of passing students to the nextgrade regardless of their academic skills or school performanceUnder the new policy students in third sixth and eighth grademust meet minimum standards on the ITBS in both reading and

24 For a more detailed analysis of the probation policy see Jacob [2002] andJacob and Lefgren [forthcoming]

867ROTTEN APPLES

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 26: rotten apples: an investigation of the prevalence and predictors of ...

mathematics in order to be promoted to the next grade Thepromotion standards were implemented in spring 1997 for thirdand sixth graders Promotion decisions are based solely on scoresin reading comprehension and mathematics

Table V presents OLS estimates of the relationship betweenteacher cheating and a variety of classroom and school charac-teristics25 The unit of observation is a classroomsubjectgradeyear The dependent variable is an indicator of whether we sus-pect the classroom cheated using our most stringent denition aclassroom is designated a cheater if both its test score uctua-tions and the suspiciousness of its answer strings are above theninety-fth percentile for that grade subject and year26

In column (1) the policy changes are restricted to have aconstant impact across all classrooms The introduction of thesocial promotion and probation policies is positively correlatedwith the likelihood of classroom cheating although the pointestimates are not statistically signicant at conventional levelsMuch more interesting results emerge when we interact thepolicy changes with the previous yearrsquos test scores for the class-room For both probation and social promotion cheating rates inthe lowest performing classrooms prove to be quite responsive tothe change in incentives In column (2) we see that a classroomone standard deviation below the mean increased its likelihood ofcheating by 043 percentage points in response to the schoolprobation policy and roughly 065 percentage points due to theending of social promotion Given a baseline rate of cheating of11 percent these effects are substantial The magnitude of thesechanges is particularly large considering that no elementaryschool on probation was actually reconstituted during this periodand that the social promotion policy has a direct impact on stu-dents but not direct ramications for teacher pay or rewards Aclassroom one standard deviation above the mean does not seeany signicant change in cheating in response to these two poli-cies Such classes are very unlikely to be located in schools at riskfor being put on probation and also are likely to have few stu-dents at risk for being retained These ndings are robust to

25 Logit and Probit models evaluated at the mean yield comparable resultsso the estimates from a linear probability model are presented for ease ofinterpretation

26 The results are not sensitive to the cheating cutoff used Note that thismeasure may include error due to both false positives and negatives Since themeasurement error is in the dependent variable the primary impact will likely beto simply decrease the precision of our estimates

868 QUARTERLY JOURNAL OF ECONOMICS

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 27: rotten apples: an investigation of the prevalence and predictors of ...

TABLE VOLS ESTIMATES OF THE RELATIONSHIP BETWEEN CHEATING AND CLASSROOM

CHARACTERISTICS

Independent variables

Dependent variable 5 Indicator ofclassroom cheating

(1) (2) (3) (4)

Social promotion policy 00011(00013)

00011(00013)

00015(00013)

00023(00009)

School probation policy 00020(00014)

00019(00014)

00021(00014)

00029(00013)

Prior classroom achievement 200047(00005)

200028(00005)

200016(00007)

200028(00007)

Social promotionclassroom achievement 200049(00014)

200051(00014)

200046(00012)

School probationclassroom achievement 200070(00013)

200070(00013)

200064(00013)

Mixed grade classroom 200084(00007)

200085(00007)

200089(00008)

200089(00012)

of students included in ofcial reporting 00252(00031)

00249(00031)

00141(00037)

00131(00037)

Teacher administers exam to ownstudents

00067(00015)

00067(00015)

00066(00015)

00061(00011)

Test form offered for the rst timea 200007(00011)

200007(00011)

200011(00010)

mdash

Average quality of teachersrsquoundergraduate institution

200026(00007)

Percent of teachers who have worked atthe school less than 3 years

200045(00031)

Percent of teachers under 30 years of age 00156(00065)

Percent of students in the school meetingnational norms in reading last year

00001(00000)

Percent free lunch in school 00001(00000)

Predominantly Black school 00068(00019)

Predominantly Hispanic school 200009(00016)

SchoolYear xed effects No No No YesNumber of observations 163474 163474 163474 163474

The unit of observation is classroomgradeyearsubject and the sample includes eight years (1993 to2000) four subjects (reading comprehension and three math sections) and ve grades (three to seven) Thedependentvariable is the cheating indicator derived using the ninety-fth percentile cutoff Robust standarderrors clustered by schoolyear are shown in parentheses Other variables included in the regressions incolumn (1) and (2) include a linear time trend grade cubic terms for the number of students a linear gradevariable and xed effects for subjects The regression shown in column (3) also includes the followingvariables indicators of the percentofstudents in theclassroom who wereBlack Hispanic male receivingfree lunchold for grade in a special education program living in foster care and living with a nonparental relative indicators ofschool size mobility rate and attendance rate and indicators of the percent of teachers in the school Other variablesinclude thepercentof teachersin the schoolwho hada masterrsquos ordoctoraldegreelivedin Chicagoand wereeducationmajors

a Test forms vary only by year so this variable will drop out of the analysis when schoolyear xed effectsare included

869ROTTEN APPLES

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 28: rotten apples: an investigation of the prevalence and predictors of ...

adding a number of classroom and school characteristics (column(3)) as well as schoolyear xed effects (column (4))

Cheating also appears to be responsive to other costs andbenets Classrooms that tested poorly last year are much morelikely to cheat For example a classroom with average studentprior achievement one classroom standard deviation below themean is 23 percent more likely to cheat Classrooms with stu-dents in multiple grades are 65 percent less likely to cheat thanclassrooms where all students are in the same grade This isconsistent with the fact that it is likely more difcult for teachersin such classrooms to cheat since they must administer twodifferent test forms to students which will necessarily have dif-ferent correct answers Moreover classes with a higher propor-tion of students who are included in the ofcial test reporting aremore likely to cheatmdasha 10 percentage point increase in the pro-portion of students in a class whose test scores ldquocountrdquo willincrease the likelihood of cheating by roughly 20 percent27

Teachers who administer the exam to their own students are 067percentage pointsmdashapproximately 50 percentmdashmore likely tocheat28

A few other notable results emerge First there is no statis-tically signicant impact on cheating of reusing a test form thathas been administered in a previous year That nding is ofinterest because it suggests that teachers taking old exams andteaching the precise questions to students is not an importantcomponent of what we are detecting as cheating (although anec-dotal evidence suggests this practice exists) Classrooms inschools with lower achievement higher poverty rates and moreBlack students are more likely to cheat Classrooms in schoolswith teachers who graduated from more prestigious undergrad-uate institutions are less likely to cheat whereas classrooms inschools with younger teachers are more likely to cheat

27 Consistent with this nding within cheating classrooms teachers areless likely to alter the answer sheets for students who are excluded from testreporting Teachers also appear to less frequently cheat for boys and for olderstudents

28 In an earlier version of the paper [Jacob and Levitt 2002] we examine forwhich students teachers tend to cheat We nd that teachers are most likely tocheat for students in the second quartile of the national ability distribution(twenty-fth to ftieth percentile) in the previous year consistent with the incen-tives under the probation policy

870 QUARTERLY JOURNAL OF ECONOMICS

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 29: rotten apples: an investigation of the prevalence and predictors of ...

VII CONCLUSION

This paper develops an algorithm for determining the preva-lence of teacher cheating on standardized tests and applies theresults to data from the Chicago public schools Our methodsreveal thousands of instances of classroom cheating representing4ndash5 percent of the classrooms each year Moreover we nd thatteacher cheating appears quite responsive to relatively minorchanges in incentives

The obvious benets of high-stakes tests as a means of pro-viding incentives must therefore be weighed against the possibledistortions that these measures induce Explicit cheating ismerely one form of distortion Indeed the kind of cheating wefocus on is one that could be greatly reduced at a relatively lowcost through the implementation of proper safeguards such as isdone by Educational Testing Service on the SAT or GRE exams29

Even if this particular form of cheating were eliminatedhowever there are a virtually innite number of dimensionsalong which strategic school personnel can act to game the cur-rent system For instance Figlio and Winicki [2001] and Rohlfs[2002] document that schools attempt to increase the caloricintake of students on test days and disruptive students areespecially likely to be suspended from school during ofcial test-ing periods [Figlio 2002] It may be prohibitively costly or evenimpossible to completely safeguard the system

Finally this paper ts into a small but growing body ofresearch focused on identifying corrupt or illicit behavior on thepart of economic actors (see Porter and Zona [1993] Fisman[2001] Di Tella and Schargrodsky [2001] and Duggan and Levitt[2002]) Because individuals engaged in such behavior activelyattempt to hide their trails the intellectual exercise associatedwith uncovering their misdeeds differs substantially from thetypical economic application in which the researcher starts witha well-dened measure of the outcome variable (eg earningseconomic growth prots) and then attempts to uncover the de-terminants of these outcomes In the case of corruption there istypically no clear outcome variable making it necessary for the

29 Although even tests administered by the Educational Testing Service arenot immune to cheating as evidenced by recent reports that many of the GREscores from China are fraudulent [Contreras 2002]

871ROTTEN APPLES

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 30: rotten apples: an investigation of the prevalence and predictors of ...

researcher to employ nonstandard approaches in generating sucha measure

APPENDIX THE CONSTRUCTION OF SUSPICIOUS STRING MEASURES

This appendix describes in greater detail how we constructeach of our measures of unexpected or suspicious responses

The rst measure focuses on the most unlikely block of iden-tical answers given on consecutive questions Using past testscores future test scores and background characteristics wepredict the likelihood that each student will give each answer oneach question For each item a student has four choices (A B Cor D) only one of which is correct We estimate a multinomiallogit for each item on the exam in order to predict how studentswill respond to each question We estimate the following modelfor each item using information from other students in that yeargrade and subject

(1) Pr~Yisc 5 j 5eb jxs

Sj51J eb jxs

where Y isc indicates the response of student s in class c on itemi the number of possible responses ( J) is four and Xs is a vectorthat includes measures of prior and future student achievementin math and reading as well as demographic variables (such asrace gender and free lunch status) for student s Thus a stu-dentrsquos predicted probability of choosing a particular response isidentied by the likelihood of other students (in the same yeargrade and subject) with similar background characteristicschoosing that response

Notice that by including future as well as prior test scores inthe model we decrease the likelihood that students with unusu-ally good teachers will be identied as cheaters since these stu-dents will likely retain some of the knowledge learned in the baseyear and thus have higher future test scores Also note that byestimating the probability of selecting each possible responserather than simply estimating the probability of choosing thecorrect response we take advantage of any additional informa-tion that is provided by particular response patterns in aclassroom

Using the estimates from this model we calculate the pre-dicted probability that each student would answer each item in

872 QUARTERLY JOURNAL OF ECONOMICS

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 31: rotten apples: an investigation of the prevalence and predictors of ...

the way that he or she in fact did

(2) pisc 5ebkxs

S j51J eb j xs

for k

5 response actually chosen by student s on item i

This provides us with one measure per student per item Takingthe product over items within student we calculate the probabil-ity that a student would have answered a string of consecutivequestions from item m to item n as he or she did

(3) pscmn 5 P

i5m

n

pisc

We then take the product across all students in the classroomwho had identical responses in the string If we dene z as astudent Szc

m n as the string of responses for student z from item mto item n and S sc

m n and as the string for student s then we canexpress the product as

(4) pscmn 5 P

s[$ zS icmn5 S sc

mn

pscmn

Note that if there are ns students in class c and each student hasa unique set of responses to these particular items then psc

m n

collapses to pscm n for each student and there will be ns distinct

values within the class On the other extreme if all of the stu-dents in class c have identical responses then there is only onedistinct value of psc

m n We repeat this calculation for all possibleconsecutive strings of length three to seven that is for all Sm n

such that 3 m 2 n 7 To create our rst indicator ofsuspicious string patterns we take the minimum of the predictedblock probability for each classroom

MEASURE 1 M1c 5 mins( ps cm n)

This measure captures the least likely block of identical answersgiven on consecutive questions in the classroom

The second measure of suspicious answer strings is intendedto capture more general patterns of similarity in student re-sponses To construct this measure we rst calculate the resid-

873ROTTEN APPLES

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 32: rotten apples: an investigation of the prevalence and predictors of ...

uals for each of the possible choices a student could have made foreach item

(5) e jisc 5 0 2eb jxs

S j51J eb j xs

if j THORN k

5 1 2eb j xs

S j51J eb jxs

if j 5 k

where e jis c is the residual for response j on item i by student s inclassroom c We thus have four separate residuals per studentper item

To create a classroom level measure of the response to item iwe need to combine the information for each student First wesum the residuals for each response across students within aclassroom

(6) ejic 5 Os

e jisc

If there is no within-class correlation in the way that studentsresponded to a particular item this term should be approximatelyzero Second we sum across the four possible responses for eachitem within classrooms At the same time we square each of thecomponent residual measures to accentuate outliers and divideby number of students in the class (nsc) to normalize by class size

(7) v ic 5S j ejic

2

nsc

The statistic vic captures something like the variance of studentresponses on item i within classroom c Notice that we choose torst sum across the residuals of each response across studentsand then sum the classroom level measures for each responserather than summing across responses within student initiallyWe do this in order to emphasize the classroom level tendencies inresponse patterns

Our second measure of suspicious strings is simply the class-room average (across items) of this variance term across all testitems

MEASURE 2 M2c 5 v c 5 yeni vic ni where ni is the number of itemson the exam

Our third measure is simply the variance (as opposed to the

874 QUARTERLY JOURNAL OF ECONOMICS

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 33: rotten apples: an investigation of the prevalence and predictors of ...

mean) in the degree of correlation across questions within aclassroom

MEASURE 3 M3c 5 sv c

2 5 yeni (vic 2 v c)2 ni

Our nal indicator focuses on the extent to which a studentrsquosresponse pattern was different from other studentrsquos with thesame aggregate score that year Let qisc equal one if student s inclassroom c answered item i correctly and zero otherwise Let As

equal the aggregate score of student s on the exam We thendetermine what fraction of students at each aggregate score levelanswered each item correctly If we let nsA equal number ofstudents with an aggregate score of A then this fraction q i

A canbe expressed as

(8) q iA 5

Ss[$ z Az5As qisc

nsA

We then calculate a measure of how much the response pattern ofstudent s differed from the response pattern of other studentswith the same aggregate score We do so by subtracting a stu-dentrsquos answer on item i from the mean response of all studentswith aggregate score A squaring these deviations and then sum-ming across all items on the exam

(9) Zsc 5 Oi

~qisc 2 q iA2

We then subtract out the mean deviation for all students with thesame aggregate score Z A and sum the students within eachclassroom to obtain our nal indicator

MEASURE 4 M4c 5 yens (Zs c 2 Z A)

KENNEDY SCHOOL OF GOVERNMENT HARVARD UNIVERSITY

UNIVERSITY OF CHICAGO AND AMERICAN BAR FOUNDATION

REFERENCES

Aiken Linda R ldquoDetecting Understanding and Controlling for Cheating onTestsrdquo Research in Higher Education XXXII (1991) 725ndash736

Angoff William H ldquoThe Development of Statistical Indices for Detecting Cheat-ersrdquo Journal of the American Statistical Association LXIX (1974) 44ndash49

Baker George ldquoIncentive Contracts and Performance Measurementrdquo Journal ofPolitical Economy C (1992) 598ndash614

Bellezza Francis S and Suzanne F Bellezza ldquoDetection of Cheating on Multiple-Choice Tests by Using Error Similarity Analysisrdquo Teaching of PsychologyXVI (1989) 151ndash155

Carnoy Martin and Susanna Loeb ldquoDoes External Accountability Affect Student

875ROTTEN APPLES

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 34: rotten apples: an investigation of the prevalence and predictors of ...

Outcomes A Cross-State Analysisrdquo Working Paper Stanford UniversitySchool of Education 2002

Contreras Russell ldquoInquiry Uncovers Possible Cheating on GRE in Asiardquo Asso-ciated Press August 7 2002

Cullen Julie Berry and Randall Reback ldquoTinkering Toward Accolades SchoolGaming under a Performance Accountability Systemrdquo Working paper Uni-versity of Michigan 2002

Deere Donald and Wayne Strayer ldquoPutting Schools to the Test School Account-ability Incentives and Behaviorrdquo Working Paper Texas AampM University2002

DiTella Rafael and Ernesto Schargrodsky ldquoThe Role of Wages and Auditingduring a Crackdown on Corruption in the City of Buenos Airesrdquo unpublishedmanuscript Harvard Business School 2001

Duggan Mark and Steven Levitt ldquoWinning Isnrsquot Everything Corruption in SumoWrestlingrdquo American Economic Review XCII (2002) 1594ndash1605

Figlio David N ldquoTesting Crime and Punishmentrdquo Working Paper University ofFlorida 2002

Figlio David N and Joshua Winicki ldquoFood for Thought The Effects of SchoolAccountability Plans on School Nutritionrdquo Working Paper University ofFlorida 2001

Figlio David N and Lawrence S Getzler ldquoAccountability Ability and DisabilityGaming the Systemrdquo Working Paper University of Florida 2002

Fisman Ray ldquoEstimating the Value of Political Connectionsrdquo American EconomicReview XCI (2001) 1095ndash1102

Frary Robert B ldquoStatistical Detection of Multiple-Choice Answer Copying Re-view and Commentaryrdquo Applied Measurement in Education VI (1993)153ndash165

Frary Robert B T Nicholas Tideman and Thomas Watts ldquoIndices of Cheatingon Multiple-Choice Testsrdquo Journal of Educational Statistics II (1977)235ndash256

Glaeser Edward and Andrei Shleifer ldquoA Reason for Quantity Regulationrdquo Amer-ican Economic Review XCI (2001) 431ndash435

Grissmer David W Ann Flanagan et al ldquoImproving Student AchievementWhat NAEP Test Scores Tell Usrdquo RAND Corporation MR-924-EDU 2000

Hanushek Eric A and Margaret E Raymond ldquoImproving Educational QualityHow Best to Evaluate Our Schoolsrdquo Conference paper Federal Reserve Bankof Boston 2002

Hofkins Diane ldquoCheating lsquoRifersquo in National Testsrdquo New York Times EducationalSupplement June 16 1995

Holland Paul W ldquoAssessing Unusual Agreement between the Incorrect Answersof Two Examinees Using the K-Index Statistical Theory and EmpiricalSupportrdquo Technical Report No 96-5 Educational Testing Service 1996

Holmstrom Bengt and Paul Milgrom ldquoMultitask Principal-Agent Analyses In-centive Contracts Asset Ownership and Job Designrdquo Journal of Law Eco-nomics and Organization VII (1991) 24ndash51

Jacob Brian A ldquoAccountability Incentives and Behavior The Impact of High-Stakes Testing in the Chicago Public Schoolsrdquo Working Paper No 8968National Bureau of Economic Research 2002

Jacob Brian A and Lars Lefgren ldquoThe Impact of Teacher Training on StudentAchievement Quasi-Experimental Evidence from School Reform Efforts inChicagordquo Journal of Human Resources forthcoming

Jacob Brian A and Steven Levitt ldquoCatching Cheating Teachers The Results ofan Unusual Experiment in Implementing Theoryrdquo Working Paper No 9413National Bureau of Economic Research 2003

Klein Stephen P Laura S Hamilton et al ldquoWhat Do Test Scores in Texas TellUsrdquo RAND Corporation 2000

Kolker Claudia ldquoTexas Offers Hard Lessons on School Accountabilityrdquo LosAngeles Times April 14 1999

Lindsay Drew ldquoWhodunit Ofcials Find Thousands of Erasures on Standard-ized Tests and Suspect Tamperingrdquo Education Week (1996) 25ndash29

Loughran Regina and Thomas Comiskey ldquoCheating the Children Educator

876 QUARTERLY JOURNAL OF ECONOMICS

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES

Page 35: rotten apples: an investigation of the prevalence and predictors of ...

Misconduct on Standardized Testsrdquo Report of the City of New York SpecialCommissioner of Investigation for the New York City School District 1999

Marcus John ldquoFaking the Graderdquo Boston Magazine February 2000May Meredith ldquoState Fears Cheating by Teachersrdquo San Francisco Chronicle

October 4 2000Perlman Carol L ldquoResults of a Citywide Testing Program Audit in Chicagordquo

Paper for American Educational Research Association annual meeting Chi-cago IL 1985

Porter Robert H and J Douglas Zona ldquoDetection of Bid Rigging in ProcurementAuctionsrdquo Journal of Political Economy CI (1993) 518ndash538

Richards Craig E and Tian Ming Sheu ldquoThe South Carolina School IncentiveReward Program A Policy Analysisrdquo Economics of Education Review XI(1992) 71ndash86

Rohlfs Chris ldquoEstimating Classroom Learning Using the School Breakfast Pro-gram as an Instrument for Attendance and Class Size Evidence from ChicagoPublic Schoolsrdquo Unpublished manuscript University of Chicago 2002

Tysome Tony ldquoCheating Purge Inspectors outrdquo New York Times Higher Educa-tion Supplement August 19 1994

Wollack James A ldquoA Nominal Response Model Approach for Detecting AnswerCopyingrdquo Applied Psychological Measurement XX (1997) 144ndash152

877ROTTEN APPLES