TEACHER QUALITY 2.0

Teacher Quality 2.0

The HangoverThinking about the Unintended Consequences of the Nation’s Teacher Evaluation Binge

Sara Mead, Andrew Rotherham, Rachael Brown | September 2012

A M E R I C A N E N T E R P R I S E I N S T I T U T E Special Report 2

Special Report 2

1

Teacher Quality 2.0

Foreword

There is incredible interest and energy today in addressing issues of human capital in K–12 education,especially in the way we prepare, evaluate, pay, and manage teachers. States have been developing andimplementing systems intended to improve these practices, with a considerable push from foundationsand the federal government.

As we start to rethink outdated tenure, evaluation, and pay systems, we must take care to respecthow uncertain our efforts are and avoid tying our hands in ways that we will regret in the decade ahead.Well-intentioned legislators too readily replace old credential- and paper-based micromanagement withmandates that rely heavily on still-nascent observational evaluations and student outcome measurementsthat pose as many questions as answers. The flood of new legislative activity is in many respects wel-come, but it does pose a risk that premature solutions and imperfect metrics are being cemented intodifficult-to-change statutes.

AEI’s Teacher Quality 2.0 series seeks to reinvigorate our now-familiar conversations about teacherquality by looking at today’s reform efforts as constituting initial steps on a long path forward. As weconceptualize it, “Teacher Quality 2.0” starts from the premise that while we’ve made great improve-ments in the past ten years in creating systems and tools that allow us to evaluate, compensate, anddeploy educators in smarter ways, we must not let today’s “reform” conventions about hiring, evalua-tion, or pay limit school and system leaders’ ability to adapt more promising staffing and school models.

In this second installment of the series, Sara Mead, Andrew Rotherham, and Rachael Brown ofBellwether Education Partners draw out the key tensions and trade-offs associated with our sprint tolegislate and build educator evaluation systems. While we can take pride in the progress we have made,the authors call for some humility in thinking about the practical limits of such statutes and processes,and remind us that such rigid structures might unintentionally stifle future innovation. They explain:“the recent evaluation binge is not without risks…headlong rushes inevitably produce unintended con-sequences—something akin to a policy hangover as ideas move from conception to implementa-tion.” Mead, Rotherham, and Brown identify several recommendations as to how to approach thechallenges that lie ahead as we move into the next generation of teacher quality reform.

I found the piece to be a useful and engaging contribution to this series, and am hopeful that youwill do the same. For further information on the paper, Mead can be reached at [email protected]. For additional information on the activities of AEI’s education policy program, please visitwww.aei.org/hess or contact Lauren Aronson at [email protected].

—FREDERICK M. HESS Director of Education Policy Studies

American Enterprise Institute

Special Report 2

2

Teacher Quality 2.0

Over the past three years, more than twenty US stateshave passed legislation establishing new teacher evalua-tion requirements and systems, and even more have com-mitted to do so in Race to the Top or Elementary andSecondary Education Act Flexibility Waiver applications.These new evaluation systems have real potential to fostera more performance-oriented public education culturethat gives teachers meaningful feedback about the qualityand impact of their work. But there are pitfalls in states’rush to legislate new systems, and there are real tensionsand trade-offs in their design.

Unfortunately, much of the current policy debatehas been framed in stark ideological terms that leave littleroom for adult discussion of these tensions. This paperseeks to move the debate beyond ideology and technicalissues by highlighting four key tensions that policymak-ers, advocates, and educators must consider in the devel-opment of new teacher evaluations:

• Flexibility versus control: There is a temptation to prescribe and legislate details of evaluations toensure rigor and prevent evaluations from beingwatered down in implementation. But overly pre-scriptive policies may also limit school autonomyand stifle innovation that could lead to the develop-ment of better evaluations.

• Evaluation in an evolving system: Poorly designedevaluation requirements could pose an obstacle toblended learning and other innovative models inwhich it is difficult or impossible to attribute stu-dent learning gains in a particular subject to a par-ticular teacher.

• Purposes of evaluations: New evaluation systemshave been sold as a way both to identify and dis-miss underperforming teachers and to provide allteachers with useful feedback to help them improvetheir performance. But there are strong tensions

between these purposes that create trade-offs inevaluation system design.

• Evaluating teachers as professionals: Advocatesargue that holding teachers responsible for their performance will bring teaching more in line withnorms in other fields, but most professional fieldsrely on a combination of data and managerial judg-ment when making evaluation and personnel deci-sions, and subsequently hold managers accountablefor those decisions, rather than trying to eliminatesubjective judgments as some new teacher evaluationsystems seek to do.

Recognizing these tensions and trade-offs, this paperoffers several policy recommendations:

• Be clear about the problems new evaluation systemsare intended to solve.

• Do not mistake processes and systems as substitutesfor cultural change.

• Look at the entire education ecosystem, includingbroader labor-market impacts, pre- and in-servicepreparation, standards and assessments, charterschools, and growth of early childhood educationand innovative school models.

• Focus on improvement, not just deselection.

• Encourage and respect innovation.

• Think carefully about waivers versus umbrellas.

• Do not expect legislation to do regulation’s job.

• Create innovation zones for pilots—and fund them.

Executive Summary

Teacher evaluation is hot these days. In the past two years,more than twenty US states have passed legislationchanging teacher evaluation systems to include evidenceof teachers’ impact on student achievement. This feveredpace of legislation was sparked by the federal Race to theTop (RTT) program, which called on states to “Designand implement rigorous, transparent, and fair evaluationsystems for teachers and principals that differentiate effec-tiveness.” And the US Department of Education’s Ele-mentary and Secondary Education Act (ESEA) flexibilitywaiver process built momentum by demanding such policies as a condition for a waiver from No Child LeftBehind (NCLB) requirements. National and local mediahave added to the clamor; newspapers in New York andLos Angeles even published individual teacher value-added ratings as major stories.

After years of policies that ignored differences inteacher effectiveness, the pendulum is swinging in theother direction. By and large, this is progress—researchshows that teachers affect student achievement more thanany other within-school factor. Decades of inattention toteacher performance have been detrimental to students,teachers, and the credibility of the teaching profession.Addressing this problem is critical to improving publiceducation outcomes and raising the status of teaching,and neither the issues raised in this paper nor technicalconcerns about the design and mechanisms of evaluationsystems should be viewed as a reason not to move towarda more performance-oriented public education culturethat gives teachers meaningful feedback about the qualityand impact of their work.

Yet the recent evaluation binge is not without risks.By nature, education policymaking tends to lurch frominattention to overreach. When a political momentappears, policymakers and advocates rush to take advan-tage as quickly as they can, knowing that opportunitiesfor real change are fleeting. This is understandable, andarguably necessary, given the nature of America’s politicalsystem. But headlong rushes inevitably produce unin-tended consequences—something akin to a policy hang-over as ideas move from conception to implementation.

Welcome to teacher evaluation’s morning after. Asstates move from paying no attention to evaluation tostarting to design and implement ambitious statewideevaluation systems, it is clear that many are strugglingwith technical and political challenges. States are figuringout how to systematically evaluate personnel for whomevaluations have been cursory for years. They are incorpo-rating value-added measures of student learning into eval-uations for teachers in tested grades and subjects. Theyare grappling with the even greater challenge of how tomeasure impact on student learning for the majority ofteachers who teach nontested grades and subjects. Theyare also coping with insufficient managerial capacity.And, in the process, many are running up against thelimits of the carefully constructed systems and design fea-tures that well-meaning policy wonks told them werecritical to effective teacher evaluation systems.

And we are not even beginning to see the greatest ofthese challenges. The current range of teacher evaluationpolicies is designed with an eye toward the US educationsystem as it currently exists—even as technological innova-tion, blended learning, and the growth of charter andportfolio models in many urban areas are fundamentallychanging the way the system works. If we are not careful,new teacher evaluations will become another Ice Nine–likeelement in education, freezing in place what they touchand ultimately becoming just as much of an impedimentto progress as the old, inadequate systems they have displaced.

Special Report 2

3

Teacher Quality 2.0

The HangoverThinking about the Unintended Consequences of the Nation’s Teacher Evaluation Binge

Sara Mead, Andrew Rotherham, Rachael Brown

Sara Mead ([email protected]) is principal at Bellwether Education Partners. Andrew Rotherham ([email protected]) is cofounder of and partner at Bellwether Education Partners. Rachael Brown ([email protected]) is an associate at Bellwether EducationPartners.

Special Report 2

4

Teacher Quality 2.0

That issue is the focus of this paper. We trace the evo-lution of teacher evaluation policy, including the very realproblems with the previous status quo that created thecurrent situation. We then turn to the question of whatcosts today’s focus on teacher evaluation may carry forinnovation, for innovative schools, and for efforts withinthe public education system to do things differently.

We believe that the current attention to teacher eval-uation is long overdue and that as a nation America hasyet to wrestle honestly with the issue of teacher quality.Yet one can acknowledge that and still worry about whatis happening today. And we do.

So how did we get here?

Historical Context

Current efforts to establish teacher evaluation systemsmust be understood in historical context as the mostrecent in a series of reforms—stretching back to the ori-gins of public schooling— that sought to improve thequality of teaching in public schools. These historicalefforts have emphasized different indicators of teacherquality and mechanisms for improvement. But researchsuggests that many of these mechanisms and indicators infact have little relationship to improved student learning.Proponents of new teacher evaluation systems have seizedon this strategy, not because they “hate” teachers as somesuggest, but because the existing evidence suggests thathow teachers actually perform in the classroom is a farbetter indicator of quality than the proxy indicators—certification, years of experience, postgraduate credentials—on which our educational system currently relies.

For over a century, efforts to improve the qualityof teaching have largely focused on a “professionalism”agenda, seeking to improve quality through increasingstate regulation and formal training requirements forteacher certification and licensure.1 Beginning in the1980s, standards-based reformers also called for greaterrigor in teacher preparation programs, which reformersargued placed too much emphasis on pedagogical theo-ries and too little on ensuring teachers had deep contentknowledge in the subjects they taught. Massachusetts, forexample, enacted reforms that required all teachers tohold a major in a subject area other than education andto pass rigorous licensure exams of communication andliteracy skills and academic content knowledge.2

Building on this concern, NCLB also emphasizesteachers’ subject matter content knowledge. The law’s“highly qualified teacher” (HQT) provisions require all

teachers to hold a bachelor’s degree and state licensureand to demonstrate knowledge of the subject they teach,through either a college major, passage of a certificationexam, or, for veteran teachers, by meeting a state-defined“highly objective, uniform state standard of evaluation”(HOUSSE). These provisions were designed to ensurethat teachers have subject matter knowledge specific tothe subjects they teach, reduce the rate of “out of field”teaching in middle and secondary school, and improveequity in the distribution of qualified teachers for poorand minority students. But while the HQT provisionswere designed with the best of intentions, they ultimatelyfell short, creating paperwork hoops for teachers andschools to jump through without necessarily improvingthe quality of instruction. Because the definition of a“highly qualified teacher” relies almost entirely on ateacher’s subject matter knowledge, there is no guaranteethat teachers who meet the HQT standard are effective inimproving student achievement.3 NCLB’s provisionsrequiring states to ensure low-income and minority stu-dents’ equitable access to quality teachers also rely on theHQT standard, so they do not ensure that these studentsare taught by effective teachers.

Unfortunately, research suggests that many of theindicators policymakers have historically relied on asmeasures of teacher quality are at best very weak predic-tors of teachers’ effectiveness in improving student learn-ing. Indeed, even in studies that account for the range ofcharacteristics commonly perceived as external indicatorsof teacher quality, these characteristics collectively accountfor only a small percentage of the observed variation inteacher effectiveness, as measured by impact on studentlearning.4 Research shows that holding a master’s degree—a proxy for quality that most teacher compensation sys-tems reward—has no positive correlation with improvedstudent learning, with the exception of secondary mathand science teachers who hold master’s degrees in thosesubjects.5 As this finding suggests, indicators of subjectknowledge are correlated with improved effectiveness forsome teachers, but content knowledge alone does notensure teacher effectiveness. Research also shows thatwhile experience does matter in teacher quality, themajority of teacher improvement comes in the first few

Content knowledge alone does not

ensure teacher effectiveness.

years of teaching, and returns diminish beyond thatpoint.6 Similarly, a wide body of empirical research findslittle relationship between a teacher’s licensure credentialsand certification and her impact on student performance;the variation in effectiveness among teachers from thesame preparation pathway is much greater than the dif-ference between pathways.7 Given the lack of evidencethat indicates that many commonly viewed indicators ofquality actually correlate with improved performance, it isnot surprising that the last century of teacher qualityefforts have often proved disappointing.

But even as policymakers and advocates for low-income and minority students have grown disillusionedwith NCLB’s HQT provisions and other policies that relyon teacher credentials, another feature of NCLB—itsannual testing requirements in grades three through eight—is generating abundant student achievement data that aretransforming the national debate on teacher quality. Datasystems in a growing number of states are linking studentachievement data to teachers, making it possible to calcu-late individual teachers’ impact on student learning.

The Growth of Value-Added Measures. Value-addedmeasures of teacher effectiveness have been used to someextent for many years. In the early 1980s, two statisticiansat the University of Tennessee—William Sanders andRobert McLean—began experimenting with statisticalmethodologies to mitigate some of the challenges of usingstudent achievement data to assess teacher and schooleffectiveness. Working with data from Tennessee schooldistricts, they found evidence that there are significantmeasurable differences in schools’ and teachers’ impactson student learning and that estimates of school andteacher effectiveness tend to be consistent from year toyear. Value-added research also showed that these differ-ences could have significant implications for studentlearning, potentially large enough to meaningfully narrowor widen achievement gaps.8

In 1991, the Tennessee legislature passed the Educa-tion Improvement Act, which created a new statewideschool accountability system, a major component ofwhich was the Tennessee Value-Added AssessmentSystem, based on Sanders’s and McLean’s work. Althoughthis value-added model had clear advantages for improv-ing both school and teacher accountability, it remainedlargely under the radar outside of Tennessee and a smallcircle of education policy wonks for several years, in partbecause many states lacked both the annual assessmentsand robust data-tracking systems needed to replicate theTennessee model.9

This changed in 2002 with the passage of NCLB,which focused national attention on educational account-ability, requiring states to annually assess all students ingrades three through eight and producing an unprec-edented wealth of student achievement data. The law alsorequired states to identify schools and districts not meet-ing “adequate yearly progress” (AYP)—a measure of aschool’s student achievement. Discontent with AYP—anadmittedly crude measure that calculated annual schoolperformance but not student progress—led some policyanalysts to look to Tennessee’s value-added model as apotentially better way to measure school performance. Atthe same time, frustrations with the limitations of NCLB’sHQT provisions sparked interest in value-added measuresas a means to ensure equitable access to effective teachingfor low-income and minority students.

In a 2004 paper published by the Education Trust,policy analyst Kevin Carey made the case for using value-added data to evaluate teacher effectiveness and offered aset of recommendations for moving toward value-addedas a measure of teacher quality. Specifically, Carey recom-mended that policymakers develop and support systemsto collect and analyze value-added data; require evidenceof student learning as part of teacher preparation andlicensure; and use value-added data to inform teacherrecruitment, hiring, compensation, performance evalua-tion, and professional development. “When used wisely,”Carey wrote, value-added data “provides a strong basis foractions that will help states, districts, and schools improveteacher quality, raise overall student achievement, andclose the achievement gap.”10

But the use of value-added data for teacher evalua-tion really seized attention following a provocative 2006Brookings Institution report by Robert Gordon, ThomasKane, and Douglas Staiger. Given the paucity of evidence

Special Report 2

5

Teacher Quality 2.0

Data systems in a growing number

of states are linking student achieve-

ment data to teachers, making it

possible to calculate individual

teachers’ impact on student learning.

Special Report 2

6

Teacher Quality 2.0

that traditional entry credentials—such as certification—are predictive of teachers’ effectiveness in the classroom,the authors argued that policymakers would do better toreduce these barriers to entry into the profession. Instead,policymakers should seek to improve quality by focusingon value-added measures of teachers’ performance oncein the classroom, rewarding teachers who improve stu-dent performance and dismissing those who do not.Specifically, the paper’s most controversial recommenda-tion called for schools to dismiss the lowest-performing25 percent of teachers after two years of teaching. Byreducing barriers to entry, creating incentives for highperformers, and annually dismissing the lowest-perform-ing 25 percent of teachers, the authors argued, policy-makers could rapidly increase the level of effectiveness inthe teaching profession and generate improved studentachievement.11

These proposals were particularly provocative whencontrasted with prevailing practices in teacher evaluation.In a seminal 2009 report entitled The Widget Effect, theNew Teacher Project (TNTP) detailed the failure ofmost existing teacher evaluation systems, in which lessthan 1 percent of teachers received unsatisfactory ratings.The name of the report, which draws from an extensiveanalysis of over forty thousand teacher evaluation records,refers to the way in which existing evaluation systems failto meaningfully differentiate teacher performance, creat-ing a situation in which teachers are treated like inter-changeable widgets. “As a result,” the authors wrote,“teacher effectiveness is largely ignored. Excellent teach-ers cannot be recognized or rewarded, chronically low-performing teachers languish, and the wide majority ofteachers performing at moderate levels do not get thedifferentiated support and development they need toimprove as professionals.”12 The report called for schooldistricts to replace these broken evaluation systems withcomprehensive performance-based evaluation modelsthat differentiate teachers based on their effectiveness inboosting student achievement and that provide targetedprofessional development to help them improve.13

Teacher Evaluation 2.0. Even as TNTP detailed the fail-ings of current teacher evaluations, a new kind of teacherevaluation system was emerging—amid considerable con-troversy—in Washington, DC. In 2009, under the lead-ership of Chancellor Michelle Rhee, founder of TNTP,the district launched a new evaluation system—IMPACT.Under IMPACT, teachers are evaluated in four areas:

• Impact on student achievement (measured through

teacher- and school-level value-added data);

• Instructional expertise (measured through formalclassroom observations);

• Collaboration and commitment to school commu-nity; and

• Professionalism (which includes attendance, on-timearrival, following policies and procedures, and treat-ing colleagues and students with respect).

For teachers in grades and subjects that take the DCassessment, impact on student achievement comprises 55 percent of a teacher’s evaluation (50 percent based onindividual value-added data and 5 percent based onschool value-added), instructional expertise counts for 35 percent, and commitment to school community con-stitutes 10 percent. Professionalism is not scored, but fail-ure to meet standards for professionalism can reduce ateacher’s rating. Based on their performance in these fourareas, teachers receive one of four ratings:

• Highly Effective

• Effective

• Minimally Effective

• Ineffective

These ratings have implications for both teachers’compensation and continued employment. Teachers whoare rated highly effective can receive a bonus of up to$25,000, and those who are rated as such for two con-secutive years may receive a base pay increase of up to$20,000. Teachers rated as effective simply advance nor-mally on the pay scale. Teachers rated as minimally effec-tive receive targeted professional development to helpthem improve, and those who do not improve after twoyears may be dismissed. Finally, teachers rated ineffectiveare subject to dismissal. With strong encouragement fromthe federal government, IMPACT soon became themodel for a new wave of state and district teacher evalua-tion systems in states such as Florida and Illinois.

These fledgling efforts to improve evaluations caughtthe attention of national policymakers. The move towardnew systems of teacher evaluation gained steam in July2009, when the US Department of Education releaseddraft application guidelines for the RTT program. Cre-

ated as part of the American Recovery and ReinvestmentAct (ARRA)—the $787 billion stimulus bill passed bythe US Congress earlier that year to revive a flaggingeconomy—RTT was a $4.35 billion pot of money fromwhich the US Department of Education was authorizedto make competitive grants to states making progress infour “assurance areas” defined in the ARRA legislation:

• Making progress toward rigorous college- and career-ready standards and high-quality assessments that arevalid and reliable for all students, including Englishlanguage learners and students with disabilities;

• Establishing pre–K through college and career datasystems that track progress and foster continuousimprovement;

• Making improvements in teacher effectiveness and inthe equitable distribution of qualified teachers for allstudents, particularly students who are most in need;and

• Providing intensive support and effective interven-tions for the lowest-performing schools.

Within those parameters, the department had broadlatitude to define specific application criteria and priori-ties for RTT. And it chose to use the RTT guidance topromote the development of new teacher evaluation sys-tems that reflected The Widget Effect’s recommendations.The “Great Teachers and Leaders” component of RTTaccounted for more points than any other section of theapplication—more than one quarter of the total points.The largest component of that section—worth over 10percent of the total RTT application—required states toarticulate their plans for developing teacher and principalevaluation systems that evaluate all teachers and principalsat least annually, include student achievement growth asa significant factor in teacher evaluations, differentiateteacher effectiveness in multiple categories, and use teacherevaluation results to inform key personnel decisions,including professional development, compensation, pro-motion, retention, tenure, certification, and dismissal.The department also created an eligibility requirement forRTT that made any state that established statutory bar-riers to using student achievement data in teacher evalua-tion ineligible to win an RTT grant.14

RTT instigated a flurry of state activity around teacherevaluation.15 By the time the first round of applicationswas submitted in early 2010, eleven states had passed leg-

islation to eliminate statutory barriers to using studentachievement data in teacher evaluations, established newstandards for school and district teacher evaluations, orcreated new state teacher evaluation systems.16 Even morewould follow in the next two years, bringing the total tonearly twenty by the end of 2011. Other states accom-plished similar results through regulatory action. To sup-port states in establishing teacher evaluation policies thatmet the RTT criteria, TNTP published Teacher Evalua-tion 2.0, a report drawing on extant research literature tooutline six key design features for teacher evaluation sys-tems.17 These include:

• Annual evaluations (at least), including evaluationsof veteran teachers;

• Clear, rigorous expectations that prioritize studentlearning;

• Multiple measures of performance, primarily theteacher’s impact on students’ academic growth;

• Multiple rating levels (at least four) with cleardescriptions and expectations at each level;

• Frequent observations and ongoing constructivecritical feedback; and

• Use in employment decisions including bonuses,tenure, compensation, promotion, and dismissal.

These features have become the grammar of the eval-uation conversation. Most of the recent state laws over-hauling teacher evaluation requirements also reflect thesefeatures, although states vary in the extent to which theydo so, as reflected in a variety of ratings of state teacherevaluation laws produced by Democrats for EducationReform, the National Council on Teacher Quality, andBellwether Education Partners, the latter of which wasdone by the authors of this paper.18 While these ratingsvary in their focus and issues covered, all reflect the sixdesign principles articulated in TNTP’s Teacher Evalua-tion 2.0. Further, some of them go well beyond the sixdesign principals to specify how state teacher evaluationsystems should address a number of topics, including the role of teacher input in evaluation development,guidelines for choosing evaluators, and an appeals ormediation process.

Most recently, guidelines for the US Department of Education’s ESEA flexibility waiver process—which

Special Report 2

7

Teacher Quality 2.0

Special Report 2

8

Teacher Quality 2.0

allows states to request a waiver of key provisions ofNCLB, including the goal of 100 percent student proficiency by 2014—require states applying for a waiverto commit to “develop, adopt, pilot, and implement”teacher and principal evaluation systems that:

• Evaluate educators on a regular basis;

• Support instructional improvement;

• Differentiate performance using at least three levels;

• Incorporate multiple measures of educator perform-ance, including student growth as a significant factor;

• Provide clear, timely, and useful feedback to informprofessional development; and

• Inform personnel decisions.

States applying for a waiver must develop and adoptguidelines for these systems and require local educationalagencies to develop and implement evaluation systemsthat comply with those guidelines.19 Given frustrationswith NCLB in most states, these waiver requirements cre-ate a powerful incentive for states to adopt new teacherevaluation requirements aligned with federal guidelines,making it likely that Teacher Evaluation 2.0 will increas-ingly displace The Widget Effect as the norm for teacherevaluation throughout the country over the next few years.

Tensions and Trade-Offs

Recent state and district policy changes are leading tonew evaluation systems that have the potential to provideteachers with more useful feedback about their perform-ance, inform professional development, and ultimatelyimprove teacher practice. The information these systemsproduce can also help policymakers address persistentinequities in teacher distribution; improve the quality ofpreservice teacher preparation; better align compensa-tion and resource allocation decisions to what mattersmost for student learning; identify, reward, and retain highperformers through meaningful career ladders; and iden-tify and remediate underperforming teachers—and, whennecessary, exit them from the system. In short, improvedteacher evaluation policies and systems have much to offer.

That said, much of the public and policy debatearound reforming teacher evaluation has been framed in

stark either-or terms that obscure—rather than illuminate—real tensions inherent in these efforts. Too often, debatesare framed as a choice between adopting policies that dic-tate the uniform implementation of the six Teacher Evalu-ation 2.0 design elements across all schools and districts,and defending a status quo in which teacher evaluationsprovide little useful feedback to educators because nearlyeveryone is rated “satisfactory.”

Public debate about teacher evaluation tends to vas-cillate between the technical (implementation challenges,issues related to the validity and stability of value-addedmeasures, and appropriate methods for evaluating teach-ers in nontested grades and subjects) and the ideologi-cal (whether it is fair to hold teachers accountable forstudent learning and the extent to which test scores trulymeasure their most important contributions). But theemphasis on technical and ideological questions tends toelide fundamental tensions and trade-offs in the develop-ment of teacher evaluation systems, which deserve greaterconsideration and discussion than our current debatehas afforded them. The remainder of this paper addressesfour key tensions of the new generation of teacher evalua-tion systems: flexibility versus control, the role of teacherevaluation in an evolving education system, purposes forwhich evaluations are designed and used, and what itmeans to evaluate teachers as professionals. We also dis-cuss two models—developed independently from currentstate policy reforms—that encompass many elements ofTeacher Evaluation 2.0 but also serve to illustrate whatmay be lost if states implement evaluation system require-ments without caution.

Flexibility versus Control

Tensions related to centralization, flexibility, and control ofkey decisions are inherent in the structure of the US edu-cation system, which vests multiple levels of government—school district, state, federal—with the power to influencewhat happens in schools and classrooms. Over time, pol-icy approaches have oscillated between emphasizing cen-tralization as a means of quality control and prioritizingflexibility to place decision making with those “closest tothe child.” During the Progressive Era, for example,reformers sought to centralize power and decision makingat the state and urban school district level; decades later,site-based management and charter school reforms wouldseek to devolve power to school-level leaders. These shiftsreflect the tension between recognition that school- andlocal-level leaders often have the best insight into those

schools’ specific needs, and skepticism or distrust of theirability or will to make the right decisions.

Flexibility and control are not inherently in tension.It is possible for state and federal policies to be, to borrowsecretary of education Arne Duncan’s formulation, “tight”in defining accountability and outcomes goals for schools,and “loose” on the specifics of how they get there. In prac-tice, though, it can be difficult to delineate exactly wherethe boundary falls between the “ends” and “means.”

Teacher evaluation is a clear demonstration. On theone hand, the recent emphasis on teacher evaluationsseems like a shift in a “tight-loose” direction: pay lessattention to teachers’ training and other characteristics,and instead focus on the results they produce in the class-room. On the other hand, mandates that teacher evalua-tions include specific design elements could be seen asoverly prescriptive. This is particularly true in severalstates that have gone beyond the broad parameters laidout in Teacher Evaluation 2.0 and the federal RTT andwaiver guidelines, and now require school districts toadopt teacher evaluations that employ state-defined value-added models or specific teacher evaluation rubrics.

While federal waiver criteria require only that statescreate guidelines for teacher evaluation systems andensure local school districts to implement systems thatmeet those guidelines, some states—including Delawareand South Carolina—have chosen to adopt a singlestatewide teacher evaluation system, in which all districtsin the state must participate. It is one thing to say all dis-tricts must have teacher evaluation systems that includestudent achievement data, meaningful observations, andmultiple differentiated levels of performance—it isanother to mandate the specific tools used for observingteachers and analyzing performance data.

“People Proof” Systems Are Stupid Systems. Statesthat mandate statewide teacher evaluation systems, value-added measures, or rubrics are not just being controlfreaks. Given the poor track record of most school dis-tricts in conducting meaningful teacher evaluations, com-pellingly illustrated in The Widget Effect, states have goodreason to be skeptical that districts, left to their owndevices, will adopt or implement rigorous evaluation sys-tems. Moreover, getting this more rigorous teacher evalu-ation right is a complicated business—as evidenced bythe struggles of front-runner states such as Tennessee andDelaware—and many districts lack the capacity orresources to do so without state support and guidance.

But policies that deny districts the freedom to be badactors can also restrict their flexibility to adapt or inno-

vate in ways that make these systems work better. This isparticularly problematic because the experience of charterschools such as Mastery Charter Schools in Philadelphiaand districts that have successfully used teacher evalua-tions seems to indicate that aligning evaluation to aparticular school’s or district’s culture and philosophyof effective instruction is critical to maximizing thebenefit of these systems (see “Mastery Charter Schools”sidebar). If a school or district has already made signifi-cant investments or built buy-in around a particularframework of quality instructional practice, requiring it

to adopt a statewide rubric that is not fully aligned tothat framework may be counterproductive.

Casting this tension in sharp relief is the juxtaposition—in both RTT and ESEA waiver guidance—of teacherevaluation provisions with those intended to expandaccess to public charter schools and streamline state regu-latory burden on schools. Charter schools are independ-ent public schools of choice that, in many states, aregranted broad flexibility from regulatory requirements inexchange for accountability. This often includes greaterleeway than traditional schools and districts have in per-sonnel matters, including fewer barriers to dismissingunderperforming teachers. New state teacher evaluationpolicies, by mandating teacher evaluations that meet cer-tain parameters, could infringe on charters’ historical free-dom in personnel matters. Existing exemptions from stateregulations may also exempt charter schools from newteacher evaluation requirements—but federal ESEAwaiver requirements seem to discourage this.

The ESEA waiver guidance states that “An SEA[state education agency] must develop and adopt guide-lines for these systems, and LEAs must develop andimplement teacher and principal evaluation and supportsystems that are consistent with the SEA’s guidelines.”20

The term LEA, or “local education agency,” is typically

Special Report 2

9

Teacher Quality 2.0

It is one thing to say all districts must

have teacher evaluation systems—it

is another to mandate the specific

tools used for observing teachers

and analyzing performance data.

Special Report 2

10

Teacher Quality 2.0

Mastery Charter Schools is a high-performing charternetwork that operates nine turnaround or mergedschools and one new-start charter school in Philadel-phia, and it has delivered impressive results for previ-ously low-performing schools and students across itsnetwork. Mastery began developing a teacher and prin-cipal evaluation system when it became clear that it wasgoing to expand from a single charter school to multi-ple campuses—creating a need to systematize expecta-tions and evaluation in order to ensure fairness and aconsistent standard of teaching. The system, whichserves as the basis for a performance-based teachercompensation system, includes three elements:

• Instructional standards, drawn from a variety of research- and practice-based sources, as well aseducator feedback, establish common definitionsand expectations for what high-quality teachingshould look like across all ten campuses. Teacherperformance against the standards is evaluatedthrough classroom observations.

• Student performance data are measured based onstudent value-added scores from multiple assess-ments, including Pennsylvania’s state test, anationally normed exam, and regular bench-mark evaluations administered as part of Mas-tery’s instructional cycle. Mastery contractedwith a statistician to develop its own value-addedmodel using all of these data sources because thevalue-added data it receives from the state are bothtoo infrequent and too late to inform instructionalpractice or personnel decisions. Because Masteryuses regular benchmarks as part of the instructionalcycle in nearly all grades and subjects, it does notface the same nontested grades and subjects prob-lem that most state teacher evaluation systems do.

• Mastery responsibilities, values, and contributionis a more subjective measure of the extent to whichteachers demonstrate Mastery values and contributeto their school community, as evaluated by theirprincipal. The first two factors comprise a greatershare of the evaluation than this more subjectivemeasure, but Mastery feels it is important that theevaluation system reflect its particular values.

More important than the specifics of Mastery’sevaluation system is its integration into a broaderculture and commitment to developing effective teach-ing. Mastery makes significant investments in profes-sional development and individual coaching to improveteachers’ performance against the instructional stand-ards, and coaching and expectations for professionalgrowth are not limited to underperformers. In fact,high-performing teachers actively seek out such oppor-tunities. Coaches are evaluated based on their success inimproving teachers’ performance in targeted areas ofthe standards.

As a result, evaluations are but one piece in anongoing conversation about instruction, performance,and improvement that is happening throughout theyear among teachers, coaches, and principals. Becauseof this ongoing conversation, teacher evaluations arenot just a score at a particular point in time, nor arethe results a surprise for teachers who are engaged inongoing conversations about their performance.

Mastery founder Scott Gordon views this largerecosystem focus on instructional quality as critical tothe evaluation system’s success. “Without the clearstandards, focus on professional development, andlinear link between all the pieces,” Gordon said, “Idon’t think we’d get that big a bang out of it.” Gor-don worries that state and district policies that createteacher evaluations without putting in place thislarger ecosystem and culture will wind up disap-pointing their supporters. Gordon also notes thatcreating an evaluation system makes conversationswith staff about their performance even more impor-tant. Building principals’ capacity to engage teachersin these honest conversations and give them produc-tive feedback has been one of the most difficult—and critical—tasks in implementing Mastery’s teacherevaluation model.

Mastery’s experience and current high perform-ance demonstrate the benefits of an effective and well-designed teacher evaluation system: rewarding andretaining high performers, improving weaker ones,exiting people who do not improve, and building aperformance culture at the school level. But obtainingthese benefits takes cumulative work in implementingand refining the system and shaping a culture that val-ues performance.

Mastery Charter Schools

used in federal law to refer to school districts, but charterschools in many states are also their own LEAs, so thislanguage appears to suggest that they must also adoptevaluation systems consistent with state guidelines. In theDistrict of Columbia, for example, the Office of the StateSuperintendent of Education (OSSE) prepared an ESEAwaiver request stating that all LEAs in the district wouldadopt teacher evaluations meeting OSSE-definedguidelines—even though 95 percent of the LEAs in DCare charter schools, and DC’s charter school law givesOSSE no authority to mandate that they adopt such sys-tems. In Florida, teachers’ unions decried a new evalua-tion law as an assault on due process. But, because the lawapplied to charter as well as traditional district schools, it

actually subjected charters to more complicated evaluationand due process requirements than before.

The problem with these requirements is not simplythat they interfere with charter autonomy. They couldalso jeopardize sophisticated human capital systemsthat some high-performing charter schools (such as theAchievement First and Mastery charter networks profiledin each of the sidebars) have developed using their auton-omy over personnel. These systems, which predate thecurrent push for new systems of teacher evaluation, inte-grate teacher evaluation, development, and performance-based compensation with the school’s instructionalphilosophy and culture, in ways that may not fit into staterequirements for more uniform evaluation systems.

Much of the current rhetoric and policy surroundingteacher evaluation has, as its subtext, a desire on the partof reformers and policymakers to “people proof” or “poli-tics proof” teacher evaluation. Value-added measures andspecific observation rubrics are cast as a way to reduce thedanger of subjective administrator judgments in evaluationsand base them instead on objective, “scientific” criteria.The impulse to “people proof” schooling, to establish sys-tems, protocols, and tools that eliminate room for humanjudgment and error, is a longstanding impulse in education—and one that often creates more problems than it solves.

Indeed, many of the systems that proponents of new

teacher evaluation systems are seeking to replace—suchas the “steps and lanes” teacher salary schedule—werecreated in part to “people proof” decisions or protectteachers from administrator bias. Our culture tends toyearn for scientific solutions that allow us to make deci-sions based on purely objective information withoutdemanding human judgment. But, in fact, “people proof”decisions are often stupid decisions, and it is impossible toeffectively evaluate teacher performance without a signifi-cant role for human judgment.

New evaluation systems, particularly those includingstudent growth data, are being treated as a way to getaround a specific management problem: because the culturein K–12 education traditionally discourages candid conver-sation about performance, school leaders have failed to takeresponsibility for making difficult personnel decisions. Well-designed teacher evaluation systems can help change this bygiving principals a tool to enable these conversations, mak-ing them more accountable for having them, and reducingsome current barriers to dismissing low performers.

But evaluations alone cannot transform a culture thatis so resistant to frank discussions of performance. “Edu-cators are not trained and have no context for supervisoryconversation of that nature,” says Mastery CharterSchools founder Scott Gordon, noting that his schools’implementation of evaluation systems has required inten-sive training of school-level leaders, not only to use thesystem and tools consistently—the focus of much of thetraining associated with current state teacher evaluationefforts—but also to provide constructive feedback andhave meaningful discussions with teachers about theirperformance. Changing the culture and building capacityfor such feedback and conversations become particularlyimportant when we consider that the purpose of newevaluation systems is not just to dismiss low-performingteachers, but to help all teachers improve their profes-sional practice and performance.

Political Dynamics Create Incentives for Greater Con-trol. Policymakers and advocates who favor new systemsof teacher evaluation recognize that they are operating ina unique political moment, in which both a Democraticpresidential administration and Republican governorsand legislative majorities in many states favor majorchanges to current teacher evaluation policies. Further-more, unusual circumstances such as the RTT contestand state demands for relief from ESEA requirementshave provided this Democratic administration withunprecedented leverage to drive state policy changes,even in states with reluctant leadership.

Special Report 2

11

Teacher Quality 2.0

It is impossible to effectively evaluate

teacher performance without a

significant role for human judgment.

Special Report 2

12

Teacher Quality 2.0

This unique—and clearly limited—political oppor-tunity creates an incentive for champions of teacher certification overhauls to lean toward control over flex-ibility in designing new policies in order to lock in asmuch of their agenda as possible before the windowcloses. This means seeking to mandate now many keydesign features of teacher evaluation systems—see, forexample, the detailed provisions of Florida’s Senate Bill

736—rather than legislating broad parameters that leavespace for district variation. It also means a tendency toput key provisions and requirements in legislation ratherthan regulation—even when regulation may be a bettervehicle for addressing many complex issues involved inteacher evaluation—because legislation is more difficultto change down the road.

While the political incentives are obvious, there

Achievement First operates twenty high-performingcharter schools in New York and Connecticut. Unlikethe current wave of state activity on teacher evaluation,which has emphasized the need for evaluations to sup-port dismissal of underperforming teachers, AchievementFirst’s teacher evaluation system grew out of a desire todevelop and recognize the most effective teachers.

The Achievement First network has been aroundfor fourteen years and has grown rapidly in the pastdecade. To support this growth, Achievement Firstmade significant investments in developing existing staffmembers to lead new schools. But effective teacherswho had been in Achievement First classrooms for sev-eral years also wanted pathways to grow and develop asprofessionals without becoming principals or schoolleaders. Achievement First developed its evaluationsystem as part of an effort to provide these teachers withsuch opportunities. Achievement First’s teacher evalua-tion system is based on four elements:

• Student achievement, which Achievement Firstbelieves should be a critical component of evalua-tions for all teachers. For teachers in grades andsubjects that take the state test, it uses its ownvalue-added model. For teachers in nontestedgrades and subjects, it uses end-of-course assess-ments to measure student learning. Studentachievement accounts for a larger percentage ofteachers’ evaluations in tested grades and subjects,reflecting differences in the reliability and validityof the different types of assessments.

• Student character development, a key part ofAchievement First’s mission, is also a component ofevery teacher’s evaluation. It is measured using par-ent and student surveys and accounts for 15 per-cent of a teacher’s evaluation.

• Planning and instruction are measured throughclassroom observations conducted at multiplepoints throughout the year using a specificrubric developed by Achievement First. Mostobservations are conducted by building-leveladministrators, in some cases supplemented byother Achievement First network staff from out-side the school. A comprehensive rating from theteacher’s instructional coach is also incorporatedinto the overall planning and instruction rating.Planning and instruction counts for a higherpercentage of the evaluation for teachers in non-tested grades and subjects.

• Core values and contribution to the team aremeasured through principal assessment and apeer survey and count for 15 percent of ateacher’s evaluation.

Achievement First’s experience offers three poten-tial lessons for broader efforts to implement teacherevaluation systems. First, because Achievement Firstdesigned its evaluation system as part of a broadereffort to develop and recognize effective teachers, everyaspect of the system is designed to support this goal.Second, Achievement First was honest with teachersabout the challenges of developing an evaluationsystem and engaged them throughout its development.Finally, because Achievement First’s emphasis has beenon developing master teachers, it has designed its eval-uation rubric to define a high standard of instructionalexcellence, rather than a minimum threshold. As aresult, average teacher scores on the rubric have beenonly middling—even though Achievement First is anextremely high-performing network. This ensures thatthe system can support ongoing development andimprovement even for high-performing teachers.

Achievement First Charter Schools

is a real danger that, in seeking to lock in the most “rigorous” standards for teacher evaluation systems, policymakers and advocates may simply be locking inthe current state-of-the-art knowledge regarding teacherevaluation. Policymakers need to take action, but theyalso need to be willing to evolve—to learn from theresults of the policies they put in place and to makeadjustments as results demand. Legislating key designelements in an effort to protect them can also restrictthe ability of future policymakers to adapt policies inresponse to the new knowledge that will inevitablyemerge as teacher evaluation systems are implementedand studied, or as our education system itself evolves inother key respects.

Evaluation in an Evolving System

Teacher evaluation policies do not exist in a vacuum.Other forces are currently at work transforming our edu-cation system in ways that have real implications for thedesign and use of teacher evaluation—forces that couldcreate real challenges if policymakers do not take thoseimplications into account.

The challenge is broader than just charter schoolsand a few alternative schools. Technological innovation isalready beginning to reshape our understanding of whatschooling looks like—and with it, the work teachers do.Blended learning models, such as Rocketship, CarpeDiem, and School of One, are leveraging technology todeliver education in new ways that have the potential toincrease educational productivity and personalization tomeet individual student needs. Blended learning modelsvary in design, approach, costs, and the extent to whichtechnology supplants in-person teachers or enables teach-ers to be deployed in new ways. These models fundamen-tally change the way teachers do their jobs; teachers spendless time on traditional lecture and practice, and moretime working with students one-on-one or in smallgroups, analyzing data to diagnose student needs, andcrafting instructional experiences to meet them.

Student groupings in these models are more flexibleand fluid, and students receive instruction and tutoringfrom a variety of teachers, as well as from technology-based modalities. As a result, it can be difficult or impos-sible in some of these models to attribute student learninggains in a particular subject to a particular teacher. Thiscreates complications for teacher evaluation systems thatrely on linking teachers to their students’ academic results.Existing observational rubrics—such as those included in

the Bill & Melinda Gates Foundation-funded Measures ofEffective Teaching project and being adopted by statesand districts—were not designed for the more personal-ized modalities in which blended learning educators oftendeliver instruction.

Nor is blended learning the only force increasing thenumber of teachers whose students do not have state testscores. Over the past decade, states and school districtshave dramatically expanded the number of children theyserve in early childhood and pre–K programs—and thelong-term trend of expansion in publicly funded pre-school is likely to continue once states and districtsrebound from the recent recession. Early childhood stu-dents also do not take state assessments, and because ofthe low adult-to-child ratios necessary in early childhoodclassrooms, preschool teachers often work in coteachingsettings that make it difficult to attribute students’progress to an individual teacher.

In other words, the expansion of blended learningand early childhood programs is likely to dramaticallyincrease the number of teachers who pose the most diffi-culty for new teacher evaluation systems—those whosework does not directly map onto the state assessment per-formance of a specific, identifiable group. The greatestchallenge facing most states currently seeking to imple-ment new teacher evaluation systems is determining howto evaluate student achievement gains for teachers in non-tested grades and subjects—those not covered by NCLB’smandate to annually assess every student in grades threethrough eight in math and English language arts. Theexperience of Tennessee, winner of $500 million in RTT’sfirst round and the birthplace of value-added measures, isa good example.

Tennessee has struggled to identify appropriate meas-ures of student growth for teachers in nontested gradesand subjects—over half of Tennessee’s teachers. In the ini-tial stages of implementation of their new system, theseteachers’ impact on student achievement will be scoredusing schoolwide value-added data in math and reading—a move Tennessee teachers and outside observers havecriticized. Further, some educators whose responsibilitiesdo not typically include classroom instruction—such asmedia specialists—may be required to teach mock classesto receive ratings on required observational measures.21

Delaware, which received $120 million during the firstround of RTT, has also faced challenges implementingkey components of its planned statewide teacher evalua-tion system.22 Like Tennessee, Delaware has struggled toidentify “student growth measures” for grades and sub-jects not subject to state testing, and as a result delayed

Special Report 2

13

Teacher Quality 2.0

Special Report 2

14

Teacher Quality 2.0

teacher and leader evaluation systems based on thosemeasures for a year.

Just because blended learning models decouple thelink between an individual student’s progress and an indi-vidual teacher, this does not mean that blended learningteachers cannot be held accountable for their impact onstudent learning. Teachers at the Florida Virtual School,for example, do not get paid if their online students donot complete their courses. Indeed, because blended learn-ing teachers work collaboratively in teams with shiftinggroups of students and constantly collect and assess dataon students’ progress, there is more transparency for theirwork than for more traditional teachers. School of Oneuses open classes, so there is much more ongoing observa-tion and engagement between classrooms and educators.The types of data analysis and collaborative teaching thatoccur in blended learning situations actually open thedoor for entirely new forms of teacher performance evalu-ation that are still tightly linked to impact on students butlook very different from the 2.0 models that federal andstate policymakers are currently mandating.

Unless state policies provide for additional flexibilityaround teacher evaluation in blended learning and otherinnovative approaches to schooling, we will miss out onthe opportunity to develop these new forms of evalua-tion. More troubling, teacher evaluation requirementscould actually become a barrier to the expansion ofblended learning models that delink student learningfrom individual teachers, or to the development of newmodels combining in-person teaching with technologyand other delivery mechanisms to personalize studentlearning experiences. Charter school authorizers may beunwilling to approve schools using new blended modelsif those schools cannot explain how they will complywith state teacher evaluation laws.

Trade-Offs in the Use of Evaluationsand Value-Added Data

Proponents of new teacher evaluation systems haveemphasized a set of specific design principles—using stu-dent achievement data as a major factor, including class-room observations as a measure, and using evaluationdata to inform key personnel decisions—to define whatquality evaluations must look like. There is a reasonablebasis for each of these design features, but proponents ofTeacher Evaluation 2.0 have too often succumbed to thetemptation to oversell these systems, rather than acknowl-edging the complexity involved and the broader reality

that progress means moving forward with imperfect mod-els and refining them over time. Several key trade-offsdeserve greater attention than they have received in thecurrent debate. These include trade-offs between “deselec-tion” and professional improvement as strategies toimprove teaching quality, trade-offs among potential eval-uators, trade-offs between individual and organizationalaccountability, and trade-offs between teacher evaluation

and other potentially potent uses of value-added data toimprove student learning.

Deselection or Professional Improvement? The movetoward new teacher evaluation systems is rooted in thebelief that better evaluations are needed to identify under-performing teachers and remove them from classrooms.As Teacher Evaluation 2.0 has gained political traction,however, proponents—including Secretary Duncan andPresident Obama—have begun to emphasize that newteacher evaluation systems are not just about firing teach-ers but should also provide teachers with useful feedbackto help them improve their performance. This politicallynecessary rhetoric has the added benefit of being true.Because current value-added measures are most reliable inidentifying only those teachers at the high and low end ofthe spectrum, with the vast majority falling in the middle,dramatically improving teacher quality through evaluationwill require using those evaluations to help the majority ofteachers who do not fall at those extremes to improve.

But the design of states’ and districts’ 2.0 evaluationsystems do not necessarily back up this rhetoric. For bet-ter or worse, the design focus in most 2.0 evaluation sys-tems has been on building systems that are sufficientlyfair, valid, and reliable to hold up in court as a basis fordismissing low-performing teachers. State efforts todesign new teacher evaluation systems have focused muchmore attention on what happens to teachers at the bot-tom of the spectrum than those in the middle or at thetop. For example, despite the rhetoric of teacher develop-ment, several states’ new teacher evaluation laws requirethe creation of a professional development plan only for

Evaluations alone cannot transform

a culture that is so resistant to frank

discussions of performance.

low-performing teachers, and primarily as a way of givingthese teachers an opportunity to improve prior to dis-missing them. Similarly, current design efforts have notemphasized features that are critical to ensuring that eval-uations really help teachers improve. Useful feedbackcomes not just in the number generated by a particularevaluation measure (whether student achievement data oran observation rubric), but from the conversations thatoccur between a teacher and a principal, observer, orexpert coach based on that data. Effective feedbackrequires evaluators to engage in an analytic conversationwith the teacher about his or her performance that identi-fies strengths to build on, areas for improvement, andcritical action steps to accomplish both. Evaluation sys-tems need to be designed to facilitate and provide timefor such conversations. While state evaluation systems aredevoting attention to training evaluators in specific obser-vation instruments, most are not devoting the same timeto building their capacity to give meaningful feedback orengage in productive conversations about performance.

Who Evaluates? There are also trade-offs in decidingwho should be responsible for teacher observations andevaluations. One school of thought holds that evaluat-ing, developing, and ensuring the quality of teaching in aschool is the principal’s job, and therefore principalsshould have the primary responsibility for teacher obser-vation and evaluations. But this raises real issues of prin-cipal capacity and time, particularly to link evaluationwith ongoing professional development, which is usuallydelivered by coaches, master teachers, and others in theschool—not by the principal. Others argue that to elimi-nate potential principal biases and ensure independenceand reliability, independent evaluators from outside theschool should conduct the evaluations, either alone orin tandem with principal reviews—and some states areincluding language in teacher evaluation legislation torequire this. Some also argue that evaluations will bemost useful for teacher development if those conductingthe evaluation are also responsible for supporting anddeveloping teachers on an ongoing basis, such as instruc-tional coaches and master teachers. But some critics fearthat evaluators who are also engaged in providing profes-sional development may be biased in their evaluations ormay be more likely to go easy on underperformingteachers. (In fact, research suggests the opposite—thatcoaches and mentor teachers appear to be harder in theirevaluations of teachers than principals are.)23 Each ofthese approaches represents differing philosophies aboutthe ultimate aims of teacher evaluation, the incentives

and behavior of different players in the system, and howto weigh trade-offs between independence, reliability,and usefulness for professional development—but thesetrade-offs are not always explicitly discussed in publicdebates about teacher evaluation systems.

Experience from high-quality evaluation models(such as those described in the appendices) suggests thateffective teacher evaluation systems need to be integratedinto a broader system of teacher development, support,and advancement that includes a clear definition orframework for what effective instruction looks like; inten-tional supports for developing teaching practices againstthat framework; ongoing conversations among teachers,peers, supervisors, and coaches or master teachers aboutprofessional performance, data, and improvement; andperformance-based compensation. In these models, allindividuals in the organization share a common organiza-tional vision about effective instruction and are account-

able for both individual and organizational progresstoward aligned goals. Most current state and districtefforts do not include all these components.

Individual versus Organizational Accountability.Measures for which teachers are held accountable mustbe aligned to those for which the larger school or organi-zation is accountable. If teacher and organizational meas-ures are not clearly aligned, evaluation systems may notdrive teacher performance toward organizational goals.The current state ESEA flexibility waiver process raisesconcerns in this vein. Some states that have applied foror received ESEA waivers do not currently include value-added or growth measures of student performance intheir accountability system for schools, but they are pro-posing to create such value-added measures to meet the

Special Report 2

15

Teacher Quality 2.0

State efforts to design new teacher

evaluation systems have focused

much more attention on what

happens to teachers at the bottom

of the spectrum than those in

the middle or at the top.

waiver requirements related to teacher evaluation. Geor-gia, for example, which recently received a waiver, pro-poses to measure school performance based on studentproficiency and graduation rates, rather than growth.Some states are also proposing to integrate new value-added or growth data into revised school accountabilitysystems—but others are not. This is troubling because itcreates the potential for evaluation metrics and systemsfor teachers that are disconnected from those forschools and school districts. Moreover, it is a completelybackwards way of adopting the use of growth or value-added measures for accountability purposes. Becausevalue-added or growth measures for individual teachersare much less robust and more subject to error thanthose for schools—and also, in the proposed evaluationsystems, often carry higher stakes—it makes little senseto implement them before, or even at the same time as,value-added or growth accountability for schools. Infact, it could undermine teacher and public trust inthese systems.

Colorado’s experience is illustrative here. When thestate first developed its well-regarded Colorado growthmodel, it originally used the data only for public informa-tion purposes. After familiarizing citizens and educatorswith the measure and the information it produced, andidentifying and working out kinks in the tool, Coloradobegan to use growth data as part of its school accountabil-ity system in 2009. Only after the growth data had beenin place as part of the accountability system for schoolswas it extended to individual teacher evaluation andaccountability. Because the need to include value-addedor growth data in teacher evaluations is driving efforts todevelop these measures in many states, they are not tak-ing the time to roll out these measures in a deliberate waythat builds public and educator understanding, trust, andacceptance of them.

Evaluation Is Not the Only Use for Value-Added Data.A focus on value-added measures for teacher evaluationhas also distracted attention from other potentially valu-able uses of value-added data to improve student per-formance—some of which may interact with teacherevaluation systems in ways that complicate their results.For example, schools and teachers can use value-addeddata to analyze individual students’ learning trajectoriesand target aid or interventions accordingly. Value-addeddata can also help schools and districts strategically assignstudents to teachers in ways that match teacher strengthswith student needs.24

Such strategies have the potential to improve student

learning outcomes, but they could also impact the valid-ity of value-added measures of teacher performance, ifstudents are assigned to teachers on a nonrandom basisthat takes into account teachers’ and students’ past value-added data. Other strategies to benefit studentscould also impact evaluations. For example, several stateshave passed or are considering policies that prohibit dis-tricts from assigning the same child to an ineffectiveteacher for two or more consecutive years. These policiesmake a great deal of sense, given the data on the cumula-tive impacts of ineffective teachers on student perform-ance, but they would also affect assignment of students toteachers in ways that could impact evaluation results—particularly if more effective teachers received an influx ofstudents who had been taught by ineffective teachers inthe previous year.

Obviously, student and teacher assignments neverhave been, and never will be, random—an issue that alsohas implications for evaluation systems. But we shouldnot allow concerns about the impact on evaluation sys-tems to prevent us from doing things that might benefitstudents. A recent Center for American Progress report,for example, argued that one reason states should notprovide raw value-added data to journalists is that parentsmight use this data to advocate for more effective teachersfor their students, and the impact on student and teacherassignments might undermine the validity of the teacherevaluation system.25 This seems like a backwards argument.

Ultimately, value-added and evaluation data can beused in a variety of ways to improve both teacher effec-tiveness and student achievement, beyond the uses thattend to feature prominently in most policy debates aboutteacher evaluations. Failure to consider those uses and thetrade-offs involved in evaluation design could result inmissed opportunities to improve student learning.

Evaluating Teachers as Professionals?

Efforts to establish new teacher evaluation systems areoften accompanied by heated disputes about their impacton the standing of the teaching profession. Advocatesargue that holding teachers responsible for their perform-ance will bring teaching more in line with norms in otherfields. Furthermore, they maintain that new evaluationswill help raise the status of the profession by encouragingdismissal of poor teachers who give teaching a bad nameand by facilitating the implementation of performance-based compensation programs that increase salaries forthe most effective teachers. Critics of new evaluation sys-

Special Report 2

16

Teacher Quality 2.0

tems argue that mechanistic teacher evaluations based ontest scores and rubrics are demeaning and demoralizingand neglect the nuanced art that goes into teachers’ work.Both sides have a point.

New teacher evaluation systems should be under-stood as part of a larger effort to move attitudes towardhuman capital in education away from an industrial-eramodel that treats all workers as interchangeable parts withlittle differentiation of pay, status, or responsibility, andtoward a more performance- and talent-sensitive orienta-tion along the lines of law, medicine, and other profes-sions. The title of The Widget Effect alludes to this.Gordon, Kane, and Staiger also argue that increasedaccountability for student achievement will ultimatelyimprove the status of teaching and attract more skilledpeople to the field.26

But some features of 2.0 teacher evaluation systemsand policies are in fact very different from evaluationnorms in other professions. Most professional fields,including business, medicine, and law, rely on a combina-tion of data and managerial judgment when making eval-uation and personnel decisions, and subsequently holdmanagers accountable for those decisions. Methods thatlack human judgment and discretion are rare. Indeed, farfrom eliminating subjective feedback from personnel eval-uation, many firms have moved to adopt “360 degree”feedback mechanisms in which employees receive feed-back from managers, direct reports, peers, and clients.Others combine objective quantitative measures—such asdollars or clients brought into the firm—with more sub-jective or soft performance indicators. 2.0 policies, incontrast, have sought to minimize the role of managerialjudgment through the use of “objective” value-addeddata, common rubrics, and third-party evaluators. Ratherthan building the capacity of managers and educators toprovide meaningful development feedback to their directreports and peers, these systems seek to minimize theextent to which they are expected to do so.

There are clear tensions associated with usinghuman judgment in teacher evaluations. As is often thecase in education, politics and mistrust exacerbate thechallenge of designing evaluation systems. The currentemphasis on “objective” data and rubrics, as well asthird-party evaluators, is in large part a response to theopposition of teachers’ unions and teachers’ associationsto past evaluation and performance pay initiatives thatwere viewed as giving principals too much managerialdiscretion. Obviously no one wants a system in whichemployees are at the whim of arbitrary or biased mana-gerial judgments. But some labor concerns reflect a pre-

union, pre–civil rights era when employees lacked manyof the protections against discrimination and arbitrarydismissal that exist today—and that will continue toexist regardless of changes in evaluation systems or policies.

Moreover, the best protection against arbitrary or biasedmanagerial judgment is not to eliminate that judgmentaltogether, but to ensure that the managers themselvesare also held accountable for performance.

In designing value-added systems, policymakersshould consider whether the design elements they areputting in place move education away from or towardprofessional norms in other similar fields. For example,policy and design decisions are currently driven inlarge part by concerns about whether or not evaluationsystems can withstand legal challenges to dismissal ofthe worst teachers—a legitimate focus in systems thatconvey a property interest in continued employmenton tenured teachers. But the result is that design ele-ments that support professional improvement canbecome secondary.

Human judgment is critical both to making smartassessments of teacher performance and to using thoseassessments in ways that improve instruction for students.Given the limitations of our current value-added andobservational rubric tools, “people proof” evaluation sys-tems will almost certainly result in nonnegligible numbersof both Type I (false positive) and Type II (false negative)errors. That is not an argument against improving evalua-tion systems through the use of these tools—our currentsystem, in which more than 99 percent of teachers arerated satisfactory, yields few false positives but likely manyfalse negatives—but it is an argument for providing spacefor human judgment to intervene when such errors seemobvious, along with the right incentives and capacitybuilding to encourage sound managerial judgment. Bythe same token, building evaluation and professional

Special Report 2

17

Teacher Quality 2.0

Teaching is ultimately a people

business, and getting improvements

in teacher effectiveness will require

human judgment and interaction—

processes cannot do it alone!

growth systems that truly develop teachers as profession-als is impossible without providing space for humanjudgment and feedback apart from purely “objective” andimpersonal mechanisms.

Policy Implications

Because there are real trade-offs and tensions in thedesign and use of teacher evaluation systems, we need sys-tems that are flexible and able to adapt to particular cir-cumstances and changes over time. It is also critical topublicly and transparently engage policymakers, educa-tors, and the broader public in conversations about trade-offs, rather than glossing over them. Such conversationscan provide an opportunity for input about the prioritiespolicymakers should take into account in making trade-offs related to teacher evaluation systems, and in theprocess can build public and educator understanding andtrust in those systems. There are several key implicationsfor policymakers to consider:

Be Clear about the Pain Points. Policymakers need to beclear about the problems teacher evaluation systems areintended to solve. Right now, teacher evaluations are toooften marketed as an educational wonder drug, without aclear theory of action about how evaluation results willtranslate into improved teaching or the other system ele-ments necessary to foster effective teaching. Policymakersmust be clear about the problems they are trying toaddress, their goals, and their theory of action—and theymust make design choices and trade-offs that reflect thattheory of action.

Do Not Treat Processes and Systems as a Substitutefor Cultural Change. Policymakers are relying heavily onnew teacher evaluation systems and the processes theymandate to improve teacher effectiveness. But teaching isultimately a people business, and getting improvementsin teacher effectiveness, whether through professionalimprovement or deselection, will require human judg-ment and interaction—processes cannot do it alone!That, in turn, requires deep cultural change within theUS educational system. Teacher evaluation systems andprocesses can help facilitate this cultural shift, by settingclear expectations, creating language and venues within aschool to talk about performance, and empowering lead-ers to confront or dismiss low performers. But these sys-tems cannot carry all the work on their own. Otherelements of the education system, such as meaningful

outcomes accountability, competitive pressure, and newapproaches to training are also needed to foster a culturethat takes performance seriously.

Look at the Entire Ecosystem. Some advocates and policy-makers talk about teacher evaluation systems as if they arebeing implemented in a vacuum, without consideringhow they relate to other elements of the education eco-system. How will reforms that tie personnel decisions toevaluations based on student achievement affect the labormarket for teachers? What changes in principals’ pre- andin-service training are needed to develop their capacity aseffective evaluators? How should teacher preparationrequirements change to align better with new evaluationsor standards? Policymakers must pay attention to theentire teacher labor market ecosystem, not only to thepoint at which high-stakes evaluations take place or haveprofessional consequences for teachers.

Similarly, policy changes are increasing high-stakesindividual accountability for teachers even as they in somecases reduce school-level accountability, and the directionof new state accountability systems under ESEA waiverrequests is not always well aligned with states’ educatoraccountability proposals. This is unwise. Strong school-levelaccountability is essential to effective teacher evaluationsand must be aligned with them. Because well-designedand aligned school- and system-level accountability createthe right incentives and pressures for school-level leaders,they can reduce the need for excessive state control onteacher evaluations and create space for more judgment.

Focus on Improvement, Not Just Deselection. Researchshows that there is a subset of teachers who adverselyaffect student learning and are less effective than the aver-age first-year teacher—which means that deselecting theseteachers and replacing them with a random new teacherwould more likely than not improve student achieve-ment. But these teachers are only a fraction of the totalworkforce, and deselecting them is not enough to generatethe level of improvement needed. Evaluation systemsbased solely on creating legally robust methods forremoving low performers are insufficient. Excessiveefforts to decouple evaluation from human judgment andcreate standards that can withstand legal scrutiny moveeducation away from, rather than toward, the professionalnorms that guide similar fields.

Several key policy implications arise here: First, poli-cymakers must be honest about the need for teacher eval-uation to include a component of professional judgmentand should design systems accordingly. Second, principals

Special Report 2

18

Teacher Quality 2.0

and other evaluators need training not only in the techni-cal aspects of evaluation systems, but also in how to pro-vide effective feedback and engage in honest conversationsabout performance to help teachers improve. They alsoneed adequate time to do so. Related to this, policymakersshould carefully consider trade-offs in selecting evaluatorsand should not discount the impact of the choice on theability to support professional growth and development.Finally, evaluation components must be aligned with oneanother so teachers do not receive mixed messages aboutwhat they need to do to improve.

Encourage and Respect Innovation. Education has along history of “one best system” thinking that leads topolicies that freeze in amber the current state of the art,which curtails innovation and renders the system unableto adapt to changing needs and challenges. It would be amistake to bring this one best system thinking to teacherevaluations. A better model to emulate might be theapproach taken by law or other professional services firmswhere there are some clear commonalities in operations,norms, and performance expectations, but evaluation andaccountability metrics reflect diversity in what is valuedmost. Similarly, policymakers must ensure that evaluationrequirements do not curtail existing autonomies. Charterschools, for instance, should be held accountable to theirauthorizers for their outcomes, rather than bound by newevaluation requirements that curtail their autonomy inpersonnel matters.

Think Carefully about Waivers versus Umbrellas. Onepotential strategy to ensure that new teacher evaluationsystems do not become a barrier to innovation is toprovide broad waiver authority enabling innovativeproviders to gain an exemption to teacher evaluationrequirements if they can demonstrate strong perform-ance or well-designed alternative teacher evaluation anddevelopment mechanisms. Relatively few states havebuilt such well-designed waivers into their teacher effec-tiveness legislation, and those that have not would bewell advised to do so. At the same time, policymakersneed to think carefully about whether it is possible, andperhaps smarter in the long run, to design teacher evalu-ation policies with a broad umbrella that can cover bothtraditional and innovative models. The answer to thisquestion likely depends on one’s assumptions about thepotential scale of blended learning and other innovativemodels and the speed at which they will grow, as well ashow teaching arrangements in these models are likely todiffer from traditional schools. But it is worth taking

some time to explicitly consider these questions, and policymakers’ hypotheses about them, when makingdecisions about teacher evaluation policies.

Do Not Send Legislation to Do a Regulation’s Job. Ina rush to institutionalize, reformers have turned to laws toprotect reforms. Laws are obviously more durable thanregulations, and the legislative process can—although notalways—build greater buy-in from stakeholders. But leg-islation can also lock in policies that should be tweakedor even overhauled. Because the old model of evaluationwas so ubiquitous and we are only beginning to experi-ment with alternative models, there are many things wedo not know, and implementation of different models islikely to yield considerable learning about what does anddoes not work in different contexts. What can and shouldbe handled in legislation versus regulation varies withstate context, but in general, policymakers should try toavoid locking in legislation components of evaluation sys-tems for which implementation is likely to provideimportant lessons about how to do things better.

Create Innovation Zones for Pilots—and Fund Them.Within the overall context of federal and state policy,there must be room for school districts, consortia ofschool districts, or even entire states to try newapproaches. Federal dollars from Title II of the ESEA andthe Teacher Incentive Fund can support innovative proj-ects to try alternative teacher evaluation methods thatcombine quantitative and qualitative methods or aredesigned specifically for new education delivery models.In keeping with its role in fostering research and innova-tion, the federal government should not simply requirestates to establish new teacher evaluation policies, butshould simultaneously provide waivers and investmentsto support state, local, and charter school innovationsthat meet certain standards for rigor and protection ofcivil rights. As part of its fiscal year 2013 budget request,the Obama administration proposed a $5 billion compe-tition for states to work with colleges of education, teach-ers’ unions, and other stakeholders to reform the teachingprofession. An evaluation innovation pilot fits squarelywith the goals of that initiative.

Conclusion

Public education in the United States has for too longlacked a performance mindset or a strategic orientationtoward developing human capital. The current move

Special Report 2

19

Teacher Quality 2.0

toward new teacher evaluation systems represents signifi-cant progress to correct these shortcomings. That said, ifadvocates of 2.0 teacher evaluation rush too quickly tocreate new systems or do so without appropriate humilityabout what we do and do not know, there is a risk thatthey will end up replacing old broken systems with newones that, while better, are equally inflexible or createbarriers to innovation and reform. In other words, thenation’s teacher evaluation spree could turn into a bigheadache. The best way to mitigate this risk is not toignore it or brush it under the rug, but to be honest andtransparent about the trade-offs and tensions, the realitythat new systems will not be perfect, and the need tolearn as we move forward.

Notes1. Andrew Rotherham and Sara Mead, “Back to the Future: The History and

Politics of State Teacher Licensure and Certification,” in A Qualified Teacher inEvery Classroom?, ed. Frederick M. Hess, Andrew Rotherham, and Kate Walsh(Cambridge, MA: Harvard Education Press, 2004), 11–48.

2. Sandra L. Stotsky, “Revising Teacher Licensing Regulations to AdvanceEducation Reform in Massachusetts,” in Education Success Stories (Amherst:National Evaluation Systems, Inc., 2001), www.pearsonassessments.com/hai/images/NES_Publications/2001_14Stotsky_466_1.pdf (accessed August 1, 2012).

3. Emma Smith and Stephen Gorard, “Improving Teacher Quality:Lessons from America’s No Child Left Behind,” Cambridge Journal of Edu-cation 37, no. 2 (2007): 191–206, www.tandfonline.com/doi/abs/10.1080/03057640701372426 (accessed August 1, 2012); and Robert Rothman andPatte Barth, “Does Highly Qualified Mean Highly Effective?” Center forPublic Education, 2009, www.centerforpubliceducation.org/Main-Menu/Staffingstudents/How-good-are-your-teachers-Trying-to-define-teacher-quality/Does-highly-qualified-mean-highly-effective.html (accessed August 1, 2012).

4. Dan D. Goldhaber, “The Mystery of Good Teaching,” Education Next 2,no. 1 (2002): 50–55.

5. Charles Clotfelter, Helen Ladd, and Jacob Vigdor, “Teacher Credentialsand Student Achievement: Longitudinal Analysis with Student Fixed Effects,”Economics of Education Review 26, no. 6 (2007): 673–82; and Goldhaber, “TheMystery of Good Teaching.”

6. Jennifer Rice King, “The Impact of Teacher Experience: Examining theEvidence and Policy Implications,” CALDER Policy Brief, no. 11 (August2010), www.urban.org/uploadedpdf/1001455-impact-teacher-experience.pdf(accessed August 1, 2012).

7. Dan D. Goldhaber and Dominic J. Brewer, “Why Don’t Schools andTeachers Seem to Matter? Assessing the Impact of Unobservables on Educa-tional Productivity,” Journal of Human Resources 32, no. 3 (1997): 505–23.

8. June C. Rivers and William L. Sanders, Cumulative and Residual Effects ofTeachers on Future Student Academic Achievement (Knoxville: University of Ten-nessee Value-Added Research and Assessment Center, 1996), 9, http://heartland

.org/sites/all/modules/custom/heartland_migration/files/pdfs/3048.pdf (accessedAugust 1, 2012); and Kati Haycock, “Good Teaching Matters: How Well-Qualified Teachers Can Close the Achievement Gap,” Education Trust, 1998,www.take2theweb.com/pub/sso/eastlinton/images/Good_teaching_matters.pdf(accessed August 1, 2012).

9. William L. Sanders and Sandra P. Horn, “The Tennessee Value-AddedAssessment System (TVAAS): Mixed Model Methodology in EducationalAssessment,” Journal of Personnel Evaluation in Education 8 (1994): 299–311,www.redmond.k12.or.us/145410515152938173/lib/145410515152938173/The_Tennessee_Value-Added_Assessment_System-_Mixed-Model_Methodology_in_Ed_Assessment.pdf (accessed August 1, 2012).10. Kevin Carey, “The Real Value of Teachers: Using New Information about

Teacher Effectiveness to Close the Achievement Gap,” Thinking K–16 8, no. 1(2004): 3–42, www.edtrust.org/dc/publication/the-real-value-of-teachers-using-new-information-about-teacher-effectiveness-to-close (accessed August 1, 2012).11. Robert Gordon, Thomas Kane, and Douglas Staiger, “Identifying Effective

Teachers Using Performance on the Job,” Brookings Institution, 2006, www.brookings.edu/papers/2006/04education_gordon.aspx (accessed August 1, 2012).12. Daniel Weisberg et al., The Widget Effect: Our National Failure to

Acknowledge and Act on Differences in Teacher Effectiveness (Chicago: The NewTeacher Project, 2009), http://widgeteffect.org/ (accessed August 1, 2012), 6.13. Ibid.14. US Department of Education, “Race to the Top Program Executive Sum-

mary” (Washington, DC, November 2009), www2.ed.gov/programs/racetothetop/executive-summary.pdf (accessed August 1, 2012).15. Bellwether Education Partners staff, including two coauthors of this paper,

were involved in writing or advising state RTT applications.16. Learning Point Associates, “State Legislation: Emerging Trends Reflected

in State Phase 1 Race to the Top Applications,” June 2011, www.learningpt.org/pdfs/RttT_State_Legislation.pdf (accessed August 1, 2012).17. The New Teacher Project, Teacher Evaluation 2.0 (Brooklyn, NY, 2010),

tntp.org/files/Teacher-Evaluation-Oct10F.pdf (accessed August 1, 2012).18. Ron Tupa, Jocelyn Huber, and Barbara Martinez, “Built to Succeed?

Ranking New Statewide Teacher Evaluation Practices,” Democrats for EducationReform, October 17, 2011, http://www.dfer.org/Report%20-%20Evaluation%20Ratings%20DRAFT9.pdf (accessed August 1, 2012); National Councilon Teacher Quality, “State of the States: Trends and Early Lessons on TeacherEvaluation and Effectiveness Policies” (Washington, DC, 2011), www.nctq.org/p/publications/docs/nctq_stateOfTheStates.pdf (accessed August 1, 2012); andSara Mead, “Recent State Action on Teacher Effectiveness: What’s in StateLaws and Regulations?” Bellwether Education Partners, August 2012, http://bellwethereducation.org/ideas/publications (accessed August 1, 2012).19. US Department of Education, “ESEA Flexibility” (Washington, DC, Sep-

tember 23, 2011), www.ed.gov/esea/flexibility (accessed August 1, 2012).20. Ibid., 7.21. Michael Winerip, “In Tennessee, Following the Rules for Evaluation Off a

Cliff,” New York Times, November 6, 2011.22. US Department of Education, “Race to the Top Annual Performance

Report” (Washington, DC, 2012), www2.ed.gov/programs/racetothetop/annual-report.pdf (accessed August 1, 2012).23. Craig D. Jerald and Kristan Van Hook, More than Measurement: The TAP

System’s Lessons Learned for Designing Better Teacher Evaluation Systems (SantaMonica, CA: National Institute for Excellence in Teaching, 2011), www.tapsystem.org/publications/eval_lessons.pdf (accessed August 1, 2012). 24. Craig D. Jerald, “The Value of Value-Added Data,” Education Trust,

Special Report 2

20

Teacher Quality 2.0

TEACHER QUALITY 2.0

Documents

Transcript of TEACHER QUALITY 2.0