Download - Classroom Composition and Measured Teacher Performance ...static.politico.com/...of-classroom-observations-of...Classroom observations take a central role in these systems, accounting

Educational Evaluation and Policy AnalysisMonth 201X, Vol. XX, No. X, pp. 1 –25

DOI: 10.3102/0162373715616249© 2016 AERA. http://eepa.aera.net

Introduction

Despite the intense focus on the use of student test scores to gauge teacher performance, the majority of our nation’s teachers receive annual evaluation ratings based primarily on observa-tions of their classroom practice (Steinberg & Donaldson, in press). These observation-based performance measures aim to capture teachers’ instructional practice and their ability to structure and maintain high-functioning classroom envi-ronments. However, little is known about the ways that classroom context—the settings in which teachers work and the students that they teach—shapes measures of teacher effectiveness based on classroom observations. Given the widespread adoption of high-stakes evaluation systems that rely heavily on classroom observa-tions, and in light of the way stakeholders such as

principals may use these scores to inform hiring and retention decisions (Sartain & Steinberg, in press), it is critical that we have a clearer under-standing of how the composition of teachers’ classrooms influences their observation scores.

Interest in understanding the role of school context in teachers’ professional development has recently emerged. For example, Kraft and Papay (2014) find that school context—includ-ing supportive and collaborative environments, effective principal instructional leadership, and positive school culture—can facilitate greater returns to teacher experience or the extent to which teachers become more effective over time. Other studies have pointed to the important role that school composition plays in shaping teacher effectiveness. In one study, Loeb, Beteille, and Kalogrides (2012) found that teachers improve more rapidly in more effective schools—those

616249 EPAXXX10.3102/0162373715616249Steinberg and GarrettClassroom Composition and Teacher Observation Scoresresearch-article2016

Classroom Composition and Measured Teacher Performance: What Do Teacher Observation

Scores Really Measure?

Matthew P. Steinberg

University of PennsylvaniaRachel Garrett

American Institutes for Research

As states and districts implement more rigorous teacher evaluation systems, measures of teacher performance are increasingly being used to support instruction and inform retention decisions. Classroom observations take a central role in these systems, accounting for the majority of teacher ratings upon which accountability decisions are based. Using data from the Measures of Effective Teaching study, we explore the extent to which classroom composition influences measured teacher performance based on classroom observation scores. The context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ measured performance. Furthermore, the intentional sorting of teachers to students has a significant influence on measured performance. Implications for high-stakes teacher accountability policies are discussed.

Keywords: classroom composition, classroom observation scores, teacher effectiveness, teacher evaluation

at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from

http://eepa.aera.net

Steinberg and Garrett

2

more able to raise the test-score performance of their students. In a second, Jackson and Bruegmann (2009) found that students of less-experienced teachers realize larger achievement gains when their teachers have higher perform-ing colleagues. Sass, Hannaway, Xu, Figlio, and Feng (2012) show that teachers improve more quickly in schools serving fewer low-income students.

Other work suggests that the characteristics of a teacher’s students play a meaningful role in determining teacher ratings. New evidence finds that classroom observation scores tend to be lower among teachers whose students are more disadvantaged and have lower incoming achieve-ment (Chaplin, Gill, Thompkins, & Miller, 2014; Whitehurst, Chingos, & Lindquist, 2014). These lower scores may reflect the nonrandom sorting of teachers to classes of students (Monk, 1987; Steinberg & Garrett, 2015) and the tendency of schools to assign novice, less-effective teachers to more disadvantaged students, while more experienced and more effective teachers are more likely to work with higher achieving stu-dents (Clotfelter, Ladd, & Vigdor, 2006; Kalogrides & Loeb, 2013; Kalogrides, Loeb, & Beteille, 2013; Monk, 1987).

Furthermore, evidence suggests that observa-tion-based measures of a teacher’s performance tend to be weakly correlated over time. Indeed, research using data from the Measures of Effective Teaching (MET) study finds that a teacher’s observation scores are only moderately correlated across 2 consecutive years (Garrett & Steinberg, 2015), on the order of the year-to-year, within-teacher correlation of student test-score-based measures of teacher performance found in other settings (Glazerman et al., 2010; Goldhaber & Hansen, 2010; McCaffrey, Sass, Lockwood, & Mihaly, 2009). These findings suggest not only that a teacher’s classroom prac-tice may vary over time but also that it may be influenced by classroom composition—the stu-dents and classes to which teachers are assigned.

As researchers and policymakers begin to bet-ter understand how different measures may differ-entially reflect distinct aspects of a teacher’s instructional performance, we need to further place these relationships into context by under-standing how the composition of the measures may be sensitive to the composition of the class

itself. Therefore, a closer investigation is war-ranted to more fully understand how classroom composition may influence observation-based measures of teacher performance, and whether these influences differentially affect teachers in different settings (e.g., math or reading subject areas, classroom teachers vs. subject specialists).

In this article, we focus on understanding the relationship between the characteristics of a teacher’s class and a teacher’s performance, as measured by observations of his or her classroom instruction and management. Given existing evi-dence that teachers who are judged to be more effective tend to be assigned higher achieving students, particular attention is paid to the rela-tionship between students’ incoming achieve-ment and measured teacher performance based on the Danielson Framework for Teaching (FFT) observation protocol. We leverage variation in student composition over time to estimate the influence on measured teacher effectiveness, allowing us to control for fixed characteristics of teachers (e.g., teacher quality endowments) and isolate the idiosyncratic influence of incoming student achievement. We further assess whether these relationships are more or less sensitive when considering specific aspects of classroom management and instruction and the extent to which the influence of incoming achievement differs for classroom generalists compared with subject-matter specialists. Moreover, we address the nonrandom sorting of teachers to classes that has been shown elsewhere to bias student achievement-based measures of teacher effec-tiveness (Rothstein, 2009, 2010). To do so, we take advantage of a hallmark of the MET study project—the randomization of teachers to classes that occurred just prior to the beginning of the MET study’s second year (i.e., 2010–2011 school year). Specifically, we exploit the randomization of teachers to classes within randomization blocks (i.e., grade-subject combinations of classes within a school) to generate experimental estimates of the effect of incoming achievement on measured teacher performance. Finally, we implement an instrumental variables (IV) strat-egy to formalize the nature of the bias in observa-tion scores that is due to the nonrandom sorting of teachers to classes. Then, we compare esti-mates generated by the teacher fixed-effects (FE) approach with ordinary least squares (OLS) and



Classroom Composition and Teacher Observation Scores

3

IV estimates to lend insight into whether the non-random matching of teachers to classes is a con-cern in the context of classroom observation-based measures of teacher performance.

We find that teacher performance, based on classroom observation, is significantly influ-enced by the context in which teachers work. In particular, students’ prior year (i.e., incoming) achievement positively affects a teacher’s mea-sured performance captured by the FFT. Students’ incoming achievement is more strongly associ-ated with observed English Language Art (ELA) instruction than math instruction. Although the influence of incoming achievement is concen-trated among subject specialists, who work with multiple classes of students within a single school day, compared with classroom generalist teachers who work with the same classroom of students all day, every day, we are unable to uniquely attribute these differences to either departmentalized (compared with generalist) class status or grade-level differences. Furthermore, incoming student achievement significantly influ-ences aspects of teacher performance that capture teachers’ interactions with their students, whereas core instructional practices are not prone to this influence. Finally, we offer evidence that the intentional, nonrandom matching of teachers to classrooms due to unobserved time-invariant teacher effects biases the relationship between incoming student achievement and observed classroom performance.

In the context of newly implemented teacher evaluation systems that incorporate multiple measures of teacher performance (including value-added measures [VAMs], student learning objectives [SLOs], and observation scores) into teachers’ summative evaluation ratings, recent evidence has indicated that principals’ human capital decisions around teacher effectiveness are based predominantly on their observations of teacher practice (Goldring et al., 2015). Moreover, some have argued that classroom observations provide a more transparent and objective means of evaluating and measuring teacher effectiveness and, in doing so, avoid many of the validity concerns that tend to be associated with VAMs (Goldring et al., 2015). However, our findings indicate that observation scores as measures of teacher effectiveness pres-ent similar validity concerns, as these scores

reflect both the systematic, nonrandom assign-ment of teachers to classes and the incoming achievement of a teacher’s students. Therefore, our results suggest that greater caution must be taken by policymakers and practitioners when making high-stakes personnel decisions based largely (if not entirely) on teachers’ classroom observation scores.

In the next section, we discuss the evolving landscape of teacher evaluation reform and the evidence on measures of teacher effectiveness being incorporated into new evaluation systems. Next, we specify a model relating teacher and classroom characteristics to observed measures of teacher performance. We then describe the data and empirical approach used to capture the influence of incoming student achievement on teacher performance. Finally, we present our findings and discuss their implications for high-stakes personnel decisions made in the context of newly developed and implemented teacher eval-uation systems.

Teacher Evaluation Reform Landscape

In the wake of the federal Race to the Top (RTTT) initiative, education policy efforts have recently led to dramatic revisions to teacher eval-uation systems and substantial effort toward improving measures of teacher effectiveness. At the state and local levels, policymakers are revis-ing and implementing new evaluation systems to incorporate more rigorous performance evalua-tions through the use of multiple measures of teacher performance—valued-added measures (VAM) and standards-based classroom observa-tions, among them—and multiple teacher ratings categories. By the start of the 2014–2015 school year, 78% of all states and 85% of the largest 25 school districts and the District of Columbia had revised and implemented teacher evaluation reforms (Steinberg & Donaldson, in press).

Revisions to teacher evaluation systems and the measures upon which teacher evaluation is based have been spurred by evidence in support of the critical role teachers play in improving student outcomes (Goldhaber, 2002; Rivkin, Hanushek, & Kain, 2005; Rockoff, 2004), while also acknowl-edging both the substantial within-school hetero-geneity in teacher effectiveness (Aaronson, Barrow, & Sander, 2007; Rivkin et al., 2005) and




4

the inability of traditional teacher-evaluation sys-tems to differentiate among teachers in their con-tribution to student learning (Weisberg, Sexton, Mulhern, & Keeling, 2009). Efforts to reform teacher evaluation are underscored by two funda-mental goals of personnel evaluation in education: to systematically support improvements in instruc-tional quality and to identify low-performing teachers for remediation or dismissal. The latter goal aims to inject more rigorous accountability into teacher evaluation systems that historically did a poor job of identifying and removing under-performing teachers (Weisberg et al., 2009). Early evidence on the effectiveness of teacher evalua-tion reforms for improving teaching and learning and identifying and removing the lowest perform-ing teachers is promising (Dee & Wyckoff, 2013; Sartain & Steinberg, in press; Steinberg & Sartain, 2015; Taylor & Tyler, 2012), suggesting that more rigorous teacher evaluation can meaningfully improve teacher quality and student achievement while satisfying the accountability function of newly developed evaluation systems.

As states and districts reform teacher evalua-tion and begin to make high-stakes personnel decisions, increasing attention is being given to the measures being incorporated into new evalu-ation systems. Eighty percent of both states and the largest districts implementing new evaluation systems are using one or more measures of teacher performance based on student test score data; these include VAM and student growth per-centiles (SGP; Steinberg & Donaldson, in press). However, evidence suggests that there is sub-stantial within-teacher variability in a teacher’s annual VAM scores (Goldhaber & Hansen, 2010; McCaffrey et al., 2009) and that this variability depends on the nature of the student assessment being used (Papay, 2011). This year-to-year vari-ability constrains the efficacy of new evaluation systems to serve the accountability function of identifying effective teachers (Rothstein, 2009, 2010). Moreover, VAM scores are limited in their ability to inform teachers’ instructional practice, as these measures are generated and provided to teachers after the conclusion of the school year, and offer no guidance for how teachers may improve their instructional practice.

Although educators and scholars alike point to the limitations implicit in test-score based mea-sures of teacher effectiveness, classroom

observations of teacher practice—designed to provide formative feedback to teachers to improve their instructional practice as well as to more carefully evaluate instructional performance—offer promise as a complementary measure for differentiating teacher performance (Kane, McCaffrey, Miller, & Staiger, 2013). The role of classroom observation in measuring teacher per-formance is particularly notable given that obser-vation-based scores constitute the majority, and often the entirety, of summative evaluation scores (Steinberg & Donaldson, in press), with nearly 70% of teachers nationwide teaching in nontested grades and subjects where student-test-score data—on which VAM scores are based—are unavailable (Watson, Kraemer, & Thorn, 2009). As noted in the introduction, however, new evi-dence suggests that observational measures of teacher practice alone are unable to identify teacher effectiveness (Garrett & Steinberg, 2015). Therefore, the reliance on observation-based measures of teacher performance raises concerns about the ability of newly implemented evalua-tion systems to both accurately capture teacher performance and satisfy the accountability goal of teacher evaluation reform.

Production of Teacher Performance

How teachers perform in the classroom depends on a number of factors. Consider the fol-lowing model where measured teacher perfor-mance is a function of a teacher’s endowed instructional ability and the student composition of his or her classroom:

Perform , , jct j ct jctf X= ( )θ ε . (1)

In Equation 1, Performjct is teacher j’s mea-sured performance, based on his or her classroom observation scores, in classroom c in school year t; θj captures the stable, time-invariant aspects of instructional skill (e.g., teacher quality) that a teacher brings to the classroom, which include a teacher’s ability to design lessons, provide instruction based on those lessons, and design and implement classroom management strategies to maximize the efficacy of instructional lessons within a classroom’s idiosyncratic environment. The endowment of teacher j’s instructional skills (captured by θj) is considered independent of the




5

students in his or her class, such that endowed teacher quality is not influenced by the class-room (c) to which the teacher is assigned. The variable Xct represents the characteristics of stu-dents in classroom c to whom a teacher is assigned in school year t; such characteristics may vary across years and likely influence how teachers perform from one year (and class) to the next. The variable εjct is an idiosyncratic error term capturing all other inputs to measured teacher performance (e.g., curriculum, teacher policy, observer/rater quality).

Just as the composition of teacher j’s class-room may influence measured performance, teachers may also alter their instruction based on the composition of students in their classes. That is, a teacher’s instruction and classroom manage-ment may differentially respond to classroom composition and the students to whom he or she is assigned. We can further decompose the vari-able εjct into an aggregate match effect, γjc, and all other inputs to measured teacher performance, µjct. We can then rewrite Equation 1 as follows:

Perform , , , jct j ct jc jctf X= ( )θ γ µ . (2)

This aggregate match effect (γjc) accounts for processes that occur both before teachers enter the classroom and those that occur during the course of teachers’ interactions with their classes. Specifically, the match effect captures (a) the process that assigns teachers to classrooms, which occurs prior to measured teacher perfor-mance and has been shown to incorporate his-torical teacher and student performance into the assignment decision (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Monk, 1987; Steinberg & Garrett, 2015) and (b) the extent to which teacher j’s instruc-tional and classroom management skills interact with and are more (or less) effective with a par-ticular mix of students in classroom c.

In light of evidence that teachers with higher observation scores tend to work with higher achieving students, this article extends the exist-ing literature on the influence that classroom composition—in particular, incoming student achievement—has on measured teacher effec-tiveness. We investigate whether the composition of students (described by the variable Xct) to whom a teacher is assigned accounts for

variation in measured classroom performance, beyond what is attributable to the teacher alone (θj). We explore this relationship by using con-secutive years of data to examine how incoming achievement influences teachers’ observation scores on the most widely used measure of instructional performance, the Danielson FFT. This teacher fixed-effects approach, described in more detail in the following sections, allows us to account for the contribution of a teacher’s endowed ability to his or her observation scores.

As formalized in Equation 2, we recognize that the purposive sorting of teachers to students may exert its own, independent influence on measured teacher performance, thereby con-founding our ability to uniquely capture the influence of classroom composition on teacher effectiveness. We, therefore, further test the rela-tionship between incoming achievement and observed measures of teacher performance by leveraging the random assignment of teachers to classes, a prominent feature of the MET study. In the second year (2010–2011) of the MET study, a subset of teachers was randomly assigned to classes of students.1 In principle, the random assignment of teachers to classes disrupts the assortative matching process that can confound measures of teacher effectiveness (Clotfelter et al., 2006; Garrett & Steinberg, 2015; Rothstein, 2009, 2010; Whitehurst et al., 2014). However, an unintended consequence of the Year 2 ran-domization in the MET study was significant noncompliance among students, who ended up with teachers other than those who were ran-domly assigned to their classes (Kane et al., 2013; White & Rowan, 2012).

In prior work, we found that classroom-level noncompliance with the teacher random assign-ment in the MET study can be characterized in three ways: (a) full compliance classrooms are those where all students, within a classroom to which teacher j was randomly assigned, remained with teacher j; (b) partial-compliance classrooms are those where at least one student, within a classroom to which teacher j was randomly assigned, did not comply with the random teacher assignment and instead ended up in another teacher’s class; and (c) full noncompliance class-rooms are those where no student, within a class-room to which teacher j was randomly assigned, remained with the randomized teacher. Full




6

noncompliance classrooms were those for which principals likely assigned teachers to classes prior to receipt of the random teacher assign-ments, therefore (intentionally or not) ignoring the random assignment of teachers to classes as part of the Year 2 randomization sample in the MET study (Steinberg & Garrett, 2015; White & Rowan, 2012). These particular consequences of ex post, nonrandom sorting on classroom-level compliance with randomization allows for addi-tional insight into the influence of the nonran-dom assignment of teachers to classes on measured teacher performance.2

Data and Sample

We use data from the MET study, which was carried out over 2 school years (2009–2010 and 2010–2011) and across six districts.3 The six par-ticipating districts were Charlotte-Mecklenburg Schools (NC), Dallas Independent School District (TX), Denver Public Schools (CO), Hillsborough County Public Schools (FL), Memphis City Schools (TN), and the New York City Department of Education (NY; White & Rowan, 2012).

Teacher Sample

The teacher-level analytic sample consists of 834 teachers in Grades 4 to 9, of which 354 taught ELA, 304 taught math, and 176 taught both subjects.4 For the 834 teachers in our sample, we incorporate the following background characteristics: gender, race/ethnicity, degree status, and years of teaching expe-rience within the district. We also observe whether a teacher was a generalist or departmentalized teacher. Generalists are subject-matter generalists who taught both ELA and mathematics to a single class of students; departmentalized teachers are subject-matter specialists who taught exclusively ELA or math, and worked with multiple classes of students. Furthermore, each teacher record includes unique district, school, and classroom (section) identifiers, variables identifying the grade taught, and the grade-subject combinations (e.g., fifth-grade math). We are able to use this unique identifying information to link students to their randomly assigned teacher as well as to their actual classroom teacher, allowing us to identify the compliance category of the class to which a teacher was

randomly assigned in Year 2 (i.e., full compliance, partial compliance, or full noncompliance).

Table 1 summarizes the characteristics of all teachers in our sample and by subject-specific sample (i.e., ELA or math). In addition to the full sample of teachers, we construct two additional analytic samples by subject area. The IV sample includes teachers in Grades 5 to 9, and from this sample, we construct IV estimates (of the effect of incoming achievement on measured teacher performance), which we compare with teacher FE estimates for an assessment of the extent of bias in observation scores due to the nonrandom assignment of teachers to classes. The random-ization sample includes those randomization blocks (i.e., grade-subject combinations of classes within a school) in which all MET study teachers were in full compliance classrooms. We use the randomization sample to generate experi-mental estimates of the effect of incoming achievement on teacher performance.5

Among our sample of 834 teachers, 46% were in full compliance classrooms, 36% were in par-tial-compliance classrooms, and 18% were in full noncompliance classrooms. Table 2 summa-rizes the teacher characteristics by classroom compliance status. Among the ELA sample, teachers differed significantly by years of teach-ing experience (in district), master’s degree attainment, and subject-matter generalist status. Indeed, fewer teachers in full compliance classes were subject-matter generalists teaching fourth or fifth grade, suggesting that much of the non-compliance with the Year 2 randomization took place in the earlier grades among the MET sam-ple of ELA teachers. We also observe very simi-lar patterns among the sample of math teachers.

Nearly a quarter of all teachers in our sample were subject-matter generalists teaching multiple subjects to one class of students. Table 3 summa-rizes the characteristics of teachers by generalist and departmentalized status. In both the ELA and math samples, generalists were significantly less likely to be male and White, although significantly more likely to be Black and have a master’s degree. Generalist teachers had approximately less than 2.5 years of teaching experience than departmentalized teachers, with nearly all (98%) generalists teaching in either Grade 4 or 5. Furthermore, we find that early-grade, generalist teachers were significantly more likely to be in



7

full noncompliance classrooms (and, by exten-sion, less likely to be in full compliance class-rooms) than departmentalized teachers. This suggests that it may have been more difficult for school leaders to intentionally ignore the teacher random assignment among departmentalized teachers and purposively match entire classes of students to these teachers. In contrast, it appears that the purposive matching of teachers to classes (and the abandoning of the teacher random assign-ment) was more easily accomplished among the elementary grade (e.g., fourth and fifth grade) teachers teaching in self-contained classrooms.

Teacher Performance Measure: Classroom Observation Scores

We use the measure of teacher effectiveness that is among the most widespread in newly devel-oped teacher evaluation systems—observation scores of a teacher’s classroom practice using

Charlotte Danielson’s FFT. Measures of a teach-er’s instructional practice are captured through FFT scores generated by MET raters using videos of subject-specific (e.g., math or ELA) lessons.6 Generalist classroom teachers were videoed, on average, on 4 separate days throughout the year, with each day producing one ELA and one math lesson video. In the first year of the study, depart-mentalized teachers were videoed on 2 separate days, capturing two different sections taught by the teacher. We randomly selected one of a depart-mentalized teacher’s two sections from the first year for inclusion in our analysis.7 In the second year of the study, departmentalized teachers were videoed on 4 different days and focused on one of the teacher’s sections. Video recordings were spread over time to capture greater representative-ness of teacher instruction. MET raters watched the lesson videos remotely, coding them using eight components from two of the FFT domains: classroom environment and instruction (see White

TABLE 1Teacher Characteristics

Teacher characteristic

ELA sample Math sample

All teachers

Full sample IV sample

Randomization sample

Full sample

IV sample

Randomization sample

Male .18 .13 .15 .20 .21 .25 .23White .58 .57 .59 .65 .55 .56 .60Black .32 .34 .32 .27 .35 .35 .29Hispanic .08 .07 .07 .06 .07 .06 .07Other race .02 .01 .02 .02 .03 .04 .04Experience (years

in district)8.7 (7.01) 8.3 (6.63) 8.7 (6.94) 8.4 (6.97) 8.3 (7.01) 8.9 (7.48) 8.4 (7.00)

Master’s or higher .33 .35 .31 .21 .40 .37 .20Grade 4 .17 .22 .00 .12 .23 .00 .12Grade 5 .19 .24 .31 .10 .25 .32 .12Grade 6 .20 .16 .21 .22 .18 .24 .29Grade 7 .14 .11 .14 .19 .12 .15 .23Grade 8 .14 .12 .16 .25 .11 .14 .24Grade 9 .16 .14 .18 .12 .12 .15 .00Generalist .22 .34 .23 .15 .37 .25 .23

Teachers 834 530 412 130 480 372 86Schools 204 172 156 49 162 146 31Districts 5 5 5 5 5 5 4

Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher characteristics except experi-ence (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Of the 834 teachers, data on a teacher’s gender is available for 799 teachers, 798 report their race, 797 report years of experience (in the dis-trict), and 618 report degree attainment (e.g., master’s or higher). Of the 530 teachers in the ELA sample, data on a teacher’s gender and race are avail-able for 503 teachers, 501 report years of experience (in the district), and 381 report degree attainment (e.g., master’s or higher). Of the 480 teachers in the math sample, data on a teacher’s gender is available for 462 teachers, 461 report their race, 462 report years of experience (in the district), and 333 report degree attainment (e.g., master’s or higher). ELA = English Language Art; IV = instrumental variables; MET = Measures of Effective Teaching.



8

& Rowan, 2012 for a detailed discussion of the process of video recording and scoring teachers’ lessons). Each component was scored on an inte-ger scale from 1 (unsatisfactory) to 4 (distin-guished). The scores for each component were then averaged within subject (using the harmonic mean) across all segments (i.e., lesson videos) to create section-level aggregate scores for each component. These aggregate scores represent the average FFT scores for a teacher’s subject-specific instruction for a single class (section) of students.

For our measure of teacher effectiveness, we created a composite performance measure in the following manner. Using the section-level means (from multiple observation scores) of each of the eight FFT components, we averaged across the eight section-level aggregate component scores within a section, by subject. For example, each of the eight FFT component means based on ELA instruction from a teacher’s class section were

averaged for a final composite ELA FFT score. Previous research using the MET data found this to be an appropriate approach for aggregating teacher effectiveness measures based on classroom obser-vation scores (Garrett & Steinberg, 2015; Kane et al., 2013; Mihaly, McCaffrey, Staiger, & Lockwood, 2013).

Incoming Achievement and Classroom Composition

The MET data contain information on student achievement for Grades 4 to 8 provided by each district for its state accountability tests in ELA and mathematics. Student achievement information was provided for the 2 years of the study (e.g., 2009–2010 and 2010–2011) and up to 3 years prior to the study, as available. As the state assessments differ across locations, scores were standardized (e.g., z scores) at the student level within district,

TABLE 2Teacher Characteristics, by Classroom Compliance Status

Teacher characteristic


Full Comply Partial Comply Noncomply Full Comply Partial Comply Noncomply

Male .15 .09 .14 .20 .23 .17White .61 .56 .49 .57 .56 .48Black .32 .32 .43 .34 .32 .43Hispanic .05* .11 .06 .06 .09 .06Other race .01 .01 .02 .03 .03 .02Experience (years

in district)8.5** (7.06) 8.9 (6.66) 6.74 (4.84) 8.75 (7.25) 8.38 (6.91) 7.48 (6.71)

Master’s or higher .28** .40 .47 .30*** .44 .51Grade 4 .15*** .30 .27 .18 .25 .26Grade 5 .19*** .25 .37 .21** .22 .36Grade 6 .19 .16 .10 .22 .15 .17Grade 7 .17*** .07 .04 .16** .09 .07Grade 8 .17*** .07 .09 .16*** .07 .08Grade 9 .14 .15 .13 .06*** .22 .06Generalist .24*** .38 .53 .32 .38 .44

Teachers 258 180 92 189 184 107Schools 120 93 48 103 95 48Districts 5 5 4 5 5 4

Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher charac-teristics except experience (mean and standard deviation reported). Full Comply refers to teachers in full compliance classrooms; Partial Comply refers to teachers in partial compliance classrooms; and Noncomply refers to teachers in full noncompliance classrooms in the second year of the MET study. ELA = English Language Art; MET = Measures of Effective Teaching.

Differences by classroom compliance status, by subject, statistically significant at the *10%, **5%, and ***1% levels.



9

grade, and subject to enable comparability across sites. The MET data also capture key background characteristics of a teacher’s students that we include as controls for classroom composition, including class size, student race/ethnicity, gender, age, special education status (SPED), free or reduced-price lunch status (FRPL; as a proxy for student poverty), gifted-status, and whether a stu-dent is an English language learner (ELL; see Table 4). Both student background and achieve-ment characteristics were aggregated to the class-room level. The aggregated, classroom-level achievement represents the average (ELA or math) achievement for all students in a given class.

Importantly, the average prior-year achieve-ment of all students in the class captures the aca-demic skills students have when arriving to a class, a focal consideration for assessing the

influence of classroom composition on measured teacher performance.

Empirical Approach

Our aim is to understand how the incoming academic achievement of a class influences a teacher’s performance as measured by his or her classroom observation scores. Assuming addi-tive separability of inputs to teacher performance, we can write Equation 1 as follows:

Perform + PriorAchieve

+ + +

0 1jct ct

ct j jctX

= ( )β β

Γ θ ε , (3)

where Performjct is the observation score for teacher j with class c in year t; PriorAchieve is the average prior-year (i.e., incoming)

TABLE 3Teacher Characteristics, by Generalist/Departmentalized Status


Teacher characteristic Generalist Departmentalized Generalist Departmentalized

Male .07*** .16 .07*** .28White .46*** .63 .48** .59Black .46*** .28 .44*** .30Hispanic .06 .08 .06 .08Other race .02 .01 .02 .03Experience (years in district) 6.65*** (5.44) 9.19 (7.02) 6.71*** (5.47) 9.25 (7.60)Master’s or higher .68*** .23 .67*** .28Grade 4 .47*** .09 .48*** .08Grade 5 .51*** .11 .50*** .10Grade 6 .02*** .23 .02*** .28Grade 7 .00*** .17 .00*** .18Grade 8 .00*** .18 .00*** .17Grade 9 .00*** .21 .00*** .19Full Comply .35*** .56 .34* .42Partial Comply .38 .32 .39 .38Noncomply .27*** .12 .27* .20

Teachers 180 350 177 303Schools 54 122 52 115Districts 4 5 4 5

Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher charac-teristics except experience (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Departmentalized teachers are subject matter specialists who taught ELA or math to more than one class of students. ELA = English Language Art; MET = Measures of Effective Teaching.

Differences by generalist/departmentalized status, by subject, statistically significant at the *10%, **5%, and ***1% levels.



10

achievement of teacher j’s class c in year t; and Xct captures other observable characteristics of students in teacher j’s classroom in year t, includ-ing class size, the average age of students in the class, the proportion of students receiving FRPL, the proportion of students who are ELLs, the pro-portion of students receiving special education services (i.e., SPED), the proportion of gifted students, and the percentage of minority (i.e., Black or Hispanic) students; θj is a teacher FE and εjct is a random error term. Accounting for a teacher’s endowed ability in this teacher fixed-effects set-up will enable us to deconfound the relationship between incoming achievement and teacher performance that may be due to fixed performance characteristics of teachers. We clus-ter the standard errors at the teacher level.

We estimate Equation 3 separately for each of the two subject-specific (math and ELA) teacher samples to examine the extent to which incoming

achievement may differentially influence mea-sured teacher performance across content areas. In addition to teachers’ aggregate classroom observa-tion scores, we examine the relationship between incoming achievement and individual teacher practices across two domains—instruction and classroom management. Furthermore, we esti-mate the extent to which incoming achievement differentially influences the observation scores for subject-matter generalists and departmentalized, subject-matter specialists.

As previously discussed, measured teacher performance may reflect, in part, the extent to which a teacher’s instruction and classroom man-agement interacts with the students to whom he or she is assigned, as well as the process by which teachers are assigned to classrooms. As formal-ized in Equation 2, the ex ante assignment process matching teachers to students and the ex post interaction of a teacher’s instruction with

TABLE 4Classroom Composition

Classroom characteristic

All teachers ELA sample Math sample

Y1 Y2 Y1 Y2 Y1 Y2

Class size 23.8 (5.4) 24.6 (6.0) 24.0 (4.8) 24.9 (5.4) 24.1 (5.6) 24.8 (6.3)Age (years) 11.9 (1.8) 10.9 (1.8) 11.6 (1.8) 10.6 (1.8) 11.5 (1.8) 10.5 (1.8)Male .50 (.12) .50 (.11) .50 (.12) .49 (.11) .50 (.11) .51 (.11)White .26 (.28) .24 (.26) .24 (.28) .23 (.26) .22 (.26) .21 (.24)Black .32 (.33) .31 (.32) .35 (.35) .35 (.35) .36 (.35) .35 (.35)Hispanic .34 (.28) .36 (.27) .33 (.28) .34 (.28) .33 (.27) .35 (.27)Asian .06 (.12) .06 (.12) .06 (.12) .06 (.12) .06 (.13) .06 (.13)Other race .03 (.05) .03 (.04) .03 (.05) .02 (.04) .03 (.05) .03 (.04)FRPL .55 (.29) .57 (.30) .53 (.30) .55 (.31) .55 (.29) .58 (.30)ELL .12 (.16) .13 (.17) .11 (.16) .12 (.17) .13 (.17) .14 (.17)SPED .08 (.10) .09 (.10) .07 (.10) .09 (.10) .08 (.10) .10 (.11)Gifted .06 (.14) .07 (.15) .08 (.16) .09 (.17) .04 (.09) .04 (.09)Achievement (ELA) .06 (.54) .10 (.54) .09 (.52) .12 (.52) .00 (.52) .03 (.50)Achievement (Math) .06 (.54) .08 (.55) .10 (.50) .11 (.52) .01 (.53) .02 (.52)

Teachers 834 834 530 530 480 480Schools 204 204 172 172 162 162Districts 5 5 5 5 5 5

Note. Each cell reports the mean (standard deviation) proportion of students, by characteristic, in a teacher’s classroom, except for class size, age, and achievement. Y1 represents the first year of the MET study (2009–2010 school year) and Y2 represents the sec-ond year of the MET study (2010–2011 school year). For class size, we report the mean (standard deviation) number of students in a teacher’s class. Achievement represents the average incoming achievement (in standard deviation units) among students in a teacher’s class (e.g., Y1 achievement, in ELA and math, is for the 2008–2009 school year, and Y2 achievement is for the 2009–2010 school year), standardized at the subject/grade/district level. ELA = English Language Art; FRPL = free or reduced-price lunch status; ELL = English language learner; SPED = special education status; MET = Measures of Effective Teaching.




11

the particular composition of the class may be captured by an aggregate match effect, γjc. To dis-tinguish between these ex ante and ex post contri-butions to the aggregate match effect, we characterize the influence of nonrandom teacher-classroom matching as γ jc

match and the interaction

of teacher instruction and classroom composition as γ jc

instr. The teacher FE approach accounts for the

sorting of teachers to classes (i.e., the ex ante match effect, γ jc

match ) that is correlated with unob-served, time-invariant teacher characteristics. However, it is possible that the sorting of teachers to classrooms may also be a function of time-vary-ing, unobserved shocks to teacher performance. For example, Kalogrides and Loeb (2013) point out that principals may assign teachers to classes as a retention strategy, to compensate them for particularly strong performance in a given year, or even as a result of pressure senior teachers might place on principals to assign them their preferred classes and students. It is important to note that the FE approach will produce biased estimates of the influence of incoming achievement on a teacher’s measured performance if teacher assignment to classes is also based on such year-specific shocks.

Experimental Estimates of Incoming Achievement

One approach for addressing the concern that time-varying shocks may be correlated with the assignment of teachers to classes and bias the FE estimates is to leverage the randomization of teachers to classes that took place in the second year of the MET study. As previously described, a subset of all MET study teachers was randomly assigned to classes within randomization blocks, or grade-subject combinations of classes consist-ing of at least two participating MET teachers. If the random assignment of teachers to classes (within randomization block) was implemented faithfully, all teachers would be in full compli-ance classrooms in the 2010–2011 school year. Then, via OLS, we could estimate the causal effect of incoming achievement (which, condi-tional on the randomization block, should be exogenous to the teacher). However, in our full ELA sample, 51% of teachers were in either par-tial or full noncompliance classrooms; among our full math sample, 61% of teachers were in classes with some noncompliance.

Even though we observe extensive classroom-level noncompliance with the teacher random assignment, we identified a subset of randomiza-tion blocks in which all MET study teachers were in full compliance classrooms (see Table 1 for summary statistics on the randomization sam-ple). Of the 260 randomization blocks in the ELA sample, we identified 58 where all MET study teachers were in full compliance classes; for the math sample, 38 of the 244 randomization blocks included only full compliance classes. Given that only a subset of the randomization blocks was included in our randomization sample, we first assessed the fidelity of the randomization. Specifically, if the randomization held among teachers (and their classes) in our randomization sample, then the classroom composition vari-ables jointly should not predict a teacher’s pre-randomization measured performance (i.e., 2009–2010 FFT scores). To conduct this covari-ate balance test, we estimate the following model for both the ELA and math samples:

Perform + PriorAchieve

+ + +

0 1

1

j t ct

ct r j,tX

,

,

−

−

= ( )1 α α

ζ υ ε (4)

where Performj,t−1 is teacher j’s prerandomization classroom observation score from year t − 1 (i.e., 2009–2010 school year); PriorAchieve is the average prior-year (i.e., incoming) achievement of teacher j’s class c in year t (i.e., 2010–2011 school year); and Xct captures other observable characteristics of students in teacher j’s class-room in year t, as in Equation 3. Given that the randomization was conducted at the randomiza-tion block level, we include randomization block FE (υr); εj,t−1 is a random error term, and we clus-ter the standard errors at the randomization block level. To determine whether the characteristics of teacher j’s class c in the 2010–2011 school year jointly predicted his or her prerandomization measured performance from the 2009–2010 school year, we assess the F statistic (and associ-ated p value) from the joint test of the null hypothesis for Equation 4. If classroom composi-tion was distributed randomly across teachers within randomization blocks, then we will be unable to reject the joint null hypothesis.8 In such cases, OLS estimates of Equation 3 for the ran-domization year (2010–2011), replacing only the teacher FE (θj) with the randomization block FE




12

(υr), will produce unbiased estimates of the effect of incoming achievement on measured teacher performance. We present both OLS and teacher FE estimates for the ELA and math randomiza-tion samples, and discuss the validity of these estimates in light of the covariate balance test described above.

Examining Bias Due to Nonrandom Matching

Prior evidence suggests that when teachers are nonrandomly assigned to classes, systematic assortative matching occurs such that higher per-forming teachers are assigned to higher perform-ing students (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Steinberg & Garrett, 2015). In addition to the systematic matching of teachers to classes on observed per-formance characteristics, policymakers and prac-titioners should also be concerned that teachers are assigned to classes based on unobserved characteristics. If teacher-class assignment depends, in part, on unobserved time-varying and time-invariant characteristics of teachers, then observation-based measures of teacher per-formance will be prone to bias.

Our aim is to uncover the extent to which such unobserved matching bias that is due to teacher-class sorting based on unobserved and time-invariant teacher quality may influence teachers’ measured performance. To do so, we compare estimates from the teacher FE approach with Year 2 OLS estimates based on Equation 3 and esti-mates from an IV approach (please see the appen-dix for details on the IV approach and the formal comparison between FE and IV strategies). Neither the OLS nor IV estimates will provide unbiased causal estimates of the effect of incom-ing achievement. Rather, the OLS estimates (based on Year 2 data) are compared with the FE estimates (based on 2 consecutive years of data) to reveal whether omitting time-invariant teacher quality biases estimates of the effect of incoming achievement on measured teacher performance. Furthermore, the IV estimates are compared with the FE estimates to examine whether omitting the correlation between time-invariant teacher qual-ity and incoming classroom achievement biases estimates of the effect of incoming achievement on measured teacher performance. Therefore,

both the OLS and IV comparisons to the FE esti-mates aim to shed light on how omitting fixed teacher quality biases the effect of incoming achievement on measured teacher performance. Most critically for policy and practice, these com-parisons will reveal whether bias in teacher obser-vation scores may be present due to the nonrandom assignment of teachers to classes based on fixed teacher quality.

Results

To what extent does measured teacher perfor-mance, based on classroom observation scores, reflect the incoming achievement of the students to whom teachers are assigned? To answer this ques-tion, we begin by showing the distribution of teacher effectiveness, measured by classroom observation scores, among teachers who teach stu-dents with varying levels of academic perfor-mance. We present results from the 2009–2010 school year only because all teachers were nonran-domly assigned to classes in the first year of the MET study. We find that ELA teachers were more than twice as likely to be rated in the top perfor-mance quintile if assigned the highest achieving students compared with teachers assigned the low-est achieving students; math teachers were more than 6 times as likely (see Figure 1). Furthermore, approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two per-formance quintiles if assigned the highest perform-ing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing stu-dents were highly rated based on classroom obser-vation scores.

The patterns observed in Figure 1 likely reflect a number of factors—both observable and unobservable characteristics of teachers and their students—that influence how teachers perform in the classroom and their subsequent performance ratings. To lend further insight into the influence of student achievement on measured teacher per-formance, we next present our main estimates of the effect of incoming student achievement on teachers’ classroom observation scores. Among the full sample of ELA and math teachers, esti-mates from the teacher FE approach reveal that incoming achievement has a large and significant effect on measured teacher performance (see col-umn 2, Table 5). For the ELA sample, the



13

coefficient of 0.11 FFT points suggests that if a teacher were to be assigned to a class with 1 stan-dard deviation higher incoming achievement, that teacher would realize one third of a standard deviation increase in measured teacher perfor-mance. For math teachers, the effect of incoming math achievement, although smaller in magni-tude, suggests that a 1 standard deviation increase in incoming student math achievement would generate approximately a 0.2 standard deviation increase in measured performance.

When we restrict the sample to teachers in Grades 5 to 9 (IV Sample), the FE estimates again suggest a substantive and statistically sig-nificant effect of incoming achievement on mea-sured teacher performance. For ELA teachers, a 0.14 FFT point effect is equivalent to increasing measured teacher performance by 0.4 standard deviations (with a commensurate 1 standard deviation increase in incoming student ELA achievement); for math teachers, a 0.08 FFT point effect is equivalent to increasing measured

FIGURE 1. Measured teacher performance by incoming student achievement.Note. Panel A shows the distribution of teacher FFT scores from ELA lessons from the 2009–2010 school year (for 530 ELA teachers) by the incoming ELA achievement of their students. Panel B shows the distribution of teacher FFT scores from math lessons from the 2009–2010 school year (for 480 math teachers) by the incoming math achievement of their students. As state achievement tests differ across the five districts in which our sample of teachers are located, student test scores were standardized (e.g., z scores) within district, grade, and subject to enable comparability across district sites. F tests confirm that the FFT score for ELA and math teachers differ significantly across the distribution of incoming classroom achievement—for the ELA sample, F = 4.21 (p = .002); for the math sample, F = 12.08 (p < .0001). FFT = Framework for Teaching; ELA = English Language Art.



14

TAB

LE

5In

com

ing

Ach

ieve

men

t and

Mea

sure

d Te

ache

r P

erfo

rman

ce

Ful

l sam

ple

IV s

ampl

eR

ando

miz

atio

n sa

mpl

e

(1)

(2)

(3)

(4)

(5)

(6)

OL

S

(Poo

led)

FE

IVF

EO

LS

(Y

2)F

E

Pan

el A

: EL

A s

ampl

e

Pri

or a

chie

vem

ent

0.04

(0.

03)

0.11

***

(0.0

4)0.

11*

(0.0

6)0.

14**

* (0

.05)

0.17

(0.

22)

0.16

** (

0.08

)

IV (

F s

tati

stic

)0.

89**

* (8

27.8

)

F

sta

tist

ic (

p va

lue)

fro

m C

ovar

iate

Bal

ance

Tes

t0.

64 (

0.80

1)

R

ando

miz

atio

n bl

ocks

260

260

205

205

58

58

T

each

er ×

Yea

r O

bser

vati

ons

1,06

01,

060

412

824

130

260

R

2.2

179

.029

8.2

681

.014

3.7

576

.069

1

FF

T M

(SD

)2.

53 (

0.33

)2.

50 (

0.33

)2.

50 (

0.35

)2.

52 (

0.34

)2.

51 (

0.36

)P

anel

B: M

ath

sam

ple

P

rior

ach

ieve

men

t0.

06**

(0.

03)

0.06

* (0

.04)

−0.

01 (

0.04

)0.

08**

(0.

04)

−0.

33 (

0.23

)0.

07 (

0.09

)

IV (

F s

tati

stic

)0.

85**

* (8

22.2

)

F

sta

tist

ic (

p va

lue)

of

Cov

aria

te B

alan

ce T

est

2.28

** (

0.02

7)

R

ando

miz

atio

n bl

ocks

244

244

194

194

38

38

T

each

er ×

Yea

r O

bser

vati

ons

960

960

372

744

86

172

R

2.2

149

.127

3.2

648

.142

2.6

695

.091

3

FF

T M

(SD

)2.

48 (

0.33

)2.

45 (

0.32

)2.

45 (

0.34

)2.

44 (

0.31

)2.

43 (

0.33

)

Not

e. E

ach

colu

mn

(wit

hin

a pa

nel)

rep

orts

a s

epar

ate

regr

essi

on. F

or p

oole

d O

LS

and

IV

est

imat

es, c

oeff

icie

nts

repo

rted

(w

ith

robu

st s

tand

ard

erro

rs c

lust

ered

at t

he s

choo

l lev

el);

for

teac

her

FE

est

imat

es, c

oeff

icie

nts

repo

rted

(wit

h ro

bust

sta

ndar

d er

rors

clu

ster

ed a

t the

teac

her l

evel

); fo

r OL

S (Y

2) e

stim

ates

, coe

ffic

ient

s re

port

ed (w

ith

robu

st s

tand

ard

erro

rs c

lust

ered

at t

he ra

ndom

-iz

atio

n bl

ock

leve

l). C

olum

ns 2

, 4, a

nd 6

incl

ude

teac

her

FE

; col

umn

3 in

clud

es d

istr

ict F

E, a

nd c

olum

n 5

incl

udes

ran

dom

izat

ion

bloc

k F

E. T

he I

V s

ampl

e in

clud

es te

ache

rs in

Gra

des

5 to

9

duri

ng th

e 20

10–2

011

scho

ol y

ear.

Pri

or A

chie

vem

ent i

s th

e av

erag

e in

com

ing

subj

ect-

spec

ific

(e.g

., E

LA

or m

ath)

ach

ieve

men

t am

ong

stud

ents

in a

teac

her’

s cl

ass.

For

the

IV e

stim

ates

, Pri

or

Ach

ieve

men

t est

imat

es th

e ef

fect

of

inco

min

g cl

assr

oom

ach

ieve

men

t on

mea

sure

d te

ache

r pe

rfor

man

ce w

hen

inst

rum

enti

ng w

ith

the

clas

s’ o

nce-

lagg

ed p

rior

ach

ieve

men

t (e.

g., t

wic

e-la

gged

ac

hiev

emen

t). T

he C

ovar

iate

Bal

ance

Tes

t rep

orts

the

F s

tati

stic

(p

valu

e) f

rom

the

join

t tes

t of

the

null

hyp

othe

sis

from

a r

egre

ssio

n of

pre

rand

omiz

atio

n (e

.g.,

2009

–201

0) m

easu

red

teac

her

perf

orm

ance

(Per

form

j,t−

1) o

n P

rior

Ach

ieve

men

t, co

ntro

llin

g fo

r cla

ssro

om c

hara

cter

isti

cs a

nd ra

ndom

izat

ion

bloc

k F

E. A

ll re

gres

sion

s co

ntro

l for

the

foll

owin

g cl

assr

oom

cha

ract

eris

tics

: cla

ss

size

and

the

sha

re o

f st

uden

ts b

y ra

ce/e

thni

city

, gen

der,

age

, spe

cial

edu

cati

on s

tatu

s, f

ree

or r

educ

ed-p

rice

lun

ch s

tatu

s, g

ifte

d-st

atus

, and

Eng

lish

lan

guag

e le

arne

r st

atus

. IV

= i

nstr

umen

tal

vari

able

s; O

LS

= o

rdin

ary

leas

t squ

ares

; FE

= f

ixed

-eff

ects

; EL

A =

Eng

lish

Lan

guag

e A

rt; F

FT

= F

ram

ewor

k fo

r T

each

ing.

Coe

ffic

ient

s st

atis

tica

lly

sign

ific

ant a

t the

*10

%, *

*5%

, and

***

1% le

vels

.




15

teacher performance by 0.24 standard deviations. We note that the FE estimates for both ELA and math teachers in the IV sample are larger in mag-nitude than the FE estimates among the full sam-ple, suggesting that the influence of incoming achievement may be less salient for teachers in earlier grades, as Grade 4 teachers were excluded from the IV sample. We return to this point when we examine the effect of incoming achievement among generalist and departmentalized teachers.

As previously discussed, the FE estimates will be biased in the presence of time-varying shocks to teacher-class assignment. To address this con-cern, we present experimental estimates of the effect of incoming achievement on measured teacher performance (see column 5, Table 5). For the randomization sample of ELA teachers, the covariate balance test indicates that the random-ization was preserved. However, the randomiza-tion does not appear to have held for the sample of math teachers, as we can reject the joint null hypothesis (p value = .03) that classroom compo-sition was uncorrelated with prerandomization measured teacher performance in math. As a result, the experimental estimate for only the sample of ELA teachers may be considered a valid causal effect of incoming achievement on measured teacher performance. For ELA teach-ers, we find a substantively large (0.17 FFT points), although not precisely estimated, effect of incoming achievement. The FE estimate (0.16 FFT points) is nearly identical to the experimen-tal estimate and more precisely estimated, con-firming the assumptions of the FE model, that is, E(εjct | PriorAchievect, X, θj) = 0 from Equation 3. We further note that the FE estimate for the ran-domization sample of math teachers (0.07) is remarkably similar to the FE estimates for the full and IV math samples. Taken together, the FE and experimental estimates suggest that incom-ing achievement plays a significant role in deter-mining measured teacher performance.

Does Incoming Achievement Influence Specific Teacher Practices?

To shed further light on the specific teacher practices that may be differentially influenced by incoming achievement, we next examine eight teacher practices across two domains—classroom environment and instruction—for which external

observers in the MET study rated teacher perfor-mance.9 We first examine the extent to which the incoming achievement of an ELA teacher’s stu-dents influences these specific teacher practices (see Panel A, Table 6). We note that all estimates are net of teacher FE. We find consistent evidence that incoming achievement influences each of the four practices related to the classroom environ-ment. These findings may not be particularly sur-prising. The composition of a classroom is likely to have an influence on measures of teacher prac-tice that require teachers and students to cocon-struct learning activities and collaborate on classroom procedures, while higher achieving stu-dents may be more engaged in learning, shaping the overall behavioral climate of the classroom. These findings therefore suggest that, independent of fixed teacher quality, teachers who are assigned higher achieving students will be judged to per-form better on measures that evaluate their capac-ity to create and sustain a higher functioning learning environment.

Moreover, evidence from Table 6 indicates that aspects of a teacher’s instructional practices that may be more sensitive to interactions between teachers and their students are influ-enced by the achievement of their students, inde-pendent of a teacher’s own instructional ability. Indeed, teachers with higher achieving students are rated higher in two areas of instruction—communicating with students (3a) and engaging students in learning (3c). Notably, however, mea-sures of instructional performance that depend more on strategies or instructional tools that teachers may bring with them to the classroom—using questioning techniques (3b) and assess-ment (3d) to drive instruction—are invariant to the level of achievement of their students.

We find that, unlike with ELA teachers, the measured performance of math teachers on indi-vidual practices is largely unrelated to the incom-ing achievement of their students (see Panel B, Table 6). The only exception is for a math teach-er’s performance in establishing a culture for learning in the class (FFT component 2b). These results are not surprising given that a math teach-er’s aggregate performance on the Danielson FFT has only a modest and marginally statisti-cally significant relationship with prior student achievement (see column 2, Panel B of Table 5). It is possible that the ways in which math



16

TAB

LE

6In

com

ing

Ach

ieve

men

t and

Tea

cher

Pra

ctic

es

Dom

ain

2: C

lass

room

env

iron

men

tD

omai

n 3:

Ins

truc

tion

2a

2b2c

2d3a

3b3c

3d

Pan

el A

: EL

A s

ampl

e

Pri

or a

chie

vem

ent

0.15

***

(0.0

5)0.

19**

* (0

.06)

0.12

** (

0.05

)0.

09*

(0.0

5)0.

10**

(0.

05)

0.07

(0.

06)

0.16

***

(0.0

5)0.

04 (

0.05

)

Cla

ssro

om c

hara

cter

isti

csX

XX

XX

XX

X

Tea

cher

FE

XX

XX

XX

XX

T

each

er ×

Yea

r O

bser

vati

ons

1,06

01,

060

1,06

01,

060

1,06

01,

060

1,06

01,

060

R

2.0

004

.037

4.0

688

.001

0.0

326

.002

6.0

770

.036

7

FF

T M

(SD

)2.

71 (

0.42

)2.

53 (

0.42

)2.

68 (

0.42

)2.

77 (

0.43

)2.

64 (

0.37

)2.

24 (

0.43

)2.

46 (

0.41

)2.

26 (

0.42

)P

anel

B: M

ath

sam

ple

P

rior

ach

ieve

men

t0.

08 (

0.06

)0.

10*

(0.0

5)0.

05 (

0.05

)0.

05 (

0.05

)0.

06 (

0.05

)0.

05 (

0.05

)0.

03 (

0.05

)0.

07 (

0.06

)

Cla

ssro

om c

hara

cter

isti

csX

XX

XX

XX

X

Tea

cher

FE

XX

XX

XX

XX

T

each

er ×

Yea

r O

bser

vati

ons

9

60

960

9

60

960

9

60

960

9

60

960

R

2.0

644

.086

4.0

468

.095

7.0

003

.017

6.0

980

.013

6

FF

T M

(SD

)2.

62 (

0.43

)2.

48 (

0.41

)2.

64 (

0.44

)2.

72 (

0.43

)2.

57 (

0.38

)2.

13 (

0.40

)2.

37 (

0.41

)2.

30 (

0.41

)

Not

e. E

ach

colu

mn

(wit

hin

a pa

nel)

rep

orts

a s

epar

ate

regr

essi

on. C

oeff

icie

nts

(wit

h ro

bust

sta

ndar

d er

rors

clu

ster

ed a

t the

teac

her

leve

l) r

epor

ted.

Eac

h co

lum

n re

pres

ents

a s

epar

ate

regr

essi

on

of a

n in

divi

dual

FF

T c

ompo

nent

(e.

g., t

each

er p

ract

ice)

on

clas

sroo

m c

ompo

siti

on. P

rior

Ach

ieve

men

t is

the

aver

age

inco

min

g E

LA

or

mat

h ac

hiev

emen

t am

ong

stud

ents

in a

teac

her’

s cl

ass.

D

omai

n 2

(cla

ssro

om e

nvir

onm

ent)

incl

udes

the

foll

owin

g F

FT

com

pone

nts:

(2a

) cr

eati

ng a

n en

viro

nmen

t of

resp

ect a

nd r

appo

rt, (

2b)

esta

blis

hing

a c

ultu

re f

or le

arni

ng, (

2c)

man

agin

g cl

ass-

room

pro

cedu

res,

and

(2d

) m

anag

ing

stud

ent b

ehav

ior.

Dom

ain

3 (i

nstr

ucti

on)

incl

udes

the

foll

owin

g F

FT

com

pone

nts:

(3a

) co

mm

unic

atin

g w

ith

stud

ents

, (3b

) us

ing

ques

tion

ing

and

disc

us-

sion

tech

niqu

es, (

3c)

enga

ging

stu

dent

s in

lear

ning

, and

(3d

) us

ing

asse

ssm

ent i

n in

stru

ctio

n (s

ee G

arre

tt &

Ste

inbe

rg, 2

015,

for

mor

e de

tail

on

each

of

the

eigh

t FF

T c

ompo

nent

s). C

lass

room

ch

arac

teri

stic

s in

clud

e cl

ass

size

and

the

shar

e of

stu

dent

s by

rac

e/et

hnic

ity,

gen

der,

age

, spe

cial

edu

cati

on s

tatu

s, f

ree

or r

educ

ed-p

rice

lunc

h st

atus

, gif

ted-

stat

us, a

nd E

ngli

sh la

ngua

ge le

arne

r st

atus

. EL

A =

Eng

lish

Lan

guag

e A

rt; F

E =

fix

ed e

ffec

ts; F

FT

= F

ram

ewor

k fo

r T

each

ing.

Coe

ffic

ient

s st

atis

tica

lly

sign

ific

ant a

t the

*10

%, *

*5%

, and

***

1% le

vels

.



17

instruction is delivered may play an important role in insulating teachers from the influence of incoming student achievement on measured per-formance. Although math instruction tends to rely more on direct instruction, ELA teachers tend to use more opportunities to coconstruct lit-eracy and reading instruction through, for exam-ple, balanced literacy approaches (such as group read alouds and small-group instruction). In the current policy environment, however, this dis-tinction between approaches to instruction across ELA and math subject areas may become less pronounced. In particular, the Common Core State Standards in mathematics (CCSS-M) reflect a key shift in mathematics instruction that is focused more on the coconstruction of knowl-edge in the classroom than on direct instruction. For example, the CCSS-M requires students to “understand and use stated assumptions, defini-tions, and previously stated results in construct-ing arguments” and to “justify their conclusions, communicate them to others, and respond to the arguments of others” (National Governors Association Center for Best Practices & Council of Chief State School Officers, 2010, pp. 6–7). With the adoption of and commensurate shift in instruction to a more constructivist approach

under the CCSS-M, measured teacher perfor-mance in mathematics may become more simi-larly influenced by student achievement over time to what we observe among ELA teachers in the MET study sample.

Does Incoming Achievement Differentially Influence Subject-Matter Generalists and

Subject-Specific Teachers?

As we have shown, incoming achievement may exert a differential influence on measured teacher performance depending on the subject a teacher teaches. To what extent, then, are teachers’ obser-vation scores sensitive to whether they teach mul-tiple subjects to the same class of students (e.g., subject-matter generalist teachers) or whether they teach one subject (math or ELA) to more than one class of students (e.g., departmentalized teachers)? Table 7 summarizes these results.10

We find that incoming achievement has no sig-nificant influence on the measured performance of ELA generalist teachers (teachers who teach ELA in addition to math and other subject areas). However, departmentalized teachers with students whose incoming achievement is 1 standard devia-tion better realize better observed performance, on

TABLE 7The Influence of Incoming Achievement, by Generalist/Departmentalized Status


Generalist Departmentalized Generalist Departmentalized

Prior achievement 0.05 (0.06) 0.15*** (0.05) −0.04 (0.07) 0.08* (0.04)Classroom characteristics X X X XTeacher FE X X X XTeacher × Year Observations 354 706 348 612R2 .0042 .0649 .0203 .1769FFT M (SD) 2.61 (0.23) 2.50 (0.37) 2.56 (0.27) 2.43 (0.36)

Note. Each column reports a separate regression. Coefficients (with robust standard errors clustered at the teacher level) reported. Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students; departmental-ized teachers are subject matter specialists who taught ELA or math to more than one class of students. We note that three ELA teachers and three math teachers were generalists in the 2010–2011 school year and departmentalized in the 2009–2010 school year; we treat these teachers as departmentalized for the purposes of the teacher FE models. Chow tests confirm that the process by which measured teacher performance depends on classroom composition is statistically significantly different across general-ist and departmentalized teachers for both the ELA, χ2(13) = 37.49, p value = .0003, and math, χ2(13) = 35.70, p value = .0007, samples. Classroom characteristics include class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. ELA = English Language Art; FE = fixed effects; FFT = Framework for Teaching.

Coefficients statistically significant at the *10%, **5%, and ***1% levels.




18

the order of 0.15 FFT points (or 0.41 standard devi-ations). For math generalists (those who teach math in addition to ELA and other subject areas to the same class of students), incoming achievement is not significantly related to measured perfor-mance. Again, however, departmentalized math teachers realize better observed performance (0.08 FFT points, or approximately 0.22 standard devia-tions) with higher performing students.

Why might there be a systematic difference in the influence of incoming achievement for subject generalists and their subject-specific counterparts? The difference might be due to the fact that subject generalists teach fewer students as they face just one classroom of students each school day, while departmentalized teachers teach the same subject to multiple classes of students. As a result, subject generalists have more time with each student, which may provide more opportunities to adjust to students’ learning needs. However, we cannot definitively attribute these differences to general-ist versus departmentalized status, as we also find significant grade-level differences between gener-alist and departmentalized teachers (see Table 3). Indeed, 98% of generalist teachers (in both the ELA and math samples) taught either Grade 4 or Grade 5, whereas only 20% of ELA (and 18% of math) teachers in departmentalized classes taught these grades. It is possible that the observed differ-ences between classroom generalists and subject specialists are driven by differences in instruction for earlier grades compared with later grades, and thus, we are ultimately unable to empirically attri-bute these differences to class status (generalist vs. departmentalized) or grade-level differences.

Does the Nonrandom Assignment of Teachers to Classes Bias Teachers’ Measured

Performance?

As we have shown, the subject to which teach-ers were assigned, and whether or not they were assigned to teach a single class of students, had dif-ferent consequences for teachers’ measured perfor-mance. In addition to the influence of the incoming achievement of a teacher’s class, consideration must also be given to the potential influence that the nonrandom matching of teachers to classes based on fixed, unobserved teacher characteristics may exert on measured teacher performance (as formalized in Equation 2). To examine the extent

to which the systematic, nonrandom assignment of teachers to classes may bias estimates of measured teacher performance, we compare estimates from the FE approach with OLS and IV estimates, by classroom compliance status.

If the randomization held among teachers in full compliance classes, then unobserved, time-invari-ant teacher heterogeneity should be uncorrelated with incoming student achievement.11 In such cases, the OLS, IV, and FE estimates should not differ (see the appendix for a formalization of this result for the IV and FE approaches). For the ELA sample, this is what we find (see Full Comply col-umn, Panel A of Table 8). Among math teachers in full compliance classes, however, the OLS, IV, and FE estimates differ, and this difference is due to the fact that we find significant nonrandom selection of math teachers into full compliance classes (i.e., classroom composition predicted prerandomiza-tion measured teacher effectiveness). As a result, we focus our attention on the ELA teachers in par-tial compliance classes (where endogenous student sorting is most relevant) for an assessment of the extent (and direction) of bias due to nonrandom teacher assignment to classes.

Among ELA teachers in partial compliance classes, we find evidence of matching based on fixed, unobservable teacher characteristics. Specifically, the OLS (0.21 FFT points) and IV (0.17 FFT points) estimates are biased upward rel-ative to the FE estimate (0.14 FFT points). This suggests that, by not accounting for time-invariant teacher heterogeneity in the teacher-class matching process, estimates of the influence of incoming class achievement on measured teacher perfor-mance will be overstated. This result further sug-gests that higher performing students were endogenously sorted into the classes of higher per-forming teachers (as has been shown in prior work with the MET data; see, for example, Steinberg & Garrett, 2015). Therefore, the nonrandom and pos-itive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher characteristics would reveal more effective teacher performance, as measured by classroom observa-tion scores, than may actually be true.

Discussion

High-stakes accountability systems aim to pro-vide more information about teacher performance



TAB

LE

8E

xam

inin

g B

ias

Due

to M

atch

ing:

Com

pari

son

of O

LS,

IV,

and

FE

Est

imat

es

Ful

l Com

ply

Par

tial

Com

ply

Non

com

ply

O

LS

(Y

2)IV

FE

OL

S (

Y2)

IVF

EO

LS

(Y

2)IV

FE

Pan

el A

: EL

A s

ampl

e

Pri

or a

chie

vem

ent

0.05

(0.

08)

0.04

(0.

10)

0.04

(0.

07)

0.21

***

(0.0

8)0.

17**

(0.

08)

0.14

** (

0.06

)0.

24**

(0.

09)

0.13

(0.

11)

0.26

* (0

.14)

IV

(F

sta

tist

ic)

0.88

***

(451

.2)

0.86

***

(321

.1)

0.97

***

(139

.6)

Ran

dom

izat

ion

bloc

ks13

513

513

5 8

9 8

9 8

942

4242

S

choo

ls10

610

610

6 7

8 7

8 7

840

4040

T

each

er ×

Yea

r O

bser

vati

ons

219

219

438

126

126

252

6767

134

R

2.2

436

.243

4.0

832

.330

1.3

285

.001

4.5

609

.551

0.0

011

F

FT

M (

SD)

2.48

(0.

35)

2.48

(0.

35)

2.49

(0.

36)

2.54

(0.

29)

2.54

(0.

29)

2.53

(0.

32)

2.49

(0.

36)

2.49

(0.

36)

2.48

(0.

35)

Pan

el B

: Mat

h sa

mpl

e

Pri

or a

chie

vem

ent

−0.

03 (

0.06

)−

0.06

(0.

08)

0.06

(0.

06)

−0.

02 (

0.06

)−

0.08

(0.

07)

0.09

(0.

08)

0.01

(0.

13)

−.0

2 (0

.13)

0.04

(0.

08)

IV

(F

sta

tist

ic)

0.84

***

(228

.2)

0.88

***

(429

.5)

0.82

***

(311

.0)

Ran

dom

izat

ion

bloc

ks10

410

410

4 9

6 9

6 9

648

4848

S

choo

ls 8

9 8

9 8

9 8

0 8

0 8

042

4242

T

each

er ×

Yea

r O

bser

vati

ons

155

155

310

138

138

276

7979

158

R

2.3

177

.316

9.1

106

.415

6.4

130

.146

4.3

381

.337

4.0

107

F

FT

M (

SD)

2.44

(0.

30)

2.44

(0.

30)

2.44

(0.

33)

2.42

(0.

33)

2.42

(0.

33)

2.42

(0.

37)

2.51

(0.

32)

2.51

(0.

32)

2.53

(0.

30)

Not

e. E

ach

colu

mn

(wit

hin

a pa

nel)

rep

orts

a s

epar

ate

regr

essi

on. F

or O

LS

est

imat

es, c

oeff

icie

nts

repo

rted

(w

ith

robu

st s

tand

ard

erro

rs c

lust

ered

at t

he s

choo

l lev

el);

for

teac

her

FE

est

imat

es,

coef

fici

ents

rep

orte

d (w

ith

robu

st s

tand

ard

erro

rs c

lust

ered

at t

he te

ache

r le

vel)

; for

IV

est

imat

es, c

oeff

icie

nts

repo

rted

(w

ith

robu

st s

tand

ard

erro

rs c

lust

ered

at t

he s

choo

l lev

el).

FE

reg

ress

ions

in

clud

e te

ache

r F

E; O

LS

and

IV

reg

ress

ions

incl

ude

dist

rict

FE

. The

sam

ple

incl

udes

teac

hers

in G

rade

s 5

to 9

dur

ing

the

2010

–201

1 sc

hool

yea

r (e

xclu

ding

Gra

de 4

teac

hers

wit

hout

twic

e-la

gged

stu

dent

ach

ieve

men

t te

st s

core

s).

Ful

l C

ompl

y re

fers

to

teac

hers

in

full

com

plia

nce

clas

sroo

ms;

Par

tial

Com

ply

refe

rs t

o te

ache

rs i

n pa

rtia

l co

mpl

ianc

e cl

assr

oom

s; a

nd N

onco

mpl

y re

fers

to te

ache

rs in

ful

l non

com

plia

nce

clas

sroo

ms

in th

e se

cond

yea

r of

the

ME

T s

tudy

. For

the

IV e

stim

ates

, Pri

or A

chie

vem

ent e

stim

ates

the

effe

ct o

f in

com

ing

clas

sroo

m a

chie

vem

ent o

n m

easu

red

teac

her

perf

orm

ance

whe

n in

stru

men

ting

wit

h th

e cl

ass’

onc

e-la

gged

pri

or a

chie

vem

ent (

e.g.

, tw

ice-

lagg

ed a

chie

vem

ent)

. All

reg

ress

ions

con

trol

for

the

foll

owin

g cl

assr

oom

cha

rac-

teri

stic

s: c

lass

siz

e an

d th

e sh

are

of s

tude

nts

by r

ace/

ethn

icit

y, g

ende

r, a

ge, s

peci

al e

duca

tion

sta

tus,

fre

e or

red

uced

-pri

ce lu

nch

stat

us, g

ifte

d-st

atus

, and

Eng

lish

lang

uage

lear

ner

stat

us. O

LS

=

ordi

nary

leas

t squ

ares

; IV

= in

stru

men

tal v

aria

bles

; FE

= f

ixed

eff

ects

; EL

A =

Eng

lish

Lan

guag

e A

rt; F

FT

= F

ram

ewor

k fo

r T

each

ing;

ME

T =

Mea

sure

s of

Eff

ecti

ve T

each

ing.

Coe

ffic

ient

s st

atis

tica

lly

sign

ific

ant a

t the

*10

%, *

*5%

, and

***

1% le

vels

.




20

for the purposes of improving classroom instruc-tion and satisfying the accountability objective of newly developed teacher evaluation systems—identifying, remediating, and, if necessary, remov-ing underperforming teachers from the classroom. However, when information about teacher perfor-mance does not reflect a teacher’s practice but rather the students to whom the teacher is assigned, such systems are at risk of misidentifying and mis-labeling teacher performance. The misidentifica-tion of teachers’ performance levels has real implications for personnel decisions and funda-mentally calls into question an evaluation sys-tem’s ability to effectively and equitably improve, reward, and sanction teachers.

In this article, we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teach-ers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time. Moreover, incoming achievement matters differ-ently for teachers in different classroom settings. Specifically, incoming student achievement exerts a larger influence on the measured performance of ELA teachers (compared with math teachers) and subject-matter specialists (compared with their generalist counterparts). These results are particu-larly relevant for policy decisions around newly implemented teacher evaluation reforms, given that classroom observation scores account for the majority, and in some cases the entirety, of a teach-er’s summative evaluation rating across many state and district evaluation systems (Steinberg & Donaldson, in press).

When examining specific aspects of classroom management and instruction, we find that mea-sures of teacher performance related to the class-room’s climate and learning culture were more sensitive to student achievement than measures of a teacher’s instructional practices for ELA instruc-tion. These findings suggest that teachers who are assigned higher performing students may be inac-curately judged to be better managers of the class-room environment than they actually are. In contrast, this evidence may also indicate that teach-ers perform better when assigned higher achieving students. This uncertainty is important for policy-makers and practitioners to recognize. Moreover,

this evidence also suggests that the goals of policy reforms may be best met through the inclusion of measures that are less sensitive to the student com-position of the classroom and that focus more on specific instructional skills—such as the use of questioning and discussion techniques—to provide better signals of teacher performance.

We also find that the nonrandom process by which teachers are often assigned to classes biases estimates of teacher performance based on classroom observation scores. This result is con-sistent with prior research examining the influ-ence of nonrandom student sorting on measured teacher effectiveness based on value-added scores (Rothstein, 2009). Just as earlier evidence from the MET study demonstrates that observa-tion scores of teacher performance alone are lim-ited in their ability to identify effective teachers (Garrett & Steinberg, 2015), this work shows that the use of classroom observation scores for high-stakes decisions may be further complicated by the fact that such scores reflect, to a large extent, the students to whom teachers are assigned.

Why might incoming student achievement exert an important influence on a teacher’s mea-sured performance based on their classroom obser-vation scores? One potential explanation is that, of the observable student characteristics, test scores may incorporate information about students’ cog-nitive and noncognitive (e.g., behavior) skills. Our finding that aspects of classroom environment are more influenced by incoming student achievement than are aspects of a teacher’s instructional prac-tices supports this interpretation. Ideally, we would observe measures of student behavior from the previous year, such as disciplinary infractions and/or the number (and extent) of disciplinary conse-quences, including in-school and out-of-school suspensions. The inclusion of such behavioral measures would enable a more direct assessment of how student behavior affects a teacher’s mea-sured performance, while also distinguishing the influence of noncognitive aspects of student per-formance from cognitive achievement patterns. Yet even this could be complicated by the fact that student behavior in one context (i.e., a given group of peers matched with one teacher) may shift con-siderably when placed in a different context (i.e., a new group of peers matched with another teacher). Nonetheless, a deeper understanding of how stu-dent behavior patterns influence measures of




21

teacher performance based on classroom observa-tion, independent of incoming student achieve-ment on state assessments, is an important area for further research.

To account for the influence that classroom composition has on teachers’ observation scores, researchers have recommended that policymak-ers and district leaders adjust observation scores based on student demographic characteristics (Whitehurst et al., 2014). If higher achieving stu-dents are more easily instructed, then a teacher’s observation scores will, in part, reflect these eas-ier-to-teach students. Ultimately, we are unable to definitively determine whether the incoming achievement effect reflects bias in observation scores due to the composition of a teacher’s class or whether teachers are systematically higher performing when assigned to higher achieving students. However, our findings suggest that even if teachers’ observation scores were adjusted to account for observed differences across class-rooms—both demographic and, particularly, achievement characteristics—unmeasured dif-ferences due to the nonrandom assignment of teachers to classes based on fixed, unobserved teacher characteristics will continue to bias esti-mates of teacher performance based on observa-tion scores. Unless observation scores are generated using multiple years of teacher data (and/or across multiple classes within the same year) to adjust for the nonrandom assignment of teachers to classes based on fixed teacher charac-teristics, annual performance evaluation that relies heavily on observation-based measures will likely mischaracterize teacher effectiveness. We believe this is particularly noteworthy for policymakers and district leaders to consider as they develop and revise teacher evaluation sys-tems and make high-stakes decisions about teachers based, in large measure, on classroom observation scores.

Appendix

Instrumental Variable (IV) Strategy for Uncovering Unobserved Matching Bias

The proposed IV approach uses the twice-lagged achievement of students in teacher j’s classroom c during the 2010–2011 school year (i.e., student achievement from the 2008–2009 school year, denoted as PriorAchievec,t−1) as an

instrument for the once-lagged achievement (PriorAchievect) to estimate the effect of incom-ing achievement on teacher j’s measured perfor-mance in year t. To formalize an assessment of nonrandom matching bias due to unobserved teacher characteristics, we begin by writing Equation 3 conditional on the IV (PriorAchievec,t−1) as follows:

E Perform PriorAchieve

E PriorAchieve Prio

-1

1

jct c,t

ct

X|

|

,( )= β rrAchieve

E PriorAchieve

-1

-1

c,t

jct c,t

( )( )+ µ | ,

(A1)

where X contains other observable characteristics of students in teacher j’s classroom in year t as in Equation 3, and the teacher fixed effect (FE) has been subsumed into µjct, such that, µjct = θj + εjct. To obtain an estimate of the quantity of interest (β1 ), we rewrite Equation A1 as follows:

β1

1=( )−cov Perform PriorAchieve

cov PriorAchieve Priojct c t

ct

,

,,

rrAchieve

cov PriorAchieve

cov PriorAchiev

c t

jct c t

,

,,

−

−

( )

−( )

1

1µ

ee PriorAchieve

cov PriorAchieve

ct c t

IV

j jct c t

,

,

,

,

−

−

( )

= −+

1

1βθ ε

(( )( )

= −

−cov ct c t

IV

PriorAchieve PriorAchieve

Nonrandom M

,

(

, 1

β aatching

Time-Variant Shocks).+

(A2)

If all teachers randomly assigned to classrooms (within randomization block) in Year 2 of the Measures of Effective Teaching (MET) study remained in full compliance classrooms, then cov(µjct, PriorAchievec,t−1) = 0 and β IV would pro-vide unbiased estimates of the effect of incoming achievement on measured teacher performance. However, violations to this exclusion restriction could have resulted (a) from either partial compli-ance or full noncompliance with randomization due to endogenous student mobility or schools simply ignoring the teacher random assignment to classes or (b) if there were systematic differences, within randomization block, between teachers who remained in full compliance classrooms and those who experienced partial (or full) noncompliance (e.g., selection into full compliance classrooms within randomization block). In either case, the IV estimates will be biased (such that β β

IV ≠ 1 ), and the direction of the bias will depend on the sign of




22

the numerator (the covariance between unobserved time-invariant and time-varying teacher character-istics and twice-lagged classroom achievement) and the denominator (the covariance between once- and twice-lagged classroom achievement, which will be positive). For example, if higher per-forming teachers are systematically matched to higher achieving classes (assortative matching), the nonrandom matching of teachers to classes will upwardly bias the IV estimate of incoming achieve-ment on measured teacher performance. In addi-tion, time-variant shocks to teacher-class matching will capture the sorting of teachers to classes that may result, for example, from lagged shocks (posi-tive or negative) to teacher performance that are independent of fixed teacher quality, and will intro-duce additional bias into the IV estimate.

In contrast to the IV approach, the FE approach conditions on time-invariant teacher heterogeneity (θj) and therefore controls for nonrandom matching (e.g., γ jc

match) that is a function of unobserved, fixed

teacher characteristics. We again note that any cor-relation between time-varying unobserved teacher effects and prior class achievement may still bias the FE estimator. We derive the FE estimator by rewriting Equation 3 as follows:

β

ε

1 =( )

( )

−

cov Perform PriorAchieve

var PriorAchieve

cov

��

�

�

jc c

c

,

jjc c

c

FE

jc

,

,

PriorAchieve

var PriorAchieve

cov Prior

�

�

��

( )( )

= −βε AAchieve

var PriorAchieve

Time-Variant Shocks

�

�

�

c

c

FE

( )( )

= −β ( )..

(A3)

where (i) Perform Perform Perform

jc jct jc t= − −( ;), 1 (ii)

PriorAchieve PriorAchieve PriorAchieve

c ct c t= − −( );, 1 and (iii) ε ε εjc jct jc t= −( )−, .1

Putting together the IV and FE results from Equations A2 and A3, respectively, we get

β

β

IV

FE

−

= −

(Nonrandom Matching

+ Time-Variant Shocks)

Time-VVariant Shocks

Nonrandom Matching

( )= ( )−

=

β

β

IV

FE

(A4)

If, as shown in Equation A4, matching on fixed teacher characteristics is operative, then β β

IV FE≠ . We assess the presence of nonrandom matching bias among teachers in the IV sample (i.e., teachers in Grades 5–9 with twice-lagged student test score data). Note that we omit Grade 4 teachers from the IV sample because twice-lagged student test scores (test scores from Grade 2) are unavailable, as high-stakes state testing begins in Grade 3. We further note that this formalization relies on the assumption that the time-variant shocks captured through the two approaches are approximately equal (and can thus cancel out) as we draw on information over two consecutive school years. Over a longer time frame, this assumption could be more problematic.

As the IV sample includes teachers in class-rooms with endogenous student sorting (partial compliance) and those nonrandomly assigned to classes (full noncompliance), along with teachers in full compliance classes, we compare the IV and FE results by classroom compliance status. If, among full compliance classes, the randomiza-tion held and classroom composition variables were distributed randomly across teachers, then the IV results should not differ from the FE results, as randomization implies (from Equation A1) that cov(µjct, PriorAchievec,t−1) = 0. Among partial and full noncompliance classrooms, any difference between the IV and FE estimates will reveal the extent (and direction) of the nonran-dom matching of teachers to classes.

Acknowledgments

The authors thank Matthew Kraft, Rebecca Maynard, John Papay, Jonah Rockoff, Brian Rowan, Andrew Wayne, and participants at the Measures of Effective Teaching (MET) Early Career Scholars Grantee Meeting for helpful comments and suggestions, and Jennifer Moore for editorial assistance.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of inter-est with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following finan-cial support for the research, authorship, and/or publi-cation of this article: Funding from the National Academy of Education MET Early Career Grantee program is gratefully acknowledged.




23

Notes

1. The randomization was conducted within ran-domization blocks that consisted of grade-subject com-binations of classes within a school, where at least two teachers must have been teaching in the same grade-subject combination (e.g., sixth grade mathematics) to be included in the randomization sample. We fur-ther note that the class rosters were established by schools, and teachers were randomly assigned to these rosters by the Measures of Effective Teaching (MET) researchers. Students were not randomly assigned to specific teachers nor were students randomly assigned to rosters. Please see White and Rowan (2012) for more detail on the Year 2 randomization sample.

2. The MET data do not allow us to observe teachers who did not participate in the MET study. Furthermore, the MET data provide classroom com-position variables only for teachers’ actual classes and not for the class rosters to which teachers were randomly assigned. Classroom composition variables for the randomly assigned rosters are unavailable as demographic and achievement data for students who ended up with a non-MET teacher (either in the same school or in another district school) after randomiza-tion are not observed in the data. Moreover, we do not observe the proportion of teachers at a subject-grade-school level that participated in the MET study.

3. Of the six MET districts, we include five dis-tricts because one district did not provide data on stu-dents’ free or reduced-price lunch status.

4. Our final teacher sample is realized as follows: Of the 1,887 teachers active during both Year 1 (2009–2010) and Year 2 (2010–2011) of the MET study, we include a subset of those teachers who taught math and/or reading in Grades 4 to 9, reducing the teacher sample to 1,629 teachers. Of these 1,629 teachers, we were able to match 1,434 teachers to data indicating whether they were part of the Year 2 randomization sample, determining that 424 did not participate. As classroom observation scores (discussed below) were unavailable in the second year of the MET study for these 424 teachers, the teacher sample was further reduced to 1,010 teachers, all of whom participated in the Year 2 randomization sample. Of these 1,010 teachers, 29 did not have classroom observation scores in the same subject area (i.e., math and/or English Language Arts [ELA]) across both study years, reduc-ing the analytic sample to 981 teachers. Of these 981 teachers, we drop an additional 176 teachers due to missing classroom composition variables, as one school district did not report data on students’ free or reduced-price lunch status.

5. We discuss our approach for assessing the fidel-ity of the randomization in the “Empirical Approach” section. We note that we are able to generate valid

causal estimates of the effect of incoming achieve-ment on measured teacher performance for the ELA, although not the math, sample of teachers, as prior achievement and classroom composition (net of ran-domization block fixed effects) do not jointly predict a teacher’s prerandomization observation scores among the subset of randomization blocks containing only full compliance classrooms.

6. The full Framework for Teaching (FFT) instru-ment was not used in the MET study, as not all scoring areas are amenable to observation through video data. Please see Garrett and Steinberg (2015) for a detailed description of the FFT domains and components included in a teacher’s FFT observation score.

7. By randomly selecting one of a departmentalized teacher’s class sections from the first year of the MET study, his or her observation scores will be based on two classroom observations rather than four, as with generalist teachers. As a consequence, additional noise will be introduced into estimates of classroom context on measured performance for this group of teachers.

8. The joint null hypothesis may be expressed as H0: ςX = αPriorAchieve = 0, where ς represents the param-eters associated with the classroom composition variables contained in X as in Equation 4, and all sig-nificance tests are adjusted for randomization block fixed effects. The exogeneity condition (as it relates to incoming achievement) may be expressed as

E Perform PriorAchieve

E Perform

j t ct ct r

j t r

X,

,

|

|

,−

−

( )(=

1

1

, υ

υ )).9. The four practices contained within the classroom

environment domain (Domain 2 of the Danielson FFT) are (2a) creating an environment of respect and rapport between the teacher and the students and among the stu-dents; (2b) establishing a culture for learning, including the norms that govern interactions among individuals in the classroom and the physical appearance of the class; (2c) managing classroom procedures, including the rou-tines and procedures a teacher establishes with the class for how to move from a small to a large group or how the students are expected to line up at the end of the day; and (2d) managing student behavior, including the establish-ment (through the coconstruction of standards of con-duct among the teacher and students) and maintenance of an orderly classroom environment. The instruction domain (Domain 3 of the Danielson FFT) captures the following four practices: (3a) communicating learning expectations to students, providing clear directions for classroom activities, and clearly presenting concepts and information to students; (3b) using questioning and dis-cussion techniques as instructional strategies to deepen student understanding; (3c) engaging students in learn-ing by incorporating activities and assignments into the




24

class that ask students to think critically, to make con-nections, to formulate and test hypotheses, and to draw conclusions; and (3d) using assessment of student learn-ing to drive instruction through monitoring of student understanding, offering feedback to students, and (when necessary) making adjustments to the lesson. We further note that Component 3b captures the only instructional strategies specifically referred to in Danielson’s FFT. See Garrett and Steinberg (2015) for more detail on each of the eight Danielson FFT components.

10. Chow tests confirm that the process by which measured teacher performance depends on classroom composition is statistically significantly different across generalist and departmentalized teachers for both the ELA, χ2(13) = 37.49, p value = .0003, and math, χ2(13) = 35.70, p value = .0007, samples.

11. We test for whether the randomization held among full compliance classes by estimating Equation 4 for both the ELA and math instrumental variables (IV) samples. For the 219 ELA teachers in full compliance classes, we cannot reject the joint null hypothesis from this covari-ate balance test (F = 0.70, p value = .751). For the 155 math teachers in full compliance classes, however, we do reject the joint null hypothesis (F = 1.92, p value = .039), suggesting that there was nonrandom selection of math teachers into full compliance classrooms.

References

Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25, 95–135.

Chaplin, D., Gill, B., Thompkins, A., & Miller, H. (2014). Professional practice, student surveys, and value-added: Multiple measures of teacher effec-tiveness in the Pittsburgh Public Schools (REL 2014–024). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved from http://ies .ed.gov/ncee/edlabs

Clotfelter, C., Ladd, H., & Vigdor, J. (2006). Teacher-student matching and the assessment of teacher effectiveness. The Journal of Human Resources, 41, 778–820.

Dee, T., & Wyckoff, J. (2013). Incentives, selection, and teacher performance: Evidence from IMPACT. Journal of Policy Analysis and Management, 34, 267–297.

Garrett, R., & Steinberg, M. (2015). Examining teacher effectiveness using classroom observation scores: Evidence from the randomization of teach-ers to students. Educational Evaluation and Policy Analysis, 37, 224–242.

Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G. (2010). Evaluating teachers: The important role of value-added. Washington, DC: The Brookings Institution.

Goldhaber, D. D. (2002). The mystery of good teach-ing. Education Next, 2(1), 50–55.

Goldhaber, D. D., & Hansen, M. (2010). Is it just a bad class? Assessing the stability of measured teacher performance (CEDR Working Paper No. 2010-3). Seattle: University of Washington.

Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value added: Principals’ human capital decisions and the emergence of teacher obser-vation data. Educational Researcher, 44(2), 96–104.

Jackson, C. K., & Bruegmann, E. (2009). Teaching students and teaching each other: The importance of peer learning for teachers. American Economic Journal: Applied Economics, 1(4), 85–108.

Kalogrides, D., & Loeb, S. (2013). Different teachers, dif-ferent peers: The magnitude of student sorting within schools. Educational Researcher, 42, 304–316.

Kalogrides, D., Loeb, S., & Beteille, T. (2013). Systemic sorting: Teacher characteristics and class assignments. Sociology of Education, 86, 103–123.

Kane, T., McCaffrey, D., Miller, T., & Staiger, D. (2013). Have we identified effective teach-ers? Validating measures of effective teaching using random assignment (Bill & Melinda Gates Foundation MET Project research paper). Seattle, WA: Bill & Melinda Gates Foundation.

Kraft, M. A., & Papay, J. P. (2014). Can professional environments in schools promote teacher develop-ment? Explaining heterogeneity in returns to teach-ing experience. Educational Evaluation and Policy Analysis, 36, 476–500.

Loeb, S., Beteille, T., & Kalogrides, D. (2012). Effective schools: Teacher hiring, assignment, development and retention. Education Finance and Policy, 7, 269–304.

McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4, 572–606.

Mihaly, K., McCaffrey, D. F., Staiger, D., & Lockwood, J. R. (2013). A composite estima-tor of effective teaching (Technical report for the Measures of Effective Teaching project). Seattle, WA: Bill & Melinda Gates Foundation.

Monk, D. H. (1987). Assigning elementary pupils to their teachers. The Elementary School Journal, 88, 166–187.

National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common Core State Standards (Mathematics). Washington, DC: Author.


http://ies.ed.gov/ncee/edlabs

http://ies.ed.gov/ncee/edlabs



25

Papay, J. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Education Research Journal, 48, 163–193.

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools and academic achievement. Econometrica, 73, 417–458.

Rockoff, J. E. (2004). The impact of individual teach-ers on student achievement: Evidence from panel data. American Economic Review, 94, 247–252.

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4, 537–571.

Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achieve-ment. The Quarterly Journal of Economics, 125, 175–214.

Sartain, L., & Steinberg, M. (in press). Teachers’ labor market responses to performance evaluation reform: Experimental evidence from Chicago pub-lic schools. Journal of Human Resources.

Sass, T., Hannaway, J., Xu, Z., Figlio, D., & Feng, L. (2012). Value added of teachers in high-poverty schools and lower-poverty schools. Journal of Urban Economics, 72, 104–122.

Steinberg, M., & Donaldson, M. (in press). The new educational accountability: Understanding the landscape of teacher evaluation in the post-NCLB era. Education Finance and Policy.

Steinberg, M., & Garrett, R. (2015). Teacher effec-tiveness, peer quality and student achievement: Evidence from student sorting following teacher random assignment (Working paper).

Steinberg, M., & Sartain, L. (2015). Does teacher evaluation improve school performance? Experimental evidence from Chicago’s Excellence in Teaching Project. Education Finance and Policy, 10, 535–572.

Taylor, E. S., & Tyler, J. H. (2012). The effect of evaluation on teacher performance. American Economic Review, 102, 3628–3651.

Watson, J. G., Kraemer, S. B., & Thorn, C. A. (2009). The other 69 percent. Washington, DC: Center for Educator Compensation Reform, U.S. Department of Education, Office of Elementary and Secondary Education.

Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effec-tiveness. Brooklyn, NY: The New Teacher Project.

White, M., & Rowan, B. (2012). A user guide to the “core study” data files available to MET early career grantees. Ann Arbor: Inter-University Consortium for Political and Social Research, The University of Michigan.

Whitehurst, G., Chingos, M., & Lindquist, K. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brown Center on Education Policy at Brookings.

Authors

MATTHEW P. STEINBERG is an assistant professor in the Graduate School of Education at the University of Pennsylvania. His research focuses on teacher eval-uation and human capital, urban school reform, school finance, and school discipline policy and safety.

RACHEL GARRETT is a senior researcher at American Institutes for Research. Her research focuses on educator effectiveness, English language learners, and mathematics instruction.

Manuscript received February 5, 2015First revision received June 8, 2015

Second revision received August 3, 2015Third revision received September 9, 2015

Accepted October 16, 2015