Educational Evaluation and Policy AnalysisMonth 201X, Vol. XX, No. X, pp. 1 –25
DOI: 10.3102/0162373715616249© 2016 AERA. http://eepa.aera.net
Introduction
Despite the intense focus on the use of student test scores to gauge teacher performance, the majority of our nation’s teachers receive annual evaluation ratings based primarily on observa-tions of their classroom practice (Steinberg & Donaldson, in press). These observation-based performance measures aim to capture teachers’ instructional practice and their ability to structure and maintain high-functioning classroom envi-ronments. However, little is known about the ways that classroom context—the settings in which teachers work and the students that they teach—shapes measures of teacher effectiveness based on classroom observations. Given the widespread adoption of high-stakes evaluation systems that rely heavily on classroom observa-tions, and in light of the way stakeholders such as
principals may use these scores to inform hiring and retention decisions (Sartain & Steinberg, in press), it is critical that we have a clearer under-standing of how the composition of teachers’ classrooms influences their observation scores.
Interest in understanding the role of school context in teachers’ professional development has recently emerged. For example, Kraft and Papay (2014) find that school context—includ-ing supportive and collaborative environments, effective principal instructional leadership, and positive school culture—can facilitate greater returns to teacher experience or the extent to which teachers become more effective over time. Other studies have pointed to the important role that school composition plays in shaping teacher effectiveness. In one study, Loeb, Beteille, and Kalogrides (2012) found that teachers improve more rapidly in more effective schools—those
616249 EPAXXX10.3102/0162373715616249Steinberg and GarrettClassroom Composition and Teacher Observation Scoresresearch-article2016
Classroom Composition and Measured Teacher Performance: What Do Teacher Observation
Scores Really Measure?
Matthew P. Steinberg
University of PennsylvaniaRachel Garrett
American Institutes for Research
As states and districts implement more rigorous teacher evaluation systems, measures of teacher performance are increasingly being used to support instruction and inform retention decisions. Classroom observations take a central role in these systems, accounting for the majority of teacher ratings upon which accountability decisions are based. Using data from the Measures of Effective Teaching study, we explore the extent to which classroom composition influences measured teacher performance based on classroom observation scores. The context in which teachers work—most notably, the incoming academic performance of their students—plays a critical role in determining teachers’ measured performance. Furthermore, the intentional sorting of teachers to students has a significant influence on measured performance. Implications for high-stakes teacher accountability policies are discussed.
Keywords: classroom composition, classroom observation scores, teacher effectiveness, teacher evaluation
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
2
more able to raise the test-score performance of their students. In a second, Jackson and Bruegmann (2009) found that students of less-experienced teachers realize larger achievement gains when their teachers have higher perform-ing colleagues. Sass, Hannaway, Xu, Figlio, and Feng (2012) show that teachers improve more quickly in schools serving fewer low-income students.
Other work suggests that the characteristics of a teacher’s students play a meaningful role in determining teacher ratings. New evidence finds that classroom observation scores tend to be lower among teachers whose students are more disadvantaged and have lower incoming achieve-ment (Chaplin, Gill, Thompkins, & Miller, 2014; Whitehurst, Chingos, & Lindquist, 2014). These lower scores may reflect the nonrandom sorting of teachers to classes of students (Monk, 1987; Steinberg & Garrett, 2015) and the tendency of schools to assign novice, less-effective teachers to more disadvantaged students, while more experienced and more effective teachers are more likely to work with higher achieving stu-dents (Clotfelter, Ladd, & Vigdor, 2006; Kalogrides & Loeb, 2013; Kalogrides, Loeb, & Beteille, 2013; Monk, 1987).
Furthermore, evidence suggests that observa-tion-based measures of a teacher’s performance tend to be weakly correlated over time. Indeed, research using data from the Measures of Effective Teaching (MET) study finds that a teacher’s observation scores are only moderately correlated across 2 consecutive years (Garrett & Steinberg, 2015), on the order of the year-to-year, within-teacher correlation of student test-score-based measures of teacher performance found in other settings (Glazerman et al., 2010; Goldhaber & Hansen, 2010; McCaffrey, Sass, Lockwood, & Mihaly, 2009). These findings suggest not only that a teacher’s classroom prac-tice may vary over time but also that it may be influenced by classroom composition—the stu-dents and classes to which teachers are assigned.
As researchers and policymakers begin to bet-ter understand how different measures may differ-entially reflect distinct aspects of a teacher’s instructional performance, we need to further place these relationships into context by under-standing how the composition of the measures may be sensitive to the composition of the class
itself. Therefore, a closer investigation is war-ranted to more fully understand how classroom composition may influence observation-based measures of teacher performance, and whether these influences differentially affect teachers in different settings (e.g., math or reading subject areas, classroom teachers vs. subject specialists).
In this article, we focus on understanding the relationship between the characteristics of a teacher’s class and a teacher’s performance, as measured by observations of his or her classroom instruction and management. Given existing evi-dence that teachers who are judged to be more effective tend to be assigned higher achieving students, particular attention is paid to the rela-tionship between students’ incoming achieve-ment and measured teacher performance based on the Danielson Framework for Teaching (FFT) observation protocol. We leverage variation in student composition over time to estimate the influence on measured teacher effectiveness, allowing us to control for fixed characteristics of teachers (e.g., teacher quality endowments) and isolate the idiosyncratic influence of incoming student achievement. We further assess whether these relationships are more or less sensitive when considering specific aspects of classroom management and instruction and the extent to which the influence of incoming achievement differs for classroom generalists compared with subject-matter specialists. Moreover, we address the nonrandom sorting of teachers to classes that has been shown elsewhere to bias student achievement-based measures of teacher effec-tiveness (Rothstein, 2009, 2010). To do so, we take advantage of a hallmark of the MET study project—the randomization of teachers to classes that occurred just prior to the beginning of the MET study’s second year (i.e., 2010–2011 school year). Specifically, we exploit the randomization of teachers to classes within randomization blocks (i.e., grade-subject combinations of classes within a school) to generate experimental estimates of the effect of incoming achievement on measured teacher performance. Finally, we implement an instrumental variables (IV) strat-egy to formalize the nature of the bias in observa-tion scores that is due to the nonrandom sorting of teachers to classes. Then, we compare esti-mates generated by the teacher fixed-effects (FE) approach with ordinary least squares (OLS) and
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
3
IV estimates to lend insight into whether the non-random matching of teachers to classes is a con-cern in the context of classroom observation-based measures of teacher performance.
We find that teacher performance, based on classroom observation, is significantly influ-enced by the context in which teachers work. In particular, students’ prior year (i.e., incoming) achievement positively affects a teacher’s mea-sured performance captured by the FFT. Students’ incoming achievement is more strongly associ-ated with observed English Language Art (ELA) instruction than math instruction. Although the influence of incoming achievement is concen-trated among subject specialists, who work with multiple classes of students within a single school day, compared with classroom generalist teachers who work with the same classroom of students all day, every day, we are unable to uniquely attribute these differences to either departmentalized (compared with generalist) class status or grade-level differences. Furthermore, incoming student achievement significantly influ-ences aspects of teacher performance that capture teachers’ interactions with their students, whereas core instructional practices are not prone to this influence. Finally, we offer evidence that the intentional, nonrandom matching of teachers to classrooms due to unobserved time-invariant teacher effects biases the relationship between incoming student achievement and observed classroom performance.
In the context of newly implemented teacher evaluation systems that incorporate multiple measures of teacher performance (including value-added measures [VAMs], student learning objectives [SLOs], and observation scores) into teachers’ summative evaluation ratings, recent evidence has indicated that principals’ human capital decisions around teacher effectiveness are based predominantly on their observations of teacher practice (Goldring et al., 2015). Moreover, some have argued that classroom observations provide a more transparent and objective means of evaluating and measuring teacher effectiveness and, in doing so, avoid many of the validity concerns that tend to be associated with VAMs (Goldring et al., 2015). However, our findings indicate that observation scores as measures of teacher effectiveness pres-ent similar validity concerns, as these scores
reflect both the systematic, nonrandom assign-ment of teachers to classes and the incoming achievement of a teacher’s students. Therefore, our results suggest that greater caution must be taken by policymakers and practitioners when making high-stakes personnel decisions based largely (if not entirely) on teachers’ classroom observation scores.
In the next section, we discuss the evolving landscape of teacher evaluation reform and the evidence on measures of teacher effectiveness being incorporated into new evaluation systems. Next, we specify a model relating teacher and classroom characteristics to observed measures of teacher performance. We then describe the data and empirical approach used to capture the influence of incoming student achievement on teacher performance. Finally, we present our findings and discuss their implications for high-stakes personnel decisions made in the context of newly developed and implemented teacher eval-uation systems.
Teacher Evaluation Reform Landscape
In the wake of the federal Race to the Top (RTTT) initiative, education policy efforts have recently led to dramatic revisions to teacher eval-uation systems and substantial effort toward improving measures of teacher effectiveness. At the state and local levels, policymakers are revis-ing and implementing new evaluation systems to incorporate more rigorous performance evalua-tions through the use of multiple measures of teacher performance—valued-added measures (VAM) and standards-based classroom observa-tions, among them—and multiple teacher ratings categories. By the start of the 2014–2015 school year, 78% of all states and 85% of the largest 25 school districts and the District of Columbia had revised and implemented teacher evaluation reforms (Steinberg & Donaldson, in press).
Revisions to teacher evaluation systems and the measures upon which teacher evaluation is based have been spurred by evidence in support of the critical role teachers play in improving student outcomes (Goldhaber, 2002; Rivkin, Hanushek, & Kain, 2005; Rockoff, 2004), while also acknowl-edging both the substantial within-school hetero-geneity in teacher effectiveness (Aaronson, Barrow, & Sander, 2007; Rivkin et al., 2005) and
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
4
the inability of traditional teacher-evaluation sys-tems to differentiate among teachers in their con-tribution to student learning (Weisberg, Sexton, Mulhern, & Keeling, 2009). Efforts to reform teacher evaluation are underscored by two funda-mental goals of personnel evaluation in education: to systematically support improvements in instruc-tional quality and to identify low-performing teachers for remediation or dismissal. The latter goal aims to inject more rigorous accountability into teacher evaluation systems that historically did a poor job of identifying and removing under-performing teachers (Weisberg et al., 2009). Early evidence on the effectiveness of teacher evalua-tion reforms for improving teaching and learning and identifying and removing the lowest perform-ing teachers is promising (Dee & Wyckoff, 2013; Sartain & Steinberg, in press; Steinberg & Sartain, 2015; Taylor & Tyler, 2012), suggesting that more rigorous teacher evaluation can meaningfully improve teacher quality and student achievement while satisfying the accountability function of newly developed evaluation systems.
As states and districts reform teacher evalua-tion and begin to make high-stakes personnel decisions, increasing attention is being given to the measures being incorporated into new evalu-ation systems. Eighty percent of both states and the largest districts implementing new evaluation systems are using one or more measures of teacher performance based on student test score data; these include VAM and student growth per-centiles (SGP; Steinberg & Donaldson, in press). However, evidence suggests that there is sub-stantial within-teacher variability in a teacher’s annual VAM scores (Goldhaber & Hansen, 2010; McCaffrey et al., 2009) and that this variability depends on the nature of the student assessment being used (Papay, 2011). This year-to-year vari-ability constrains the efficacy of new evaluation systems to serve the accountability function of identifying effective teachers (Rothstein, 2009, 2010). Moreover, VAM scores are limited in their ability to inform teachers’ instructional practice, as these measures are generated and provided to teachers after the conclusion of the school year, and offer no guidance for how teachers may improve their instructional practice.
Although educators and scholars alike point to the limitations implicit in test-score based mea-sures of teacher effectiveness, classroom
observations of teacher practice—designed to provide formative feedback to teachers to improve their instructional practice as well as to more carefully evaluate instructional performance—offer promise as a complementary measure for differentiating teacher performance (Kane, McCaffrey, Miller, & Staiger, 2013). The role of classroom observation in measuring teacher per-formance is particularly notable given that obser-vation-based scores constitute the majority, and often the entirety, of summative evaluation scores (Steinberg & Donaldson, in press), with nearly 70% of teachers nationwide teaching in nontested grades and subjects where student-test-score data—on which VAM scores are based—are unavailable (Watson, Kraemer, & Thorn, 2009). As noted in the introduction, however, new evi-dence suggests that observational measures of teacher practice alone are unable to identify teacher effectiveness (Garrett & Steinberg, 2015). Therefore, the reliance on observation-based measures of teacher performance raises concerns about the ability of newly implemented evalua-tion systems to both accurately capture teacher performance and satisfy the accountability goal of teacher evaluation reform.
Production of Teacher Performance
How teachers perform in the classroom depends on a number of factors. Consider the fol-lowing model where measured teacher perfor-mance is a function of a teacher’s endowed instructional ability and the student composition of his or her classroom:
Perform , , jct j ct jctf X= ( )θ ε . (1)
In Equation 1, Performjct is teacher j’s mea-sured performance, based on his or her classroom observation scores, in classroom c in school year t; θj captures the stable, time-invariant aspects of instructional skill (e.g., teacher quality) that a teacher brings to the classroom, which include a teacher’s ability to design lessons, provide instruction based on those lessons, and design and implement classroom management strategies to maximize the efficacy of instructional lessons within a classroom’s idiosyncratic environment. The endowment of teacher j’s instructional skills (captured by θj) is considered independent of the
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
5
students in his or her class, such that endowed teacher quality is not influenced by the class-room (c) to which the teacher is assigned. The variable Xct represents the characteristics of stu-dents in classroom c to whom a teacher is assigned in school year t; such characteristics may vary across years and likely influence how teachers perform from one year (and class) to the next. The variable εjct is an idiosyncratic error term capturing all other inputs to measured teacher performance (e.g., curriculum, teacher policy, observer/rater quality).
Just as the composition of teacher j’s class-room may influence measured performance, teachers may also alter their instruction based on the composition of students in their classes. That is, a teacher’s instruction and classroom manage-ment may differentially respond to classroom composition and the students to whom he or she is assigned. We can further decompose the vari-able εjct into an aggregate match effect, γjc, and all other inputs to measured teacher performance, µjct. We can then rewrite Equation 1 as follows:
Perform , , , jct j ct jc jctf X= ( )θ γ µ . (2)
This aggregate match effect (γjc) accounts for processes that occur both before teachers enter the classroom and those that occur during the course of teachers’ interactions with their classes. Specifically, the match effect captures (a) the process that assigns teachers to classrooms, which occurs prior to measured teacher perfor-mance and has been shown to incorporate his-torical teacher and student performance into the assignment decision (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Monk, 1987; Steinberg & Garrett, 2015) and (b) the extent to which teacher j’s instruc-tional and classroom management skills interact with and are more (or less) effective with a par-ticular mix of students in classroom c.
In light of evidence that teachers with higher observation scores tend to work with higher achieving students, this article extends the exist-ing literature on the influence that classroom composition—in particular, incoming student achievement—has on measured teacher effec-tiveness. We investigate whether the composition of students (described by the variable Xct) to whom a teacher is assigned accounts for
variation in measured classroom performance, beyond what is attributable to the teacher alone (θj). We explore this relationship by using con-secutive years of data to examine how incoming achievement influences teachers’ observation scores on the most widely used measure of instructional performance, the Danielson FFT. This teacher fixed-effects approach, described in more detail in the following sections, allows us to account for the contribution of a teacher’s endowed ability to his or her observation scores.
As formalized in Equation 2, we recognize that the purposive sorting of teachers to students may exert its own, independent influence on measured teacher performance, thereby con-founding our ability to uniquely capture the influence of classroom composition on teacher effectiveness. We, therefore, further test the rela-tionship between incoming achievement and observed measures of teacher performance by leveraging the random assignment of teachers to classes, a prominent feature of the MET study. In the second year (2010–2011) of the MET study, a subset of teachers was randomly assigned to classes of students.1 In principle, the random assignment of teachers to classes disrupts the assortative matching process that can confound measures of teacher effectiveness (Clotfelter et al., 2006; Garrett & Steinberg, 2015; Rothstein, 2009, 2010; Whitehurst et al., 2014). However, an unintended consequence of the Year 2 ran-domization in the MET study was significant noncompliance among students, who ended up with teachers other than those who were ran-domly assigned to their classes (Kane et al., 2013; White & Rowan, 2012).
In prior work, we found that classroom-level noncompliance with the teacher random assign-ment in the MET study can be characterized in three ways: (a) full compliance classrooms are those where all students, within a classroom to which teacher j was randomly assigned, remained with teacher j; (b) partial-compliance classrooms are those where at least one student, within a classroom to which teacher j was randomly assigned, did not comply with the random teacher assignment and instead ended up in another teacher’s class; and (c) full noncompliance class-rooms are those where no student, within a class-room to which teacher j was randomly assigned, remained with the randomized teacher. Full
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
6
noncompliance classrooms were those for which principals likely assigned teachers to classes prior to receipt of the random teacher assign-ments, therefore (intentionally or not) ignoring the random assignment of teachers to classes as part of the Year 2 randomization sample in the MET study (Steinberg & Garrett, 2015; White & Rowan, 2012). These particular consequences of ex post, nonrandom sorting on classroom-level compliance with randomization allows for addi-tional insight into the influence of the nonran-dom assignment of teachers to classes on measured teacher performance.2
Data and Sample
We use data from the MET study, which was carried out over 2 school years (2009–2010 and 2010–2011) and across six districts.3 The six par-ticipating districts were Charlotte-Mecklenburg Schools (NC), Dallas Independent School District (TX), Denver Public Schools (CO), Hillsborough County Public Schools (FL), Memphis City Schools (TN), and the New York City Department of Education (NY; White & Rowan, 2012).
Teacher Sample
The teacher-level analytic sample consists of 834 teachers in Grades 4 to 9, of which 354 taught ELA, 304 taught math, and 176 taught both subjects.4 For the 834 teachers in our sample, we incorporate the following background characteristics: gender, race/ethnicity, degree status, and years of teaching expe-rience within the district. We also observe whether a teacher was a generalist or departmentalized teacher. Generalists are subject-matter generalists who taught both ELA and mathematics to a single class of students; departmentalized teachers are subject-matter specialists who taught exclusively ELA or math, and worked with multiple classes of students. Furthermore, each teacher record includes unique district, school, and classroom (section) identifiers, variables identifying the grade taught, and the grade-subject combinations (e.g., fifth-grade math). We are able to use this unique identifying information to link students to their randomly assigned teacher as well as to their actual classroom teacher, allowing us to identify the compliance category of the class to which a teacher was
randomly assigned in Year 2 (i.e., full compliance, partial compliance, or full noncompliance).
Table 1 summarizes the characteristics of all teachers in our sample and by subject-specific sample (i.e., ELA or math). In addition to the full sample of teachers, we construct two additional analytic samples by subject area. The IV sample includes teachers in Grades 5 to 9, and from this sample, we construct IV estimates (of the effect of incoming achievement on measured teacher performance), which we compare with teacher FE estimates for an assessment of the extent of bias in observation scores due to the nonrandom assignment of teachers to classes. The random-ization sample includes those randomization blocks (i.e., grade-subject combinations of classes within a school) in which all MET study teachers were in full compliance classrooms. We use the randomization sample to generate experi-mental estimates of the effect of incoming achievement on teacher performance.5
Among our sample of 834 teachers, 46% were in full compliance classrooms, 36% were in par-tial-compliance classrooms, and 18% were in full noncompliance classrooms. Table 2 summa-rizes the teacher characteristics by classroom compliance status. Among the ELA sample, teachers differed significantly by years of teach-ing experience (in district), master’s degree attainment, and subject-matter generalist status. Indeed, fewer teachers in full compliance classes were subject-matter generalists teaching fourth or fifth grade, suggesting that much of the non-compliance with the Year 2 randomization took place in the earlier grades among the MET sam-ple of ELA teachers. We also observe very simi-lar patterns among the sample of math teachers.
Nearly a quarter of all teachers in our sample were subject-matter generalists teaching multiple subjects to one class of students. Table 3 summa-rizes the characteristics of teachers by generalist and departmentalized status. In both the ELA and math samples, generalists were significantly less likely to be male and White, although significantly more likely to be Black and have a master’s degree. Generalist teachers had approximately less than 2.5 years of teaching experience than departmentalized teachers, with nearly all (98%) generalists teaching in either Grade 4 or 5. Furthermore, we find that early-grade, generalist teachers were significantly more likely to be in
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
7
full noncompliance classrooms (and, by exten-sion, less likely to be in full compliance class-rooms) than departmentalized teachers. This suggests that it may have been more difficult for school leaders to intentionally ignore the teacher random assignment among departmentalized teachers and purposively match entire classes of students to these teachers. In contrast, it appears that the purposive matching of teachers to classes (and the abandoning of the teacher random assign-ment) was more easily accomplished among the elementary grade (e.g., fourth and fifth grade) teachers teaching in self-contained classrooms.
Teacher Performance Measure: Classroom Observation Scores
We use the measure of teacher effectiveness that is among the most widespread in newly devel-oped teacher evaluation systems—observation scores of a teacher’s classroom practice using
Charlotte Danielson’s FFT. Measures of a teach-er’s instructional practice are captured through FFT scores generated by MET raters using videos of subject-specific (e.g., math or ELA) lessons.6 Generalist classroom teachers were videoed, on average, on 4 separate days throughout the year, with each day producing one ELA and one math lesson video. In the first year of the study, depart-mentalized teachers were videoed on 2 separate days, capturing two different sections taught by the teacher. We randomly selected one of a depart-mentalized teacher’s two sections from the first year for inclusion in our analysis.7 In the second year of the study, departmentalized teachers were videoed on 4 different days and focused on one of the teacher’s sections. Video recordings were spread over time to capture greater representative-ness of teacher instruction. MET raters watched the lesson videos remotely, coding them using eight components from two of the FFT domains: classroom environment and instruction (see White
TABLE 1Teacher Characteristics
Teacher characteristic
ELA sample Math sample
All teachers
Full sample IV sample
Randomization sample
Full sample
IV sample
Randomization sample
Male .18 .13 .15 .20 .21 .25 .23White .58 .57 .59 .65 .55 .56 .60Black .32 .34 .32 .27 .35 .35 .29Hispanic .08 .07 .07 .06 .07 .06 .07Other race .02 .01 .02 .02 .03 .04 .04Experience (years
in district)8.7 (7.01) 8.3 (6.63) 8.7 (6.94) 8.4 (6.97) 8.3 (7.01) 8.9 (7.48) 8.4 (7.00)
Master’s or higher .33 .35 .31 .21 .40 .37 .20Grade 4 .17 .22 .00 .12 .23 .00 .12Grade 5 .19 .24 .31 .10 .25 .32 .12Grade 6 .20 .16 .21 .22 .18 .24 .29Grade 7 .14 .11 .14 .19 .12 .15 .23Grade 8 .14 .12 .16 .25 .11 .14 .24Grade 9 .16 .14 .18 .12 .12 .15 .00Generalist .22 .34 .23 .15 .37 .25 .23
Teachers 834 530 412 130 480 372 86Schools 204 172 156 49 162 146 31Districts 5 5 5 5 5 5 4
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher characteristics except experi-ence (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Of the 834 teachers, data on a teacher’s gender is available for 799 teachers, 798 report their race, 797 report years of experience (in the dis-trict), and 618 report degree attainment (e.g., master’s or higher). Of the 530 teachers in the ELA sample, data on a teacher’s gender and race are avail-able for 503 teachers, 501 report years of experience (in the district), and 381 report degree attainment (e.g., master’s or higher). Of the 480 teachers in the math sample, data on a teacher’s gender is available for 462 teachers, 461 report their race, 462 report years of experience (in the district), and 333 report degree attainment (e.g., master’s or higher). ELA = English Language Art; IV = instrumental variables; MET = Measures of Effective Teaching.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
8
& Rowan, 2012 for a detailed discussion of the process of video recording and scoring teachers’ lessons). Each component was scored on an inte-ger scale from 1 (unsatisfactory) to 4 (distin-guished). The scores for each component were then averaged within subject (using the harmonic mean) across all segments (i.e., lesson videos) to create section-level aggregate scores for each component. These aggregate scores represent the average FFT scores for a teacher’s subject-specific instruction for a single class (section) of students.
For our measure of teacher effectiveness, we created a composite performance measure in the following manner. Using the section-level means (from multiple observation scores) of each of the eight FFT components, we averaged across the eight section-level aggregate component scores within a section, by subject. For example, each of the eight FFT component means based on ELA instruction from a teacher’s class section were
averaged for a final composite ELA FFT score. Previous research using the MET data found this to be an appropriate approach for aggregating teacher effectiveness measures based on classroom obser-vation scores (Garrett & Steinberg, 2015; Kane et al., 2013; Mihaly, McCaffrey, Staiger, & Lockwood, 2013).
Incoming Achievement and Classroom Composition
The MET data contain information on student achievement for Grades 4 to 8 provided by each district for its state accountability tests in ELA and mathematics. Student achievement information was provided for the 2 years of the study (e.g., 2009–2010 and 2010–2011) and up to 3 years prior to the study, as available. As the state assessments differ across locations, scores were standardized (e.g., z scores) at the student level within district,
TABLE 2Teacher Characteristics, by Classroom Compliance Status
Teacher characteristic
ELA sample Math sample
Full Comply Partial Comply Noncomply Full Comply Partial Comply Noncomply
Male .15 .09 .14 .20 .23 .17White .61 .56 .49 .57 .56 .48Black .32 .32 .43 .34 .32 .43Hispanic .05* .11 .06 .06 .09 .06Other race .01 .01 .02 .03 .03 .02Experience (years
in district)8.5** (7.06) 8.9 (6.66) 6.74 (4.84) 8.75 (7.25) 8.38 (6.91) 7.48 (6.71)
Master’s or higher .28** .40 .47 .30*** .44 .51Grade 4 .15*** .30 .27 .18 .25 .26Grade 5 .19*** .25 .37 .21** .22 .36Grade 6 .19 .16 .10 .22 .15 .17Grade 7 .17*** .07 .04 .16** .09 .07Grade 8 .17*** .07 .09 .16*** .07 .08Grade 9 .14 .15 .13 .06*** .22 .06Generalist .24*** .38 .53 .32 .38 .44
Teachers 258 180 92 189 184 107Schools 120 93 48 103 95 48Districts 5 5 4 5 5 4
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher charac-teristics except experience (mean and standard deviation reported). Full Comply refers to teachers in full compliance classrooms; Partial Comply refers to teachers in partial compliance classrooms; and Noncomply refers to teachers in full noncompliance classrooms in the second year of the MET study. ELA = English Language Art; MET = Measures of Effective Teaching.
Differences by classroom compliance status, by subject, statistically significant at the *10%, **5%, and ***1% levels.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
9
grade, and subject to enable comparability across sites. The MET data also capture key background characteristics of a teacher’s students that we include as controls for classroom composition, including class size, student race/ethnicity, gender, age, special education status (SPED), free or reduced-price lunch status (FRPL; as a proxy for student poverty), gifted-status, and whether a stu-dent is an English language learner (ELL; see Table 4). Both student background and achieve-ment characteristics were aggregated to the class-room level. The aggregated, classroom-level achievement represents the average (ELA or math) achievement for all students in a given class.
Importantly, the average prior-year achieve-ment of all students in the class captures the aca-demic skills students have when arriving to a class, a focal consideration for assessing the
influence of classroom composition on measured teacher performance.
Empirical Approach
Our aim is to understand how the incoming academic achievement of a class influences a teacher’s performance as measured by his or her classroom observation scores. Assuming addi-tive separability of inputs to teacher performance, we can write Equation 1 as follows:
Perform + PriorAchieve
+ + +
0 1jct ct
ct j jctX
= ( )β β
Γ θ ε , (3)
where Performjct is the observation score for teacher j with class c in year t; PriorAchieve is the average prior-year (i.e., incoming)
TABLE 3Teacher Characteristics, by Generalist/Departmentalized Status
ELA sample Math sample
Teacher characteristic Generalist Departmentalized Generalist Departmentalized
Male .07*** .16 .07*** .28White .46*** .63 .48** .59Black .46*** .28 .44*** .30Hispanic .06 .08 .06 .08Other race .02 .01 .02 .03Experience (years in district) 6.65*** (5.44) 9.19 (7.02) 6.71*** (5.47) 9.25 (7.60)Master’s or higher .68*** .23 .67*** .28Grade 4 .47*** .09 .48*** .08Grade 5 .51*** .11 .50*** .10Grade 6 .02*** .23 .02*** .28Grade 7 .00*** .17 .00*** .18Grade 8 .00*** .18 .00*** .17Grade 9 .00*** .21 .00*** .19Full Comply .35*** .56 .34* .42Partial Comply .38 .32 .39 .38Noncomply .27*** .12 .27* .20
Teachers 180 350 177 303Schools 54 122 52 115Districts 4 5 4 5
Note. Data are from the second year of the MET study (2010–2011 school year). Proportions are reported for all teacher charac-teristics except experience (mean and standard deviation reported). Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students. Departmentalized teachers are subject matter specialists who taught ELA or math to more than one class of students. ELA = English Language Art; MET = Measures of Effective Teaching.
Differences by generalist/departmentalized status, by subject, statistically significant at the *10%, **5%, and ***1% levels.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
10
achievement of teacher j’s class c in year t; and Xct captures other observable characteristics of students in teacher j’s classroom in year t, includ-ing class size, the average age of students in the class, the proportion of students receiving FRPL, the proportion of students who are ELLs, the pro-portion of students receiving special education services (i.e., SPED), the proportion of gifted students, and the percentage of minority (i.e., Black or Hispanic) students; θj is a teacher FE and εjct is a random error term. Accounting for a teacher’s endowed ability in this teacher fixed-effects set-up will enable us to deconfound the relationship between incoming achievement and teacher performance that may be due to fixed performance characteristics of teachers. We clus-ter the standard errors at the teacher level.
We estimate Equation 3 separately for each of the two subject-specific (math and ELA) teacher samples to examine the extent to which incoming
achievement may differentially influence mea-sured teacher performance across content areas. In addition to teachers’ aggregate classroom observa-tion scores, we examine the relationship between incoming achievement and individual teacher practices across two domains—instruction and classroom management. Furthermore, we esti-mate the extent to which incoming achievement differentially influences the observation scores for subject-matter generalists and departmentalized, subject-matter specialists.
As previously discussed, measured teacher performance may reflect, in part, the extent to which a teacher’s instruction and classroom man-agement interacts with the students to whom he or she is assigned, as well as the process by which teachers are assigned to classrooms. As formal-ized in Equation 2, the ex ante assignment process matching teachers to students and the ex post interaction of a teacher’s instruction with
TABLE 4Classroom Composition
Classroom characteristic
All teachers ELA sample Math sample
Y1 Y2 Y1 Y2 Y1 Y2
Class size 23.8 (5.4) 24.6 (6.0) 24.0 (4.8) 24.9 (5.4) 24.1 (5.6) 24.8 (6.3)Age (years) 11.9 (1.8) 10.9 (1.8) 11.6 (1.8) 10.6 (1.8) 11.5 (1.8) 10.5 (1.8)Male .50 (.12) .50 (.11) .50 (.12) .49 (.11) .50 (.11) .51 (.11)White .26 (.28) .24 (.26) .24 (.28) .23 (.26) .22 (.26) .21 (.24)Black .32 (.33) .31 (.32) .35 (.35) .35 (.35) .36 (.35) .35 (.35)Hispanic .34 (.28) .36 (.27) .33 (.28) .34 (.28) .33 (.27) .35 (.27)Asian .06 (.12) .06 (.12) .06 (.12) .06 (.12) .06 (.13) .06 (.13)Other race .03 (.05) .03 (.04) .03 (.05) .02 (.04) .03 (.05) .03 (.04)FRPL .55 (.29) .57 (.30) .53 (.30) .55 (.31) .55 (.29) .58 (.30)ELL .12 (.16) .13 (.17) .11 (.16) .12 (.17) .13 (.17) .14 (.17)SPED .08 (.10) .09 (.10) .07 (.10) .09 (.10) .08 (.10) .10 (.11)Gifted .06 (.14) .07 (.15) .08 (.16) .09 (.17) .04 (.09) .04 (.09)Achievement (ELA) .06 (.54) .10 (.54) .09 (.52) .12 (.52) .00 (.52) .03 (.50)Achievement (Math) .06 (.54) .08 (.55) .10 (.50) .11 (.52) .01 (.53) .02 (.52)
Teachers 834 834 530 530 480 480Schools 204 204 172 172 162 162Districts 5 5 5 5 5 5
Note. Each cell reports the mean (standard deviation) proportion of students, by characteristic, in a teacher’s classroom, except for class size, age, and achievement. Y1 represents the first year of the MET study (2009–2010 school year) and Y2 represents the sec-ond year of the MET study (2010–2011 school year). For class size, we report the mean (standard deviation) number of students in a teacher’s class. Achievement represents the average incoming achievement (in standard deviation units) among students in a teacher’s class (e.g., Y1 achievement, in ELA and math, is for the 2008–2009 school year, and Y2 achievement is for the 2009–2010 school year), standardized at the subject/grade/district level. ELA = English Language Art; FRPL = free or reduced-price lunch status; ELL = English language learner; SPED = special education status; MET = Measures of Effective Teaching.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
11
the particular composition of the class may be captured by an aggregate match effect, γjc. To dis-tinguish between these ex ante and ex post contri-butions to the aggregate match effect, we characterize the influence of nonrandom teacher-classroom matching as γ jc
match and the interaction
of teacher instruction and classroom composition as γ jc
instr. The teacher FE approach accounts for the
sorting of teachers to classes (i.e., the ex ante match effect, γ jc
match ) that is correlated with unob-served, time-invariant teacher characteristics. However, it is possible that the sorting of teachers to classrooms may also be a function of time-vary-ing, unobserved shocks to teacher performance. For example, Kalogrides and Loeb (2013) point out that principals may assign teachers to classes as a retention strategy, to compensate them for particularly strong performance in a given year, or even as a result of pressure senior teachers might place on principals to assign them their preferred classes and students. It is important to note that the FE approach will produce biased estimates of the influence of incoming achievement on a teacher’s measured performance if teacher assignment to classes is also based on such year-specific shocks.
Experimental Estimates of Incoming Achievement
One approach for addressing the concern that time-varying shocks may be correlated with the assignment of teachers to classes and bias the FE estimates is to leverage the randomization of teachers to classes that took place in the second year of the MET study. As previously described, a subset of all MET study teachers was randomly assigned to classes within randomization blocks, or grade-subject combinations of classes consist-ing of at least two participating MET teachers. If the random assignment of teachers to classes (within randomization block) was implemented faithfully, all teachers would be in full compli-ance classrooms in the 2010–2011 school year. Then, via OLS, we could estimate the causal effect of incoming achievement (which, condi-tional on the randomization block, should be exogenous to the teacher). However, in our full ELA sample, 51% of teachers were in either par-tial or full noncompliance classrooms; among our full math sample, 61% of teachers were in classes with some noncompliance.
Even though we observe extensive classroom-level noncompliance with the teacher random assignment, we identified a subset of randomiza-tion blocks in which all MET study teachers were in full compliance classrooms (see Table 1 for summary statistics on the randomization sam-ple). Of the 260 randomization blocks in the ELA sample, we identified 58 where all MET study teachers were in full compliance classes; for the math sample, 38 of the 244 randomization blocks included only full compliance classes. Given that only a subset of the randomization blocks was included in our randomization sample, we first assessed the fidelity of the randomization. Specifically, if the randomization held among teachers (and their classes) in our randomization sample, then the classroom composition vari-ables jointly should not predict a teacher’s pre-randomization measured performance (i.e., 2009–2010 FFT scores). To conduct this covari-ate balance test, we estimate the following model for both the ELA and math samples:
Perform + PriorAchieve
+ + +
0 1
1
j t ct
ct r j,tX
,
,
−
−
= ( )1 α α
ζ υ ε (4)
where Performj,t−1 is teacher j’s prerandomization classroom observation score from year t − 1 (i.e., 2009–2010 school year); PriorAchieve is the average prior-year (i.e., incoming) achievement of teacher j’s class c in year t (i.e., 2010–2011 school year); and Xct captures other observable characteristics of students in teacher j’s class-room in year t, as in Equation 3. Given that the randomization was conducted at the randomiza-tion block level, we include randomization block FE (υr); εj,t−1 is a random error term, and we clus-ter the standard errors at the randomization block level. To determine whether the characteristics of teacher j’s class c in the 2010–2011 school year jointly predicted his or her prerandomization measured performance from the 2009–2010 school year, we assess the F statistic (and associ-ated p value) from the joint test of the null hypothesis for Equation 4. If classroom composi-tion was distributed randomly across teachers within randomization blocks, then we will be unable to reject the joint null hypothesis.8 In such cases, OLS estimates of Equation 3 for the ran-domization year (2010–2011), replacing only the teacher FE (θj) with the randomization block FE
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
12
(υr), will produce unbiased estimates of the effect of incoming achievement on measured teacher performance. We present both OLS and teacher FE estimates for the ELA and math randomiza-tion samples, and discuss the validity of these estimates in light of the covariate balance test described above.
Examining Bias Due to Nonrandom Matching
Prior evidence suggests that when teachers are nonrandomly assigned to classes, systematic assortative matching occurs such that higher per-forming teachers are assigned to higher perform-ing students (Clotfelter et al., 2006; Kalogrides & Loeb, 2013; Kalogrides et al., 2013; Steinberg & Garrett, 2015). In addition to the systematic matching of teachers to classes on observed per-formance characteristics, policymakers and prac-titioners should also be concerned that teachers are assigned to classes based on unobserved characteristics. If teacher-class assignment depends, in part, on unobserved time-varying and time-invariant characteristics of teachers, then observation-based measures of teacher per-formance will be prone to bias.
Our aim is to uncover the extent to which such unobserved matching bias that is due to teacher-class sorting based on unobserved and time-invariant teacher quality may influence teachers’ measured performance. To do so, we compare estimates from the teacher FE approach with Year 2 OLS estimates based on Equation 3 and esti-mates from an IV approach (please see the appen-dix for details on the IV approach and the formal comparison between FE and IV strategies). Neither the OLS nor IV estimates will provide unbiased causal estimates of the effect of incom-ing achievement. Rather, the OLS estimates (based on Year 2 data) are compared with the FE estimates (based on 2 consecutive years of data) to reveal whether omitting time-invariant teacher quality biases estimates of the effect of incoming achievement on measured teacher performance. Furthermore, the IV estimates are compared with the FE estimates to examine whether omitting the correlation between time-invariant teacher qual-ity and incoming classroom achievement biases estimates of the effect of incoming achievement on measured teacher performance. Therefore,
both the OLS and IV comparisons to the FE esti-mates aim to shed light on how omitting fixed teacher quality biases the effect of incoming achievement on measured teacher performance. Most critically for policy and practice, these com-parisons will reveal whether bias in teacher obser-vation scores may be present due to the nonrandom assignment of teachers to classes based on fixed teacher quality.
Results
To what extent does measured teacher perfor-mance, based on classroom observation scores, reflect the incoming achievement of the students to whom teachers are assigned? To answer this ques-tion, we begin by showing the distribution of teacher effectiveness, measured by classroom observation scores, among teachers who teach stu-dents with varying levels of academic perfor-mance. We present results from the 2009–2010 school year only because all teachers were nonran-domly assigned to classes in the first year of the MET study. We find that ELA teachers were more than twice as likely to be rated in the top perfor-mance quintile if assigned the highest achieving students compared with teachers assigned the low-est achieving students; math teachers were more than 6 times as likely (see Figure 1). Furthermore, approximately half of the teachers—48% in ELA and 54% in math—were rated in the top two per-formance quintiles if assigned the highest perform-ing students, while 37% of ELA and only 18% of math teachers assigned the lowest performing stu-dents were highly rated based on classroom obser-vation scores.
The patterns observed in Figure 1 likely reflect a number of factors—both observable and unobservable characteristics of teachers and their students—that influence how teachers perform in the classroom and their subsequent performance ratings. To lend further insight into the influence of student achievement on measured teacher per-formance, we next present our main estimates of the effect of incoming student achievement on teachers’ classroom observation scores. Among the full sample of ELA and math teachers, esti-mates from the teacher FE approach reveal that incoming achievement has a large and significant effect on measured teacher performance (see col-umn 2, Table 5). For the ELA sample, the
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
13
coefficient of 0.11 FFT points suggests that if a teacher were to be assigned to a class with 1 stan-dard deviation higher incoming achievement, that teacher would realize one third of a standard deviation increase in measured teacher perfor-mance. For math teachers, the effect of incoming math achievement, although smaller in magni-tude, suggests that a 1 standard deviation increase in incoming student math achievement would generate approximately a 0.2 standard deviation increase in measured performance.
When we restrict the sample to teachers in Grades 5 to 9 (IV Sample), the FE estimates again suggest a substantive and statistically sig-nificant effect of incoming achievement on mea-sured teacher performance. For ELA teachers, a 0.14 FFT point effect is equivalent to increasing measured teacher performance by 0.4 standard deviations (with a commensurate 1 standard deviation increase in incoming student ELA achievement); for math teachers, a 0.08 FFT point effect is equivalent to increasing measured
FIGURE 1. Measured teacher performance by incoming student achievement.Note. Panel A shows the distribution of teacher FFT scores from ELA lessons from the 2009–2010 school year (for 530 ELA teachers) by the incoming ELA achievement of their students. Panel B shows the distribution of teacher FFT scores from math lessons from the 2009–2010 school year (for 480 math teachers) by the incoming math achievement of their students. As state achievement tests differ across the five districts in which our sample of teachers are located, student test scores were standardized (e.g., z scores) within district, grade, and subject to enable comparability across district sites. F tests confirm that the FFT score for ELA and math teachers differ significantly across the distribution of incoming classroom achievement—for the ELA sample, F = 4.21 (p = .002); for the math sample, F = 12.08 (p < .0001). FFT = Framework for Teaching; ELA = English Language Art.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
14
TAB
LE
5In
com
ing
Ach
ieve
men
t and
Mea
sure
d Te
ache
r P
erfo
rman
ce
Ful
l sam
ple
IV s
ampl
eR
ando
miz
atio
n sa
mpl
e
(1)
(2)
(3)
(4)
(5)
(6)
OL
S
(Poo
led)
FE
IVF
EO
LS
(Y
2)F
E
Pan
el A
: EL
A s
ampl
e
Pri
or a
chie
vem
ent
0.04
(0.
03)
0.11
***
(0.0
4)0.
11*
(0.0
6)0.
14**
* (0
.05)
0.17
(0.
22)
0.16
** (
0.08
)
IV (
F s
tati
stic
)0.
89**
* (8
27.8
)
F
sta
tist
ic (
p va
lue)
fro
m C
ovar
iate
Bal
ance
Tes
t0.
64 (
0.80
1)
R
ando
miz
atio
n bl
ocks
260
260
205
205
58
58
T
each
er ×
Yea
r O
bser
vati
ons
1,06
01,
060
412
824
130
260
R
2.2
179
.029
8.2
681
.014
3.7
576
.069
1
FF
T M
(SD
)2.
53 (
0.33
)2.
50 (
0.33
)2.
50 (
0.35
)2.
52 (
0.34
)2.
51 (
0.36
)P
anel
B: M
ath
sam
ple
P
rior
ach
ieve
men
t0.
06**
(0.
03)
0.06
* (0
.04)
−0.
01 (
0.04
)0.
08**
(0.
04)
−0.
33 (
0.23
)0.
07 (
0.09
)
IV (
F s
tati
stic
)0.
85**
* (8
22.2
)
F
sta
tist
ic (
p va
lue)
of
Cov
aria
te B
alan
ce T
est
2.28
** (
0.02
7)
R
ando
miz
atio
n bl
ocks
244
244
194
194
38
38
T
each
er ×
Yea
r O
bser
vati
ons
960
960
372
744
86
172
R
2.2
149
.127
3.2
648
.142
2.6
695
.091
3
FF
T M
(SD
)2.
48 (
0.33
)2.
45 (
0.32
)2.
45 (
0.34
)2.
44 (
0.31
)2.
43 (
0.33
)
Not
e. E
ach
colu
mn
(wit
hin
a pa
nel)
rep
orts
a s
epar
ate
regr
essi
on. F
or p
oole
d O
LS
and
IV
est
imat
es, c
oeff
icie
nts
repo
rted
(w
ith
robu
st s
tand
ard
erro
rs c
lust
ered
at t
he s
choo
l lev
el);
for
teac
her
FE
est
imat
es, c
oeff
icie
nts
repo
rted
(wit
h ro
bust
sta
ndar
d er
rors
clu
ster
ed a
t the
teac
her l
evel
); fo
r OL
S (Y
2) e
stim
ates
, coe
ffic
ient
s re
port
ed (w
ith
robu
st s
tand
ard
erro
rs c
lust
ered
at t
he ra
ndom
-iz
atio
n bl
ock
leve
l). C
olum
ns 2
, 4, a
nd 6
incl
ude
teac
her
FE
; col
umn
3 in
clud
es d
istr
ict F
E, a
nd c
olum
n 5
incl
udes
ran
dom
izat
ion
bloc
k F
E. T
he I
V s
ampl
e in
clud
es te
ache
rs in
Gra
des
5 to
9
duri
ng th
e 20
10–2
011
scho
ol y
ear.
Pri
or A
chie
vem
ent i
s th
e av
erag
e in
com
ing
subj
ect-
spec
ific
(e.g
., E
LA
or m
ath)
ach
ieve
men
t am
ong
stud
ents
in a
teac
her’
s cl
ass.
For
the
IV e
stim
ates
, Pri
or
Ach
ieve
men
t est
imat
es th
e ef
fect
of
inco
min
g cl
assr
oom
ach
ieve
men
t on
mea
sure
d te
ache
r pe
rfor
man
ce w
hen
inst
rum
enti
ng w
ith
the
clas
s’ o
nce-
lagg
ed p
rior
ach
ieve
men
t (e.
g., t
wic
e-la
gged
ac
hiev
emen
t). T
he C
ovar
iate
Bal
ance
Tes
t rep
orts
the
F s
tati
stic
(p
valu
e) f
rom
the
join
t tes
t of
the
null
hyp
othe
sis
from
a r
egre
ssio
n of
pre
rand
omiz
atio
n (e
.g.,
2009
–201
0) m
easu
red
teac
her
perf
orm
ance
(Per
form
j,t−
1) o
n P
rior
Ach
ieve
men
t, co
ntro
llin
g fo
r cla
ssro
om c
hara
cter
isti
cs a
nd ra
ndom
izat
ion
bloc
k F
E. A
ll re
gres
sion
s co
ntro
l for
the
foll
owin
g cl
assr
oom
cha
ract
eris
tics
: cla
ss
size
and
the
sha
re o
f st
uden
ts b
y ra
ce/e
thni
city
, gen
der,
age
, spe
cial
edu
cati
on s
tatu
s, f
ree
or r
educ
ed-p
rice
lun
ch s
tatu
s, g
ifte
d-st
atus
, and
Eng
lish
lan
guag
e le
arne
r st
atus
. IV
= i
nstr
umen
tal
vari
able
s; O
LS
= o
rdin
ary
leas
t squ
ares
; FE
= f
ixed
-eff
ects
; EL
A =
Eng
lish
Lan
guag
e A
rt; F
FT
= F
ram
ewor
k fo
r T
each
ing.
Coe
ffic
ient
s st
atis
tica
lly
sign
ific
ant a
t the
*10
%, *
*5%
, and
***
1% le
vels
.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
15
teacher performance by 0.24 standard deviations. We note that the FE estimates for both ELA and math teachers in the IV sample are larger in mag-nitude than the FE estimates among the full sam-ple, suggesting that the influence of incoming achievement may be less salient for teachers in earlier grades, as Grade 4 teachers were excluded from the IV sample. We return to this point when we examine the effect of incoming achievement among generalist and departmentalized teachers.
As previously discussed, the FE estimates will be biased in the presence of time-varying shocks to teacher-class assignment. To address this con-cern, we present experimental estimates of the effect of incoming achievement on measured teacher performance (see column 5, Table 5). For the randomization sample of ELA teachers, the covariate balance test indicates that the random-ization was preserved. However, the randomiza-tion does not appear to have held for the sample of math teachers, as we can reject the joint null hypothesis (p value = .03) that classroom compo-sition was uncorrelated with prerandomization measured teacher performance in math. As a result, the experimental estimate for only the sample of ELA teachers may be considered a valid causal effect of incoming achievement on measured teacher performance. For ELA teach-ers, we find a substantively large (0.17 FFT points), although not precisely estimated, effect of incoming achievement. The FE estimate (0.16 FFT points) is nearly identical to the experimen-tal estimate and more precisely estimated, con-firming the assumptions of the FE model, that is, E(εjct | PriorAchievect, X, θj) = 0 from Equation 3. We further note that the FE estimate for the ran-domization sample of math teachers (0.07) is remarkably similar to the FE estimates for the full and IV math samples. Taken together, the FE and experimental estimates suggest that incom-ing achievement plays a significant role in deter-mining measured teacher performance.
Does Incoming Achievement Influence Specific Teacher Practices?
To shed further light on the specific teacher practices that may be differentially influenced by incoming achievement, we next examine eight teacher practices across two domains—classroom environment and instruction—for which external
observers in the MET study rated teacher perfor-mance.9 We first examine the extent to which the incoming achievement of an ELA teacher’s stu-dents influences these specific teacher practices (see Panel A, Table 6). We note that all estimates are net of teacher FE. We find consistent evidence that incoming achievement influences each of the four practices related to the classroom environ-ment. These findings may not be particularly sur-prising. The composition of a classroom is likely to have an influence on measures of teacher prac-tice that require teachers and students to cocon-struct learning activities and collaborate on classroom procedures, while higher achieving stu-dents may be more engaged in learning, shaping the overall behavioral climate of the classroom. These findings therefore suggest that, independent of fixed teacher quality, teachers who are assigned higher achieving students will be judged to per-form better on measures that evaluate their capac-ity to create and sustain a higher functioning learning environment.
Moreover, evidence from Table 6 indicates that aspects of a teacher’s instructional practices that may be more sensitive to interactions between teachers and their students are influ-enced by the achievement of their students, inde-pendent of a teacher’s own instructional ability. Indeed, teachers with higher achieving students are rated higher in two areas of instruction—communicating with students (3a) and engaging students in learning (3c). Notably, however, mea-sures of instructional performance that depend more on strategies or instructional tools that teachers may bring with them to the classroom—using questioning techniques (3b) and assess-ment (3d) to drive instruction—are invariant to the level of achievement of their students.
We find that, unlike with ELA teachers, the measured performance of math teachers on indi-vidual practices is largely unrelated to the incom-ing achievement of their students (see Panel B, Table 6). The only exception is for a math teach-er’s performance in establishing a culture for learning in the class (FFT component 2b). These results are not surprising given that a math teach-er’s aggregate performance on the Danielson FFT has only a modest and marginally statisti-cally significant relationship with prior student achievement (see column 2, Panel B of Table 5). It is possible that the ways in which math
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
16
TAB
LE
6In
com
ing
Ach
ieve
men
t and
Tea
cher
Pra
ctic
es
Dom
ain
2: C
lass
room
env
iron
men
tD
omai
n 3:
Ins
truc
tion
2a
2b2c
2d3a
3b3c
3d
Pan
el A
: EL
A s
ampl
e
Pri
or a
chie
vem
ent
0.15
***
(0.0
5)0.
19**
* (0
.06)
0.12
** (
0.05
)0.
09*
(0.0
5)0.
10**
(0.
05)
0.07
(0.
06)
0.16
***
(0.0
5)0.
04 (
0.05
)
Cla
ssro
om c
hara
cter
isti
csX
XX
XX
XX
X
Tea
cher
FE
XX
XX
XX
XX
T
each
er ×
Yea
r O
bser
vati
ons
1,06
01,
060
1,06
01,
060
1,06
01,
060
1,06
01,
060
R
2.0
004
.037
4.0
688
.001
0.0
326
.002
6.0
770
.036
7
FF
T M
(SD
)2.
71 (
0.42
)2.
53 (
0.42
)2.
68 (
0.42
)2.
77 (
0.43
)2.
64 (
0.37
)2.
24 (
0.43
)2.
46 (
0.41
)2.
26 (
0.42
)P
anel
B: M
ath
sam
ple
P
rior
ach
ieve
men
t0.
08 (
0.06
)0.
10*
(0.0
5)0.
05 (
0.05
)0.
05 (
0.05
)0.
06 (
0.05
)0.
05 (
0.05
)0.
03 (
0.05
)0.
07 (
0.06
)
Cla
ssro
om c
hara
cter
isti
csX
XX
XX
XX
X
Tea
cher
FE
XX
XX
XX
XX
T
each
er ×
Yea
r O
bser
vati
ons
9
60
960
9
60
960
9
60
960
9
60
960
R
2.0
644
.086
4.0
468
.095
7.0
003
.017
6.0
980
.013
6
FF
T M
(SD
)2.
62 (
0.43
)2.
48 (
0.41
)2.
64 (
0.44
)2.
72 (
0.43
)2.
57 (
0.38
)2.
13 (
0.40
)2.
37 (
0.41
)2.
30 (
0.41
)
Not
e. E
ach
colu
mn
(wit
hin
a pa
nel)
rep
orts
a s
epar
ate
regr
essi
on. C
oeff
icie
nts
(wit
h ro
bust
sta
ndar
d er
rors
clu
ster
ed a
t the
teac
her
leve
l) r
epor
ted.
Eac
h co
lum
n re
pres
ents
a s
epar
ate
regr
essi
on
of a
n in
divi
dual
FF
T c
ompo
nent
(e.
g., t
each
er p
ract
ice)
on
clas
sroo
m c
ompo
siti
on. P
rior
Ach
ieve
men
t is
the
aver
age
inco
min
g E
LA
or
mat
h ac
hiev
emen
t am
ong
stud
ents
in a
teac
her’
s cl
ass.
D
omai
n 2
(cla
ssro
om e
nvir
onm
ent)
incl
udes
the
foll
owin
g F
FT
com
pone
nts:
(2a
) cr
eati
ng a
n en
viro
nmen
t of
resp
ect a
nd r
appo
rt, (
2b)
esta
blis
hing
a c
ultu
re f
or le
arni
ng, (
2c)
man
agin
g cl
ass-
room
pro
cedu
res,
and
(2d
) m
anag
ing
stud
ent b
ehav
ior.
Dom
ain
3 (i
nstr
ucti
on)
incl
udes
the
foll
owin
g F
FT
com
pone
nts:
(3a
) co
mm
unic
atin
g w
ith
stud
ents
, (3b
) us
ing
ques
tion
ing
and
disc
us-
sion
tech
niqu
es, (
3c)
enga
ging
stu
dent
s in
lear
ning
, and
(3d
) us
ing
asse
ssm
ent i
n in
stru
ctio
n (s
ee G
arre
tt &
Ste
inbe
rg, 2
015,
for
mor
e de
tail
on
each
of
the
eigh
t FF
T c
ompo
nent
s). C
lass
room
ch
arac
teri
stic
s in
clud
e cl
ass
size
and
the
shar
e of
stu
dent
s by
rac
e/et
hnic
ity,
gen
der,
age
, spe
cial
edu
cati
on s
tatu
s, f
ree
or r
educ
ed-p
rice
lunc
h st
atus
, gif
ted-
stat
us, a
nd E
ngli
sh la
ngua
ge le
arne
r st
atus
. EL
A =
Eng
lish
Lan
guag
e A
rt; F
E =
fix
ed e
ffec
ts; F
FT
= F
ram
ewor
k fo
r T
each
ing.
Coe
ffic
ient
s st
atis
tica
lly
sign
ific
ant a
t the
*10
%, *
*5%
, and
***
1% le
vels
.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
17
instruction is delivered may play an important role in insulating teachers from the influence of incoming student achievement on measured per-formance. Although math instruction tends to rely more on direct instruction, ELA teachers tend to use more opportunities to coconstruct lit-eracy and reading instruction through, for exam-ple, balanced literacy approaches (such as group read alouds and small-group instruction). In the current policy environment, however, this dis-tinction between approaches to instruction across ELA and math subject areas may become less pronounced. In particular, the Common Core State Standards in mathematics (CCSS-M) reflect a key shift in mathematics instruction that is focused more on the coconstruction of knowl-edge in the classroom than on direct instruction. For example, the CCSS-M requires students to “understand and use stated assumptions, defini-tions, and previously stated results in construct-ing arguments” and to “justify their conclusions, communicate them to others, and respond to the arguments of others” (National Governors Association Center for Best Practices & Council of Chief State School Officers, 2010, pp. 6–7). With the adoption of and commensurate shift in instruction to a more constructivist approach
under the CCSS-M, measured teacher perfor-mance in mathematics may become more simi-larly influenced by student achievement over time to what we observe among ELA teachers in the MET study sample.
Does Incoming Achievement Differentially Influence Subject-Matter Generalists and
Subject-Specific Teachers?
As we have shown, incoming achievement may exert a differential influence on measured teacher performance depending on the subject a teacher teaches. To what extent, then, are teachers’ obser-vation scores sensitive to whether they teach mul-tiple subjects to the same class of students (e.g., subject-matter generalist teachers) or whether they teach one subject (math or ELA) to more than one class of students (e.g., departmentalized teachers)? Table 7 summarizes these results.10
We find that incoming achievement has no sig-nificant influence on the measured performance of ELA generalist teachers (teachers who teach ELA in addition to math and other subject areas). However, departmentalized teachers with students whose incoming achievement is 1 standard devia-tion better realize better observed performance, on
TABLE 7The Influence of Incoming Achievement, by Generalist/Departmentalized Status
ELA sample Math sample
Generalist Departmentalized Generalist Departmentalized
Prior achievement 0.05 (0.06) 0.15*** (0.05) −0.04 (0.07) 0.08* (0.04)Classroom characteristics X X X XTeacher FE X X X XTeacher × Year Observations 354 706 348 612R2 .0042 .0649 .0203 .1769FFT M (SD) 2.61 (0.23) 2.50 (0.37) 2.56 (0.27) 2.43 (0.36)
Note. Each column reports a separate regression. Coefficients (with robust standard errors clustered at the teacher level) reported. Generalist teachers are subject-matter generalists who taught ELA and mathematics to a single class of students; departmental-ized teachers are subject matter specialists who taught ELA or math to more than one class of students. We note that three ELA teachers and three math teachers were generalists in the 2010–2011 school year and departmentalized in the 2009–2010 school year; we treat these teachers as departmentalized for the purposes of the teacher FE models. Chow tests confirm that the process by which measured teacher performance depends on classroom composition is statistically significantly different across general-ist and departmentalized teachers for both the ELA, χ2(13) = 37.49, p value = .0003, and math, χ2(13) = 35.70, p value = .0007, samples. Classroom characteristics include class size and the share of students by race/ethnicity, gender, age, special education status, free or reduced-price lunch status, gifted-status, and English language learner status. ELA = English Language Art; FE = fixed effects; FFT = Framework for Teaching.
Coefficients statistically significant at the *10%, **5%, and ***1% levels.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
18
the order of 0.15 FFT points (or 0.41 standard devi-ations). For math generalists (those who teach math in addition to ELA and other subject areas to the same class of students), incoming achievement is not significantly related to measured perfor-mance. Again, however, departmentalized math teachers realize better observed performance (0.08 FFT points, or approximately 0.22 standard devia-tions) with higher performing students.
Why might there be a systematic difference in the influence of incoming achievement for subject generalists and their subject-specific counterparts? The difference might be due to the fact that subject generalists teach fewer students as they face just one classroom of students each school day, while departmentalized teachers teach the same subject to multiple classes of students. As a result, subject generalists have more time with each student, which may provide more opportunities to adjust to students’ learning needs. However, we cannot definitively attribute these differences to general-ist versus departmentalized status, as we also find significant grade-level differences between gener-alist and departmentalized teachers (see Table 3). Indeed, 98% of generalist teachers (in both the ELA and math samples) taught either Grade 4 or Grade 5, whereas only 20% of ELA (and 18% of math) teachers in departmentalized classes taught these grades. It is possible that the observed differ-ences between classroom generalists and subject specialists are driven by differences in instruction for earlier grades compared with later grades, and thus, we are ultimately unable to empirically attri-bute these differences to class status (generalist vs. departmentalized) or grade-level differences.
Does the Nonrandom Assignment of Teachers to Classes Bias Teachers’ Measured
Performance?
As we have shown, the subject to which teach-ers were assigned, and whether or not they were assigned to teach a single class of students, had dif-ferent consequences for teachers’ measured perfor-mance. In addition to the influence of the incoming achievement of a teacher’s class, consideration must also be given to the potential influence that the nonrandom matching of teachers to classes based on fixed, unobserved teacher characteristics may exert on measured teacher performance (as formalized in Equation 2). To examine the extent
to which the systematic, nonrandom assignment of teachers to classes may bias estimates of measured teacher performance, we compare estimates from the FE approach with OLS and IV estimates, by classroom compliance status.
If the randomization held among teachers in full compliance classes, then unobserved, time-invari-ant teacher heterogeneity should be uncorrelated with incoming student achievement.11 In such cases, the OLS, IV, and FE estimates should not differ (see the appendix for a formalization of this result for the IV and FE approaches). For the ELA sample, this is what we find (see Full Comply col-umn, Panel A of Table 8). Among math teachers in full compliance classes, however, the OLS, IV, and FE estimates differ, and this difference is due to the fact that we find significant nonrandom selection of math teachers into full compliance classes (i.e., classroom composition predicted prerandomiza-tion measured teacher effectiveness). As a result, we focus our attention on the ELA teachers in par-tial compliance classes (where endogenous student sorting is most relevant) for an assessment of the extent (and direction) of bias due to nonrandom teacher assignment to classes.
Among ELA teachers in partial compliance classes, we find evidence of matching based on fixed, unobservable teacher characteristics. Specifically, the OLS (0.21 FFT points) and IV (0.17 FFT points) estimates are biased upward rel-ative to the FE estimate (0.14 FFT points). This suggests that, by not accounting for time-invariant teacher heterogeneity in the teacher-class matching process, estimates of the influence of incoming class achievement on measured teacher perfor-mance will be overstated. This result further sug-gests that higher performing students were endogenously sorted into the classes of higher per-forming teachers (as has been shown in prior work with the MET data; see, for example, Steinberg & Garrett, 2015). Therefore, the nonrandom and pos-itive assignment of teachers to classes of students based on time-invariant (and unobserved) teacher characteristics would reveal more effective teacher performance, as measured by classroom observa-tion scores, than may actually be true.
Discussion
High-stakes accountability systems aim to pro-vide more information about teacher performance
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
TAB
LE
8E
xam
inin
g B
ias
Due
to M
atch
ing:
Com
pari
son
of O
LS,
IV,
and
FE
Est
imat
es
Ful
l Com
ply
Par
tial
Com
ply
Non
com
ply
O
LS
(Y
2)IV
FE
OL
S (
Y2)
IVF
EO
LS
(Y
2)IV
FE
Pan
el A
: EL
A s
ampl
e
Pri
or a
chie
vem
ent
0.05
(0.
08)
0.04
(0.
10)
0.04
(0.
07)
0.21
***
(0.0
8)0.
17**
(0.
08)
0.14
** (
0.06
)0.
24**
(0.
09)
0.13
(0.
11)
0.26
* (0
.14)
IV
(F
sta
tist
ic)
0.88
***
(451
.2)
0.86
***
(321
.1)
0.97
***
(139
.6)
Ran
dom
izat
ion
bloc
ks13
513
513
5 8
9 8
9 8
942
4242
S
choo
ls10
610
610
6 7
8 7
8 7
840
4040
T
each
er ×
Yea
r O
bser
vati
ons
219
219
438
126
126
252
6767
134
R
2.2
436
.243
4.0
832
.330
1.3
285
.001
4.5
609
.551
0.0
011
F
FT
M (
SD)
2.48
(0.
35)
2.48
(0.
35)
2.49
(0.
36)
2.54
(0.
29)
2.54
(0.
29)
2.53
(0.
32)
2.49
(0.
36)
2.49
(0.
36)
2.48
(0.
35)
Pan
el B
: Mat
h sa
mpl
e
Pri
or a
chie
vem
ent
−0.
03 (
0.06
)−
0.06
(0.
08)
0.06
(0.
06)
−0.
02 (
0.06
)−
0.08
(0.
07)
0.09
(0.
08)
0.01
(0.
13)
−.0
2 (0
.13)
0.04
(0.
08)
IV
(F
sta
tist
ic)
0.84
***
(228
.2)
0.88
***
(429
.5)
0.82
***
(311
.0)
Ran
dom
izat
ion
bloc
ks10
410
410
4 9
6 9
6 9
648
4848
S
choo
ls 8
9 8
9 8
9 8
0 8
0 8
042
4242
T
each
er ×
Yea
r O
bser
vati
ons
155
155
310
138
138
276
7979
158
R
2.3
177
.316
9.1
106
.415
6.4
130
.146
4.3
381
.337
4.0
107
F
FT
M (
SD)
2.44
(0.
30)
2.44
(0.
30)
2.44
(0.
33)
2.42
(0.
33)
2.42
(0.
33)
2.42
(0.
37)
2.51
(0.
32)
2.51
(0.
32)
2.53
(0.
30)
Not
e. E
ach
colu
mn
(wit
hin
a pa
nel)
rep
orts
a s
epar
ate
regr
essi
on. F
or O
LS
est
imat
es, c
oeff
icie
nts
repo
rted
(w
ith
robu
st s
tand
ard
erro
rs c
lust
ered
at t
he s
choo
l lev
el);
for
teac
her
FE
est
imat
es,
coef
fici
ents
rep
orte
d (w
ith
robu
st s
tand
ard
erro
rs c
lust
ered
at t
he te
ache
r le
vel)
; for
IV
est
imat
es, c
oeff
icie
nts
repo
rted
(w
ith
robu
st s
tand
ard
erro
rs c
lust
ered
at t
he s
choo
l lev
el).
FE
reg
ress
ions
in
clud
e te
ache
r F
E; O
LS
and
IV
reg
ress
ions
incl
ude
dist
rict
FE
. The
sam
ple
incl
udes
teac
hers
in G
rade
s 5
to 9
dur
ing
the
2010
–201
1 sc
hool
yea
r (e
xclu
ding
Gra
de 4
teac
hers
wit
hout
twic
e-la
gged
stu
dent
ach
ieve
men
t te
st s
core
s).
Ful
l C
ompl
y re
fers
to
teac
hers
in
full
com
plia
nce
clas
sroo
ms;
Par
tial
Com
ply
refe
rs t
o te
ache
rs i
n pa
rtia
l co
mpl
ianc
e cl
assr
oom
s; a
nd N
onco
mpl
y re
fers
to te
ache
rs in
ful
l non
com
plia
nce
clas
sroo
ms
in th
e se
cond
yea
r of
the
ME
T s
tudy
. For
the
IV e
stim
ates
, Pri
or A
chie
vem
ent e
stim
ates
the
effe
ct o
f in
com
ing
clas
sroo
m a
chie
vem
ent o
n m
easu
red
teac
her
perf
orm
ance
whe
n in
stru
men
ting
wit
h th
e cl
ass’
onc
e-la
gged
pri
or a
chie
vem
ent (
e.g.
, tw
ice-
lagg
ed a
chie
vem
ent)
. All
reg
ress
ions
con
trol
for
the
foll
owin
g cl
assr
oom
cha
rac-
teri
stic
s: c
lass
siz
e an
d th
e sh
are
of s
tude
nts
by r
ace/
ethn
icit
y, g
ende
r, a
ge, s
peci
al e
duca
tion
sta
tus,
fre
e or
red
uced
-pri
ce lu
nch
stat
us, g
ifte
d-st
atus
, and
Eng
lish
lang
uage
lear
ner
stat
us. O
LS
=
ordi
nary
leas
t squ
ares
; IV
= in
stru
men
tal v
aria
bles
; FE
= f
ixed
eff
ects
; EL
A =
Eng
lish
Lan
guag
e A
rt; F
FT
= F
ram
ewor
k fo
r T
each
ing;
ME
T =
Mea
sure
s of
Eff
ecti
ve T
each
ing.
Coe
ffic
ient
s st
atis
tica
lly
sign
ific
ant a
t the
*10
%, *
*5%
, and
***
1% le
vels
.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
20
for the purposes of improving classroom instruc-tion and satisfying the accountability objective of newly developed teacher evaluation systems—identifying, remediating, and, if necessary, remov-ing underperforming teachers from the classroom. However, when information about teacher perfor-mance does not reflect a teacher’s practice but rather the students to whom the teacher is assigned, such systems are at risk of misidentifying and mis-labeling teacher performance. The misidentifica-tion of teachers’ performance levels has real implications for personnel decisions and funda-mentally calls into question an evaluation sys-tem’s ability to effectively and equitably improve, reward, and sanction teachers.
In this article, we find that the incoming achievement of a teacher’s students significantly and substantively influences observation-based measures of teacher performance. Indeed, teach-ers working with higher achieving students tend to receive higher performance ratings, above and beyond that which might be attributable to aspects of teacher quality that are fixed over time. Moreover, incoming achievement matters differ-ently for teachers in different classroom settings. Specifically, incoming student achievement exerts a larger influence on the measured performance of ELA teachers (compared with math teachers) and subject-matter specialists (compared with their generalist counterparts). These results are particu-larly relevant for policy decisions around newly implemented teacher evaluation reforms, given that classroom observation scores account for the majority, and in some cases the entirety, of a teach-er’s summative evaluation rating across many state and district evaluation systems (Steinberg & Donaldson, in press).
When examining specific aspects of classroom management and instruction, we find that mea-sures of teacher performance related to the class-room’s climate and learning culture were more sensitive to student achievement than measures of a teacher’s instructional practices for ELA instruc-tion. These findings suggest that teachers who are assigned higher performing students may be inac-curately judged to be better managers of the class-room environment than they actually are. In contrast, this evidence may also indicate that teach-ers perform better when assigned higher achieving students. This uncertainty is important for policy-makers and practitioners to recognize. Moreover,
this evidence also suggests that the goals of policy reforms may be best met through the inclusion of measures that are less sensitive to the student com-position of the classroom and that focus more on specific instructional skills—such as the use of questioning and discussion techniques—to provide better signals of teacher performance.
We also find that the nonrandom process by which teachers are often assigned to classes biases estimates of teacher performance based on classroom observation scores. This result is con-sistent with prior research examining the influ-ence of nonrandom student sorting on measured teacher effectiveness based on value-added scores (Rothstein, 2009). Just as earlier evidence from the MET study demonstrates that observa-tion scores of teacher performance alone are lim-ited in their ability to identify effective teachers (Garrett & Steinberg, 2015), this work shows that the use of classroom observation scores for high-stakes decisions may be further complicated by the fact that such scores reflect, to a large extent, the students to whom teachers are assigned.
Why might incoming student achievement exert an important influence on a teacher’s mea-sured performance based on their classroom obser-vation scores? One potential explanation is that, of the observable student characteristics, test scores may incorporate information about students’ cog-nitive and noncognitive (e.g., behavior) skills. Our finding that aspects of classroom environment are more influenced by incoming student achievement than are aspects of a teacher’s instructional prac-tices supports this interpretation. Ideally, we would observe measures of student behavior from the previous year, such as disciplinary infractions and/or the number (and extent) of disciplinary conse-quences, including in-school and out-of-school suspensions. The inclusion of such behavioral measures would enable a more direct assessment of how student behavior affects a teacher’s mea-sured performance, while also distinguishing the influence of noncognitive aspects of student per-formance from cognitive achievement patterns. Yet even this could be complicated by the fact that student behavior in one context (i.e., a given group of peers matched with one teacher) may shift con-siderably when placed in a different context (i.e., a new group of peers matched with another teacher). Nonetheless, a deeper understanding of how stu-dent behavior patterns influence measures of
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
21
teacher performance based on classroom observa-tion, independent of incoming student achieve-ment on state assessments, is an important area for further research.
To account for the influence that classroom composition has on teachers’ observation scores, researchers have recommended that policymak-ers and district leaders adjust observation scores based on student demographic characteristics (Whitehurst et al., 2014). If higher achieving stu-dents are more easily instructed, then a teacher’s observation scores will, in part, reflect these eas-ier-to-teach students. Ultimately, we are unable to definitively determine whether the incoming achievement effect reflects bias in observation scores due to the composition of a teacher’s class or whether teachers are systematically higher performing when assigned to higher achieving students. However, our findings suggest that even if teachers’ observation scores were adjusted to account for observed differences across class-rooms—both demographic and, particularly, achievement characteristics—unmeasured dif-ferences due to the nonrandom assignment of teachers to classes based on fixed, unobserved teacher characteristics will continue to bias esti-mates of teacher performance based on observa-tion scores. Unless observation scores are generated using multiple years of teacher data (and/or across multiple classes within the same year) to adjust for the nonrandom assignment of teachers to classes based on fixed teacher charac-teristics, annual performance evaluation that relies heavily on observation-based measures will likely mischaracterize teacher effectiveness. We believe this is particularly noteworthy for policymakers and district leaders to consider as they develop and revise teacher evaluation sys-tems and make high-stakes decisions about teachers based, in large measure, on classroom observation scores.
Appendix
Instrumental Variable (IV) Strategy for Uncovering Unobserved Matching Bias
The proposed IV approach uses the twice-lagged achievement of students in teacher j’s classroom c during the 2010–2011 school year (i.e., student achievement from the 2008–2009 school year, denoted as PriorAchievec,t−1) as an
instrument for the once-lagged achievement (PriorAchievect) to estimate the effect of incom-ing achievement on teacher j’s measured perfor-mance in year t. To formalize an assessment of nonrandom matching bias due to unobserved teacher characteristics, we begin by writing Equation 3 conditional on the IV (PriorAchievec,t−1) as follows:
E Perform PriorAchieve
E PriorAchieve Prio
-1
1
jct c,t
ct
X|
|
,( )= β rrAchieve
E PriorAchieve
-1
-1
c,t
jct c,t
( )( )+ µ | ,
(A1)
where X contains other observable characteristics of students in teacher j’s classroom in year t as in Equation 3, and the teacher fixed effect (FE) has been subsumed into µjct, such that, µjct = θj + εjct. To obtain an estimate of the quantity of interest (β1 ), we rewrite Equation A1 as follows:
β1
1=( )−cov Perform PriorAchieve
cov PriorAchieve Priojct c t
ct
,
,,
rrAchieve
cov PriorAchieve
cov PriorAchiev
c t
jct c t
,
,,
−
−
( )
−( )
1
1µ
ee PriorAchieve
cov PriorAchieve
ct c t
IV
j jct c t
,
,
,
,
−
−
( )
= −+
1
1βθ ε
(( )( )
= −
−cov ct c t
IV
PriorAchieve PriorAchieve
Nonrandom M
,
(
, 1
β aatching
Time-Variant Shocks).+
(A2)
If all teachers randomly assigned to classrooms (within randomization block) in Year 2 of the Measures of Effective Teaching (MET) study remained in full compliance classrooms, then cov(µjct, PriorAchievec,t−1) = 0 and β IV would pro-vide unbiased estimates of the effect of incoming achievement on measured teacher performance. However, violations to this exclusion restriction could have resulted (a) from either partial compli-ance or full noncompliance with randomization due to endogenous student mobility or schools simply ignoring the teacher random assignment to classes or (b) if there were systematic differences, within randomization block, between teachers who remained in full compliance classrooms and those who experienced partial (or full) noncompliance (e.g., selection into full compliance classrooms within randomization block). In either case, the IV estimates will be biased (such that β β
IV ≠ 1 ), and the direction of the bias will depend on the sign of
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
22
the numerator (the covariance between unobserved time-invariant and time-varying teacher character-istics and twice-lagged classroom achievement) and the denominator (the covariance between once- and twice-lagged classroom achievement, which will be positive). For example, if higher per-forming teachers are systematically matched to higher achieving classes (assortative matching), the nonrandom matching of teachers to classes will upwardly bias the IV estimate of incoming achieve-ment on measured teacher performance. In addi-tion, time-variant shocks to teacher-class matching will capture the sorting of teachers to classes that may result, for example, from lagged shocks (posi-tive or negative) to teacher performance that are independent of fixed teacher quality, and will intro-duce additional bias into the IV estimate.
In contrast to the IV approach, the FE approach conditions on time-invariant teacher heterogeneity (θj) and therefore controls for nonrandom matching (e.g., γ jc
match) that is a function of unobserved, fixed
teacher characteristics. We again note that any cor-relation between time-varying unobserved teacher effects and prior class achievement may still bias the FE estimator. We derive the FE estimator by rewriting Equation 3 as follows:
β
ε
1 =( )
( )
−
cov Perform PriorAchieve
var PriorAchieve
cov
��
�
�
jc c
c
,
jjc c
c
FE
jc
,
,
PriorAchieve
var PriorAchieve
cov Prior
�
�
��
( )( )
= −βε AAchieve
var PriorAchieve
Time-Variant Shocks
�
�
�
c
c
FE
( )( )
= −β ( )..
(A3)
where (i) Perform Perform Perform
jc jct jc t= − −( ;), 1 (ii)
PriorAchieve PriorAchieve PriorAchieve
c ct c t= − −( );, 1 and (iii) ε ε εjc jct jc t= −( )−, .1
Putting together the IV and FE results from Equations A2 and A3, respectively, we get
β
β
IV
FE
−
= −
(Nonrandom Matching
+ Time-Variant Shocks)
Time-VVariant Shocks
Nonrandom Matching
( )= ( )−
=
β
β
IV
FE
(A4)
If, as shown in Equation A4, matching on fixed teacher characteristics is operative, then β β
IV FE≠ . We assess the presence of nonrandom matching bias among teachers in the IV sample (i.e., teachers in Grades 5–9 with twice-lagged student test score data). Note that we omit Grade 4 teachers from the IV sample because twice-lagged student test scores (test scores from Grade 2) are unavailable, as high-stakes state testing begins in Grade 3. We further note that this formalization relies on the assumption that the time-variant shocks captured through the two approaches are approximately equal (and can thus cancel out) as we draw on information over two consecutive school years. Over a longer time frame, this assumption could be more problematic.
As the IV sample includes teachers in class-rooms with endogenous student sorting (partial compliance) and those nonrandomly assigned to classes (full noncompliance), along with teachers in full compliance classes, we compare the IV and FE results by classroom compliance status. If, among full compliance classes, the randomiza-tion held and classroom composition variables were distributed randomly across teachers, then the IV results should not differ from the FE results, as randomization implies (from Equation A1) that cov(µjct, PriorAchievec,t−1) = 0. Among partial and full noncompliance classrooms, any difference between the IV and FE estimates will reveal the extent (and direction) of the nonran-dom matching of teachers to classes.
Acknowledgments
The authors thank Matthew Kraft, Rebecca Maynard, John Papay, Jonah Rockoff, Brian Rowan, Andrew Wayne, and participants at the Measures of Effective Teaching (MET) Early Career Scholars Grantee Meeting for helpful comments and suggestions, and Jennifer Moore for editorial assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of inter-est with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following finan-cial support for the research, authorship, and/or publi-cation of this article: Funding from the National Academy of Education MET Early Career Grantee program is gratefully acknowledged.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
23
Notes
1. The randomization was conducted within ran-domization blocks that consisted of grade-subject com-binations of classes within a school, where at least two teachers must have been teaching in the same grade-subject combination (e.g., sixth grade mathematics) to be included in the randomization sample. We fur-ther note that the class rosters were established by schools, and teachers were randomly assigned to these rosters by the Measures of Effective Teaching (MET) researchers. Students were not randomly assigned to specific teachers nor were students randomly assigned to rosters. Please see White and Rowan (2012) for more detail on the Year 2 randomization sample.
2. The MET data do not allow us to observe teachers who did not participate in the MET study. Furthermore, the MET data provide classroom com-position variables only for teachers’ actual classes and not for the class rosters to which teachers were randomly assigned. Classroom composition variables for the randomly assigned rosters are unavailable as demographic and achievement data for students who ended up with a non-MET teacher (either in the same school or in another district school) after randomiza-tion are not observed in the data. Moreover, we do not observe the proportion of teachers at a subject-grade-school level that participated in the MET study.
3. Of the six MET districts, we include five dis-tricts because one district did not provide data on stu-dents’ free or reduced-price lunch status.
4. Our final teacher sample is realized as follows: Of the 1,887 teachers active during both Year 1 (2009–2010) and Year 2 (2010–2011) of the MET study, we include a subset of those teachers who taught math and/or reading in Grades 4 to 9, reducing the teacher sample to 1,629 teachers. Of these 1,629 teachers, we were able to match 1,434 teachers to data indicating whether they were part of the Year 2 randomization sample, determining that 424 did not participate. As classroom observation scores (discussed below) were unavailable in the second year of the MET study for these 424 teachers, the teacher sample was further reduced to 1,010 teachers, all of whom participated in the Year 2 randomization sample. Of these 1,010 teachers, 29 did not have classroom observation scores in the same subject area (i.e., math and/or English Language Arts [ELA]) across both study years, reduc-ing the analytic sample to 981 teachers. Of these 981 teachers, we drop an additional 176 teachers due to missing classroom composition variables, as one school district did not report data on students’ free or reduced-price lunch status.
5. We discuss our approach for assessing the fidel-ity of the randomization in the “Empirical Approach” section. We note that we are able to generate valid
causal estimates of the effect of incoming achieve-ment on measured teacher performance for the ELA, although not the math, sample of teachers, as prior achievement and classroom composition (net of ran-domization block fixed effects) do not jointly predict a teacher’s prerandomization observation scores among the subset of randomization blocks containing only full compliance classrooms.
6. The full Framework for Teaching (FFT) instru-ment was not used in the MET study, as not all scoring areas are amenable to observation through video data. Please see Garrett and Steinberg (2015) for a detailed description of the FFT domains and components included in a teacher’s FFT observation score.
7. By randomly selecting one of a departmentalized teacher’s class sections from the first year of the MET study, his or her observation scores will be based on two classroom observations rather than four, as with generalist teachers. As a consequence, additional noise will be introduced into estimates of classroom context on measured performance for this group of teachers.
8. The joint null hypothesis may be expressed as H0: ςX = αPriorAchieve = 0, where ς represents the param-eters associated with the classroom composition variables contained in X as in Equation 4, and all sig-nificance tests are adjusted for randomization block fixed effects. The exogeneity condition (as it relates to incoming achievement) may be expressed as
E Perform PriorAchieve
E Perform
j t ct ct r
j t r
X,
,
|
|
,−
−
( )(=
1
1
, υ
υ )).9. The four practices contained within the classroom
environment domain (Domain 2 of the Danielson FFT) are (2a) creating an environment of respect and rapport between the teacher and the students and among the stu-dents; (2b) establishing a culture for learning, including the norms that govern interactions among individuals in the classroom and the physical appearance of the class; (2c) managing classroom procedures, including the rou-tines and procedures a teacher establishes with the class for how to move from a small to a large group or how the students are expected to line up at the end of the day; and (2d) managing student behavior, including the establish-ment (through the coconstruction of standards of con-duct among the teacher and students) and maintenance of an orderly classroom environment. The instruction domain (Domain 3 of the Danielson FFT) captures the following four practices: (3a) communicating learning expectations to students, providing clear directions for classroom activities, and clearly presenting concepts and information to students; (3b) using questioning and dis-cussion techniques as instructional strategies to deepen student understanding; (3c) engaging students in learn-ing by incorporating activities and assignments into the
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Steinberg and Garrett
24
class that ask students to think critically, to make con-nections, to formulate and test hypotheses, and to draw conclusions; and (3d) using assessment of student learn-ing to drive instruction through monitoring of student understanding, offering feedback to students, and (when necessary) making adjustments to the lesson. We further note that Component 3b captures the only instructional strategies specifically referred to in Danielson’s FFT. See Garrett and Steinberg (2015) for more detail on each of the eight Danielson FFT components.
10. Chow tests confirm that the process by which measured teacher performance depends on classroom composition is statistically significantly different across generalist and departmentalized teachers for both the ELA, χ2(13) = 37.49, p value = .0003, and math, χ2(13) = 35.70, p value = .0007, samples.
11. We test for whether the randomization held among full compliance classes by estimating Equation 4 for both the ELA and math instrumental variables (IV) samples. For the 219 ELA teachers in full compliance classes, we cannot reject the joint null hypothesis from this covari-ate balance test (F = 0.70, p value = .751). For the 155 math teachers in full compliance classes, however, we do reject the joint null hypothesis (F = 1.92, p value = .039), suggesting that there was nonrandom selection of math teachers into full compliance classrooms.
References
Aaronson, D., Barrow, L., & Sander, W. (2007). Teachers and student achievement in the Chicago public high schools. Journal of Labor Economics, 25, 95–135.
Chaplin, D., Gill, B., Thompkins, A., & Miller, H. (2014). Professional practice, student surveys, and value-added: Multiple measures of teacher effec-tiveness in the Pittsburgh Public Schools (REL 2014–024). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved from http://ies .ed.gov/ncee/edlabs
Clotfelter, C., Ladd, H., & Vigdor, J. (2006). Teacher-student matching and the assessment of teacher effectiveness. The Journal of Human Resources, 41, 778–820.
Dee, T., & Wyckoff, J. (2013). Incentives, selection, and teacher performance: Evidence from IMPACT. Journal of Policy Analysis and Management, 34, 267–297.
Garrett, R., & Steinberg, M. (2015). Examining teacher effectiveness using classroom observation scores: Evidence from the randomization of teach-ers to students. Educational Evaluation and Policy Analysis, 37, 224–242.
Glazerman, S., Loeb, S., Goldhaber, D., Staiger, D., Raudenbush, S., & Whitehurst, G. (2010). Evaluating teachers: The important role of value-added. Washington, DC: The Brookings Institution.
Goldhaber, D. D. (2002). The mystery of good teach-ing. Education Next, 2(1), 50–55.
Goldhaber, D. D., & Hansen, M. (2010). Is it just a bad class? Assessing the stability of measured teacher performance (CEDR Working Paper No. 2010-3). Seattle: University of Washington.
Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., & Schuermann, P. (2015). Make room value added: Principals’ human capital decisions and the emergence of teacher obser-vation data. Educational Researcher, 44(2), 96–104.
Jackson, C. K., & Bruegmann, E. (2009). Teaching students and teaching each other: The importance of peer learning for teachers. American Economic Journal: Applied Economics, 1(4), 85–108.
Kalogrides, D., & Loeb, S. (2013). Different teachers, dif-ferent peers: The magnitude of student sorting within schools. Educational Researcher, 42, 304–316.
Kalogrides, D., Loeb, S., & Beteille, T. (2013). Systemic sorting: Teacher characteristics and class assignments. Sociology of Education, 86, 103–123.
Kane, T., McCaffrey, D., Miller, T., & Staiger, D. (2013). Have we identified effective teach-ers? Validating measures of effective teaching using random assignment (Bill & Melinda Gates Foundation MET Project research paper). Seattle, WA: Bill & Melinda Gates Foundation.
Kraft, M. A., & Papay, J. P. (2014). Can professional environments in schools promote teacher develop-ment? Explaining heterogeneity in returns to teach-ing experience. Educational Evaluation and Policy Analysis, 36, 476–500.
Loeb, S., Beteille, T., & Kalogrides, D. (2012). Effective schools: Teacher hiring, assignment, development and retention. Education Finance and Policy, 7, 269–304.
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance and Policy, 4, 572–606.
Mihaly, K., McCaffrey, D. F., Staiger, D., & Lockwood, J. R. (2013). A composite estima-tor of effective teaching (Technical report for the Measures of Effective Teaching project). Seattle, WA: Bill & Melinda Gates Foundation.
Monk, D. H. (1987). Assigning elementary pupils to their teachers. The Elementary School Journal, 88, 166–187.
National Governors Association Center for Best Practices & Council of Chief State School Officers. (2010). Common Core State Standards (Mathematics). Washington, DC: Author.
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Classroom Composition and Teacher Observation Scores
25
Papay, J. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Education Research Journal, 48, 163–193.
Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools and academic achievement. Econometrica, 73, 417–458.
Rockoff, J. E. (2004). The impact of individual teach-ers on student achievement: Evidence from panel data. American Economic Review, 94, 247–252.
Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4, 537–571.
Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achieve-ment. The Quarterly Journal of Economics, 125, 175–214.
Sartain, L., & Steinberg, M. (in press). Teachers’ labor market responses to performance evaluation reform: Experimental evidence from Chicago pub-lic schools. Journal of Human Resources.
Sass, T., Hannaway, J., Xu, Z., Figlio, D., & Feng, L. (2012). Value added of teachers in high-poverty schools and lower-poverty schools. Journal of Urban Economics, 72, 104–122.
Steinberg, M., & Donaldson, M. (in press). The new educational accountability: Understanding the landscape of teacher evaluation in the post-NCLB era. Education Finance and Policy.
Steinberg, M., & Garrett, R. (2015). Teacher effec-tiveness, peer quality and student achievement: Evidence from student sorting following teacher random assignment (Working paper).
Steinberg, M., & Sartain, L. (2015). Does teacher evaluation improve school performance? Experimental evidence from Chicago’s Excellence in Teaching Project. Education Finance and Policy, 10, 535–572.
Taylor, E. S., & Tyler, J. H. (2012). The effect of evaluation on teacher performance. American Economic Review, 102, 3628–3651.
Watson, J. G., Kraemer, S. B., & Thorn, C. A. (2009). The other 69 percent. Washington, DC: Center for Educator Compensation Reform, U.S. Department of Education, Office of Elementary and Secondary Education.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effec-tiveness. Brooklyn, NY: The New Teacher Project.
White, M., & Rowan, B. (2012). A user guide to the “core study” data files available to MET early career grantees. Ann Arbor: Inter-University Consortium for Political and Social Research, The University of Michigan.
Whitehurst, G., Chingos, M., & Lindquist, K. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts. Washington, DC: Brown Center on Education Policy at Brookings.
Authors
MATTHEW P. STEINBERG is an assistant professor in the Graduate School of Education at the University of Pennsylvania. His research focuses on teacher eval-uation and human capital, urban school reform, school finance, and school discipline policy and safety.
RACHEL GARRETT is a senior researcher at American Institutes for Research. Her research focuses on educator effectiveness, English language learners, and mathematics instruction.
Manuscript received February 5, 2015First revision received June 8, 2015
Second revision received August 3, 2015Third revision received September 9, 2015
Accepted October 16, 2015
at UNIV OF PENNSYLVANIA on January 12, 2016http://eepa.aera.netDownloaded from
Top Related