MET Ensuring Fair and Reliable Measures Practitioner Brief

download MET Ensuring Fair and Reliable Measures Practitioner Brief

of 28

Transcript of MET Ensuring Fair and Reliable Measures Practitioner Brief

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    1/28

    METpjet

    Ensuring

    Fair and ReliableMeasures o

    Efective TeachingCulminating Findings romthe MET Projects Three-Year Study

    Policy andPracTicE BriE

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    2/28

    ABOUT THIS REPORT: This non-technical research brie or policymakers and practitioners summarizes recent analyses rom

    the Measures o Eective Teaching (MET) project on identiying eective teaching while accounting or dierences among

    teachers students, on combining measures into composites, and on assuring reliable classroom observations.1

    Readers who wish to explore the technical aspects o these analyses may go to www.metproject.org to fnd the three companion

    research reports: Have We Identifed Eective Teachers?by Thomas J. Kane, Daniel F. McCarey, Trey Miller, and Douglas O.

    Staiger;A Composite Estimator o Eective Teaching by Kata Mihaly, Daniel F. McCarey, Douglas O. Staiger, and J.R. Lockwood;

    and The Reliability o Classroom Observations by School Personnel by Andrew D. Ho and Thomas J. Kane.

    Earlier MET project bries and research reports also on the website include:

    Working with Teachers to

    Develop Fair and Reliable

    Measures of Teaching (2010).

    A white paper describing the

    rationale or and components

    o the MET projects study

    o multiple measures o

    eective teaching.

    Learning about Teaching:

    Initial Findings from the

    Measures of Effective Teaching

    Project(2010). A research

    report and non-technical

    policy brie with the same

    title on analysis o student-

    perception surveys and

    student achievement gainmeasures.

    Gathering Feedback for

    Teaching: Combining

    High-Quality Observations

    with Student Surveys and

    Achievement Gains(2012). A

    research report and policy/

    practitioner brie with the

    same title with initial fndings

    on the reliability o classroomobservations and implications

    or combining measures o

    teaching.

    Asking Students about

    Teaching: Student

    Perception Surveys and

    Their Implementation

    (2012). A non-technical

    brie or policymakers and

    practitioners on the qualities

    o well-designed student

    surveys and implicationsor their implementation

    or teacher eedback and

    evaluation.

    In addition, on www.metproject.org readers will fnd a set o principles to guide the design o teacher

    evaluation and support systems based on the work o the MET project, its partners, and other leading systems

    and organizations, Feedback or Better Teaching: Nine Principles or Using Measures o Eective Teaching (2013).

    January 20

    ABOUT THE MET PROJECT: The MET project is a research partnership o academics, teachers, and education organizations

    committed to investigating better ways to identiy and develop eective teaching. Funding is provided by the Bill & Melinda

    Gates Foundation.

    The approximately 3,000 MET project teachers who

    volunteered to open up their classrooms or this work are

    rom the ollowing districts: The Charlotte-Mecklenburg

    Schools, the Dallas Independent Schools, the Denver

    Public Schools, the Hillsborough County Public Schools,

    the Memphis Public Schools, the New York City Schools,

    and the Pittsburgh Public Schools.

    Partners include representatives o the ollowing

    institutions and organizations: American Institutes or

    Research, Cambridge Education, University o Chicago,The Danielson Group, Dartmouth College, Educational

    Testing Service, Empirical Education, Harvard University,

    National Board or Proessional Teaching Standards,

    National Math and Science Initiative, New Teacher Center,

    University o Michigan, RAND, Rutgers University,

    University o Southern Caliornia, Stanord University,

    Teachscape, University o Texas, University o Virginia,

    University o Washington, and Westat.

    MET Pjet Tehes

    Denver

    Memphis

    Hillsborough

    County

    Dallas

    Charlotte-

    Mecklenbu

    Pittsburgh

    New Y

    City

    MET project

    Feedbackfor Better

    TeachingNine PrinciplesforUsing Measures

    of Effective Teaching

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    3/28

    ContentsExecutive Summary 3

    Can Measures o Eective Teaching Identiy

    Teachers Who Better Help Students Learn? 6

    How Much Weight Should Be Placed

    on Each Measure o Eective Teaching? 10

    How Can Teachers Be Assured Trustworthy

    Results rom Classroom Observations? 16

    What We Know Now 20

    Endnotes 23

    Culminating Findings from the MET Projects Three-Year Study

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    4/28Ensuring Fair and Reliable Measures of Effective Teaching2

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    5/28

    Sttes stts hve he peeete ets in recent years to

    build new eedback and evaluation systems that support teacher growth anddevelopment. The goal is to improve practice so that teachers can better helptheir students graduate rom high school ready to succeed in college and beyond.

    Executive Summary

    These systems depend on trustworthy

    inormation about teaching eective-

    nessinormation that recognizes the

    complexity o teaching and is trusted

    by both teachers and administrators.

    To that end, the Measures o Eective

    Teaching (MET) project set out three

    years ago to investigate how a set omeasures could identiy eective teach-

    ing airly and reliably. With the help o

    3,000 teacher volunteers who opened

    up their classrooms to usalong with

    scores o academic and organizational

    partnerswe have studied, among other

    measures:

    cssm bsevt

    stmets, including both

    subject-specic and cross-subjecttools, that dene discrete teaching

    competencies and describe dierent

    levels o perormance or each;

    Stet peept sves that

    assess key characteristics o the

    classroom environment, includ-

    ing supportiveness, challenge, and

    order; and

    Stet hevemet s on state

    tests and on more cognitively chal-lenging assessments.

    We have reported ndings as we learned

    them in order to provide states and dis-

    tricts with evidence-based guidance to

    inorm their ongoing work. In our initial

    report in 2010 (Learning about Teaching),

    we ound that a well-designed student

    perception survey can provide reliable

    eedback on aspects o teaching practicethat are predictive o student learning.

    In 2012 (Gathering Feedback or Teaching),

    we presented similar results or class-

    room observations. We also ound that

    an accurate observation rating requires

    two or more lessons, each scored by a

    dierent certied observer. With each

    analysis we have better understood

    the particular contribution that each

    measure makes to a complete picture o

    eective teaching and how those mea-sures should be implemented to provide

    teachers with accurate and meaningul

    eedback.

    This nal brie rom the MET projects

    three-year study highlights new analy-

    ses that extend and deepen the insights

    rom our previous work. These studies

    address three undamental questions

    that ace practitioners and policymakers

    engaged in creating teacher support andevaluation systems.

    Culminating Findings from the MET Projects Three-Year Study

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    6/28

    The Qests

    c meses eetve teh

    et tehes wh bette hep

    stets e?

    Despite decades o research suggesting

    that teachers are the most important in-

    school actor aecting student learning,

    an underlying question remains unan-

    swered: Are seemingly more eectiveteachers truly better than other teachers

    at improving student learning, or do they

    simply have better students?

    Ultimately, the only way to resolve that

    question was by randomly assigning

    students to teachers to see i teachers

    previously identied as more eective

    actually caused those students to learn

    more. That is what we did or a subset

    o MET project teachers. Based on data

    we collected during the 200910 school

    year, we produced estimates o teach-

    ing eectiveness or each teacher. We

    adjusted our estimates to account or

    student dierences in prior test scores,

    demographics, and other traits. We

    then randomly assigned a classroom o

    students to each participating teacher

    or 201011.

    Following the 201011 school year we

    asked two questions: First, did students

    actually learn more when randomly

    assigned to the teachers who seemed

    more eective when we evaluated them

    the prior year? And, second, did the

    magnitude o the dierence in student

    outcomes ollowing random assignment

    correspond with expectations?

    Hw mh weht sh be pe

    eh mese eetve

    teh?

    While using multiple measures to

    provide eedback to teachers, many

    states and districts also are combining

    measures into a single index to support

    decisionmaking. To date, there has been

    little empirical evidence to inorm how

    systems might weight each measure

    within a composite to support improve-

    ments in teacher eectiveness. To help

    ll that void, we tasked a group o our

    research partners to use data rom METproject teachers to build and compare

    composites using dierent weights and

    dierent outcomes.

    Hw tehes be sse

    tstwth ests m

    ssm bsevts?

    Our last report on classroom observa-

    tions prompted numerous ques-

    tions rom practitioners about

    how to best use resources

    to produce quality inor-

    mation or eedback

    on classroom practice. For example:

    How many observers are needed to

    achieve sucient reliability rom a given

    number o observations? Do all obser-

    vations need to be the same length to

    have condence in the results? And

    what is the value o adding observers

    rom outside a teachers own school?

    To help answer these questions, we

    designed a study in which administra-

    tors and peer observers produced more

    than 3,000 scores or lessons taught by

    teachers within one MET project partner

    school district.

    Key ndings rom those analyses:

    1. Eetve teh be mese.

    We collected measures o teaching

    during 200910. We adjusted those

    measures or the backgrounds and

    prior achievement o the students

    in each class. But, without random

    assignment, we had no way to know i

    the adjustments we made were su-

    cient to discern the markers o eec-

    tive teaching rom the unmeasured

    aspects o students backgrounds.

    Feedback and evaluation systems depend

    on trustworthy inormation about teaching

    eectiveness to support improvement in teachers

    practice and better outcomes or students.

    Ensuring Fair and Reliable Measures of Effective Teaching4

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    7/28

    In act, we learned that the adjusted

    measures did identiy teachers who

    produced higher (and lower) average

    student achievement gains ollowing

    random assignment in 201011. The

    data show that we can identiy groups

    o teachers who are more eective

    in helping students learn. Moreover,

    the magnitude o the achievement

    gains that teachers generated was

    consistent with expectations.

    In addition, we ound that more

    eective teachers not only caused

    students to perorm better on state

    tests, but they also caused students

    to score higher on other, more cog-

    nitively challenging assessments inmath and English.

    2. Be wehts te mtpe

    spets eetve teh. A com-

    posite with weights between 33 per-

    cent and 50 percent assigned to state

    test scores demonstrated the best mix

    o low volatility rom year to year and

    ability to predict student gains on mul-

    tiple assessments. The composite that

    best indicated improvement on statetests heavily weighted teachers prior

    student achievement gains based on

    those same tests. But composites

    that assigned 33 percent to 50 percent

    o the weight to state tests did nearly

    as well and were somewhat better at

    predicting student learning on more

    cognitively challenging assessments.

    Multiple measures also produce

    more consistent ratings than stu-dent achievement measures alone.

    Estimates o teachers eective-

    ness are more stable rom year to

    year when they combine classroom

    observations, student surveys, and

    measures o student achievement

    gains than when they are based

    solely on the latter.

    3. a se bseve eses

    ebt sft me th

    hv the sme bseve se

    t ess. Teachers

    observation scores vary more

    rom observer to observer than

    rom lesson to lesson. Given the

    same total number o observations,

    including the perspectives o two or

    more observers per teacher greatly

    enhances reliability. Our study o

    video-based observation scoring also

    revealed that:

    . Additional shorter observations

    can increase reliability. Our

    analysis suggests that having

    additional observers watch just

    part o a lesson may be a cost-

    eective way to boost reliability by

    including additional perspectives.

    b. Although school administrators

    rate their own teachers some-

    what higher than do outside

    observers, how they rank their

    teachers practice is very similar

    and teachers own administrators

    actually discern bigger dier-

    ences in teaching practice, which

    increases reliability.

    . Adding observations by observ-

    ers rom outside a teachers

    school to those carried out by a

    teachers own administrator can

    provide an ongoing check against

    in-school bias. This could be done

    or a sample o teachers rather

    than all, as we said in Gathering

    Feedback or Teaching.

    The ollowing pages urther explain

    these ndings and the analyses that

    produced them.

    Culminating Findings from the MET Projects Three-Year Study

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    8/28

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    9/28

    caused students to learn less. We can

    say they caused more (or less) student

    learning because when we randomly

    assigned teachers to students during the

    second year, we could be condent that

    any subsequent dierences in achieve-

    ment were being driven by the teachers,

    not by the unmeasured characteristics

    o their students. In addition, the mag-

    nitude o the gains they caused was

    consistent with our expectations.

    e 2 illustrates just how well the

    measures o eective teaching pre-

    dicted student achievement ollowingrandom assignment. The diagonal line

    represents perect prediction. Dots

    above the diagonal line indicate groups

    o teachers whose student outcomes ol-

    lowing random assignment were better

    than predicted. Dots below the line indi-

    cate groups o teachers whose student

    outcomes ollowing random assignment

    were worse than predicted. Each dot

    e 1

    Putting Measures o Efective Teaching to the Test withRandom Assignment

    Do measures o teaching really identiy teachers who help students learn more, or do seemingly more eective teachers

    just get better students? To nd out, the MET project orchestrated a large-scale experiment with MET project teachers

    to see i teachers identied as more eective than their peers would have greater student achievement gains even with

    students who were assigned randomly.

    To do so, the MET project rst estimated teachers eectiveness using multiple measures rom the 200910 school year.

    As is common in schools, some teachers had been assigned students with stronger prior achievement than others. In

    assessing each teachers practice that year, the project controlled or students prior achievement and demographic

    characteristics. But there may have been other dierences among students as well. So or the ollowing school year(201011), principals created rosters o students or each class in the study, and then researchers randomly assigned

    each roster to a participating teacher rom among those who could teach the class.

    At the end o the 201011 school year, MET project analysts checked to see i students taught by teachers identied as

    more eective than their colleagues actually had greater achievement gains than students taught by teachers iden-

    tied as less eective. They also checked to see how well actual student achievement gains or teachers matched

    predicted gains.

    Culminating Findings from the MET Projects Three-Year Study

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    10/28

    represents 5 percent o the teachers in

    the analysis, sorted based on their pre-

    dicted impact on student achievement.4

    As seen in e 2, in both math

    and English language arts (ELA), the

    groups o teachers with greater pre-dicted impacts on student achievement

    generally had greater actual impacts on

    student achievement ollowing ran-

    dom assignment. Further, the actual

    impacts are approximately in line with

    the predicted impacts.5 We also ound

    that teachers who we identied as being

    eective in promoting achievement on

    the state tests also generated larger

    gains on the supplemental tests admin-

    istered in spring 2011.

    Based on our analysis, we can unam-

    biguously say that school systems

    should account or the prior test scores

    o students. When we removed this

    control, we wound up predicting much

    larger dierences in achievement

    than actually occurred, indicating that

    student assignment biased the results.

    However, our analysis could not shed

    as much light on the need to control

    or demographics or peer eects

    that is, the average prior achievement

    and demographics o each students

    classmates. Although we included those

    e 2

    Actual and Predicted Achievement o Randomized Classrooms (Math)

    -.1

    -.1

    .10

    0

    -.05

    -.

    15

    .15

    .05

    -.

    05

    .1

    .05

    -.1

    -.

    1

    .10

    0

    -.05

    -.

    15

    .15

    .05

    -.

    05

    .1

    .05

    Pete ahevemet (in standard deviations)

    actulachievement

    (in

    standard

    deviations)

    Actual and Predicted Achievement o Randomized Classrooms

    (English Language Arts)

    These charts compare the actual 201011 school

    year achievement gains or randomly assigned

    classrooms with the results that were predicted

    based on the earlier measures o teaching

    eectiveness. Each dot represents the combination

    o actual and estimated student perormance or 5

    percent o the teachers in the study, grouped by the

    teachers estimated eectiveness. The dashed line

    shows where the dots would be i the actual andpredicted gains matched perectly.

    On average, students o teachers with higher teache

    eectiveness estimates outperormed students o

    teachers with lower teacher eectiveness estimates

    Moreover, the magnitude o students actual gains

    largely corresponded with gains predicted by their

    eectiveness measured the previous year. Both

    the actual and predicted achievement are reported

    relative to the mean in the randomization block. Tha

    is, a zero on either axis implies that the value was

    no dierent rom the mean or the small group oteachers in a grade, subject, and school within whic

    class lists were randomized.

    Impacts are reported in student-level standard

    deviations. A .25 standard deviation dierence

    is roughly equivalent to a year o schooling. The

    predicted impacts are adjusted downward to accoun

    or incomplete compliance with randomization.

    Efectiveness Measures Identiy Teachers

    Who Help Students Learn More

    Actual =Predicted

    Actual =Predicted

    Pete ahevemet (in standard deviations)

    actulachievement

    (in

    standardd

    ev

    iations)

    Ensuring Fair and Reliable Measures of Effective Teaching8

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    11/28

    controls, we cannot determine rom

    our evidence whether school systems

    should include them. Our results were

    ambiguous on that score.

    To avoid over-interpretation o these

    results, we hasten to add two caveats:

    First, a prediction can be correct on

    average but still be subject to measure-

    ment error. Our predictions o students

    achievement ollowing random assign-

    ment were correct on average, but

    within every group there were some

    teachers whose students perormed

    better than predicted and some whose

    students perormed worse. Second,

    we could not, as a practical matter,

    randomly assign students or teachers to

    a dierent school site. As a result, our

    study does not allow us to investigate

    bias in teacher eectiveness measures

    arising rom student sorting betweendierent schools.6

    Nonetheless, our analysis should

    give heart to those who have invested

    considerable eort to develop practices

    and policies to measure and support

    eective teaching. Through this large-

    scale study involving random assign-

    ment o teachers to students, we are

    condent that we can identiy groups o

    teachers who are comparatively more

    eective than their peers in helping stu-dents learn. Great teaching does make

    a dierence.

    We can unambiguously say that school systems should adjust their achievement

    gain measures to account or the prior test scores o students. When we removed

    this control, we wound up predicting much larger dierences in achievement than

    actually occurred.

    Culminating Findings from the MET Projects Three-Year Study

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    12/28

    How Much Weight

    Should Be Placedon Each Measure oEfective Teaching?

    7

    Teh s t mpe se mese peme t pte t

    te. Identiying great teachers requires multiple measures. While statesand districts embrace multiple measures or targeted eedback, many also arecombining measures into a single index, or composite. An index or compositecan be a useul summary o complex inormation to support decisionmaking.The challenge is to combine measures in ways that support eective teachingwhile avoiding such unintended consequences as too-narrow a ocus on oneaspect o eective teaching.

    To date, there has been little empiri-

    cal evidence to suggest a rationale or

    particular weights. The MET projects

    report Gathering Feedback or Teachingshowed that equally weighting three

    measures, including achievement gains,

    did a better job predicting teachers

    success (across several student out-

    comes) than teachers years o experi-

    ence and masters degrees. But that

    work did not attempt to determine opti-

    mal weights or composite measures.

    Over the past year, a team o MET

    project researchers rom the RAND

    Corporation and Dartmouth College

    used MET project data to compare di-

    erently weighted composites and study

    the implications o dierent weighting

    schemes or dierent outcomes. As

    in the Gathering Feedback or Teaching

    report, these composites included stu-

    dent achievement gains based on state

    assessments, classroom observations,and student surveys. The research-

    ers estimated the ability o variously

    weighted composites to produce con-

    sistent results and accurately orecast

    teachers impact on student achieve-

    ment gains on dierent types o tests.

    The goal was not to suggest a spe-

    cic set o weights but to illustrate

    the trade-os involved when choosing

    weights. Assigning signicant weight

    to one measure might yield the best

    predictor o uture perormance on that

    measure. But heavily weighting a single

    measure may incentivize teachers to

    ocus too narrowly on a single aspect

    Ensuring Fair and Reliable Measures of Effective Teaching10

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    13/28

    o eective teaching and neglect its

    other important aspects. For example, a

    singular ocus on state tests could dis-

    place gains on other harder-to-measure

    outcomes. Moreover, i the goal is or

    students to meet a broader set o learn-

    ing objectives than are measured by a

    states tests, then too-heavily weighting

    that test could make it harder to identiy

    teachers who are producing other val-

    ued outcomes.

    cmpstes cmpe

    The research team compared our

    dierent weighting models, illustrated

    in e 3: (Model 1) The best

    predictor o state achievement

    test gains (with weights calculated

    to maximize the ability to predict

    teachers student achievement gains on

    state tests, resulting in 65+ percent o

    the weight being placed on the student

    achievement gains across grades and

    subjects); (Model 2) a composite that

    assigned 50 percent o the weight to

    students state achievement test gains;

    (Model 3) a composite that applied

    equal weights to each measure; and

    (Model 4) one that gave 50 percent to

    observation ratings and 25 percent

    each to achievement gains and student

    surveys. The weights that best predict

    state tests, shown or Model 1 in

    e 3, were calculated to predict

    gains on state ELA tests at the middle

    school level, which assigns a whopping

    81 percent o the weight to prior gains

    on the same tests (best-predictor

    weights or other grades and subjects

    are in the table on page 14).

    e 4 compares the dierent weight-

    ing schemes on three criteria, usingmiddle school ELA as an example (see

    the table on page 14 or other grades

    and subjects). The rst is predicting

    teachers student achievement gains

    on state assessments. A correlation o

    1.0 would indicate perect accuracy in

    Heavily weighting a single measure may incentivize

    teachers to ocus too narrowly on a single aspect o

    eective teaching and neglect its other importantaspects. ... [I] the goal is or students to meet a

    broader set o learning objectives than are measured

    by a states tests, then too-heavily weighting that test

    could make it harder to identiy teachers who are

    producing other valued outcomes.

    ObservationsStudent surveysAchievement gainson state tests

    81%

    2%

    17%

    Model 1

    50%

    25%

    25%33%

    33%

    33% 25%

    50%

    25%

    Weighted for maximumaccuracy in predicting

    gains on state tests*

    *Weights shown for Model 1 were calculated to best predict gains on state tests for middle school Englishlanguage arts. Similar best predictor weights for other grades and subjects are in the table on page 14.

    50% weight onstate test results

    Equal weights 50% weights onobservations

    Model 2 Model 3 Model 4

    Four Ways to Weighte 3

    These charts illustrate our ways to construc

    composite measure o eective teaching. Eac

    model uses dierent weights but includes the

    same components student achievement ga

    on the state tests, student perception surveys

    and classroom observations. Model 1 uses th

    weights that would best predict a teachers

    impact on state test scores. Across grades an

    subjects, the best predictor model assigns

    65 percent or more o the weight to a teacher

    prior state test gains. Models 24 are not bas

    on maximizing any particular outcome. They

    approximate dierent weighting schemes

    used by states and districts, with each model

    placing progressively less weight on student

    achievement gains on state tests.

    Culminating Findings from the MET Projects Three-Year Study 1

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    14/28

    predicting teachers student achieve-

    ment gains on state tests. By denition,

    the best composite in this regard is

    Model 1, the model weighted or maxi-

    mizing accuracy on state test results.

    Models 24 show the eect o reducing

    weights on student achievement gainson state tests or middle school ELA. As

    shown rom middle school ELA, reduc-

    ing weights on student achievement

    gains decreases the power to predict

    uture student achievement gains on

    state tests rom 0.69 to 0.63 with Model

    2; to 0.53 with Model 3; and to 0.43 with

    Model 4. Other grades and subjects

    showed similar patterns, as indicated in

    the table on page 14.

    While it is true that the state tests

    are limited and that schools should

    value other outcomes, observations

    and student surveys may not be more

    correlated with those other outcomes

    than the state tests. As a result, we

    set out to test the strength o each

    models correlation with another set o

    test outcomes. The middle set o bars

    in e 4 compares the our models

    (see Figure 3)each using state test

    results to measure achievement

    gainson how well they would predict

    teachers student achievement gains

    on supplemental tests that wereadministered in MET project teachers

    classrooms: The SAT 9 Open-Ended

    Reading Assessment (SAT 9 OE)

    and the Balanced Assessment in

    Mathematics (BAM).

    ReliabilityCorrelation withstate tests gains

    Correlation withhigher-order tests

    1

    0.290.34 0.33 0.32

    0.51

    0.66

    0.76 0.75

    2 3 1 2 3 1 2 34 4 4

    0.63

    0.53

    0.43

    0.69

    Trade-Offs from Different Weighting SchemesMiddle School English Language Arts

    Models

    e 4

    These bars compare the our weighting schemes in Figure 3 on three criteria: accuracy in

    predicting teachers achievement gains on state tests; accuracy in predicting student achievement

    gains on supplemental assessments designed to test higher-order thinking skills; and reliability,

    refecting the year-to-year stability o teachers results. Shown are the results or middle school

    ELA (see Table 1 on page 14 or results or other grades and subjects).

    As indicated, Model 2 (50 percent state test results) and Model 3 (33 percent state tests) achieve

    much o the same predictive power as Model 1 (the best predictor o state test results) in

    anticipating teachers uture state test results (Model 1). Model 4 (50 percent observation) is

    considerably less predictive. However, the gures also illustrate two other trade-os. Models

    2 and 3 also are somewhat better than Model 1 at predicting gains on the tests o higher-order

    thinking skills (or all but elementary school math). Across most grades and subjects, Model 1 wasthe least reliable.

    Ensuring Fair and Reliable Measures of Effective Teaching12

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    15/28

    Increasing Accuracy, Reducing Mistakes

    When high-stakes decisions must be made, can these

    measures support them? Undoubtedly, that question willbe repeated in school board meetings and in aculty break

    rooms around the country in the coming years.

    The answer is yes, not because the measures are perect

    (they are not), but because the combined measure is better

    on virtually every dimension than the measures in use now.

    There is no way to avoid the stakes attached to every hir-

    ing, retention, and pay decision. And deciding not to make a

    change is, ater all, a decision. No measure is perect, but

    better inormation should support better decisions.

    In our report Gathering Feedback or Teaching, we compared

    the equally weighted measure (Model 3 in Figures 3 and 4)

    to two indicators that are almost universally used or pay or

    retention decisions today: teaching experience and posses-

    sion o a masters degree. On every student outcomethe

    state tests, supplemental tests, students sel-reported

    level o eort and enjoyment in classthe teachers who

    excelled on the composite measure had better outcomes

    than those with high levels o teaching experience or a mas-

    ters degree.

    In addition, many districts currently require classroom

    observations, but they do not include student surveys or

    achievement gains. We tested whether observations aloneare enough. Even with our ull classroom observations (two

    by one observer and two by another), conducted by observ-

    ers trained and certifed by the Educational Testing Service,

    the observation-only model perormed ar worse than any o

    our multiple measures composites. (The correlations com-

    parable to those in Figure 5 would have been .14 and .25 withthe state tests and test o higher-order skills.)

    Still, it is air to ask, what might be done to reduce error?

    Many steps have been discussed in this and other reports

    rom the project:

    First, i any type o student data is to be usedeither

    rom tests or rom student surveysschool systems

    should give teachers a chance to correct errors in their

    student rosters.

    Second, classroom observers should not only be trained

    on the instrument. They should frst demonstrate theiraccuracy by scoring videos or observing a class with a

    master observer.

    Third, observations should be done by more than one

    observer. A principals observation is not enough. To

    ensure reliability, it is important to involve at least one

    other observer, either rom inside or outside the school.

    Fourth, i multiple years o data on student achievement

    gains, observations, and student surveys are available,

    they should be used. For novice teachers and or systems

    implementing teacher evaluations or the frst time, there

    may be only a single year available. We have demon-

    strated that a single year contains inormation worth

    acting on. But the inormation would be even better i it

    included multiple years. When multiple years o data are

    available they should be averaged (although some sys-

    tems may choose to weight recent years more heavily).

    While covering less material than

    state tests, the SAT 9 OE and BAM

    assessments include more cogni-

    tively challenging items that require

    writing, analysis, and application

    o concepts, and they are meant to

    assess higher-order thinking skills.

    Sample items released by the assess-

    ment consortia or the new Common

    Core State Standards assessments

    are more similar to the items on these

    supplemental tests than the ones

    on the state assessments. Shown

    in e 4 is the eect o reduc-

    ing the weight on state test gains

    in predicting gains on these other

    assessments, again or middle school

    ELA. For most grades and subjects,

    Model 2 and Model 3 (50 percent state

    test and equal weights or all three

    measures) best predicted teachers

    student achievement gains on these

    supplemental assessments, with little

    dierence between the two models.

    The one exception was elementary

    school math, where Model 1 (best pre-

    dictor) was best.

    The third set o bars in e 4compares composites on their reli-

    abilitythat is, the extent to which the

    composite would produce consistent

    results or the same teachers rom

    year to year (on a scale rom 01.0, with

    Culminating Findings from the MET Projects Three-Year Study 1

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    16/28

    calculaTEd WEigHTS or Ma xiMuM accuracy in PrEdicTing gainS on STaTE TESTS

    English Language Arts Math

    State Tests Observations Student Surveys State Tests Observations Student Surveys

    Elementary 65% 9% 25% 85% 5% 11%

    Middle 81% 2% 17% 91% 4% 5%

    rEliaBiliTy and accuracy o diErEnT WEigHTing ScHEMES

    English Language Arts Math

    Weightedor Max

    State TestAccuracy

    50% StateTest

    EqualWeights

    50%Observations

    Weightedor Max

    State TestAccuracy

    50% StateTest

    EqualWeights

    50%Observations

    Elementry

    Reliability 0.42 0.46 0.50 0.49 0.52 0.57 0.57 0.55

    Correlationwith statetest

    0.61 0.59 0.53 0.45 0.72 0.65 0.54 0.46

    Correlationwith higher-order test

    0.35 0.37 0.37 0.35 0.31 0.29 0.25 0.20

    Middle

    Reliability 0.51 0.66 0.76 0.75 0.86 0.88 0.88 0.83

    Correlationwith statetest

    0.69 0.63 0.53 0.43 0.92 0.84 0.73 0.65

    Correlationwith higher-order test

    0.29 0.34 0.33 0.32 0.38 0.44 0.45 0.45

    Tbe 1

    1.0 representing perect consistency

    and no volatility). Again, results shown

    are or middle school ELA. Across all

    grades and subjects, the most reliable

    composites were either Models 2 (50

    percent state test) or 3 (equal weights).

    For all but middle school math, the least

    reliable composite was Model 1 (best

    predictor). Model 4 (50 percent observa-

    tions) was somewhat less reliable than

    Model 2 (equal weights) or all grades

    and subjects. Although not shown, stu-

    dent achievement gains on state tests

    by themselves are less stable than all

    o the composites, with one exception:

    Model 4 (50 percent observations) is

    slightly less stable than achievement

    gains alone or middle school math.

    gee impts

    The intent o this analysis was not torecommend an ideal set o weights to

    use in every circumstance. Rather, our

    goal was to describe the trade-os

    among dierent approaches.8

    I the goal is to predict gains on state

    tests, then the composites that put 65+

    percent o the weight on the student

    achievement gains on those tests will

    generally show the greatest accuracy.

    However, reducing the weights on the

    state test achievement gain measures to

    50 percent or 33 percent generates two

    positive trade-os: it increases stability

    (lessens volatility rom year to year) and

    it also increases somewhat the correla-

    tion with tests other than the state tests.

    However, it is possible to go too ar.

    Lowering the weight on state test

    achievement gains below 33 percent,

    and raising the weight on observations

    to 50 percent and including student

    surveys at 25 percent, is counter-

    productive. It not only lowers the

    Ensuring Fair and Reliable Measures of Effective Teaching14

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    17/28

    correlation with state achievement

    gains; it can also lower reliability and

    the correlation with other types o

    testing outcomes.

    Ultimately, states, local education

    authorities, and other stakehold-ers need to decide how to weight the

    measures in a composite. Our data

    suggest that assigning 50 percent or

    33 percent o the weight to state test

    results maintains considerable pre-

    dictive power, increases reliability,

    and potentially avoids the unintended

    negative consequences rom assigning

    too-heavy weights to a single measure.

    Removing too much weight rom state

    tests, however, may not be a good idea,given the lower predictive power and

    reliability o Model 4 (25 percent state

    tests). In short, there is a range o

    reasonable weights or a composite o

    multiple measures.

    Vt ctetKwee Teh

    Teachers shouldnt be asked to expend

    eort to improve something that doesnthelp them achieve better outcomes

    or their students. I a mea-

    sure is to be included

    in ormal evaluation, then it should be

    shown that teachers who perorm better

    on that measure are generally more

    eective in improving student outcomes.

    This test or validity has been central

    to the MET projects analyses. Measures

    that have passed this test include high-

    quality classroom observations, well-

    designed student-perception surveys,

    and teachers prior records o student

    achievement gains on state tests.

    Over the past year, MET project

    researchers have investigated another

    type o measure, called the Content

    Knowledge or Teaching (CKT) tests.

    These are meant to assess teach-

    ers understanding o how studentsacquire and understand subject-

    specic skills and concepts in math

    and ELA. Developed by the Educational

    Testing Service and researchers at the

    University o Michigan, these tests are

    among the newest measures o teaching

    included in the MET projects analyses.

    Mostly multiple choice, the questions

    ask how to best represent ideas to

    students, assess student understand-

    ing, and determine sources o studentsconusion.

    The CKT tests studied by the MET

    project did not pass our test or validity.

    MET project teachers who perormed

    better on the CKT tests were not

    substantively more eective in

    improving student achievement on

    the outcomes we measured. This was

    true whether student achievement

    was measured using state tests or the

    supplemental assessments o higher-

    order thinking skills. For this reason,

    the MET project did not include CKT

    results within its composite measure o

    eective teaching.

    These results, however, speak to the

    validity o the current measure still

    early in its development in predictingachievement gains on particular stu-

    dent assessmentsnot to the impor-

    tance o content-specic pedagogical

    knowledge. CKT as a concept remains

    promising. The teachers with higher

    CKT scores did seem to have somewhat

    higher scores on two subject-based

    classroom observation instruments:

    the Mathematical Quality o Instruction

    (MQI) and the Protocol or Language

    Arts Teacher Observations (PLATO).Moreover, the MET projects last report

    suggested that some content-specic

    observation instruments were better

    than cross-subject ones in identiying

    teachers who were more eective in

    improving student achievement in ELA

    and math. Researchers will continue to

    develop measures or assessing teach-

    ers content-specic teaching knowl-

    edge and validating them as states

    create new assessments aligned to theCommon Core State Standards. When

    they have been shown to be substan-

    tively related to a teachers students

    achievement gains, these should be

    considered or inclusion as part o

    a composite measure o eective

    teaching.

    Culminating Findings from the MET Projects Three-Year Study 1

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    18/28

    How Can Teachers

    Be AssuredTrustworthy Results romClassroom Observations?

    9

    cssm bsevts be pwe ts pess wth. But

    or observations to be o value, they must reliably refect what teachers dothroughout the year, as opposed to the subjective impressions o a particularobserver or some unusual aspect o a particular lesson. Teachers need to knowthey are being observed by the right people, with the right skills, and a sucientnumber o times to produce trustworthy results. Given this, the challenge orschool systems is to make the best use o resources to provide teachers withhigh-quality eedback to improve their practice.

    The MET projects report Gathering

    Feedback or Teaching showed the

    importance o averaging together

    multiple observations rom multipleobservers to boost reliability. Reliability

    represents the extent to which results

    refect consistent aspects o a teachers

    practice, as opposed to other ac-

    tors such as observer judgment. We

    also stressed that observers must be

    well-trained and assessed or accuracy

    beore they score teachers lessons.

    But there were many practical ques-

    tions the MET project couldnt answer in

    its previous study. Among them:

    Can school administrators reliably

    assess the practice o teachers in

    their schools?

    Can additional observations by exter-

    nal observers not amiliar with a

    teacher increase reliability?

    Must all observations involve viewing

    the entire lesson or can partial les-

    sons be used to increase reliability?

    And,

    What is the incremental benet o

    adding additional lessons and addi-

    tional observers?

    These questions came rom our

    partners, teachers, and administra-

    tors in urban school districts. Inresponse, with the help o a partner

    district, the Hillsborough County (Fla.)

    Public Schools, the MET project added

    a study o classroom observation

    For the same total

    number o observations,

    incorporating additionalobservers increases

    reliability.

    Ensuring Fair and Reliable Measures of Effective Teaching16

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    19/28

    Hillsborough Countys Classroom Observation Instrument

    Like many school districts, Hillsborough County uses an

    evaluation instrument adapted rom the Framework or

    Teaching, developed by Charlotte Danielson. The ramework

    defnes our levels o perormance or specifc competen-

    cies in our domains o practice. Two o those domains

    pertain to activities outside the classroom: Planning and

    Preparation, and Proessional Responsibility. Observers

    rated teachers on the 10 competencies in the rameworks

    two classroom-ocused domains, as shown:

    Domain 2: The Classroom Environment Domain 3: Instruction

    Creating an Environment o Respect and Rapport

    Establishing a Culture o Learning

    Managing Classroom Procedures

    Managing Student Behavior

    Organizing Physical Space

    Communicating with Students

    Using Discussion and Questioning Techniques

    Engaging Students in Learning

    Using Assessment in Instruction

    Demonstrating Flexibility and Responsiveness

    reliability. This study engaged district

    administrators and teacher experts

    to observe video-recorded lessons o

    67 Hillsborough County teachers who

    agreed to participate.

    cmps rts

    Two types o observers took part in

    the study: Fity-three were

    school-based admin-

    istratorseither

    principals or

    assistant

    principalsand 76 were peer observers.

    The latter are district-based posi-

    tions lled by teachers on leave rom

    the classroom who are responsible

    or observing and providing eed-

    back to teachers in multiple schools.

    In Hillsborough Countys evaluation

    system, teachers are observed multiple

    times, ormally and inormally, by their

    administrators and by peer observ-

    ers. Administrators and peers are

    trained and certied in the districts

    observation instrument, which is

    based on Charlotte Danielsons

    Framework or Teaching.

    These observers each rated 24 lessons

    or us and produced more than 3,000

    ratings that we could use to investigate

    our questions. MET project research-

    ers were able to calculate reliability

    or many combinations o observers

    (administrator and peer), lessons (rom

    1 to 4), and observation duration (ull

    lesson or 15 minutes). We were able to

    compare dierences in the ratings given

    to teachers lessons by their own

    and unknown administrators

    and between administrators

    and peers.

    Culminating Findings from the MET Projects Three-Year Study 1

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    20/28

    Eets rebt

    e 5 graphically represents many

    o the key ndings rom our analyses

    o those ratings. Shown are the esti-

    mated reliabilities or results rom a

    given set o classroom observations.Reliability is expressed on a scale rom

    0 to 1. A higher number indicates that

    results are more attributable to the

    particular teacher as opposed to other

    actors such as the particular observer

    or lesson. When results or the same

    teachers vary rom lesson to lesson or

    rom observer to observer, then averag-

    ing teachers ratings across multiple

    lessons or observers decreases the

    amount o error due to such actors,

    and it increases reliability.

    Adding lessons and observers increasesthe reliability o classroom observa-

    tions. In our estimates, i a teachers

    results are based on two lessons, having

    the second lesson scored by a second

    observer can boost reliability signi-

    cantly. This is shown in e 5: When

    the same administrator observes a

    second lesson, reliability increases rom

    .51 to .58, but when the second lesson

    is observed by a dierent administra-

    tor rom the same school, reliability

    increases more than twice as much,

    rom .51 to .67. Whenever a given number

    o lessons was split between multiple

    observers, the reliability was greater

    than that achieved by a single observer.

    In other words, or the same total

    number o observations, incorporating

    additional observers increases reliability.

    O course, it would be a problem i

    school administrators and peer observ-

    ers produced vastly dierent results or

    the same teachers. But we didnt nd

    that to be the case. Although adminis-trators gave higher scores to their own

    teachers, their rankings o their own

    teachers were similar to those produced

    by peer observers and administrators

    rom other schools. This implies that

    administrators are seeing the same.51

    .58

    .67

    Reliability

    .67 .66.69

    .72

    There Are Many Roads to Reliability

    Lesson observed by own administrator = 45 min

    Lesson observed by peer observer = 45 min

    Three 15-minute lessons observed by three additional peer observers = 45 min

    A

    A

    B

    B

    Aand Bdenote different observers of the same type

    e 5

    These bars show how the number o observations and observers aects

    reliability. Reliability represents the extent to which the variation in results

    refects consistent aspects o a teachers practice, as opposed to other

    actors such as diering observer judgments. Dierent colors represent

    dierent categories o observers. The A and B in column three show

    that ratings were averaged rom two dierent own-school observers.

    Each circle represents approximately 45 minutes o observation time (a

    solid circle indicates one observation o that duration, while a circle split

    into three indicates three 15-minute observations by three observers).

    As shown, reliabilities o .66.72 can be achieved in multiple ways, with

    dierent combinations o number o observers and observations. (For

    example, one observation by a teachers administrator when combined with

    three short, 15-minute observations each by a dierent observer would

    produce a reliability o .67.)

    Ensuring Fair and Reliable Measures of Effective Teaching18

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    21/28

    things in the videos that others do, and

    they are not being swayed by personal

    biases.

    I additional observations by additional

    observers are important, how can the

    time or those added observationsbe divided up to maximize the use

    o limited resources while assuring

    trustworthy results? This is an increas-

    ingly relevant question as more school

    systems make use o video in providing

    teachers with eedback on their prac-

    tice. Assuming multiple videos or a

    teacher exist, an observer could use the

    same amount o time to watch one ull

    lesson or two or three partial lessons.

    But to consider the latter, one wouldwant to know whether partial-lesson

    observations increase reliability.

    Our analysis rom Hillsborough County

    showed observations based on the

    rst 15 minutes o lessons were about

    60 percent as reliable as ull lesson

    observations, while requiring one-third

    as much observer time. Thereore,

    Although administrators gave higher scores to

    their own teachers, their rankings o their own

    teachers were similar to those produced by external

    observers and administrators rom other schools.

    one way to increase reliability is to

    expose a given teachers practice to

    multiple perspectives. Having three

    dierent observers each observe or

    15 minutes may be a more economical

    way to improve reliability than having

    one additional observer sit in or 45

    minutes. Our results also suggest that

    it is important to have at least one or

    two ull-length observations, given that

    some aspects o teaching scored on theFramework or Teaching (Danielsons

    instrument) were requently not

    observed during the rst 15 minutes

    o class.

    Together, these results provide a range

    o scenarios or achieving reliable

    classroom observations. There is a

    point where both additional observers

    and additional observations do little to

    reduce error. Reliability above 0.65 can

    be achieved with several congurations

    (see e 5).

    impts dstts

    Ultimately, districts must decide how to

    allocate time and resources to class-

    room observations. The answers to the

    questions o how many lessons, o what

    duration, and conducted by whom are

    inormed by reliability considerations,

    as well as other relevant actors, such

    as novice teacher status, prior eec-

    tiveness ratings, and a districts overall

    proessional development strategy.

    Culminating Findings from the MET Projects Three-Year Study 1

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    22/28

    i thee es we hve ee t bt hw mtpe meses et

    eetve teh the tbt tht tehes mke t stet

    e. The goal is or such measures to inorm state and district eorts tosupport improvements in teaching to benet all students. Many o these lessonshave already been put into practice as school systems eagerly seek out evidence-based guidance. Only a ew years ago the norm or teacher evaluation was toassign satisactory ratings to nearly all teachers evaluated while providingvirtually no useul inormation to improve practice.10 Among the signicant lessonslearned through the MET project and the work o its partners:

    What We

    Know Now

    Stet peept sves

    ssm bsevts

    pve me eebk t

    tehes. They also can help system

    leaders prioritize their investments

    in proessional development to target

    the biggest gaps between teachers

    actual practice and the expectations

    or eective teaching.

    impemet spef pees

    evt sstems ese

    tst the t the ests.

    These include rigorous training and

    certication o observers; observa-

    tion o multiple lessons by dierent

    observers; and in the case o student

    surveys, the assurance o student

    condentiality.

    Eh mese s smeth

    ve. Classroom observationsprovide rich eedback on practice.

    Student perception surveys provide

    a reliable indicator o the learning

    environment and give voice to the

    intended beneciaries o instruction.

    Student learning gains (adjusted

    to account or dierences among

    students) can help identiy groups

    o teachers who, by virtue o their

    instruction, are helping students

    learn more.

    a be pph s mst ses-

    be whe ss wehts t m

    mpste mese. Compared

    with schemes that heavily weight

    one measure, those that assign 33

    percent to 50 percent o the weight

    to student achievement gains

    achieve more consistency, avoid the

    risk o encouraging too narrow a

    ocus on any one aspect o teaching,

    and can support a broader range o

    learning objectives than measured

    by a single test.

    Thee s et ptet s

    ve tehe eebk

    the t ssessmet

    bseves. The advances made in

    this technology have been signicant,

    resulting in lower costs, greater ease

    o use, and better quality.

    Ensuring Fair and Reliable Measures of Effective Teaching20

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    23/28

    The Wk ahe

    As we move orward, MET project

    teachers are supporting the transition

    rom research to practice. More than

    300 teachers are helping the project

    build a video library o practice or usein proessional development. They will

    record more than 50 lessons each by

    the end o this school year and make

    these lessons available to states, school

    districts, and other organizations com-

    mitted to improving eective teaching.

    This will allow countless educators to

    analyze instruction and see examples o

    great teaching in action.

    Furthermore, the unprecedented data

    collected by the MET project over

    the past three years are being madeavailable to the larger research com-

    munity to carry out additional analyses,

    which will increase knowledge o what

    constitutes eective teaching and how

    to support it. MET project partners

    already are tapping those data or new

    studies on observer training, combining

    student surveys and observations, and

    other practical concerns. Finally, com-

    mercially available video-based tools or

    observer training and certication now

    exist using the lessons learned rom the

    MET projects studies.

    Many o the uture lessons regarding

    teacher eedback and evaluation systems

    must necessarily come rom the eld, as

    states and districts innovate, assess the

    results, and make needed adjustments.

    This will be a signicant undertaking,

    as systems work to better support great

    teaching. Thanks to the hard work o

    MET project partners, we have a solid

    oundation on which to build.Many o the uture lessons regarding teacher

    eedback and evaluation systems must necessarilycome rom the feld, as states and districts

    innovate, assess the results, and make needed

    adjustments. This will be a signifcant undertaking,

    as systems work to better support great teaching.

    Culminating Findings from the MET Projects Three-Year Study 2

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    24/28Ensuring Fair and Reliable Measures of Effective Teaching22

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    25/28

    1. The lead authors o this brie are Steven Cantrell, Chie Research Ocer

    at the Bill & Melinda Gates Foundation, and Thomas J. K ane, Proessoro Education and Economics at the Harvar d Graduate School o Education

    and principal investigator o the Measures o Eective Teaching (MET)

    project. Lead authors o the related research papers are Thomas J.

    Kane (Harvard), Daniel F. McCarey (RAND), and Douglas O. Staiger

    (Dartmouth). Essential support came rom Je Archer, Sarah Buhayar,

    Alejandro Ganimian, Andrew Ho, Kerri Kerr, Erin McGoldrick, and

    David Parker. KSA-Plus Communications provided design and editorial

    assistance.

    2. This section summarizes the analyses and key ndings rom the

    research report Have We Identifed Eective Teachers?by Thomas J. Kane,

    Daniel F. McCarey, Trey Miller, and Douglas O. Staiger. Readers who

    want to review the ull set o ndings can download that repor t at www.

    metproject.org.

    3. As expected, not every student on a randomly assigned roster stayed inthe classroom o the intended teacher. Fortunately, we could track those

    students. We estimated the eects o teachers on student achievement

    using a statistical technique commonly used in randomized trials called

    instrumental variables.

    4. These predictions, as well as the average achievement outcomes, are

    reported relative to the average among par ticipating teachers in the same

    school, grade, and subject.

    5. Readers may notice that some o the dierences in Figure 2 are smaller

    than the dierences reported in earlier MET report s. Due to non-

    compliancestudents not remaining with their randomly assigned

    teacheronly about 30 percent o the randomly assigned dierence in

    teacher eectiveness translated into dierences in the eectiveness o

    students actual teacher. The estimates in Figure 2 are adjusted or non-compliance. I all the students had remained with their randomly assigned

    teachers, we would have predicted impacts roughly three times as big.

    Our results imply that, without non-compliance, we would have expected

    to see dierences just as large as included in earlier reports.

    6. Other researchers have studied natural movements o teachers between

    schools (as opposed to randomly assigned transers) and ound noevidence o bias in estimated teacher eectiveness between schools.

    See Raj Chetty, John Friedman, and Jonah E. Rocko, The Long-Term

    Impacts o Teachers: Teacher Value-Added and Student Outcomes in

    Adulthood, working paper no. 17699, National Bureau o Economic

    Research, December 2011.

    7. The ndings highlighted in this summary and the technical details o the

    methods that produced them are explained in detail in the r esearch paper

    A Composite Estimator o Eective Teaching, by Kata Mihaly, Daniel

    McCarey, Douglas O. Staiger, and J.R. Lockwood. A copy may be ound at

    www.metproject.org.

    8. Dierent student assessments, observation protocols, and student

    survey instruments would likely yield somew hat dierent amounts o

    reliability and accuracy. Moreover, measures used or evaluation may

    produce dierent results than seen in the ME T project, which attached nostakes to the measures it administered in the classrooms o its volunteer

    teachers.

    9. This section summarizes key analyses and ndings rom the report

    The Reliability o Classroom Observations by School Personnel by Andrew

    D. Ho and Thomas J. Kane. Readers who want to review the ull set

    o ndings and methods or the analyses can download that report at

    www.metproject.org. The MET project acknow ledges the hard work o

    Danni Greenberg Resnick and David Steele, o the Hillsborough County

    Public Schools, and the work o the teachers, administrators, and peer

    observers who par ticipated in this study.

    10. Weisburg, D. et al. (2009). The Widget Eect: Our National Failure to

    Acknowledge and Act on Dierences in Teacher Eectiveness. Brooklyn: New

    Teacher Project.

    Endnotes

    Culminating Findings from the MET Projects Three-Year Study 2

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    26/28Ensuring Fair and Reliable Measures of Effective Teaching24

  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    27/28

    2013 Bill & Melinda Gates Foundation. All Rights Reserved.

    Bill & Melinda Gates Foundation is a registered trademark

    in the United States and other countries.

    Bill & Melinda Gates Foundation

    Guided by the belie that every lie has equal

    value, the Bill & Melinda Gates Foundation

    works to help all people lead healthy,

    productive lives. In developing countries, it

    ocuses on improving peoples health and

    giving them the chance to lit themselves out

    o hunger and extreme poverty. In the United

    States, it seeks to ensure that all people

    especially those with the ewest resources

    have access to the opportunities they need to

    succeed in school and lie. Based in Seattle,

    Washington, the oundation is led by CEO Je

    Raikes and Co-chair William H. Gates Sr.,

    under the direction o Bill and Melinda Gates

    and Warren Buett.

    For more inormation on the U.S. Program,

    which works primarily to improve high school

    and postsecondary education, please visit

    www.gatesoundation.org.

    http://www.gatesfoundation.org/http://www.gatesfoundation.org/
  • 7/27/2019 MET Ensuring Fair and Reliable Measures Practitioner Brief

    28/28

    www.gatesoundation.org