Teacher Productivity & Models of Employer Learning

Teacher Productivity & Models of Employer Learning

Economic Models in Education Research Workshop

University of ChicagoApril 7, 2011

Douglas O. StaigerDartmouth College

Teacher Productivity & Models of Employer Learning

Teacher productivity Estimating value added models Statistical tests of model assumptions Stability of the effects

Models of employer learning Searching for effective teachers --

heterogeneity Career concerns – heterogeneity &

effort

Teacher Productivity

Huge non-experimental literature on “teacher effects” Non-experimental studies estimate standard

deviation in teacher-effect of .10 to .25 student-level standard deviations (2-5 percentiles) each year.

Key findings in non-exp literature: Teacher effects unrelated to traditional teacher

credentials Payoff to experience steep in first 3 years but flat

afterwards Predict sizable differences with 1-3 years prior

performance One experimental study (TN class-size experiment)

yields similar estimate of variance.

Variation in Value Added Within & Between Groups of Teachers in NYC, by Teacher Certification

0

24

68

Ker

nel D

ensi

ty E

stim

ate

-.4 -.3 -.2 -.1 0 .1 .2 .3 .4Student Level Standard Deviations

Traditionally Certified Teaching FellowTeach for America Uncertified

Note: Shown are estimates of teachers' impacts on average student performance, controlling for teachers' experience levels and students' baselinescores, demographics and program participation; includes teachers of grades 4-8 hired since the 1999-2000 school year.

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Teacher Certification

0.0

3.0

6.0

9.1

2P

ropo

rtio

n o

f Cla

ssro

oms

-15 -10 -5 0 5 10 15Change in Percentile Rank of Average Student

Traditionally Certified UncertifiedAlternatively Certified

Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, grade 2 through 5.

by Initial CertificationTeacher Impacts on Math Performance

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Years of Experience

0.0

3.0

6.0

9.1

2P

ropo

rtio

n o

f Cla

ssro

oms


1st Year 2nd Year3rd Year

Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, < 4 years experience.

by Year of ExperienceTeacher Impacts on Math Performance

Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Prior Value Added

0.0

3.0

6.0

9.1

2P

ropo

rtio

n o

f Cla

ssro

oms


Bottom 3rd Quartile2nd Quartile Top Quartile

Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, < 4 years experience.

by Ranking After First Two YearsTeacher Impacts on Math Performance in Third Year

How Are Teacher Effects Estimated?

Growing use of “value added” estimates to identify effective teachers for pay, promotion, and professional development.

But growing concern that statistical assumptions needed to estimate teacher effect are strong and untested – are these “causal” effects of teachers?

Basics of Value Added Analysis

Teacher value added compares actual student achievement to a counterfactual expectation

Difference between actual and expected achievement, averaged over teacher’s students (average residual)

Expected achievement is average achievement for students who looked similar at start of year Same prior-year test scores Same demographics, program participation Same characteristics of peers in classroom

Estimating Value Added

1.Estimating Non-Experimental Teacher Effects

error.year by Student

shockyear by classroom persistent-Non

effectTeacher

.covariates level

-classroom andstudent of setsDifferent

gainor level scorest Student te

,

ijt

jt

j

ijt

ijt

ijtjtjijtijtijtijt

X

A

whereXA

Similar teacher residual by OLS, RE, FE (β driven by within variation). What matters is whether X includes baseline score & peer measures.

Estimating Value Added

2.Generating Empirical Bayes Estimates of Non-Experimental Teacher Effects

. noisewwhereVA j

tjtjtj

noisejj

,ˆˆ

ˆ22

2

222

2

12

ˆˆˆ:

ˆ:

),cov(ˆ:

:componentsvarianceRequires

ijt

jtijt

jtjt

VarClassroom

VarStudent

vvTeacher

Empirical Bayes Methods Goal: Forecast teacher performance next year

(BLUP)

Forecast is prediction of persistent teacher component “Shrinkage” estimator = posterior mean: E(μ|M) = Mβ

Weight (β) placed on measure increase with: Correlation with persistent component of interest Reliability with which measure is estimated

(which may vary by teacher – e.g. based on sample size)

Can apply to any measure (value added, video rating, etc.)or combination of measures (composite estimates)

Error components

Performance measure (Mjc) for teacher j in classroom c is noisy estimate of persistent teacher effect (μj).

Noise consists of two independent components: classroom component (θjc) representing peer

effects, etc. sampling error (νjc) if measure averages over

students, videos, raters, etc. (variance depends on sample size)

Model for error prone measure: Mjc = μj + θjc + νjc

Prediction in simple case

Using one measure (M) to predict teacher performance on a possibly different measure (M') in a different classroom simplifies to predicting the persistent teacher component:

E(μ'j|Mjc) = Mjcβj

Optimal weights (βj) analogous to regression coefficients:

βj = Cov(μ'j,Mjc)/Var(Mjc) = Cov(μ'j,μj)/[Var(μj)+Var(θjc)+Var(νjc)]

= {Cov(μ'j,μj)/Var(μj)}*{Var(μj)/[Var(μj)+Var(θjc)+Var(νjc)]}

= {β if Mjc had no noise}*{reliability of Mjc}

Two Key Measurement Problems

Reliability/Instability Imprecision transitory measurement error E.g., low correlation across classrooms

Validity/Bias Persistently misrepresent performance (e.g. student sorting) Test scores capture only one dimension of performance Depends on design, content, & scaling of test

Validity & reliability determine a measures ability to predict performance Correlation of measure with true performance =

(correlation of persistent part of measure with true performance) * (square root of reliability)

E.g., Teacher certification versus value added

Statistical Tests of Model Assumptoins

Experimental forecasting test (Kane & Staiger)

Observational specification tests (Rothstein)

Quasi-experimental forecasting test (Carrell & West)

What Kane/Staiger do

Randomly assign 78 pairs of teachers to classrooms in LAUSD elementary schools

Provides experimental estimate of parameter of interest If a given classroom of students were to have

teacher A rather than teacher B, how much different would their average test scores be at the end of the year?

Evaluate whether pre-experimental estimates from various value-added models predict experimental results

Experimental Design

All NBPTS applicants from Los Angeles area. For each NBPTS applicant, identified comparison

teachers working in same school, grade, calendar track.

LAUSD chief of staff wrote letters to principals inviting them to draw up two classrooms that they would be willing to assign to either teacher.

If principal agreed, classroom rosters (not individual students) were randomly assigned by LAUSD on the day of switching. LAUSD made paper copies of rosters on day of switch.

Yielded 78 pairs of teachers (156 classrooms and 3500 students) for whom we had estimates of “value-added” impacts from the pre-experimental period.

LAUSD Data

Grades 2 through 5 Three Time Periods:

Years before Random Assignment: Spring 2000 through Spring 2003 Years of Random Assignment: Either Spring 2004 or 2005 Years after Random Assignment: Spring 2005 (or 2006) through

Spring 2007 Outcomes:

California Standards Test (Spring 2004- 2007) Stanford 9 Tests (Spring 2000 through 2002) California Achievement Test (Spring 2003)

Covariates: Student: baseline math and reading scores (interacted with grade),

race/ethnicity (hispanic, white, black, other or missing), ever retained, Title I, Eligible for free lunch, Gifted and talented, Special education, English language development (level 1-5).

Peers: Means of all the above for students in classrooms. Fixed Effects: School x Grade x Track x Year

Sample Exclusions: >20 percent special education classes Fewer than 5 and more than 36 students in class

All standardized by grade and year.

Teacher Effects

Teacher by Year Random

Effect

Mean Sample Size per Teacher

Math Levels with...No Controls 0.448 0.229 47.255Student/Peer Controls (incl. prior scores) 0.231 0.179 41.611Student/Peer Controls (incl. prior scores) & School F.E. 0.219 0.177 41.611Student Fixed Effects 0.101 0.061 47.255

Math Gains with...No Controls 0.236 0.219 43.888Student/Peer Controls 0.234 0.219 43.888Student/Peer Controls & School F.E. 0.225 0.219 43.888

Specification Used for Non-experimental Teacher Effect

Note: The above estimates are based on the total variance in estimated teacher fixed effects using observations from the pre-experimental data (years 1999-2000 through 2002-03). See the text for discussion of the estimation of the decomposition into teacher by year random effects, student-level error, and "actual" teacher effects. The sample was limited to schools with teachers in the experimental sample. Any individual students who were in the experiment were dropped from the pre-experimental estimation, to avoid any spurious relationship due to regression to the mean, etc.

Table 3: Non-experimental Estimates of Teacher Effect Variance Components

Standard Deviation of Each Component (in Student-level

Standard Deviation Units)

Evaluating Value Added

3.Test validity of VAj against experimental outcomes

.

1,...,78p and 1,2jfor jpjppjp VAY

1,1:H .scoresTest

1,1:H ,scoresTest

1:H ,scoresTest

0:H stics,characteri Baseline Y

1,...,78pfor ~

o2year alexperiment

o1year alexperiment

oyear alexperiment

o

1212

ppppp VAVAYY

Test Score Second Year

Test Score Third Year

Coefficient R2 Coefficient Coefficient

Math Levels with...No Controls 0.511*** 0.185 0.282** 0.124

(0.108) (0.107) (0.101)Student/Peer Controls (incl. prior scores) 0.852*** 0.210 0.359* 0.034

(0.177) (0.172) (0.133)Student/Peer Controls (incl. prior scores) & School F.E. 0.905*** 0.226 0.390* 0.07

(0.180) (0.176) (0.136)Student Fixed Effects 1.859*** 0.153 0.822 0.304

(0.470) (0.445) (0.408)

Math Gains with...No Controls 0.794*** 0.162 0.342 0.007

(0.201) (0.185) (0.146)Student/Peer Controls 0.828*** 0.171 0.356 0.01

(0.207) (0.191) (0.151)Student/Peer Controls & School F.E. 0.865*** 0.177 0.382 0.025

(0.213) (0.200) (0.157)

N: 78 78 78

Note: Each baseline characteristic listed in the columns was used as a dependent variable (math or ELA scores, corresponding to the teacher effect), regressing the within-pair difference in mean test scores on different non-experimental estimates of teacher effects. The coefficients were estimated in separate bivariate regressions with no constant. Robust standard errors are reported in parentheses.


Test Score First Year

Table 6. Regression of Experimental Difference in Average Test Scores on Non-Experimental Estimates of Differences in Teacher Effect

-1.5

-1-.

50

.51

1.5

With

in P

air

Diff

ere

nce

in E

nd

of F

irst Y

ear

Tes

t Sco

re

0 .2 .4 .6 .8Within Pair Difference in Pre-experimental Value-added

Observed Linear Fitted Values45-degree Line Lowess Fitted Values

Mathematics

Figure 1: Within Pair Differences in Pre-experimentalValue-added and End of First Year Test Score

How much of the variance in (μ2p –μ1p) is “explained” by (VA2p –VA1p)?

53.425/226.explained"" of Proportion

226.

425.6.17/48.7) actual had (if Maximum

48.7ˆ278

6.17)(

2

2

j2

212

2

12

RActual

R

VarVarianceSignal

YYVarianceTotalp

pp

Test Score Second Year

Test Score Third Year

Coefficient R2 Coefficient Coefficient

Math Levels with...No Controls 0.511*** 0.185 0.282** 0.124

(0.108) (0.107) (0.101)Student/Peer Controls (incl. prior scores) 0.852*** 0.210 0.359* 0.034

(0.177) (0.172) (0.133)Student/Peer Controls (incl. prior scores) & School F.E. 0.905*** 0.226 0.390* 0.07

(0.180) (0.176) (0.136)Student Fixed Effects 1.859*** 0.153 0.822 0.304

(0.470) (0.445) (0.408)

Math Gains with...No Controls 0.794*** 0.162 0.342 0.007

(0.201) (0.185) (0.146)Student/Peer Controls 0.828*** 0.171 0.356 0.01

(0.207) (0.191) (0.151)Student/Peer Controls & School F.E. 0.865*** 0.177 0.382 0.025

(0.213) (0.200) (0.157)

N: 78 78 78

Note: Each baseline characteristic listed in the columns was used as a dependent variable (math or ELA scores, corresponding to the teacher effect), regressing the within-pair difference in mean test scores on different non-experimental estimates of teacher effects. The coefficients were estimated in separate bivariate regressions with no constant. Robust standard errors are reported in parentheses.


Test Score First Year

Table 6. Regression of Experimental Difference in Average Test Scores on Non-Experimental Estimates of Differences in Teacher Effect

Not Clear How To Interpret Fade-out

Forgetting, transitory teaching-to-test Value added overstates long-term impact

Knowledge that is not used becomes inoperable Need string of good teachers to maintain effect

Grade-specific content of tests not cumulative Later tests understate contribution of current teacher

Students of best teachers mixed with students of worst teachers in following year, and new teacher will focus effort on students who are behind (peer effects). no fade-out if teachers were all effective

.assignment random followingt achievemenstudent predict

validlyˆ,ˆ estimators alexperiment-noner Test wheth 1.

:ApproachOur

les.unobservabon selection of effects Simulate 3.

VAM4)2Cov(VAM1,-)ˆ-ˆVar()Bias"Var(" .2

.,on nsrestrictioexclusion Test 1.

:Approach sRothstein'

),,,,,,,( :

),,,(:

),,(:

21

24

2VAM114

21,3,2

421321414

21212

1111

gg

VAMgg

ggigig

igggigigiggiigig

igiggiigig

iggiigig

possible

AA

AAAXfAAVAM

AXfAAVAM

XfAAVAM

Reconciling with Rothstein(2010)

4VAM1

412

42VAM1

VAM4VAM1,

))VAM-Var(VAM(21

VAM

VAM

correlation with VAM4: .94 .93 .98 .998

1. Both of us find that past teachers have lingering effects due to fade-out.2. Rothstein finds that richer set of covariates has negligible effects.3. While Rothstein speculates that selection on unobservables could cause

problems, our results fail to find evidence of bias.

Reconciling Kane/Staiger with Rothstein

Both Rothstein and Kane/Staiger find evidence of fade-out Rothstein finds current student gain is associated with past

teacher assignment, conditional on student’s prior test score. Consistent with fade out of prior teacher’s effect in

Kane/Staiger Bias in current teacher effect depends on correlation between

current & past teacher value added (small in Rothstein & Kane/Staiger data).

Both Rothstein and Kane/Staiger find that after conditioning on prior test score, other observables don’t matter much Rothstein finds prior student gain is associated with current

teacher assignment, conditional on student’s prior test score . i.e., current teacher assignment is associated with past 2 tests.

Rothstein (and others) finds that controlling for earlier tests has little effect on estimates of teacher effect (corr>.98)

Rothstein speculates that other unobservables used to track students may bias estimates of teacher effects Kane/staiger find no substantial bias from such omitted factors

Carrell/West – A Cautionary Tale!

Quasi-experimental evidence from AF Academy Randomized to classes, common test & grading Estimate teacher effect in 1st-year intro classes Does it predict performance in 2nd-year class?

Strong evidence of teaching-to-test Big teacher effects in 1st year

Lower rank have larger “value added” & satisfaction But predicts worse performance in 2nd year class

AF system facilitated teaching to test

Summary of Statistical Tests

Value-added estimates in low-stakes environment yielded unbiased predictions of causal effects of teachers on short-term student achievement controlling for baseline score yielded unbiased

predictions Further controlling for peer characteristics yielded

highest explanatory power, explaining over 50% of teacher variation

Relative differences in achievement between teacher’s students fade-out at annual rate of .4-.6. Understanding the mechanism is key to long-term benefits of using value added.

Performance measures can go wrong when easily gamed.

Are Teacher Effects Stable?

Different across students within a class? No.

Change over time? Correlation falls slowly at longer lags Teacher peer effects (Jackson/Bruegman) Effect of evaluation on performance

(Taylor/Tyler)

Depend on match/context? Correlation falls when change grades, course. Correlation falls when change schools

(Jackson)

How Should Value Added Be Used?

Growing use of value added to identify effective teachers for pay, promotion, and professional development

Concern that current value added estimates are too imprecise & volatile to be used in high-stakes decisions Year-to-year correlation (reliability) around 0.3-0.5 Of top quartile one year, >10% in bottom quartile

next year

No systematic analysis of what this evidence implies for how measures could be used

Models of Employer Learning

Motivating facts Large persistent variation across teachers

(heterogeneity) Difficult to predict at hire (not inspection

good) Predictable after hire (experience good

learning) Return to experience in first few years (cost

of hiring)

Searching For Effective Teachers

Use simple search model to illustrate how one could use imperfect information on effectiveness to screen teachers

Use estimates of model parameters from NYC & LAUSD to simulate the potential gains from screening teachers

Evaluate potential gains from: Observing teacher performance for more years Obtaining more reliable information on teacher

performance Obtaining more reliable information at time of hire

Simple search model: Setup

Teacher effect: μ~N(0,σμ2)

Pre-hire signal (if available) Y0~N(μ, σ0

2 ), reliability = σμ2 /(σμ

2 + σ02)

#applicants = 10 times natural attrition Constraint: #hired = #dismissed + natural turnover

Annual performance on the job (t=1,…,30) Yt~N(μ + βt, σ

2 ), reliability = σμ2 /(σμ

2 + σ2)

Return to experience: βt<0 for early t, cost of hiring

Exogenous annual turnover rate (t<30): π

Can dismiss up until tenure at t=T

Simple search model: Solution

Objective: Maximize student achievement by screening out ineffective teachers using imperfect performance measure

Solution is similar to Jovanovic (1979) matching model

Principal sets reservation value (rt), increasing with t dismiss after period t if E(μ|Y0,.., Yt) < rt

from normal learning model:

Reservation value increases because of declining option value

No simple analytic solution to general model Numerically estimate optimal rt through simulations

t

YYYE tt 22

2

0...|

Tenure cutoff in simple case

Suppose: No pre-hire signal (new hire is random draw) Tenure after 1 year (no option value) Return to experience only in year 1 (β1<0)

f.o.c.: marginal tenured teacher = average teacher next year

30

21

1

1

22

2

1

1

Pr11

11Pr

|1Pr11Pr

|

:,|

t

t cYt

cYEttY

ccYE

whereYcYE

Simulation assumptions from NYC & LAUSD

Maintained assumptions across all simulations SD of teacher effect: σμ

2 = 0.15 (in student SD units; national black-white gap = .8-.9)

Turnover rate if not dismissed: π = 5%

Assumptions for simplest base case (will be varied later) No useful information at time of hire Reliability of Yt: σμ

2 /(σμ2 + σ

2) =0.4 (40% reliability)

Cost of hiring new teacher: βt = -.07 in 1st year, -.02 in 2nd year

Dismissal only after first year (e.g. tenure decision after 1 year)

Simple model: dismiss 80% of probationary teachers(!)

0.2

.4.6

.8P

ropo

rtio

n n

ovic

e

0.0

2.0

4.0

6.0

8A

vera

ge

valu

e a

dded

0 .2 .4 .6 .8 1Proportion dismissed

Average value added Proportion novice

Why dismiss so many probationary teachers?

Differences in teacher effects are large & persistent, relative to short-lived costs of hiring a new teacher

Even unreliable performance measures predict substantial differences in teacher effects Costs of retaining an ineffective teacher outweigh

costs of dismissing an effective teacher

Option value of new hires For every 5 new hires, one will be highly effective Trade off short-term cost of 4 dismissed vs. long-

term benefit of 1 retained

Why not dismiss so many probationary teachers?

Smaller benefits than assumed in the model? High turnover rates Teacher differences that do not persist in future

(including if PD can help ineffective teachers) High stakes distortion of performance measures

Larger costs than assumed in the model? Direct costs of recruiting/firing (little effect if

added) Difficulty recruiting applicants (but LAUSD did) Higher pay required to offset job insecurity

(particularly if require teacher-training up front)

Requiring a 2nd or 3rd year to evaluate a probationary teacher is a bad idea.

T=1 T=2 T=3 T=4

Average value added 0.080 0.075 0.068 0.061

% dismissed 81% 75% 71% 68%

Wait to dismiss until year:

Allowing a 2nd or 3rd year to evaluate a probationary teacher is a good idea.

T=1 T=2 T=3 T=4

Average value added 0.080 0.095 0.099 0.101

% dismissed (total) 81% 83% 84% 84%% dismissed (by year)

In year 1 81% 67% 67% 67%In year 2 0% 16% 8% 8%In year 3 0% 0% 9% 4%In year 4 0% 0% 0% 5%

Dismissal at any time through year:

Obtaining more reliable information on teacher performance is valuable, little effect on dismissal

T=1T=2T=3T>3

0.1

.2.3

.4.5

.6.7

.8P

ropo

rtio

n d

ism

isse

d

0.0

2.0

4.0

6.0

8.1

.12

.14

.16

Ave

rag

e va

lue

add

ed

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Reliability of annual performance measure

Average value added Proportion dismissed

Obtaining more reliable information at time of hire is even more valuable, and reduces dismissal rate.

02

04

06

08

0P

ropo

rtio

n d

ism

isse

d

.1.1

5.2

.25

Ave

rag

e va

lue

add

ed

0 .2 .4 .6 .8 1Reliability of pre-hire performance signal

Average value added Proportion dismissed

Implications

Why do principals set a low tenure bar? Poor incentives (private schools?) Lack verifiable performance information Current up-front training requirements (not necessary?) Lose best teachers if cannot raise pay

Why don’t other occupations & professions dismiss 80%? Job ladder – low-stakes entry-level job used to screen MD, JD – require up-front training, job differentiation

later

Alternatives to current system No up-front investment – can train later Rather than credentials, base certification on

performance Develop “job ladder” pre-screen – e.g. initial job where

few students put at risk, but reveals your ability (summer school?)

Summary

Potential gain is large Could raise average annual achievement gains by ≈0.08 Similar magnitude to STAR class-size experiment and to

recent results from charter school lotteries

Gains could be doubled if had more reliable performance measure, and tripled if observed this pre-hire

Select only the most effective teachers, and do it quickly May be practical reasons limiting success of this

strategy May require rethinking teacher training & job ladder

Focused on screening, but other uses may yield large gains

Combining Heterogeneity & Effort: Model of Career Concerns

Gibbons & Murphy (1992) Output (yt) is the sum of ability (η),

effort (at), and noise (et). Workers risk-averse, convex costs of

effort Information imperfect, but symmetric,

so firms pay expected output.

Combining Heterogeneity & Effort: Model of Career Concerns Gibbons & Murphy (1992)

Simple optimal linear contract:wt = ct + bt

*( yt - ct ), and ct = at* + mt-1

Base pay (ct) is expected value at t-1 of output at t, and is sum of two terms:

equilibrium effort (at*) – an experience effect.

Posterior mean of ability (mt-1) based on earlier output

Incentive payment depends on: How much output exceeds expectations ( yt - ct ) Weight (bt

*) declines with noise in yt, and grows with experience – early effort rewarded indirectly through impact on beliefs (mt-1)

Implications For Teacher Contract

Base pay determined by experience & Empirical Bayes estimate of performance (pay grades)

Incentive payment later in career (after tenure) based on performance relative to others in your pay grade

Incentives depend on noise in value added – may be worse in some subjects (ELA) or small classes – so could base on percentile rank within class size & subject categories.

Teacher Productivity & Models of Employer Learning

Documents

Transcript of Teacher Productivity & Models of Employer Learning