Teacher Productivity & Models of Employer Learning
description
Transcript of Teacher Productivity & Models of Employer Learning
Teacher Productivity & Models of Employer Learning
Economic Models in Education Research Workshop
University of ChicagoApril 7, 2011
Douglas O. StaigerDartmouth College
Teacher Productivity & Models of Employer Learning
Teacher productivity Estimating value added models Statistical tests of model assumptions Stability of the effects
Models of employer learning Searching for effective teachers --
heterogeneity Career concerns – heterogeneity &
effort
Teacher Productivity
Huge non-experimental literature on “teacher effects” Non-experimental studies estimate standard
deviation in teacher-effect of .10 to .25 student-level standard deviations (2-5 percentiles) each year.
Key findings in non-exp literature: Teacher effects unrelated to traditional teacher
credentials Payoff to experience steep in first 3 years but flat
afterwards Predict sizable differences with 1-3 years prior
performance One experimental study (TN class-size experiment)
yields similar estimate of variance.
Variation in Value Added Within & Between Groups of Teachers in NYC, by Teacher Certification
0
24
68
Ker
nel D
ensi
ty E
stim
ate
-.4 -.3 -.2 -.1 0 .1 .2 .3 .4Student Level Standard Deviations
Traditionally Certified Teaching FellowTeach for America Uncertified
Note: Shown are estimates of teachers' impacts on average student performance, controlling for teachers' experience levels and students' baselinescores, demographics and program participation; includes teachers of grades 4-8 hired since the 1999-2000 school year.
Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Teacher Certification
0.0
3.0
6.0
9.1
2P
ropo
rtio
n o
f Cla
ssro
oms
-15 -10 -5 0 5 10 15Change in Percentile Rank of Average Student
Traditionally Certified UncertifiedAlternatively Certified
Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, grade 2 through 5.
by Initial CertificationTeacher Impacts on Math Performance
Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Years of Experience
0.0
3.0
6.0
9.1
2P
ropo
rtio
n o
f Cla
ssro
oms
-15 -10 -5 0 5 10 15Change in Percentile Rank of Average Student
1st Year 2nd Year3rd Year
Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, < 4 years experience.
by Year of ExperienceTeacher Impacts on Math Performance
Variation in Value Added Within & Between Groups of Teachers in LAUSD, by Prior Value Added
0.0
3.0
6.0
9.1
2P
ropo
rtio
n o
f Cla
ssro
oms
-15 -10 -5 0 5 10 15Change in Percentile Rank of Average Student
Bottom 3rd Quartile2nd Quartile Top Quartile
Note: Classroom-level impacts on average student performance, controlling for baseline scores,student demographics and program participation. LAUSD elementary teachers, < 4 years experience.
by Ranking After First Two YearsTeacher Impacts on Math Performance in Third Year
How Are Teacher Effects Estimated?
Growing use of “value added” estimates to identify effective teachers for pay, promotion, and professional development.
But growing concern that statistical assumptions needed to estimate teacher effect are strong and untested – are these “causal” effects of teachers?
Basics of Value Added Analysis
Teacher value added compares actual student achievement to a counterfactual expectation
Difference between actual and expected achievement, averaged over teacher’s students (average residual)
Expected achievement is average achievement for students who looked similar at start of year Same prior-year test scores Same demographics, program participation Same characteristics of peers in classroom
Estimating Value Added
1.Estimating Non-Experimental Teacher Effects
error.year by Student
shockyear by classroom persistent-Non
effectTeacher
.covariates level
-classroom andstudent of setsDifferent
gainor level scorest Student te
,
ijt
jt
j
ijt
ijt
ijtjtjijtijtijtijt
X
A
whereXA
Similar teacher residual by OLS, RE, FE (β driven by within variation). What matters is whether X includes baseline score & peer measures.
Estimating Value Added
2.Generating Empirical Bayes Estimates of Non-Experimental Teacher Effects
. noisewwhereVA j
tjtjtj
noisejj
,ˆˆ
ˆ22
2
222
2
12
ˆˆˆ:
ˆ:
),cov(ˆ:
:componentsvarianceRequires
ijt
jtijt
jtjt
VarClassroom
VarStudent
vvTeacher
Empirical Bayes Methods Goal: Forecast teacher performance next year
(BLUP)
Forecast is prediction of persistent teacher component “Shrinkage” estimator = posterior mean: E(μ|M) = Mβ
Weight (β) placed on measure increase with: Correlation with persistent component of interest Reliability with which measure is estimated
(which may vary by teacher – e.g. based on sample size)
Can apply to any measure (value added, video rating, etc.)or combination of measures (composite estimates)
Error components
Performance measure (Mjc) for teacher j in classroom c is noisy estimate of persistent teacher effect (μj).
Noise consists of two independent components: classroom component (θjc) representing peer
effects, etc. sampling error (νjc) if measure averages over
students, videos, raters, etc. (variance depends on sample size)
Model for error prone measure: Mjc = μj + θjc + νjc
Prediction in simple case
Using one measure (M) to predict teacher performance on a possibly different measure (M') in a different classroom simplifies to predicting the persistent teacher component:
E(μ'j|Mjc) = Mjcβj
Optimal weights (βj) analogous to regression coefficients:
βj = Cov(μ'j,Mjc)/Var(Mjc) = Cov(μ'j,μj)/[Var(μj)+Var(θjc)+Var(νjc)]
= {Cov(μ'j,μj)/Var(μj)}*{Var(μj)/[Var(μj)+Var(θjc)+Var(νjc)]}
= {β if Mjc had no noise}*{reliability of Mjc}
Two Key Measurement Problems
Reliability/Instability Imprecision transitory measurement error E.g., low correlation across classrooms
Validity/Bias Persistently misrepresent performance (e.g. student sorting) Test scores capture only one dimension of performance Depends on design, content, & scaling of test
Validity & reliability determine a measures ability to predict performance Correlation of measure with true performance =
(correlation of persistent part of measure with true performance) * (square root of reliability)
E.g., Teacher certification versus value added
Statistical Tests of Model Assumptoins
Experimental forecasting test (Kane & Staiger)
Observational specification tests (Rothstein)
Quasi-experimental forecasting test (Carrell & West)
What Kane/Staiger do
Randomly assign 78 pairs of teachers to classrooms in LAUSD elementary schools
Provides experimental estimate of parameter of interest If a given classroom of students were to have
teacher A rather than teacher B, how much different would their average test scores be at the end of the year?
Evaluate whether pre-experimental estimates from various value-added models predict experimental results
Experimental Design
All NBPTS applicants from Los Angeles area. For each NBPTS applicant, identified comparison
teachers working in same school, grade, calendar track.
LAUSD chief of staff wrote letters to principals inviting them to draw up two classrooms that they would be willing to assign to either teacher.
If principal agreed, classroom rosters (not individual students) were randomly assigned by LAUSD on the day of switching. LAUSD made paper copies of rosters on day of switch.
Yielded 78 pairs of teachers (156 classrooms and 3500 students) for whom we had estimates of “value-added” impacts from the pre-experimental period.
LAUSD Data
Grades 2 through 5 Three Time Periods:
Years before Random Assignment: Spring 2000 through Spring 2003 Years of Random Assignment: Either Spring 2004 or 2005 Years after Random Assignment: Spring 2005 (or 2006) through
Spring 2007 Outcomes:
California Standards Test (Spring 2004- 2007) Stanford 9 Tests (Spring 2000 through 2002) California Achievement Test (Spring 2003)
Covariates: Student: baseline math and reading scores (interacted with grade),
race/ethnicity (hispanic, white, black, other or missing), ever retained, Title I, Eligible for free lunch, Gifted and talented, Special education, English language development (level 1-5).
Peers: Means of all the above for students in classrooms. Fixed Effects: School x Grade x Track x Year
Sample Exclusions: >20 percent special education classes Fewer than 5 and more than 36 students in class
All standardized by grade and year.
Teacher Effects
Teacher by Year Random
Effect
Mean Sample Size per Teacher
Math Levels with...No Controls 0.448 0.229 47.255Student/Peer Controls (incl. prior scores) 0.231 0.179 41.611Student/Peer Controls (incl. prior scores) & School F.E. 0.219 0.177 41.611Student Fixed Effects 0.101 0.061 47.255
Math Gains with...No Controls 0.236 0.219 43.888Student/Peer Controls 0.234 0.219 43.888Student/Peer Controls & School F.E. 0.225 0.219 43.888
Specification Used for Non-experimental Teacher Effect
Note: The above estimates are based on the total variance in estimated teacher fixed effects using observations from the pre-experimental data (years 1999-2000 through 2002-03). See the text for discussion of the estimation of the decomposition into teacher by year random effects, student-level error, and "actual" teacher effects. The sample was limited to schools with teachers in the experimental sample. Any individual students who were in the experiment were dropped from the pre-experimental estimation, to avoid any spurious relationship due to regression to the mean, etc.
Table 3: Non-experimental Estimates of Teacher Effect Variance Components
Standard Deviation of Each Component (in Student-level
Standard Deviation Units)
Evaluating Value Added
3.Test validity of VAj against experimental outcomes
.
1,...,78p and 1,2jfor jpjppjp VAY
1,1:H .scoresTest
1,1:H ,scoresTest
1:H ,scoresTest
0:H stics,characteri Baseline Y
1,...,78pfor ~
o2year alexperiment
o1year alexperiment
oyear alexperiment
o
1212
ppppp VAVAYY
Test Score Second Year
Test Score Third Year
Coefficient R2 Coefficient Coefficient
Math Levels with...No Controls 0.511*** 0.185 0.282** 0.124
(0.108) (0.107) (0.101)Student/Peer Controls (incl. prior scores) 0.852*** 0.210 0.359* 0.034
(0.177) (0.172) (0.133)Student/Peer Controls (incl. prior scores) & School F.E. 0.905*** 0.226 0.390* 0.07
(0.180) (0.176) (0.136)Student Fixed Effects 1.859*** 0.153 0.822 0.304
(0.470) (0.445) (0.408)
Math Gains with...No Controls 0.794*** 0.162 0.342 0.007
(0.201) (0.185) (0.146)Student/Peer Controls 0.828*** 0.171 0.356 0.01
(0.207) (0.191) (0.151)Student/Peer Controls & School F.E. 0.865*** 0.177 0.382 0.025
(0.213) (0.200) (0.157)
N: 78 78 78
Note: Each baseline characteristic listed in the columns was used as a dependent variable (math or ELA scores, corresponding to the teacher effect), regressing the within-pair difference in mean test scores on different non-experimental estimates of teacher effects. The coefficients were estimated in separate bivariate regressions with no constant. Robust standard errors are reported in parentheses.
Specification Used for Non-experimental Teacher Effect
Test Score First Year
Table 6. Regression of Experimental Difference in Average Test Scores on Non-Experimental Estimates of Differences in Teacher Effect
-1.5
-1-.
50
.51
1.5
With
in P
air
Diff
ere
nce
in E
nd
of F
irst Y
ear
Tes
t Sco
re
0 .2 .4 .6 .8Within Pair Difference in Pre-experimental Value-added
Observed Linear Fitted Values45-degree Line Lowess Fitted Values
Mathematics
Figure 1: Within Pair Differences in Pre-experimentalValue-added and End of First Year Test Score
How much of the variance in (μ2p –μ1p) is “explained” by (VA2p –VA1p)?
53.425/226.explained"" of Proportion
226.
425.6.17/48.7) actual had (if Maximum
48.7ˆ278
6.17)(
2
2
j2
212
2
12
RActual
R
VarVarianceSignal
YYVarianceTotalp
pp
Test Score Second Year
Test Score Third Year
Coefficient R2 Coefficient Coefficient
Math Levels with...No Controls 0.511*** 0.185 0.282** 0.124
(0.108) (0.107) (0.101)Student/Peer Controls (incl. prior scores) 0.852*** 0.210 0.359* 0.034
(0.177) (0.172) (0.133)Student/Peer Controls (incl. prior scores) & School F.E. 0.905*** 0.226 0.390* 0.07
(0.180) (0.176) (0.136)Student Fixed Effects 1.859*** 0.153 0.822 0.304
(0.470) (0.445) (0.408)
Math Gains with...No Controls 0.794*** 0.162 0.342 0.007
(0.201) (0.185) (0.146)Student/Peer Controls 0.828*** 0.171 0.356 0.01
(0.207) (0.191) (0.151)Student/Peer Controls & School F.E. 0.865*** 0.177 0.382 0.025
(0.213) (0.200) (0.157)
N: 78 78 78
Note: Each baseline characteristic listed in the columns was used as a dependent variable (math or ELA scores, corresponding to the teacher effect), regressing the within-pair difference in mean test scores on different non-experimental estimates of teacher effects. The coefficients were estimated in separate bivariate regressions with no constant. Robust standard errors are reported in parentheses.
Specification Used for Non-experimental Teacher Effect
Test Score First Year
Table 6. Regression of Experimental Difference in Average Test Scores on Non-Experimental Estimates of Differences in Teacher Effect
Not Clear How To Interpret Fade-out
Forgetting, transitory teaching-to-test Value added overstates long-term impact
Knowledge that is not used becomes inoperable Need string of good teachers to maintain effect
Grade-specific content of tests not cumulative Later tests understate contribution of current teacher
Students of best teachers mixed with students of worst teachers in following year, and new teacher will focus effort on students who are behind (peer effects). no fade-out if teachers were all effective
.assignment random followingt achievemenstudent predict
validlyˆ,ˆ estimators alexperiment-noner Test wheth 1.
:ApproachOur
les.unobservabon selection of effects Simulate 3.
VAM4)2Cov(VAM1,-)ˆ-ˆVar()Bias"Var(" .2
.,on nsrestrictioexclusion Test 1.
:Approach sRothstein'
),,,,,,,( :
),,,(:
),,(:
21
24
2VAM114
21,3,2
421321414
21212
1111
gg
VAMgg
ggigig
igggigigiggiigig
igiggiigig
iggiigig
possible
AA
AAAXfAAVAM
AXfAAVAM
XfAAVAM
Reconciling with Rothstein(2010)
4VAM1
412
42VAM1
VAM4VAM1,
))VAM-Var(VAM(21
VAM
VAM
correlation with VAM4: .94 .93 .98 .998
1. Both of us find that past teachers have lingering effects due to fade-out.2. Rothstein finds that richer set of covariates has negligible effects.3. While Rothstein speculates that selection on unobservables could cause
problems, our results fail to find evidence of bias.
Reconciling Kane/Staiger with Rothstein
Both Rothstein and Kane/Staiger find evidence of fade-out Rothstein finds current student gain is associated with past
teacher assignment, conditional on student’s prior test score. Consistent with fade out of prior teacher’s effect in
Kane/Staiger Bias in current teacher effect depends on correlation between
current & past teacher value added (small in Rothstein & Kane/Staiger data).
Both Rothstein and Kane/Staiger find that after conditioning on prior test score, other observables don’t matter much Rothstein finds prior student gain is associated with current
teacher assignment, conditional on student’s prior test score . i.e., current teacher assignment is associated with past 2 tests.
Rothstein (and others) finds that controlling for earlier tests has little effect on estimates of teacher effect (corr>.98)
Rothstein speculates that other unobservables used to track students may bias estimates of teacher effects Kane/staiger find no substantial bias from such omitted factors
Carrell/West – A Cautionary Tale!
Quasi-experimental evidence from AF Academy Randomized to classes, common test & grading Estimate teacher effect in 1st-year intro classes Does it predict performance in 2nd-year class?
Strong evidence of teaching-to-test Big teacher effects in 1st year
Lower rank have larger “value added” & satisfaction But predicts worse performance in 2nd year class
AF system facilitated teaching to test
Summary of Statistical Tests
Value-added estimates in low-stakes environment yielded unbiased predictions of causal effects of teachers on short-term student achievement controlling for baseline score yielded unbiased
predictions Further controlling for peer characteristics yielded
highest explanatory power, explaining over 50% of teacher variation
Relative differences in achievement between teacher’s students fade-out at annual rate of .4-.6. Understanding the mechanism is key to long-term benefits of using value added.
Performance measures can go wrong when easily gamed.
Are Teacher Effects Stable?
Different across students within a class? No.
Change over time? Correlation falls slowly at longer lags Teacher peer effects (Jackson/Bruegman) Effect of evaluation on performance
(Taylor/Tyler)
Depend on match/context? Correlation falls when change grades, course. Correlation falls when change schools
(Jackson)
How Should Value Added Be Used?
Growing use of value added to identify effective teachers for pay, promotion, and professional development
Concern that current value added estimates are too imprecise & volatile to be used in high-stakes decisions Year-to-year correlation (reliability) around 0.3-0.5 Of top quartile one year, >10% in bottom quartile
next year
No systematic analysis of what this evidence implies for how measures could be used
Models of Employer Learning
Motivating facts Large persistent variation across teachers
(heterogeneity) Difficult to predict at hire (not inspection
good) Predictable after hire (experience good
learning) Return to experience in first few years (cost
of hiring)
Searching For Effective Teachers
Use simple search model to illustrate how one could use imperfect information on effectiveness to screen teachers
Use estimates of model parameters from NYC & LAUSD to simulate the potential gains from screening teachers
Evaluate potential gains from: Observing teacher performance for more years Obtaining more reliable information on teacher
performance Obtaining more reliable information at time of hire
Simple search model: Setup
Teacher effect: μ~N(0,σμ2)
Pre-hire signal (if available) Y0~N(μ, σ0
2 ), reliability = σμ2 /(σμ
2 + σ02)
#applicants = 10 times natural attrition Constraint: #hired = #dismissed + natural turnover
Annual performance on the job (t=1,…,30) Yt~N(μ + βt, σ
2 ), reliability = σμ2 /(σμ
2 + σ2)
Return to experience: βt<0 for early t, cost of hiring
Exogenous annual turnover rate (t<30): π
Can dismiss up until tenure at t=T
Simple search model: Solution
Objective: Maximize student achievement by screening out ineffective teachers using imperfect performance measure
Solution is similar to Jovanovic (1979) matching model
Principal sets reservation value (rt), increasing with t dismiss after period t if E(μ|Y0,.., Yt) < rt
from normal learning model:
Reservation value increases because of declining option value
No simple analytic solution to general model Numerically estimate optimal rt through simulations
t
YYYE tt 22
2
0...|
Tenure cutoff in simple case
Suppose: No pre-hire signal (new hire is random draw) Tenure after 1 year (no option value) Return to experience only in year 1 (β1<0)
f.o.c.: marginal tenured teacher = average teacher next year
30
21
1
1
22
2
1
1
Pr11
11Pr
|1Pr11Pr
|
:,|
t
t cYt
cYEttY
ccYE
whereYcYE
Simulation assumptions from NYC & LAUSD
Maintained assumptions across all simulations SD of teacher effect: σμ
2 = 0.15 (in student SD units; national black-white gap = .8-.9)
Turnover rate if not dismissed: π = 5%
Assumptions for simplest base case (will be varied later) No useful information at time of hire Reliability of Yt: σμ
2 /(σμ2 + σ
2) =0.4 (40% reliability)
Cost of hiring new teacher: βt = -.07 in 1st year, -.02 in 2nd year
Dismissal only after first year (e.g. tenure decision after 1 year)
Simple model: dismiss 80% of probationary teachers(!)
0.2
.4.6
.8P
ropo
rtio
n n
ovic
e
0.0
2.0
4.0
6.0
8A
vera
ge
valu
e a
dded
0 .2 .4 .6 .8 1Proportion dismissed
Average value added Proportion novice
Why dismiss so many probationary teachers?
Differences in teacher effects are large & persistent, relative to short-lived costs of hiring a new teacher
Even unreliable performance measures predict substantial differences in teacher effects Costs of retaining an ineffective teacher outweigh
costs of dismissing an effective teacher
Option value of new hires For every 5 new hires, one will be highly effective Trade off short-term cost of 4 dismissed vs. long-
term benefit of 1 retained
Why not dismiss so many probationary teachers?
Smaller benefits than assumed in the model? High turnover rates Teacher differences that do not persist in future
(including if PD can help ineffective teachers) High stakes distortion of performance measures
Larger costs than assumed in the model? Direct costs of recruiting/firing (little effect if
added) Difficulty recruiting applicants (but LAUSD did) Higher pay required to offset job insecurity
(particularly if require teacher-training up front)
Requiring a 2nd or 3rd year to evaluate a probationary teacher is a bad idea.
T=1 T=2 T=3 T=4
Average value added 0.080 0.075 0.068 0.061
% dismissed 81% 75% 71% 68%
Wait to dismiss until year:
Allowing a 2nd or 3rd year to evaluate a probationary teacher is a good idea.
T=1 T=2 T=3 T=4
Average value added 0.080 0.095 0.099 0.101
% dismissed (total) 81% 83% 84% 84%% dismissed (by year)
In year 1 81% 67% 67% 67%In year 2 0% 16% 8% 8%In year 3 0% 0% 9% 4%In year 4 0% 0% 0% 5%
Dismissal at any time through year:
Obtaining more reliable information on teacher performance is valuable, little effect on dismissal
T=1T=2T=3T>3
0.1
.2.3
.4.5
.6.7
.8P
ropo
rtio
n d
ism
isse
d
0.0
2.0
4.0
6.0
8.1
.12
.14
.16
Ave
rag
e va
lue
add
ed
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1Reliability of annual performance measure
Average value added Proportion dismissed
Obtaining more reliable information at time of hire is even more valuable, and reduces dismissal rate.
02
04
06
08
0P
ropo
rtio
n d
ism
isse
d
.1.1
5.2
.25
Ave
rag
e va
lue
add
ed
0 .2 .4 .6 .8 1Reliability of pre-hire performance signal
Average value added Proportion dismissed
Implications
Why do principals set a low tenure bar? Poor incentives (private schools?) Lack verifiable performance information Current up-front training requirements (not necessary?) Lose best teachers if cannot raise pay
Why don’t other occupations & professions dismiss 80%? Job ladder – low-stakes entry-level job used to screen MD, JD – require up-front training, job differentiation
later
Alternatives to current system No up-front investment – can train later Rather than credentials, base certification on
performance Develop “job ladder” pre-screen – e.g. initial job where
few students put at risk, but reveals your ability (summer school?)
Summary
Potential gain is large Could raise average annual achievement gains by ≈0.08 Similar magnitude to STAR class-size experiment and to
recent results from charter school lotteries
Gains could be doubled if had more reliable performance measure, and tripled if observed this pre-hire
Select only the most effective teachers, and do it quickly May be practical reasons limiting success of this
strategy May require rethinking teacher training & job ladder
Focused on screening, but other uses may yield large gains
Combining Heterogeneity & Effort: Model of Career Concerns
Gibbons & Murphy (1992) Output (yt) is the sum of ability (η),
effort (at), and noise (et). Workers risk-averse, convex costs of
effort Information imperfect, but symmetric,
so firms pay expected output.
Combining Heterogeneity & Effort: Model of Career Concerns Gibbons & Murphy (1992)
Simple optimal linear contract:wt = ct + bt
*( yt - ct ), and ct = at* + mt-1
Base pay (ct) is expected value at t-1 of output at t, and is sum of two terms:
equilibrium effort (at*) – an experience effect.
Posterior mean of ability (mt-1) based on earlier output
Incentive payment depends on: How much output exceeds expectations ( yt - ct ) Weight (bt
*) declines with noise in yt, and grows with experience – early effort rewarded indirectly through impact on beliefs (mt-1)
Implications For Teacher Contract
Base pay determined by experience & Empirical Bayes estimate of performance (pay grades)
Incentive payment later in career (after tenure) based on performance relative to others in your pay grade
Incentives depend on noise in value added – may be worse in some subjects (ELA) or small classes – so could base on percentile rank within class size & subject categories.