Post on 17-Jan-2016
1
Statistical power in educational settings
Workshop at Wellcome seminar on educational research, May 2008
Dylan Wiliam
Institute of Education, University of London
www.dylanwiliam.net
3
The argument…Premise 1 Learning is insensitive to instruction Measures of learning even more so So even small system-wide gains in learning are educationally important
Premise 2 Education systems are inherently multi-levelled Taking account of clustering in data lowers statistical power Educational experiments are inherently weak
Conclusion RCTs in education frequently need to be very large, and therefore expensive
4
Learning is slow…
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
6 7 8 9 10 11 12
Age (years)
Facility
Source: Leverhulme Numeracy Research Programme
860+570=?
5
…especially for deep learning…Achievement in decimals by age
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6
Level achieved
Proportion
Age 12
Age 13
Age 14
Age 15
Hart, 1981
6
…and measures are insensitive…Annual growth in school attainment (STEP)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 6 7 8 9 10 11 12 13 14 15
Grade
Annual increase (standard deviations)
ReadingWritingListeningSoc. Stud.ScienceMath
Sequential tests of educational progress (ETS, 1957)
7
…and measures are insensitive…Annual growth in school attainment (STEP)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5 6 7 8 9 10 11 12 13 14 15
Grade
Annual increase (standard deviations)
ReadingWritingListeningSoc. Stud.ScienceMath
NAEPTIMSS
8
…so small gains in learning are worthwhileAverage rate of progress of cohorts is 0.3 standard deviations per year
Average cost of one year’s education for a cohort in England is £3bn
An effect size of 0.05 sd might be regarded as “small”
But system-wide, is worth £6bn
9
…but hard to detect…Statistical power: The likelihood that a statistical test will reject a false null hypothesisDepends on
The level set for statistical significance The size of the difference between compared groups (effect size) The sensitivity of the measures
Clustering reduces statistical power, but is an inherent feature of educational settings, and especially for school-wide interventionsTeacher qualityAbility grouping
10
…especially in educational settings
(Konstantopoulos,2006)
p = #studentsn = #classrooms = effect sizec= classroom clustering
s= school clustering
11
So…The most important question is not “Are RCTs good?” but “When are RCTs good?”
How should we answer?
12
Institute of Education Sciences (USA)Five goals
1. identify existing programs, practices, and policies that may have an impact on student outcomes and the factors that may mediate or moderate the effects of these programs, practices, and policies;
2. develop programs, practices, and policies that are theoretically and empirically based;
3. evaluate the efficacy of fully developed programs, practices, and policies;
4. evaluate the impact of programs, practices, and policies implemented at scale;
5. develop and/or validate data and measurement systems and tools.