The reliability of educational assessments Dylan Wiliam dylanwiliam
description
Transcript of The reliability of educational assessments Dylan Wiliam dylanwiliam
The reliability of educational assessments
Dylan Wiliam
www.dylanwiliam.net
Ofqual Annual Lecture, Coventry: 7 May 2009
The argumentThe public understanding of the reliability of assessments is weakContributory factors areThe need of humans for certainty (and beliefs that its absence is chaos)The inherent unreliability of all measurements, educational and otherwiseThe use in education of tools derived from individual differences psychologyAn emphasis on scores, rather than how they are usedPolitical assumptions about the educability of the public, combined withA desire to use assessment outcomes as drivers of reform
Those who produce—and those who mandate the use of—assessments must take responsibility for informed use of assessment outcomes
Dealing with uncertainty in societyPeople like certainty…Hilbert (1900): “In mathematics, there is no ignoramibus”He was wrongAnd it was unsettling (Klein, 1980)
…and to attribute blame…Deaths of children in care (e.g., “Baby P.”)
…although there are some cases where uncertainty is accepted“It is better and more satisfactory to acquit a thousand guilty persons than to
put a single innocent one to death” (Maimonides)“It is better that ten guilty persons escape than that one innocent suffer”
(Blackstone)
The very first high-stakes assessment…“Then Jephthah gathered together all the men of Gilead, and fought with Ephraim: and the men of Gilead smote Ephraim, because they said, Ye Gileadites are fugitives of Ephraim among the Ephraimites, and among the Manassites.
And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay;
Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand. (Judges 12, 4-6, King James version)
ReliabilityHansen (1993) distinguishes between literal and representational assessmentsThere are no literal assessmentsAll assessments are representationalAll assessments involve generalization
Reliability is a measure of the stability of assessment outcomes under changes in—or the ability to generalize across—things that (we think) shouldn’t make a difference, such as:marker/rateroccasion* item selection*
* UK excepted
Uncertainty in assessing EnglishStarch & Elliott (1912)
Uncertainty in assessing mathematicsStarch & Elliott (1913)
Measures of reliabilityIn classical test theory, reliability is defined as a kind of “signal-to-noise” ratio (in fact a signal to signal-plus-noise ratio)Reliability is increased
by decreasing the noise, or, easier, by increasing the signal
Hence the need for discriminationThe legacy of individual differences psychology
A focus on discrimination between individuals In education, more appropriate ways of estimating reliability exist
Discriminating between those who have and have not been taught Discriminating between those who have and have not been taught well
Test length and reliability
0.70 0.75 0.80 0.85 0.90 0.95
0.70 1.0
0.75 1.3 1.0
0.80 1.7 1.3 1.0
0.85 2.4 1.9 1.4 1.0
0.90 3.9 3.0 2.3 1.6 1.0
0.95 8.1 6.3 4.8 3.4 2.1 1.0
From
To
Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).
Reliability is not what we really wantTake a test which is known to have a reliability of around 0.90 for a
particular group of students.Administer the test to the group of students and score itGive each student a random script rather than their ownRecord the scores assigned to each student
What is the reliability of the scores assigned in this way?A. 0.10B. 0.30C. 0.50D. 0.70E. 0.90
Reliability v consistencyClassical measures of reliabilityare meaningful only for groupsare designed for continuous measuresMarks versus gradesScores suffer from spurious accuracyGrades suffer from spurious precisionClassification consistencyA more technically appropriate measure of the reliability of assessmentCloser to the intuitive meaning of reliability
Uncertainty in assessment at A-levelClassification consistency at A-levelPlease, D. N. (1971) “Estimation of the proportion of examination candidates
who are wrongly graded”. Br. J. math. statist. Psychol. 24: 230-238.
Here’s the table that got me into trouble…
Classification consistency of National Curriculum Assessment in England
reliability levels 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95
KS1 3 73% 75% 77% 79% 81% 83% 86% 90%
KS2 5 56% 58% 60% 64% 68% 73% 77% 84%
KS3 8 45% 47% 50% 54% 57% 62% 68% 76%
AERA, APA, NCME Standards (4 e/d,1999)Standard 2.1 For each total score, subscore or combination of scores that is to be interpreted,
estimates of relevant reliabilites and standard errors of measurement or test information functions should be reported (p. 31)
Standard 2.2 The standard error of measurement, both overall and conditional (if relevant) should
be reported both in raw score or original scale units and in units of each derived score recommended for use in test interpretation (p. 31, my emphasis).
Standard 2.3When test interpretation emphasizes differences between two observed scores of an
individual, or two averages of a group, reliability data, including standard errors, should be provided for such differences (p. 32)
ConclusionIt is simply unethical to produce or to mandate the use of assessments without taking steps to ensure informed use of the outcomes of the assessments by those likely to do so.Error bounds should be routinely estimated, and reported in terms of the
units used for reporting (e.g., grades and aggregate measures)Government and its agencies should actively promote public understanding
of the limitations of assessments, both in terms of reliability and other
aspects of validity appropriate interpretations of assessment outcomes, for individuals and
groups