The reliability of educational assessments Dylan Wiliam dylanwiliam

The reliability of educational assessments

Dylan Wiliam

www.dylanwiliam.net

Ofqual Annual Lecture, Coventry: 7 May 2009

The argumentThe public understanding of the reliability of assessments is weakContributory factors areThe need of humans for certainty (and beliefs that its absence is chaos)The inherent unreliability of all measurements, educational and otherwiseThe use in education of tools derived from individual differences psychologyAn emphasis on scores, rather than how they are usedPolitical assumptions about the educability of the public, combined withA desire to use assessment outcomes as drivers of reform

Those who produce—and those who mandate the use of—assessments must take responsibility for informed use of assessment outcomes

Dealing with uncertainty in societyPeople like certainty…Hilbert (1900): “In mathematics, there is no ignoramibus”He was wrongAnd it was unsettling (Klein, 1980)

…and to attribute blame…Deaths of children in care (e.g., “Baby P.”)

…although there are some cases where uncertainty is accepted“It is better and more satisfactory to acquit a thousand guilty persons than to

put a single innocent one to death” (Maimonides)“It is better that ten guilty persons escape than that one innocent suffer”

(Blackstone)

The very first high-stakes assessment…“Then Jephthah gathered together all the men of Gilead, and fought with Ephraim: and the men of Gilead smote Ephraim, because they said, Ye Gileadites are fugitives of Ephraim among the Ephraimites, and among the Manassites.

And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay;

Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand. (Judges 12, 4-6, King James version)

ReliabilityHansen (1993) distinguishes between literal and representational assessmentsThere are no literal assessmentsAll assessments are representationalAll assessments involve generalization

Reliability is a measure of the stability of assessment outcomes under changes in—or the ability to generalize across—things that (we think) shouldn’t make a difference, such as:marker/rateroccasion* item selection*

* UK excepted

Uncertainty in assessing EnglishStarch & Elliott (1912)

Uncertainty in assessing mathematicsStarch & Elliott (1913)

Measures of reliabilityIn classical test theory, reliability is defined as a kind of “signal-to-noise” ratio (in fact a signal to signal-plus-noise ratio)Reliability is increased

by decreasing the noise, or, easier, by increasing the signal

Hence the need for discriminationThe legacy of individual differences psychology

A focus on discrimination between individuals In education, more appropriate ways of estimating reliability exist

Discriminating between those who have and have not been taught Discriminating between those who have and have not been taught well

Test length and reliability

0.70 0.75 0.80 0.85 0.90 0.95

0.70 1.0

0.75 1.3 1.0

0.80 1.7 1.3 1.0

0.85 2.4 1.9 1.4 1.0

0.90 3.9 3.0 2.3 1.6 1.0

0.95 8.1 6.3 4.8 3.4 2.1 1.0

From

To

Just about the only way to increase the reliability of a test is to make it longer, or narrower (which amounts to the same thing).

Reliability is not what we really wantTake a test which is known to have a reliability of around 0.90 for a

particular group of students.Administer the test to the group of students and score itGive each student a random script rather than their ownRecord the scores assigned to each student

What is the reliability of the scores assigned in this way?A. 0.10B. 0.30C. 0.50D. 0.70E. 0.90

Reliability v consistencyClassical measures of reliabilityare meaningful only for groupsare designed for continuous measuresMarks versus gradesScores suffer from spurious accuracyGrades suffer from spurious precisionClassification consistencyA more technically appropriate measure of the reliability of assessmentCloser to the intuitive meaning of reliability

Uncertainty in assessment at A-levelClassification consistency at A-levelPlease, D. N. (1971) “Estimation of the proportion of examination candidates

who are wrongly graded”. Br. J. math. statist. Psychol. 24: 230-238.

Here’s the table that got me into trouble…

Classification consistency of National Curriculum Assessment in England

reliability levels 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95

KS1 3 73% 75% 77% 79% 81% 83% 86% 90%

KS2 5 56% 58% 60% 64% 68% 73% 77% 84%

KS3 8 45% 47% 50% 54% 57% 62% 68% 76%

AERA, APA, NCME Standards (4 e/d,1999)Standard 2.1 For each total score, subscore or combination of scores that is to be interpreted,

estimates of relevant reliabilites and standard errors of measurement or test information functions should be reported (p. 31)

Standard 2.2 The standard error of measurement, both overall and conditional (if relevant) should

be reported both in raw score or original scale units and in units of each derived score recommended for use in test interpretation (p. 31, my emphasis).

Standard 2.3When test interpretation emphasizes differences between two observed scores of an

individual, or two averages of a group, reliability data, including standard errors, should be provided for such differences (p. 32)

ConclusionIt is simply unethical to produce or to mandate the use of assessments without taking steps to ensure informed use of the outcomes of the assessments by those likely to do so.Error bounds should be routinely estimated, and reported in terms of the

units used for reporting (e.g., grades and aggregate measures)Government and its agencies should actively promote public understanding

of the limitations of assessments, both in terms of reliability and other

aspects of validity appropriate interpretations of assessment outcomes, for individuals and

groups

The reliability of educational assessments Dylan Wiliam dylanwiliam

Documents

Transcript of The reliability of educational assessments Dylan Wiliam dylanwiliam