Not invented here: the baffling insularity of assessment practices in higher education Dylan Wiliam ...
-
Upload
giles-jefferson -
Category
Documents
-
view
213 -
download
0
Transcript of Not invented here: the baffling insularity of assessment practices in higher education Dylan Wiliam ...
Not invented here: the baffling insularity of assessment practices in higher education
Dylan Wiliam
www.dylanwiliam.net
Keynote presentation at the University of London External System’s 150th anniversary Assessment Symposium
Overview: some assessment tensionsFunctionFormative versus summative
QualityValidity versus reliability
FormatMultiple-choice versus constructed response
ScopeContinuous versus one-off
FunctionQualityFormatScope
A statement of the blindlingly obviousYou can’t work out how good something is until you know what it’s intended to do…
Function, then quality
Formative and summativeDescriptions of InstrumentsPurposesFunctions
An assessment functions formatively when evidence about student achievement elicited by the assessment is interpreted and used to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of that evidence.
Gresham’s law and assessmentUsually (incorrectly) stated as “Bad money drives out good”
“The essential condition for Gresham's Law to operate is that there must be two (or more) kinds of money which are of equivalent value for some purposes and of different value for others” (Mundell, 1998)
The parallel for assessment: Summative drives out formative
The most that summative assessment (more properly, assessment designed to serve a summative function) can do is keep out of the way
FunctionQualityFormatScope
ValidityTraditional definition: a property of assessmentsA test is valid to the extent that it assesses what it purports to assessKey properties (content validity)
Relevance Representativeness
“Trinitarian” doctrines of validityContent validityCriterion-related validity
Concurrent validity Predictive validity
Construct validity
ValidityValidity is a property of inferences, not of assessments
“One validates, not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971; emphasis in original)
The phrase “A valid test” is therefore a category error (like “A happy rock”) No such thing as a valid (or indeed invalid) assessment No such thing as a biased assessment
Reliability is a pre-requisite for validity Talking about “reliability and validity” is like talking about “swallows and birds” Validity includes reliability
Modern conceptions of validity
Validity subsumes all aspects of assessment qualityReliabilityRepresentativeness (content coverage)RelevancePredictiveness
“Validity is an integrative evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (Messick, 1989 p. 13)
Meanings and consequences
Result interpretation Result use
Evidential basis Content validityConstruct validity+
utility
Consequential basis Value implications Social consequences
Adverse social consequences … are not in themselves indicative of invalidity (Messick, 1989, p. 89)Right concern, wrong concept (Popham, 1997)
Threats to validityInadequate reliability
Construct-irrelevant variance Differences in scores are caused, in part, by differences not relevant to the construct
of interest The assessment assesses things it shouldn’t The assessment is “too big”
Construct under-representation Differences in the construct are not reflected in scores
The assessment doesn’t assess things it should The assessment is “too small”
With clear construct definition all of these are technical—not value—issuesBut they interact strongly…
FunctionQualityFormatScope
Item formats“No assessment technique has been rubbished quite like multiple choice, unless it be graphology” Wood, 1991, p. 32)
Myths about multiple-choice itemsThey are biased against femalesThey assess only candidates’ ability to spot or guessThey test only lower-order skills
Mathematics 2What can you say about the means of the following two data sets?
Set 1: 10 12 13 15
Set 2: 10 12 13 15 0
A. The two sets have the same mean.
B. The two sets have different means.
C. It depends on whether you choose to count the zero.
Mathematics 3Which of the shapes below contains a dotted line that is also a diagonal?
Wilson & Draney, 2004
Science
The ball sitting on the table is not moving. It is not moving because:
A. no forces are pushing or pulling on the ball.
B. gravity is pulling down, but the table is in the way.C. the table pushes up with the same force that gravity pulls downD. gravity is holding it onto the table. E. there is a force inside the ball keeping it from rolling off the table
OU S354: Understanding space & timeBelow are five statements about the cosmic background radiation of our Universe. Select two options that are correct, according to the standard model of the Universe.
A. The microwave radiation collected on Earth is dominated by signals of cosmic origin
B. The total energy of the cosmic background radiation is currently much greater than that of matter
C. In a closed universe, the cosmic background radiation would eventually appear as visible light
D. The temperature of the cosmic background radiation was equal to that of the matter in the Universe until the appearance of galaxies
E. The number of photons in the cosmic background radiation has remained approximately constant since the era of decoupling
EnglishWhere would be the best place to begin a new paragraph?
No rules are carved in stone dictating how long a paragraph should be. However, for argumentative essays, a good rule of thumb is that, if your paragraph is shorter than five or six good, substantial sentences, then you should reexamine it to make sure that you've
developed the ideas fully. A Do not look at that rule of thumb, however, as hard and fast. It is simply a general guideline that may not fit some paragraphs. B A paragraph should be long enough to do justice to the main idea of the paragraph. Sometimes a paragraph may be short; sometimes it will be long. C On the other hand, if your paragraph runs on to a page or longer, you should probably reexamine its coherence to make sure that you are sticking to only one main topic. Perhaps you can find subtopics that merit their own paragraphs. D Think more about the unity, coherence, and development of a paragraph than the basic length. E If you are worried that a paragraph is too short, then it probably lacks sufficient development. If you are worried that a paragraph is too long, then you may have rambled on to topics other than the one stated in your topic sentence.
English 2In a piece of persuasive writing, which of these would be the best thesis statement?
A. The typical TV show has 9 violent incidentsB. There is a lot of violence on TVC. The amount of violence on TV should be reducedD. Some programs are more violent than othersE. Violence is included in programs to boost ratingsF. Violence on TV is interestingG. I don’t like the violence on TVH. The essay I am going to write is about violence on TV
HistoryWhy are historians concerned with bias when analyzing sources?
A. People can never be trusted to tell the truthB. People deliberately leave out important detailsC. People are only able to provide meaningful information if they
experienced an event firsthandD. People interpret the same event in different ways, according to their
experienceE. People are unaware of the motivations for their actionsF. People get confused about sequences of events
Automated scoring technologies
unstructuredstructured
evidence structure
Low-order
High-order
Multiple-choice items
Ski
ll le
vel a
sse
sse
d
c-rater
m-rater
e-ratersimulations
FunctionQualityFormatScope
Continuous vs. one-off assessmentContinuous assessment Pros
High validity (including reliability) Reduced stress (for some students)
Cons Comparability of work done at different times Questions about the accumulation of learning over the programme
One-off assessment Pros
Synoptic Comparability issues minimized
Cons Limited validity (especially reliability) Stressful for some students (construct-irrelevant variance)
Reflections
The challengeTo design an assessment system that is:
Distributed So that evidence collection is not undertaken entirely at the end
Synoptic So that learning has to accumulate
Extensive So that all important aspects are covered (breadth and depth)
Manageable So that costs are proportionate to benefits
Trusted So that stakeholders have faith in the outcomes
The minimal take-aways…No such thing as a summative assessment
No such thing as a reliable test
No such thing as a valid test
No such thing as a biased test
“Validity including reliability”