Investigating a new method for standardising essay marking ... · Whitehouse and Pollitt (2012)...
Transcript of Investigating a new method for standardising essay marking ... · Whitehouse and Pollitt (2012)...
-
International Journal of Assessment Tools in Education
2019, Vol. 6, No. 2, 218–234
https://dx.doi.org/10.21449/ijate.564824
Published at http://www.ijate.net http://dergipark.org.tr/ijate Research Article
218
Investigating a new method for standardising essay marking using levels-
based mark schemes
Jackie Greatorex 1*, Tom Sutch 1, Magda Werno 1, Jess Bowyer 2, Karen Dunn 3
1 Cambridge Assessment, Triangle Building, Shaftesbury Road, Cambridge, UK, CB2 8EA 2 University of Exeter, St Luke's Campus, Heavitree Road, Exeter, UK, EX1 2LU 3 British Council, 10 Spring Gardens, London SW1A 2BN, UK
ARTICLE HISTORY
Received: 18 Janauary 2019
Revised: 11 April 2019
Accepted: 23 April 2019
KEYWORDS
Comparative judgement,
Marking,
Standardisation,
Reliability,
Essay
Abstract: Standardisation is a procedure used by Awarding Organisations to
maximise marking reliability, by teaching examiners to consistently judge
scripts using a mark scheme. However, research shows that people are better
at comparing two objects than judging each object individually.
Consequently, Oxford, Cambridge and RSA (OCR, a UK awarding
organisation) proposed investigating a new procedure, involving ranking
essays, where essay quality is judged in comparison to other essays. This
study investigated the marking reliability yielded by traditional
standardisation and ranking standardisation. The study entailed a marking
experiment followed by examiners completing a questionnaire. In the control
condition live procedures were emulated as authentically as possible within
the confines of a study. The experimental condition involved ranking the
quality of essays from the best to the worst and then assigning marks. After
each standardisation procedure the examiners marked 50 essays from an AS
History unit. All participants experienced both procedures, and marking
reliability was measured. Additionally, the participants’ questionnaire
responses were analysed to gain an insight into examiners’ experience. It is
concluded that the Ranking Procedure is unsuitable for use in public
examinations in its current form. The Traditional Procedure produced
statistically significantly more reliable marking, whilst the Ranking Procedure
involved a complex decision-making process. However, the Ranking
Procedure produced slightly more reliable marking at the extremities of the
mark range, where previous research has shown that marking tends to be less
reliable.
1. INTRODUCTION
General Certificate of Secondary Education (GCSE), Advanced Subsiduary (AS) and
Advanced Level (A Level) are school examinations taken in the UK. Given the high-stakes
nature of these examinations, it is essential that marking reliability is high and that
standardisation (examiner training to accomplish uniform use of the mark scheme) is effectual,
so that marks and grades are dependable. Generally, marking reliability is greater for short
CONTACT: Jackie Greatorex [email protected] Research Division,
Cambridge Assessment, Triangle Building, Shaftesbury Road, Cambridge, UK, CB2 8EA
ISSN-e: 2148-7456 /© IJATE 2019
https://dx.doi.org/10.21449/ijate.564824http://www.ijate.net/http://dergipark.org.tr/ijatehttps://orcid.org/0000-0002-2303-0638https://orcid.org/0000-0001-8157-277Xhttps://orcid.org/0000-0002-7499-9895
-
Greatorex, Sutch, Werno, Bowyer & Dunn
219
answer questions than for questions requiring a long response which are marked with levels-
based mark schemes†. Consequently, effective procedures for standardising examiners’ essay
marking are crucial.
It may be feasible to improve the current approach to standardisation and maximise essay
marking reliability. The purpose of standardisation is to ensure that all examiners apply the
mark scheme fairly and consistently, and procedures vary between Awarding Organisations,
subjects and units. Traditionally, standardisation consists of practice marking, a meeting, where
examiners are trained to apply the mark scheme, typically by marking a number of scripts as a
group. Examiners then individually mark a sample of scripts at home, which are checked by
their Team Leader, or the Principal Examiner (PE: the lead marker in charge of the examination
or qualification) in smaller subjects. The Team Leader or PE may require further samples of
marking to be checked.
In addition to standardisation there are several procedures for maximising marking quality
including scaling (correcting consistently lenient or severe marking), and marker monitoring
(observing marking post standardisation). The focus of the research is standardisation and it is
beyond the scope of the research to account for these additional procedures.
Laming (2004) concludes from extensive research that people are better at comparing two
objects than making absolute judgements about an object. Subsequently, a new approach to
standardisation, forthwith called the Ranking Procedure, was proposed. This procedure focuses
on comparing essays with one another and ranking them from the best to the worst.
The present research had two aims:
to investigate whether the Ranking Procedure and Traditional procedure resulted in equivalent levels of marking reliability or whether the reliability from one was demonstrably
better;
to evaluate whether examiners considered the Ranking Procedure to be useful, how they conducted their marking, and whether the procedure was efficient.
1.1. Literature Review
Extended response questions are widely regarded as the most difficult questions to mark and
are associated with the lowest levels of marking reliability (Black, Suto, & Bramley, 2011;
Suto, Nádas, & Bell, 2011b). Consequently, there has been much research investigating why
extended response questions have lower reliability, and how this can be improved.
One suggestion is that marking extended response questions entails a high cognitive load for
the examiner, which could in turn lead to reduced marking reliability. Suto and Greatorex
(2008) found that more complex cognitive strategies, such as evaluating and scrutinising, were
used significantly more in the marking of GCSE Business Studies, which uses a levels-based
mark scheme, than in GCSE Mathematics, which uses a more objective points-based mark
scheme. Senior examiners in the same study suggested that it might be useful to train examiners
in the use of these cognitive strategies. This may particularly benefit new examiners, as
research indicates that examiners with lower subject expertise, marking and teaching
experience mark extended response questions less accurately than others (Suto, Nádas, & Bell,
2011a).
† Levels-based mark schemes (levels-of-response mark schemes) are generally used for marking extended written responses. Such mark schemes often divide the available marks into smaller mark bands, each mark band is associated with a level and a description of the type of answer that will obtain a mark from within a given mark band. The examiner classifies a candidate’s response into a level and then decides which mark from the associated mark band is most appropriate. For more detailed descriptions see Pinot de Moira (2013) and Greatorex and Bell (2008a).
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218-234
220
Attempts to increase marking reliability in extended response questions have tended to focus
on two key aspects of the marking process: standardisation (or examiner training), and mark
schemes. Both are particularly pertinent to the current study.
1.2. Improving Standardisation
The greatest recent change to standardisation procedures (also called examiner or rater training
in the literature) has been the transition from face-to-face to online standardisation. Some
research indicates that online standardisation may be very slightly more effective in increasing
marking accuracy, although in the context of English as a Second Language (ESL) assessment
(Knoch, Read, & von Randow, 2007; Wolfe, Matthews, & Vickers, 2010; Wolfe & McVay,
2010). Research into the use of online standardisation in UK high-stakes examinations
indicates that online standardisation is equally as effective as face-to-face standardisation
(Billington & Davenport, 2011; Chamberlain & Taylor, 2010).
There is evidence to suggest that face-to-face or online meetings are not particularly effective
for increasing marking accuracy on their own (Greatorex & Bell, 2008b; Raikes, Fidler, & Gill,
2009), and should be combined with additional feedback (Greatorex & Bell, 2008b; Johnson
& Black, 2012). The type of feedback received does not affect marking reliability, whether it
is iterative or immediate, personalised or prewritten, or targeted at improving accuracy or
internal consistency (Greatorex & Bell, 2008b; Sykes et al., 2009). However, Johnson and
Black (2012) found that examiners find feedback most helpful when it is immediate, refers to
the mark scheme and focuses on specific problems with scripts.
Standardisation is of greatest benefit for new or less experienced examiners, whilst having little
or no effect on the marking accuracy of experienced examiners (Meadows & Billington, 2007;
Meadows & Billington, 2010; Raikes et al., 2009; Suto, Greatorex, & Nádas, 2009). Additional
background effects such as subject knowledge and expertise also affect marking reliability
(Suto & Nádas, 2008). Despite this, with adequate training, some examiners with no or little
teaching and marking experience can become as reliable as the most experienced examiners,
although these individuals may be difficult to identify before the standardisation process begins
(Meadows & Billington, 2010; Suto & Nádas, 2008). It is noteworthy that research by
Meadows and Billington (2010) and Suto and Nádas (2008) related to questions requiring short
or medium length responses rather than essays, and therefore the findings may not generalise
to examinations marked with levels-based mark schemes.
1.3. Mark Schemes
Alternative studies have investigated whether changes to levels-based mark schemes could
improve marking reliability. A key factor is that levels-based mark schemes are often lengthy
and contain a lot of information. Whilst more constrained mark schemes are associated with
higher levels of reliability (Bramley, 2009; Pinot de Moira, 2013; Suto & Nádas, 2009), they
do so by restricting the number of creditable responses and thus would compromise the validity
of extended-response assessment (Ahmed & Pollitt, 2011; O'Donovan, 2005; Pinot de Moira,
2011a; Pinot de Moira, 2011b).
An alternative is to use a holistic, rather than analytic, mark scheme. Holistic mark schemes
are where one overall mark is given to a response. The mark scheme may specify different
elements of performance, but the examiner attaches their own weighting to each feature. In
analytic levels-based mark schemes, the examiner awards separate marks for individual
elements of a response. There is no clear consensus as to which is more valid and reliable.
The evidence suggests that inter-rater reliability is higher in holistic scoring (Çetin, 2011;
Harsch & Martin, 2013; Lai, Wolfe, & Vickers, 2012). However, analytic scoring is
particularly helpful in diagnostic English as a Second Language (ESL) assessment as features
-
Greatorex, Sutch, Werno, Bowyer & Dunn
221
of a student’s writing can be assessed individually and that information then be fed back to the
candidate to guide future learning (Barkaoui, 2011; Knoch, 2007; Lai et al., 2012; Michieka,
2010). Holistic mark schemes, on the other hand, can obscure differences in individual traits
of a student’s response, as well as how examiners weigh and apply different assessment criteria
(Harsch & Martin, 2013).
A style of holistic marking that has particular relevance for the current study is Comparative
Judgement (CJ). This method entails deciding which is the better of two scripts, thus making
holistic but also relative judgements about script quality. Examiners make a series of these
judgements, until each script has been judged a number of times. A rank order of all scripts is
then statistically compiled, usually by fitting a Bradley-Terry model to the paired comparison
data.
If the pairs are presented online and the data can be analysed in ‘real time’ it is possible to
make the presentation of pairs ‘adaptive’. This means that as more judgements are made,
examiners are given scripts that appear to be closer together in quality, in order to make more
nuanced distinctions between scripts and reduce the overall number of comparisons that need
to be made. This process is known as ‘adaptive comparative judgement’ (ACJ), Pollitt (2012a)
and Pollitt (2012b).
Whilst CJ is most often used for comparability studies, it is argued that it is more valid and
reliable than traditional marking, as examiners are simply making overall judgements about
script quality (Kimbell, 2007; Kimbell, Wheeler, Miller, & Pollitt, 2007; Pollitt, 2009, 2012a,
2012b; Pollitt, Elliott, & Ahmed, 2004). A project, which used ACJ to assess Design and
Technology portfolios, found a reliability coefficient of 0.93 (Kimbell, 2007), whilst
Whitehouse and Pollitt (2012) found a reliability coefficient of 0.97 when using ACJ in AS
level Geography papers. However, adaptivity can inflate the reliability coefficient, so the high
reliability found in these studies is disputed (Bramley, 2015; Bramley & Vitello, 2018).
Moreover, there is empirical evidence that the strength of CJ lies in multiple judgements and a
strong statistical model, rather than comparing one script directly with another (Benton &
Gallagher, 2018). Also the process is very time-consuming (Whitehouse & Pollitt, 2012).
Consequently, there are serious doubts as to whether it is practicable in large-scale, high-stakes
assessment.
1.4. Research Questions
The experiment tested the following hypotheses:
H0: The Traditional and Ranking Standardisation result in equivalent levels of marking
reliability.
H1: The Traditional or Ranking Standardisation result in more reliable marking.
Examiners’ perspectives were collected to answer the following questions:
How did the examiners undertake the Ranking Procedure?
Was the Ranking Procedure (in)efficient and (un)suitable for upsacling or digitising?
2. METHOD
The project utilised candidates’ scripts from an OCR AS level History examination. This
examination was chosen for several reasons: firstly, the entry was large enough to select a wide
range of scripts. Secondly, the questions required essay responses and were marked using a
levels-based mark scheme.
The Principal Examiner from live examining was used as the Principal Examiner for this study.
Ten Assistant Examiners (examiners) participated in the study. They had not marked this paper
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218-234
222
in live examining, but they had either been eligible to mark it or had marked a similar
examination (e.g. another A level History paper).
2.1. Design
The experiment had two conditions.
The control condition was a simulation of a traditional standardisation process. This is used within some current awarding organisations. Each examiner marked a Provisional
Sample. They attended a standardisation meeting where their marking was standardised
using the Traditional Mark Scheme. After the meeting, the examiners marked the
Standardisation Essays and received feedback on their marking from the PE. The PE
decided whether each examiner could proceed to the next stage of the experiment, re-
mark the Standardisation Sample or mark a further Standardisation Sample. Finally, the
examiners marked the Allocation.
The experimental condition, called Ranking Standardisation, broadly followed the same process as the control condition. Each examiner ranked a Provisional Sample. They
attended a standardisation meeting at which they learnt how to rank responses (with no
marks) and then learnt how to mark the ranked essays. After the meeting each examiner
rank ordered the Standardisation Sample from the best to the worst response. They
received feedback on their ranking from the PE. The PE decided whether each examiner
could proceed to the next stage of the experiment or re-rank the Standardisation Sample.
Subsequently, each examiner marked the Standardisation Sample. The PE decided
whether each examiner could proceed to the next stage of the experiment, re-mark the
Standardisation Sample or rank and mark a further Standardisation Sample. Finally, the
examiners marked the Allocation.
The design was within subjects. Marking reliability was measured at the end of the experiment.
Counterbalancing was achieved by:
Allocating examiners to groups based on which date they were available to attend a meeting‡
Conducting conditions in the order determined by a 2x2 Latin Square: o group 1 control condition then experimental condition o group 2 experimental condition then control condition.
This was to guard against the order of conditions affecting the results.
2.2. Materials
2.2.1. Scripts
All the essays involved were responses to the two most popular questions for that examination
paper (referred to here as questions A and B). Each had a maximum of 50 marks available.
Scripts were anonymised for use by examiners.
The experiment involved four samples of essays: The Provisional, Meeting and Standardisation
samples, as well as the examiners’ final marking allocation. This was intended to mirror the
real standardisation process. Participants marked the Provisional Sample before the
standardisation meeting; the Meeting Sample was used in the standardisation meeting to teach
participants about applying the mark scheme correctly; and the Standardisation Sample was
marked after the meeting to ensure participants were applying the mark scheme correctly and
to gain a measure of inter-rater reliability. After the standardisation process was completed,
participants were asked to mark an allocation of 50 essays. The essays covered a range of
quality of performance.
‡ It was assumed that availability would be as random as any other way of allocating examiners to groups.
-
Greatorex, Sutch, Werno, Bowyer & Dunn
223
2.2.2. Mark Schemes
Traditional Mark Scheme
The Traditional Mark Scheme was the live mark scheme for question A or B. The live mark
scheme included level descriptors and a description of content for each question.
Ranking Mark Scheme
In the Ranking Mark Scheme the level descriptors from the live mark scheme were replaced
with a brief description of the characteristics of quality of performance. The Ranking Mark
Scheme was written by the PE and reviewed by OCR. After the standardisation meeting, an
indication of the marks given to each Meeting Essay was added. For an example of the Ranking
Mark Scheme see Figure 1.
2.2.3. Examiner Questionnaire
A questionnaire was developed that included open and closed questions. The questionnaire
focused on the usefulness of aspects of standardisation, how the Ranking Procedure might be
upscaled or conducted on-screen, and related merits and limitations.
2.3. Controls
Control mechanisms were in place. First, examiners were allocated to groups based on
availability. Secondly, group 1 completed each stage of the control condition using the
Traditional Mark Scheme on question A before completing the parallel stage of the
experimental condition using the Ranking Mark Scheme for question B. Group 2 completed
each stage of the experimental condition with the Ranking Mark Scheme on question A before
completing the parallel stage of the control condition with the Traditional Mark Scheme for
question B. Thirdly, all examiners marked the same essays at each stage of the experiment.
Fourthly, none of the examiners marked the question paper in live marking, as such participants
would have violated the crossover design by experiencing the control condition before the
experimental condition.
These controls and the within subjects design enabled a direct comparison between the
reliability of marking generated by the two experimental conditions.
-
Greatorex, Sutch, Werno, Bowyer & Dunn
224
Figure 1. Ranking Mark Scheme Question A
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218–234
225
2.5. Analysis
2.5.1. Quantitative
The within-subjects aspect of the design was statistically efficient because each examiner
served as their own control, so examiner-level variation was isolated.
The experiment was more complex than a standard crossover design, because the measurement
taken for each examiner for each condition also had a structure; an average of 50 essays. This
was exploited in the analysis by modelling for candidate level effects.
The difference from the definitive mark§ for each marked essay was computed. The appropriate
definitive mark was used depending on whether the essay was marked with the Traditional or
Ranking Mark Scheme.
Different mark schemes were used for the two procedures, therefore the definitive marks for a
given response could vary depending on procedure. In turn, this could mean that the effective
mark range may vary and thus differences between examiners could be affected. To take an
extreme example, if a mark scheme strongly fostered marking towards the centre of the mark
range, the examiners would comply and the differences between examiners would be small,
giving a false impression of high reliability. In order to address this, standardised differences
were computed, by dividing the difference between the examiner and PE marks (definitive
marks) by the standard deviation of the definitive mark for the given question under the
appropriate procedure.
Analyses of variance were calculated using a variety of dependent variables. A standard analysis
for a 2×2 crossover design, as described by Senn (2002) for example, would be possible at the
examiner level (using the mean difference under each procedure), but this would not exploit the
multiple measurements for each examiner and the variation at a candidate and response level.
As a result, a more complex model was applied:
𝑌𝑖𝑗𝑘 = 𝜇 + 𝑚(𝑖)𝑘 + 𝑞𝑗 + 𝜏𝑑[𝑖,𝑗] + 𝑐𝑙 + 𝑐𝑟𝑗𝑙 + 𝑒𝑖𝑗𝑘𝑙
where the terms are as follows, notation based on Jones and Kenward (1989):
𝑌𝑖𝑗𝑘 : Random variable representing marking difference (with observed values 𝑦𝑖𝑗𝑘) – either actual difference, or absolute difference as appropriate
𝜇 : General mean
𝑚(𝑖)𝑘 : The effect of examiner 𝑘 in group 𝑖
𝑞𝑗 : The effect of question (and also period) 𝑗
𝜏𝑑[𝑖,𝑗] : The direct effect of the treatment (procedure) used in period 𝑗 for group 𝑖
𝑐𝑙 : The effect of candidate 𝑙
𝑐𝑟𝑗𝑙 : The effect of response by candidate 𝑙 to question 𝑗
𝑒𝑖𝑗𝑘𝑙 : A random error for candidate 𝑙, examiner 𝑘, period/ question 𝑗 and group 𝑖, assumed to be independently and identically normally distributed with mean 0 and variance 𝜎2.
Note that no carry-over effect (denoted by Jones and Kenward (1989) as λ, and capturing any effect of the method used for the first period on the results in the second period) was included
in the model. We followed the advice of Senn (2002) in not testing for this and carrying out a
two-stage analysis, as was once common**.
§ There are several legitimate ways to calculate the definitive mark. For the purposes of this study, the Principal
Examiner’s marks were used as the definitive marks. ** A two-stage analysis first tests for the presence of a carry-over effect, then if such an effect is found, only data
from the first period are used to test for the treatment (in our case, procedure) effect. As Senn (2002) explains, this
-
Greatorex, Sutch, Werno, Bowyer & Dunn
226
In our experiment it was not possible to separate any effect of period (that is, whether
examiners’ reliability changed between the first and second sets of responses marked) from the
question marked (any effect on reliability due to the essay question, or the History topic)
because all examiners in both groups were standardised on and marked question A first,
followed by question B.
2.5.2 Qualitative
A thematic analysis of the qualitative responses to the questionnaire was guided by advice from
Braun and Clarke (2006). There were four themes in the data, however, our focus is:
Decision Process for the Ranking Procedure (including an Initial Sorting stage)
Ranking is Time-consuming
Regarding the Decision Process a diagram was drawn to represent the data. A second researcher
checked the diagram against their reading of the data.
3. FINDINGS
3.1. Quantitative
In the six analysis of variance models most effects were strongly statistically significant,
reflecting the size of the sample. For the purposes of brevity these figures are not included.
The focus of the research was the procedure effect, which was significant at the 5% level for
all analyses except standardised absolute difference from the average examiner mark (Table 1).
The unstandardised differences were more easily interpreted as they are articulated in terms of
raw marks, while the standardised differences are the number of standard deviations (of the
definitive mark distribution). Using the measure of actual difference, the examiners were more
severe (by 0.9 of a mark), and further away from the definitive mark, under the Ranking
Procedure. When absolute difference from the definitive mark is considered, there was greater
marking error (by 0.5 of a mark on average) using the Ranking Procedure than the Traditional
Procedure; the difference was somewhat smaller (0.3 marks) when the average examiner mark
was used as the comparator.
The standardisation of the differences had a small influence on the direction of the results, but
did reduce the significance of the procedure effect when considering the absolute difference.
When focusing on the absolute difference with respect to the average examiner mark the
difference became statistically insignificant (at the 5% level) when standardised differences
were used.
In short, the results supported the hypothesis (H1: The Traditional or Ranking Standardisation
result in more reliable marking) and the Traditional Standardisation procedure yielded greater
marking reliability.
Figure 2 and Figure 3 show the mean error for each of the responses to question A and B (the
results for each procedure originated from a different group of examiners). The x-axis shows
the responses arranged by definitive mark, and the 10 Meeting Essays from the experimental
condition are shown as vertical lines††. The three panels for each question show the same results
with different y-axes:
actual difference from definitive mark
absolute difference from definitive mark
absolute difference from average examiner mark‡‡.
approach is flawed because the test based on the first period only is highly correlated with the pre-test for carry-
over, and is thus heavily biased. †† Note that these vertical lines are not necessarily the definitive marks for the control condition, but they are
retained to enable comparison between the two halves of each graph. ‡‡ Average actual difference from average examiner mark is not shown, as it is zero for each response.
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218–234
227
Table 1. Effect sizes for procedure, and estimates of mean
Response
Estimates of means under
each procedure
Effect of procedure
Traditional Ranking Estimate Standard
Error t value Pr > |t|
Actual difference -1.628 -2.518 -0.890 0.247 -3.60 0.0003
Absolute difference 3.780 4.326 0.546 0.178 3.07 0.0022
Actual difference
(standardised) -0.2449 -0.3622 -0.1173 0.0363 -3.23 0.0013
Absolute difference
(standardised) 0.5688 0.6237 0.0550 0.0261 2.11 0.0353
Absolute difference
(average examiner mark) 2.68 3.01 0.332 0.135 2.47 0.0138
Absolute difference
(average examiner mark)
(standardised)
0.4084 0.4179 0.00949 0.01949 0.49 0.6266
There were few discernable trends. The proximity between the marks for Meeting Essays and
the definitive mark of the target essay had no clear effect on the marking reliability of question
A or question B. For actual difference in question A, the negative gradient suggests a slight
tendency for examiners to be harsher for higher marks (that is, they mark closer to the middle
of the mark scale than the PE) in both conditions.
For the bottom panel (absolute difference from average examiner mark) there were a few
indications that the Ranking Procedure yielded greater marking reliability at the extremes of
the mark range. For question A, there was more consensus among examiners using the Ranking
than the Traditional Procedure at the upper end of the mark range. For question B a similar
effect was observed at the lower end of the mark range, although, the responses in the allocation
had definitive marks lower than the lowest Meeting Essay, G.
Figure 2. Mean error for each response along with definitive mark and Meeting Essays: question A
-
Greatorex, Sutch, Werno, Bowyer & Dunn
228
Figure 3. Mean error for each response along with definitive mark and Meeting Essays: question B
3.1. Qualitative
The Decision Process for the Ranking Procedure was a complex, multi stage process comprising
several paired comparisons of essays (Figure 4). The core Decision Process was preceded in
some instances by the Initial Sorting of the essays.
Nine examiners said that the Ranking Procedure was more time-consuming than the Traditional
procedure. Reasons cited by examiners included:
the difficulty of judging how essays compared to one another
re-reading essays
dealing with the accumulation of essays available to compare with the target essay
lack of familiarity.
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218–234
229
Select a target essay Select a target essay
Place the target essay in
the ranking from best to
worst
Place the target essay in
the ranking from best to
worst
Compare the essays and rank
them from the best to the worst.
The target essay is
Compare the essays and rank
them from the best to the worst.
The target essay is
Estimate the quality of the
target essay in relation to
previously judged essays
Estimate the quality of the
target essay in relation to
previously judged essays
Select a previously judged
(Meeting) essay, of a similar
quality
Select a previously judged
(Meeting) essay, of a similar
quality
Select a previously
judged essay of a
higher quality
Select a previously
judged essay of a
higher quality
Select a previously
judged essay of a
lower quality
Select a previously
judged essay of a
lower quality
worsebetter
Compare the two essays and
decide which is the better. The
target essay is
Compare the two essays and
decide which is the better. The
target essay is
worse than both
Select a previously
judged essay of a
lower quality. Return
the best essay to its
place
Select a previously
judged essay of a
lower quality. Return
the best essay to its
place
Select a previously
judged essay of a
higher quality.
Return the worst
essay to its place
Select a previously
judged essay of a
higher quality.
Return the worst
essay to its place
All essays
ranked?
All essays
ranked?
ENDENDSTARTSTART
Award
(provisional)
mark(s)
Award
(provisional)
mark(s)
Award
Mark(s)
Award
Mark(s)
Initial sorting of essays into groupsInitial sorting of essays into groups Select a groupSelect a group
All essays in
group ranked?
All essays in
group ranked?
Award
(provisional)
mark(s)
Award
(provisional)
mark(s)
DecisionDecision Sub processSub process START/ENDSTART/ENDApplies when initial
sorting was doneprocess
Select a previously
judged essay of more
similar quality.
Return uneeded
essays to their place
Select a previously
judged essay of more
similar quality.
Return uneeded
essays to their place
better than both
in-between and the
previously judged essays
were NOT adjacent
in the rankings
At any point in the process
examiners may; (1)
remember, read and re-read
essays (2) make notes on
essays, read and re-read the
notes (3) refer to the mark
scheme (4) draw on their
expertise
At any point in the process
examiners may; (1)
remember, read and re-read
essays (2) make notes on
essays, read and re-read the
notes (3) refer to the mark
scheme (4) draw on their
expertise
same as one
OR
in-between
and the previously judged
essays were adjacent
in the ranking
Previously judged
essay misplaced
NO
YESYESNO
same
Figure 4. Decision Process for the Ranking Procedure
-
Greatorex, Sutch, Werno, Bowyer & Dunn
230
4. DISCUSSION
Prior studies illustrate that people’s ability to compare two objects surpasses our ability to make
absolute judgements about an object (Laming, 2004). Therefore, OCR proposed investigating
a new procedure, the Ranking Procedure focusing on comparing the quality of essays and
ranking them according to their quality before assigning marks. This research evaluated the
marking reliability resulting from both Traditional and Ranking Standardisation, and
examiners’ experiences of the procedures. The research has limitations, which are outlined
below. However, the research generated important findings.
The experiment was designed to simulate standardisation processes. It was beyond the scope
of the research to incorporate the many checks and balances that are used to achieve reliable
marking in addition to standardisation, for instance scaling. Therefore, the marks and statistics
from the experiment were not directly comparable to marking reliability in live marking.
However, the experimental data were suitable for testing the hypotheses.
The examiners were likely to be more familiar with the Traditional Procedure than the Ranking
procedure, which may result in the former yielding more reliable marking. Arguably, examiners
new to both procedures should have been recruited to ensure the experiment was a fair
evaluation. However, if an Awarding Organisation were to switch from the Traditional
Procedure to another procedure then many examiners would, in the short term, be more familiar
with the Traditional Procedure. Consequently, the experiment is an authentic comparison of
reliability delivered from the Traditional Procedure and a potential new procedure.
There was insufficient time between operational activities for examiners to complete one
condition after another. Therefore, a departure from a crossover design was invoked. Group 1
undertook each stage of the experiment using the Traditional Mark Scheme on question A
before completing the parallel stage of the experiment using the Ranking Mark Scheme for
question B. Group 2 undertook each stage of the experiment using the Ranking Mark Scheme
on question A before completing the parallel stage of the experiment using the Traditional Mark
Scheme for question B. Each examiner marked the same essays as the other examiners at each
stage of the experiment. The interweaving of stages may have had a confounding effect on each
condition. Ideally there would be a ‘wash-out’ period between the two conditions, to allow any
effects from the first half of the experiment to dissipate before commencing the second. The
lack of a wash out period was a major practical constraint to the design. It was hoped that the
effect of the conditions would be large enough to overpower any confounding variables,
particularly as the interweaving of stages was common to the groups (with the two procedures
reversed).
Several findings emphasised the limitations of the Ranking Procedure. First, the Traditional
Procedure produced greater marking reliability than the Ranking Procedure. The Traditional
Standardisation resulted in smaller mean differences for all measures. Second, the procedure
effect was statistically significant at the 5% level for all measures of reliability, with the
exception of standardised absolute difference from the average examiner mark. This concurred
with previous research. When alternatives to Traditional Standardisation were investigated they
did not consistently lead to better marking reliability than Traditional Standardisation
(Greatorex & Bell, 2008b). Additionally, the mark scheme and feedback to examiners improve
marking reliability, but examplar scripts do not improve marking reliability in terms of absolute
difference between the PE’s and examiners’ marking (Baird, Greatorex, & Bell, 2004). Finally,
the high reliability of using paired comparisons in marking (ACJ) has been disputed (Bramley,
2015; Bramley & Vitello, 2018). Based on the reliability measures Ranking Standardisation
was not as effective as Traditional Standardisation.
There were additional limitations of the Ranking Procedure. First, it was time-consuming, due
to the need to re-read essays. Both using ranking of more than two objects and involving
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218–234
231
adaptivity have the potential to reduce the time taken. Whitehouse and Pollitt (2012) maintained
that ACJ with pairs was not viable as a form of summative assessment for large scale public
examinations in England in its current form as it was too time consuming. This study suggested
that the Ranking procedure is also likely to be too time-consuming for these purposes. Secondly,
for examiners the Decision Process for the Ranking Procedure is complex, indeed more
complex than the decision process in CJ with pairs. Together the thematic analysis and the
literature suggested that in its current form the Ranking Procedure was too complex and too
time-consuming to be used for summative assessments for large scale public examinations in
England.
However, the Ranking Procedure had merits which suggest it is worth further consideration.
The Ranking Procedure overcame one drawback of levels-based mark schemes: that marking
can be less reliable at the extremities of the mark range. Ideally marking is reliable throughout
the mark range. However, prior research showed that marking reliability was better towards the
centre of the mark range and not as good towards the top and bottom of the mark scale (Pinot
de Moira, 2013) and a remedy is sought. Our qualitative data included comments from
examiners that the Ranking Procedure gave greater discrimination and a lower prospect of
inaccurate marking. The qualitative findings aligned with the quantitative evidence that at the
extremities of the mark scale the Ranking Procedure performed better than the Traditional
Procedure. In question A, there was higher reliability among examiners using the Ranking
Procedure than the Traditional Procedure at the upper end of the mark range, and in question B
a similar effect was noted at the lower end of the mark range. There is no clear cause for this
result. It is possible that the holistic marking of the Ranking Procedure outperformed the
analytical marking of the Traditional procedure in supporting reliability at the extremes of the
mark range. Further consideration may be given to how (features of) the Ranking Procedure
can be applied to mark essays at the extremes of the mark range.
5. CONCLUSION
This study investigated the merits and limitations of the Ranking Procedure compared with the
Traditional Procedure. The limitations of the Ranking Procedure were that it yielded less
reliable marking than the Traditional Procedure, was time consuming and entailed an inefficient
Decision Process. However, the Ranking Procedure had several merits. Examiners noted that
the Ranking Procedure gave greater discrimination and a lower prospect of inaccurate marking.
The quantitative results indicate that the Ranking Procedure produced reliable marking at the
extremities of the mark range, whereas traditional levels-based mark schemes tend to generate
more reliable marking in the middle of the mark range and less reliable marking at the
extremities. The findings suggest that the Ranking Procedure is unsuitable for implementation
in public examinations in its current form. However, it may be advantageous to explore
techniques for refining the Ranking Procedure so that its merits may be realised, for example
in awarding or research studies.
ORCID
Jackie Greatorex https://orcid.org/0000-0002-2303-0638
Tom Sutch https://orcid.org/0000-0001-8157-277X
Karen Dunn https://orcid.org/0000-0002-7499-9895
Conflicts of Interest
The authors declared no conflict of interest.
Acknowledgements
We would like to thank colleagues in OCR and the Research Division at Cambridge Assessment
for their help and advice with the study, particularly Beth Black who had the original ideas for
-
Greatorex, Sutch, Werno, Bowyer & Dunn
232
the Ranking Procedure. We would also like to thank the examiners who engaged with the
research, especially the Principal Examiner who played a pivotal role in the project.
6. REFERENCES
Ahmed, A., & Pollitt, A. (2011). Improving marking quality through a taxonomy of mark
schemes. Assessment in Education: Principles, Policy & Practice, 18(3), 259-278. doi:
http://dx.doi.org/10.1080/0969594X.2010.546775
Baird, J.-A., Greatorex, J., & Bell, J. F. (2004). What makes marking reliable? Experiments
with UK examinations. Assessment in Education: Principles, Policy & Practice, 11(3),
331-348.
Barkaoui, K. (2011). Effects of marking method and rater experience on ESL essay scores and
rater performance. Assessment in Education: Principles, Policy & Practice, 18(3), 279-
293.
Benton, T., & Gallagher, T. (2018). Is comparative judgement just a quick form of multiple
marking. Research Matters: A Cambridge Assessment Publication (26), 22-28.
Billington, L., & Davenport, C. (2011). On line standardisation trial, Winter 2008: Evaluation
of examiner performance and examiner satisfaction. Manchester: AQA Centre for
Education Research Policy.
Black, B., Suto, W. M. I., & Bramley, T. (2011). The interrelations of features of questions,
mark schemes and examinee responses and their impact upon marker agreement.
Assessment in Education: Principles, Policy & Practice, 18(3), 295-318.
Bramley, T. (2009). Mark scheme features associated with different levels of marker agreement.
Research Matters: A Cambridge Assessment Publication (8), 16-23.
Bramley, T. (2015). Investigating the reliability of Adaptive Comparative Judgment Cambridge
Assessment Research Report. Cambridge, UK: Cambridge Assessment.
Bramley, T., & Vitello, S. (2018). The effect of adaptivity on the reliability coefficient in
adaptive comparative judgement. Assessment in Education: Principles, Policy &
Practice, 1-16. doi: 10.1080/0969594X.2017.1418734
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research
in Psychology, 3, 77-101.
Çetin, Y. (2011). Reliability of raters for writing assessment: analytic - holistic, analytic-
analytic, holistic-holistic. Mustafa Kemal University Journal of Social Sciences Institute,
8(16), 471-486.
Chamberlain, S., & Taylor, R. (2010). Online or face to face? An experimental study of
examiner training. British Journal of Educational Technology, 42(4), 665-675.
Greatorex, J., & Bell, J. F. (2008a). What makes AS marking reliable? An experiment with
some stages from the standardisation process. Research Papers in Education, 23(3), 333-
355.
Greatorex, J., & Bell, J. F. (2008b). What makes AS marking reliable? An experiment with
some stages from the standardisation process. Research Papers in Education, 23(3), 333-
355.
Harsch, C., & Martin, G. (2013). Comparing holistic and analytic scoring methods: issues of
validity and reliability. Assessment in Education: Principles, Policy & Practice, 20(3),
281-307.
Johnson, M., & Black, B. (2012). Feedback as scaffolding: senior examiner monitoring
processes and their effects on examiner marking. Research in Post-Compulsory
Education, 17(4), 391-407.
Jones, B., & Kenward, M. G. (1989). Design and Analysis of Cross-Over Trials. London:
Chapman and Hall.
Kimbell, R. (2007). e-assessment in project e-scape. Design and Technology Education: an
International Journal, 12(2), 66-76.
http://dx.doi.org/10.1080/0969594X.2010.546775
-
Int. J. Asst. Tools in Educ., Vol. 6, No. 2, (2019) pp. 218–234
233
Kimbell, R., Wheeler, T., Miller, S., & Pollitt, A. (2007). E-scape portfolio assessment. Phase
2 report. London: Department for Education and Skills.
Knoch, U. (2007). ‘ Little coherence, considerable strain for reader’: A comparison between
two rating scales for the assessment of coherence. Assessing Writing, 12(2), 108-128. doi:
10.1016/j.asw.2007.07.002
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it
compare with face-to-face training? Assessing Writing, 12, 26-43. doi:
10.1016/j.asw.2007.04.001
Lai, E. R., Wolfe, E. W., & Vickers, D. H. (2012). Halo Effects and Analytic Scoring: A
Summary of Two Empirical Studies Research Report. New York: Pearson Research and
Innovation Network.
Laming, D. (2004). Human judgment: the eye of the beholder. Hong Kong: Thomson Learning.
Meadows, M., & Billington, L. (2007). NAA Enhancing the Quality of Marking Project: Final
Report for Research on Marker Selection. Manchester: National Assessment Agency.
Meadows, M., & Billington, L. (2010). The effect of marker background and training on the
quality of marking in GCSE English. Manchester: Centre for Education Research and
Policy.
Michieka, M. (2010). Holistic or Analytic Scoring? Issues in Grading ESL Writing. TNTESOL
Journal.
O'Donovan, N. (2005). There are no wrong answers: an investigation into the assessment of
candidates' responses to essay-based examinations. Oxford Review of Education, 31, 395-
422.
Pinot de Moira, A. (2011a). Effective discrimination in mark schemes. Manchester: AQA.
Pinot de Moira, A. (2011b). Levels-based mark schemes and marking bias. Manchester: AQA.
Pinot de Moira, A. (2013). Features of a levels-based mark scheme and their effect on marking
reliability. Manchester: AQA.
Pollitt, A. (2009). Abolishing marksism and rescuing validity. Paper presented at the
International Association for Educational Assessment, Brisbane, Australia.
http://www.iaea.info/documents/paper_4d527d4e.pdf
Pollitt, A. (2012a). Comparative judgement for assessment. International Journal of
Technology and Design Education, 22(2), 157-170.
Pollitt, A. (2012b). The method of adaptive comparative judgement. Assessment in Education:
Principles, Policy & Practice, 19(3), 281 - 300. doi: http://dx.doi.org/10.1080/0969594
X.2012.665354
Pollitt, A., Elliott, G., & Ahmed, A. (2004). Let's stop marking exams. Paper presented at the
International Association for Educational Assessment, Philadelphia, USA.
Raikes, N., Fidler, J., & Gill, T. (2009). Must examiners meet in order to standardise their
marking? An experiment with new and experienced examiners of GCE AS Psychology
Paper presented at the British Educational Research Association, University of
Manchester, UK.
Senn, S. (2002). Cross-Over Trials in Clinical Research. Chichester: Wiley.
Suto, I., Nádas, R., & Bell, J. (2011a). Who should mark what? A study of factors affecting
marking accuracy in a biology examination. Research Papers in Education, 26(1), 21-51.
Suto, W. M. I., & Greatorex, J. (2008). A quantitative analysis of cognitive strategy usage in
the marking of two GCSE examinations. Assessment in Education: Principles, Policy &
Practice, 15(1), 73-89.
Suto, W. M. I., Greatorex, J., & Nádas, R. (2009). Thinking about making the right mark: Using
cognitive strategy research to explore examiner training. Research Matters: A Cambridge
Assessment Publication(8), 23-32.
http://www.iaea.info/documents/paper_4d527d4e.pdfhttp://dx.doi.org/10.1080/0969594X.2012.665354http://dx.doi.org/10.1080/0969594X.2012.665354
-
Greatorex, Sutch, Werno, Bowyer & Dunn
234
Suto, W. M. I., & Nádas, R. (2008). What determines GCSE marking accuracy? An exploration
of expertise among maths and physics markers. Research Papers in Education, 23(4),
477-497. doi: 10.1080/02671520701755499
Suto, W. M. I., & Nádas, R. (2009). Why are some GCSE examination questions harder to mark
accurately than others? Using Kelly's Repertory Grid technique to identify relevant
question features. Research Papers in Education, 24(3), 335-377. doi:
http://dx.doi.org/10.1080/02671520801945925
Suto, W. M. I., Nádas, R., & Bell, J. (2011b). Who should mark what? A study of factors
affecting marking accuracy in a biology examination. Research Papers in Education,
26(1), 21-51.
Sykes, E., Novakovic, N., Greatorex, J., Bell, J., Nádas, R., & Gill, T. (2009). How effective is
fast and automated feedback to examiners in tackling the size of marking errors? Research
Matters: A Cambridge Assessment Publication (8), 8-15.
Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly
reliable rank order in summative assessment. Manchester: AQA Centre for Education
Research and Policy.
Wolfe, E. W., Matthews, S., & Vickers, D. (2010). The effectiveness and efficiency of
distributed online, regional online, and regional face-to-face training for writing
assessment raters. The Journal of Technology, Learning and Assessment, 10(1).
http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1601/1457
Wolfe, E. W., & McVay, A. (2010). Rater effects as a function of rater training context. New
York: Pearson Research and Innovation Network.
http://dx.doi.org/10.1080/02671520801945925http://ejournals.bc.edu/ojs/index.php/jtla/article/view/1601/1457