THE EFFECT OF GENDER ON MULTIPLE CHOICE EXAMS: … · 2017-08-04 · Dit onderzoek zal proberen bij...
Transcript of THE EFFECT OF GENDER ON MULTIPLE CHOICE EXAMS: … · 2017-08-04 · Dit onderzoek zal proberen bij...
THE EFFECT OF GENDER ON
MULTIPLE CHOICE EXAMS:
RETROSPECTIVE CORRECTING FOR
GUESSING
Aantal woorden: 20.877
Daphné Dejonckheere Stamnummer: 01271215
Promotor: dr. Evelien Opdecam
Masterproef voorgedragen tot het bekomen van de graad van:
Master of Science in de Bedrijfseconomie Afstudeerrichting: bedrijfseconomie
Academiejaar: 2016 - 2017
THE EFFECT OF GENDER ON
MULTIPLE CHOICE EXAMS:
RETROSPECTIVE CORRECTING FOR
GUESSING Aantal woorden: 20.877
Daphné Dejonckheere Stamnummer: 01271215
Promotor: dr. Evelien Opdecam
Masterproef voorgedragen tot het bekomen van de graad van:
Master of Science in de Bedrijfseconomie Afstudeerrichting: bedrijfseconomie
Academiejaar: 2016 - 2017
PERMISSION
I declare that the content of this Master’s Dissertation may be consulted and/or reproduced,
provided that the source is referenced.
Daphné Dejonckheere
Nederlandstalige samenvatting
Multiple choice examens zijn een algemeen gekend examen formaat voor het meten van kennis
van studenten in hoger onderwijs. Voorgaande literatuur vraagt echter een verdere verkenning
van alternatieve scoringsmethodes voor multiple choice examens, aangezien de twee meest
gebruikte scoringsmethodes, zijnde “number right scoring” en “giscorrectie”, beide inherente
nadelen vertonen. Een belangrijke bezorgdheid bij het gebruik van meerkeuzevragen is dat het
bepaalde groepen van studenten zou bevoordelen. Eerder onderzoek stelde immers vaak vast
dat mannen een voordeel zouden hebben bij meerkeuzevragen ten opzichte van vrouwen.
Significante verschillen in prestaties tussen mannen en vrouwen op multiple choice examens
zijn vastgesteld op examens zowel met als zonder giscorrectie. Dit onderzoek zal proberen bij
te dragen aan voorgaand onderzoek door na te gaan of er ook significante verschillen tussen de
prestaties van mannen en vrouwen optreden indien er een alternatieve methode wordt gebruikt
om multiple choice examens te verbeteren. De alternatieve verbetermethode die in dit
onderzoek zal worden onderzocht, wordt standard setting of hogere cesuur genoemd. Bij deze
methode verliest men geen punten bij een verkeerd antwoord, maar moeten studenten wel meer
dan de helft van de vragen juist beantwoorden om te kunnen slagen voor het examen. Naast
geslacht, zal ook rekening worden gehouden met andere eigenschappen van studenten die
mogelijks verschillen in prestaties tussen studenten kunnen verklaren. Dit onderzoek werd
uitgevoerd bij derde bachelor studenten handelswetenschappen. De resultaten tonen geen
significante verschillen in prestaties tussen mannen en vrouwen op examens waar standard
setting als verbetermethode werd gehanteerd. Indien echter gekeken wordt naar de scores van
studenten op de verschillende types van meerkeuzevragen, tonen de resultaten aan dat
mannelijke studenten duidelijk beter presteerden op rekenvragen, terwijl vrouwen beter
scoorden op toepassingen. Met betrekking tot de andere factoren, bleek dat wekelijkse studietijd
en een oppervlakkige leerbenadering respectievelijk een significant positieve en een significant
negatieve invloed hebben op examenprestaties waar standard setting wordt gehanteerd. Tot slot
behaalden de studenten die de laatste les bijwoonden een significant hogere score op het examen
dan de studenten die afwezig waren.
Abstract
Multiple choice examinations are a widely known exam format for measuring students’
knowledge in higher education. Previous literature calls, however, for further exploration of
alternative scoring methods for multiple choice assessment, since the two most commonly used
scoring methods – “number right scoring” and “negative marking” – both have shown inherent
drawbacks. One major concern with use of multiple choice questions is that it would favour
particular groups of students. More specifically, prior research often identified a gender bias in
favour of male students with multiple choice questions. Gender differences in performance on
multiple choice exams have occurred both with and without the use of penalties for wrong
answers. This study will try to contribute to prior research by examining whether a significant
gender effect also exists in case an alternative method is used to score multiple choice exams.
The alternative scoring method that will be explored in this study is called retrospective
correcting for guessing (also called “standard setting” or “hogere cesuur” in Dutch). This
scoring method does not penalize wrong answers, but students have to answer more than half
of the questions correctly in order to pass the exam. Besides gender, this study will also take
other students’ characteristics into account which may explain differences in performance
among students. The study is administered in a third-year undergraduate course of Business
Administration students. The results provide no evidence for significant gender differences in
performance on multiple choice exams that are corrected retrospectively for guessing. When
looking at performance scores on the different types of multiple choice questions though, male
students performed significantly better on calculations, while women outperformed men on
application questions. With regard to other students’ characteristics, this study found that
weekly invested study time and a surface approach to learning respectively have a significantly
positive and negative influence on performance on exams which are corrected retrospectively
for guessing. Finally, students who attended the last lecture achieved significantly higher marks
compared to students who were absent.
I
Preface
This master thesis can be considered as the final proof of competence for obtaining the Master
of Science degree in Complementary Studies in Business Economics at the University of Ghent.
Several persons have substantially contributed to this master thesis. Therefore, I would like to
take this opportunity to express my gratitude towards those people who have helped me through
this.
First and foremost, I sincerely want to show gratitude towards my promotor, dr. Evelien
Opdecam. She presented me this interesting topic and offered me the opportunity to work on
this subject. I also want to thank her for her faith in my capabilities and her regular feedback
and guidance to improve the quality of this study.
Secondly, I would like to thank students enrolled in the third bachelor of Business
Administration at our university, who have completed my survey. Without their participation,
it would be impossible to investigate this topic. Their answers formed the backbone of this
study.
Thirdly, I want to thank my parents for the opportunity to follow this education and their
continuous support, even during hard times.
Lastly, I owe gratitude to Lukas, my boyfriend, for his moral support, comforting words and
sincere interest in my work. He also carefully read my text and corrected the grammatical
mistakes I made.
Daphné Dejonckheere
II
Table of contents
Introduction 1
1 Theoretical framework 3
1.1 Multiple choice examinations in higher education 3
1.2 The influence of gender on performance on multiple choice exams 4
1.3 Scoring methods for multiple choice assessment 7
1.3.1 Conventional scoring methods 7
1.3.1.1 Number right (NR) scoring 7
1.3.1.2 Negative marking (NM) 8
1.3.2 Retrospective correcting for guessing 11
1.4 Other explanatory factors of performance 15
1.4.1 Prior experience, familiarity and preference 15
1.4.2 Lesson attendance 16
1.4.3 Study time 17
1.4.4 Students’ perceptions about course difficulty 17
1.4.5 Learning approaches 18
1.4.6 Motivation 19
2 Research design & methodology 23
2.1 Research goal & questions 23
2.2 Research techniques 23
2.2.1 Surveys 24
2.2.1.1 Sample 25
2.3 Measurement 25
2.3.1 Dependent variables: performance 25
2.3.2 Independent variables 26
2.3.3 Control variable 29
2.4 Analysing the results 30
2.4.1 Independent samples T-test 30
2.4.2 Regression analyses 30
3 Research findings 31
III
3.1 Descriptive statistics 31
3.2 Correlations 37
3.3 Gender differences 40
3.4 Hypotheses testing 43
3.4.1 Hypothesis 1 43
3.4.2 Hypotheses 2 44
3.4.3 Hypothesis 3 46
3.4.4 Hypothesis 4 47
3.4.5 Hypothesis 5 48
3.4.6 Hypothesis 6 49
3.4.7 Robustness check 50
4 Discussion 54
4.1 Limitations 60
4.2 Future research 61
5 Conclusion 62
Bibliography VII
Appendices 1
Appendix 1: Survey 1
Appendix 2: Factor loadings and Cronbach’s alpha familiarity 8
Appendix 3: Factor loadings and Cronbach’s alphas R-SPQ-2F 9
Appendix 4: Factor loadings and Cronbach’s alphas RAI 10
IV
List of used abbreviations
CR Constructed-response
MC Multiple choice
NM Negative marking
NR Number right (scoring)
RAI Relative autonomy index
R-SPQ-2F Revised two factor study process questionnaire
SDT Self-determination theory
SPQ Study process questionnaire
SRQ Self-regulation questionnaire
VIF Variance inflation factor
V
List of tables and figures
TABLES:
Table 1: Literature review regarding the influence of students’ characteristics on
performance
Table 2: Descriptives performance on the exam
Table 3: Frequencies gender
Table 4: Descriptives familiarity, preference, perceptions course difficulty, learning
approaches & ability
Table 5: Frequencies times participated in the exam
Table 6: Frequencies lesson attendance (exercises)
Table 7: Frequencies lesson attendance (theory)
Table 8: Frequencies weekly reported study time (excl. lessons)
Table 9: Frequencies quadrants of learning approaches
Table 10: Correlation table
Table 11: Gender differences (Independent samples T-test & Mann-Whitney U-test)
Table 12: ANCOVA for gender differences in performance (control variable: ability)
Table 13: Regression of familiarity with retrospective correcting for guessing on performance
Table 14: Regression of preference of scoring method on performance
Table 15: Regression of lesson attendance on performance
Table 16: Additional t-test regarding attendance of last course
Table 17: Regression of time weekly spent on performance
Table 18: Regression of perceptions about course difficulty on performance
Table 19: Regression of learning approaches on performance
Table 20: Regression of all the independent variables on performance with retrospective
correcting for guessing
Table 21: Regression of all the independent variables on performance on theoretical questions
Table 22: Regression of all the independent variables on performance on calculations
Table 23: Regression of all the independent variables on performance on application
questions
VI
FIGURES:
Figure 1: The internalization continuum depicting the various types of extrinsic motivation
posited within self-determination theory
Figure 2: Histogram performance with retrospective correcting for guessing (mark on 40)
Figure 3: Plot of the learning approaches (Mean-split)
Figure 4: Mean scores on the different types of MC questions (mark on 10)
1
Introduction
Since 1953 a spectacular growth in higher education enrolments in Belgium can be observed, which
had implications for the format of examination. As courses are followed by larger groups of students,
instructors have to score considerable amounts of exams (Duchesne & Nonneman, 1998). This task
of grading exams can be a very time-consuming task for instructors. Consequently, many
constructed-response (CR) tests have been replaced by multiple choice (MC) examinations, for
which computerized evaluation is possible (Kastner & Stangl, 2011). As examinations in higher
education mainly aim at extracting the knowledge of students from their responses, test scores have
to reflect the “true” level of knowledge mastery of students. Hence, a lot of education literature
concentrated on scoring methods for these MC test formats (Lesage, Valcke, & Sabbe, 2013).
This debate has, however, mainly been single-sided and concentrated on two most commonly used
scoring methods: number right (NR) scoring versus negative marking (NM). Results of former
studies indicated that both methods do not seem to meet the expectations and have inherent
drawbacks with regard to test validity and reliability. Whereas the major problem with NR scoring
is the fact that students can gain marks through guessing, the use of a penalties in case of NM is said
to favour particular groups of students (Lesage, Valcke, & Sabbe, 2013). Several authors talk about
a gender bias as implementing correction for guessing results in a different level of omitted items
between male students and female students and consequently to differences in performances (Betts,
Elder, Hartley, & Trueman, 2009). Also other drawbacks of the correction for guessing format have
been mentioned in literature and will be discussed later on.
Therefore, a growing need arises to explore alternative approaches for scoring MC exams in order
to inform and support instructors and other test designers (Lesage, Valcke, & Sabbe, 2013).
However, when exploring literature, a substantial gap appears with regard to these “non-
conventional” scoring methods. Therefore, this study will try to contribute to previous research by
switching the focus to a non-conventional scoring method: the retrospective correcting for guessing.
The lack of research that has been devoted to this alternative approach, combined with the fact that
the University of Ghent decided in 2014 to replace the correction for guessing by this non-
conventional method (also known as “standard setting” or “hogere cesuur”), are the main reasons
2
for choosing this scoring method as the main focus of this dissertation. Nevertheless, it should be
acknowledged that there still exist other scoring methods, which may also benefit from additional
research.
This study will analyse whether differences in marks on multiple choice exams, that are corrected
retrospectively for guessing, can be attributed to different characteristics of students. It will mainly
be examined whether a gender difference in performance also appears in case the retrospective
correcting for guessing scoring method is applied, and if so, the extent of that gender effect.
However, also the possible influence of other variables such as lesson attendance, invested study
time, learning approaches, etc. on performance will be discussed. This research will be done for a
sample of third bachelor Business Administration (“handelswetenschappen”) students, following a
course of corporation tax (“vennootschapsbelasting”) at the University of Ghent.
The remainder of this study is organized as follows. First, an outline of relevant literature regarding
previous research on multiple choice (MC) examination and their scoring methods will be given.
This literature study will lead to the formulation of hypotheses. Next, details concerning the
methodology and the data used in this study will be given. Thirdly, the results of the analyses will
be presented and discussed, which will lead to either the confirmation or rejection of the hypotheses.
Finally, conclusions of this study are contained in the last section.
3
1 Theoretical framework 1.1 Multiple choice examinations in higher education
Multiple-choice (MC) examinations have become a widespread evaluation tool within higher
education. MC questions have a stem and a set of possible answers from which examinees have to
select the correct answer(s). Contrary to MC tests, constructed-response (CR) questions require
students to independently formulate their own answers, which might be a short answer, an essay, a
diagram, an explanation of a procedure or a solution to a mathematical question (Kastner & Stangl,
2011). The frequent use of MC examinations can be observed in different disciplines such as
accounting (Bible, Simkin, & Kuechler, 2008; Arthur & Everaert, 2012), economics (Chan &
Kennedy, 2002; Du Plessis & Du Plessis, 2007), psychology (Betts, Elder, Hartley, & Trueman,
2009), information technology (Woodford & Bancroft, 2004) and mathematics (Beller & Gafni,
2000).
This increasing use of MC examinations can be attributed to several benefits this format offers in
comparison to CR tests. The main advantages for instructors include the possibility to cover a broad
range of subjects in a single examination, and this for large cohorts of students, greater efficiency
and reliability in scoring (Betts, Elder, Hartley, & Trueman, 2009; Kastner & Stangl, 2011). For
students, the most important benefits of MC exams are the following: the perception that this scoring
method is more objective, the fact that their writing skills and writing speed are no determining
factors and a heightened confidence in their ability to improve their marks through making correct
guesses or uncovering the solution by a process of elimination (Bible, Simkin, & Kuechler, 2008).
Nevertheless, prior literature also mentioned several drawbacks of MC tests. First of all, the
possibility to gain marks through lucky guessing is a very pronounced concern by researchers
regarding the reliability of MC examinations (Betts, Elder, Hartley, & Trueman, 2009). A second
concern is whether these tests assess the same level of understanding as CR tests. A third
disadvantage relates to potential ambiguity in MC questions themselves (Tsui et al., in: Bible,
Simkin, & Kuechler, 2008). Fourthly, it has been argued that these tests do not adequately measure
students’ critical, communication and analytical skills, although these skills are actively encouraged
in higher education and essential in preparing students for future employment (Bible, Simkin, &
4
Kuechler, 2008). Fifthly, it is argued by some authors that MC examination typically promotes
‘surface’ rather than ‘deep’ learning as MC questions may encourage students to memorize subject
matter instead of understanding concepts (Williams & Clark, in: Betts, Elder, Hartley, & Trueman,
2009). Finally, the debate whether the use of MC questions favours particular groups of students
presents a particular case in research literature. More specifically, several studies identified a gender
bias in favour of male students with MC questions. However, findings about this issue have not been
consistent and will be discussed in the next section.
1.2 The influence of gender on performance on multiple choice exams
Considerable attention in prior literature has been devoted to the question whether exam format
matters when measuring student performance. It is often argued that, depending on personal
characteristics, some students are predisposed to perform better on a certain mode of assessment
(Krieg & Uyar, 2001). Especially, the relationship between gender and student performance on MC
examinations has been a prominent research focus in education literature. In what follows, prior
findings about the relationship between gender and performance on MC exams will be discussed.
A substantial amount of local and international research has identified a gender bias in favour of
male students with MC examinations. Bias or systematic error appears when it is impossible to
measure all subgroups of the population in the same way. Consequently, gender bias can be defined
as a systematic error in the measurement of differences in skills between men and women
(Willingham & Cole, 1997). Several studies found that MC questions favour male students more
compared to female students. For instance, this effect was confirmed in accounting examinations by
Arthur & Everaert (2012). Although women outperformed their male counterparts in both MC and
CR exam formats, their superior performance in MC questions decreased, when compared to CR
questions. Both for theory and exercise MC questions, male students seemed to have a relative
advantage over females. Also research of Krieg & Uyar (2001), investigating the importance of
exam structure in economics, found that male students are predisposed to perform better on MC
exams.
5
Leaver & van Walbeek (2006) supplemented prior research by examining whether certain types of
MC questions induced a stronger “gender bias” than other types. Therefore, they examined whether
the gender difference could be explained by either the content type or degree of cognitive reasoning
needed in order to answer questions. According to content, they divided questions in five categories:
1) Quantitative questions (i.e. calculations)
2) Qualitative questions (i.e. descriptions, definitions)
3) Specific graphical questions (i.e. finding a specific solution based on a graph)
4) General graphical questions (i.e. general shifts of curves based on a graph)
5) Factual questions (i.e. general knowledge or current affairs )
Secondly, they classified questions according to the level of cognitive reasoning by making use of
the Cognitive Model of Bloom’s taxonomy. The main idea of this taxonomy is that educational
objectives can be organized in a hierarchy from less to more complex. The six levels are successive,
meaning that one level has to be mastered before a following level can be attained. However, there
is only consensus that the first four classes of the taxonomy form a hierarchy, while there is some
disagreement about the fifth and sixth category. The most MC questions can, however, be classified
as belonging to one of the first four categories. The six levels of cognitive reasoning include:
1) Knowledge: recalling or recognizing previously learned information
2) Comprehension: understanding the meaning of information, interpreting information
3) Application: using information in new situations
4) Analysis: examining and dividing information into component parts
5) Synthesis: integrating or combining information
6) Evaluation: assessing the value of information (Leaver & van Walbeek, 2006)
The results of this research indicated that female students were outperformed by male students, and
this finding holds for both questions categorized according to content as well as for questions
classified according to Bloom’s taxonomy. With regard to content, females appeared to be at a
disadvantage for all five categories, but to a larger extent in case of quantitative and graphical
questions. In Bloom’s taxonomy, the gender difference becomes more prominent at higher levels:
the more complex the questions, the more likely women failed to answer them correctly (Leaver &
van Walbeek, 2006).
6
Other than a gender bias in favour of male students with MC questions, some studies found evidence
for a positive female gender effect on CR tests. Research of Du Plessis & Du Plessis (2007), for
instance, could not confirm the strong claims about the gender bias in favour of male students in
MC examinations, but showed a positive female gender effect on performance in case of CR
questions. Also Bible, Simkin, & Kuechler (2008) found a small, but significant positive relation
between females and performances on CR questions. Their research indicated that women have a
four percent advantage over men on CR questions.
There is also a stream in literature that contradicts previous findings listed above. Wester &
Henrikkson (2000) made use of identical items in different exam formats to investigate performance
in mathematics. Women performed slightly better than men for the MC items and this difference in
performance remained the same for CR questions. Hence, no significant changes in gender
differences were found when the exam format was changed. Also the study of Hartley, Betts, &
Murray (2007), which compared the scores between male and female final-year psychology students
for different modes of assessment, found that women performed significantly better than men on all
modes of assessment (inclusive MC exams). Finally, there also are studies that found no gender
differences in performance at all. Chan & Kennedy (2002), for instance, found no significant
differences between the performance of male and female students both on MC and CR tests.
7
1.3 Scoring methods for multiple choice assessment
Up till now, the way MC exams can be scored have not been taken into account, though a variety of
options exists. An important discussion in this field concerns the question whether a penalty for
wrong answers should be used or not. Consequently, prior literature mainly concentrated on two
widely used methods: the number right (NR) scoring versus negative marking (NM). These
conventional scoring methods will be discussed in the first following paragraph. Since both methods
show inherent benefits as well as disadvantages and no empirical evidence exists that helps to direct
the choice between both, an alternative approach will be explored as well. This alternative approach
is nowadays applied at Ghent University and will be the main focus of this master thesis.
1.3.1 Conventional scoring methods
1.3.1.1 Number right (NR) scoring
Number right scoring is one of the most simple scoring methods, which rewards right answers with
a positive value, while incorrect or omitted answers are scored with a value of zero (Lesage, Valcke,
& Sabbe, 2013).
Multiple choice exams that use the NR scoring method also seem a very good alternative to the
constructed-response tests, since they use somewhat the same logic. Both exam formats do not
penalize wrong answers, while most other scoring rules (e.g. negative marking) are found to be more
strict than NR scoring and CR tests (Kastner & Stangl, 2011).
However, one of the major drawbacks of this scoring method is the fact that students can answer
correctly through guessing. Consequently, the reliability and validity of test scores decreases as
instructors are not able to distinguish guessed answers from answers based on knowledge (Bar-
Hillel, Budescu, & Attali, in: Lesage, Valcke, & Sabbe, 2013). However, the frequency of blind
guessing may be substantially overestimated. Students hardly resort to blind guessing, which refers
to the process of purely random guessing in which each answer option has an equal chance of being
chosen. Moreover, blind guessing alone is not likely to result in high grades. As Downing (2003)
formulated: “the odds of achieving a perfect score on a test through random guessing alone
approach the odds of winning the lottery” (pp. 670). In case of informed guessing, on the other
hand, students use their partial knowledge in eliminating incorrect answers in order to improve their
chance of picking the correct answer (Downing, 2003).
8
1.3.1.2 Negative marking (NM)
Since student guessing has been an issue since the beginning of the MC format usage, correction for
guessing or negative marking (NM) is nowadays frequently incorporated in MC exams (Betts, Elder,
Hartley, & Trueman, 2009). The predominant correcting model within this method is the ‘rights
minus wrongs’. Contrary to NR scoring, incorrect answers are penalized by deducting a percentage
of a mark. Mostly, the penalty for an incorrect answer is 1
(n−1), with n representing the number of
choices. However, also whole marks are sometimes deducted for incorrect answers (Lesage, Valcke,
& Sabbe, 2013).
As the introduction of negative marking is believed to discourage students to guess, this method
would result in higher reliability and validity of test scores (Muijtjens et al., in: Lesage, Valcke, &
Sabbe, 2013). The test scores represent a more reliable reflection of a student’s capability.
Nevertheless, other problems seem to arise when making use of this scoring method.
First of all, it is argued that this method seems to miss its goal as it does not overcome the guessing
problem, but instead introduces new tensions. One student may show greater risk seeking behaviour
by trying to guess the correct answers more frequently, while another student may be more risk
averse and show a higher tendency to omit items when he or she is not sure. Hence, risk averse
students may be disadvantaged while they might have equal ability levels as their fellow students
who frequently dare to guess. The focus may shift away from measuring students’ knowledge
towards measuring students’ answering strategies and risk taking behaviour (Bar-Hillel, Budescu,
& Attali, in: Lesage, Valcke, & Sabbe, 2013).
A second disadvantage related to the guessing problem are the instructions that should be given to
students in advance. With the introduction of negative marking, students were instructed not to guess
at all. Though, later on, it was stated that students should be recommended to guess when they could
eliminate one or more alternative options (Betts et al., in: Lesage, Valcke, & Sabbe, 2013). It is clear
that this challenges the original underlying principle of this scoring method, namely discourage
guessing. Since students will react differently and inconsistently, it can be concluded that instructors
have to be very cautious when instructing students whether to guess or not. Formulating instructions
that are beneficial to all students seem to be a very difficult task (Budescu & Bar-Hillel, in: Lesage,
9
Valcke, & Sabbe, 2013). Figuring out the most optimal decision strategy under negative marking is
challenging for students as well (Lesage, Valcke, & Sabbe, 2013).
Thirdly, there is also disagreement in literature about the optimal penalty that should be attached to
incorrect answers. Some are in favour of a penalty exceeding the standard penalty of 1
(n−1) (Budescu
& Bar-Hillel, in: Lesage, Valcke, & Sabbe, 2013). A higher penalty can be justified since, although
it may discriminate against risk averse students, this effect is negligible compared to the
measurement error it prevents (Bible, Simkin, & Kuechler, 2008).
Finally, implementing a penalty to discourage guessing behaviour may also be detrimental for
students’ performance. As correction for guessing increases the number of questions unanswered,
lower final grades are a sound, immediate consequence. Research of Betts et al. (2009) found,
however, that this detrimental effect of correcting for guessing only appears in case of closed-book
MC examinations. In case of open-book examinations, the implementation of a penalty does not
lead to significantly poorer performance. Furthermore, the implementation of correcting for
guessing may also lead to gender differences in performance, as men and women have shown
different risk patterns resulting in leaving a different number of questions unanswered.
Consequently, a recent stream of literature concentrated on the debate whether the use of penalties
in MC assessment induces a(n) (additional) gender bias or not. Again, mixed results can be observed
and will be discussed in the next paragraph.
1.3.1.2.1 Gender differences in risk aversion
The use of penalties in MC exams inevitably results in a higher number of omitted items. Differences
between students in the tendency to omit items have been explained by their attitudes towards risk:
more risk averse students omit more items compared to less risk averse students (Espinosa &
Gardeazabal, 2010). Accordingly, students with a lower degree of risk aversion can obtain a higher
score on MC examinations that penalize wrong answers, while more risk averse students suffer a
disadvantage with this kind of exams (Marín & Rosa-García, 2011). As women are more risk averse
than men, it is sometimes argued that this type of examination involves a discrimination against
women.
10
Persistent differences in the number of questions left unanswered between male students and female
students were, for instance, found in the study of Marín & Rosa-García (2011). They observed a
higher risk aversion in women as they consistently answered less questions in comparison to men.
Though women obtained lower scores in comparison to men, these differences in marks were very
small and mostly insignificant. Nevertheless, they concluded that a discrimination against women
exists with this type of MC examinations due to their higher tendency to omit items compared to
men.
There are several studies, however, that found no gender differences at all concerning the degree of
risk aversion in MC exams. Research of Betts et al. (2009), for instance, found that men and women
left an approximately equal percentage of questions omitted. Also the experiment of Du Plessis &
Du Plessis (2007) revealed no evidence of gender differences in the level of risk aversion. Their
experiment, however, yielded another interesting result: a significant difference was found between
the success of guessing by men and women. Male students guessed significantly more MC versions
of questions correctly, which were found difficult to answer in written form.
We can conclude that both these conventional scoring methods affect the test reliability in a negative
way. To overcome the weaknesses of both methods, increasing the number of questions in exams
as well as the number of alternative options for each question may offer a solution. However, in
higher education settings, this is not always a feasible solution since the time given to students for
completing an exam is mostly restricted. Moreover, also test developers may be confronted with
new difficulties when they have to think about additional item options. When these extra alternative
options are not able to act as effective distractors, they will not be able to discourage guessing
behaviour (Lesage, Valcke, & Sabbe, 2013).
11
1.3.2 Retrospective correcting for guessing
Due to the weaknesses of conventional scoring methods, alternative scoring methods will have to
be explored that may overcome these shortcomings. The ultimate goal of test designers is to find an
optimal balance between high reliability as with NM, while at the same time avoiding bias due to
risk-taking behaviour in case of NR scoring (Muijtjens et al.,: in Lesage, Valcke, & Sabbe, 2013).
Also Ghent University has decided to no longer use the NM format of correction for guessing.
Reasons for abolishment were manifold and included amongst others the fact that students differ in
guessing behaviour may cause differences in final grades, the observation that students were too
occupied with tactical considerations whether to answer questions or not, etc. (Universiteit Gent,
2017).
Since the academic year 2014 – 2015, the NM scoring method has been replaced by what is called
‘hogere cesuur’ or ‘standard setting’. A standard can be defined as a score that indicates a boundary
between those who perform well enough and others who do not (Norcini, 2003). Similar to NM
scoring, a correct answer will be rewarded with a positive score. Contrary to NM scoring, wrong
and absent answers will no longer be penalized with negative marks, but will be given a value of
zero. Afterwards, a recalculation of the grades follows where one has to answer more than half of
the questions correctly to pass the exam. This method allows students to fully concentrate on the
content during exams instead of considering whether or not to answer a question (Universiteit Gent,
2017). This alternative scoring method will be the focus of this master thesis. It should be
acknowledged, however, that there still exist alternative non-conventional scoring methods (e.g.
partial-credit scoring methods), but these fall outside the scope of this dissertation.
By exploring educational literature, the term ‘retrospective correcting for guessing’ can be
encountered. According to this format, students are encouraged to answer every question since an
omitted answer is considered as an incorrect answer. The correction for guessing is then
implemented afterwards, or retrospectively, hence the term ‘retrospective correcting for guessing’.
Based on an estimation of guessing behaviour of students, scores are corrected. On the one hand,
this method penalizes blind guessing, which is clearly an advantage compared to NR scoring. On
the other hand, risk taking behaviour of the students becomes irrelevant since students benefit from
answering all the questions as the expected mark for responding cannot be lower than omitting
12
(Lesage, Valcke, & Sabbe, 2013). With the introduction of ‘hogere cesuur’, the University of Ghent
emphasizes these underlying principles.
In this master thesis, the term ‘retrospective correcting for guessing’ will be used preferably, rather
than the Dutch term ‘standard setting’. The reason is mainly to avoid confusion, as standard setting
comprises two main categories in educational literature, being norm-referenced and criterion-
referenced assessment. Since especially one of the two categories differs conceptually from the
Dutch understanding of ‘standard setting’, the term should be used with caution in this context.
Whereas the norm-referenced form of assessment shows no similarities with the method applied at
the Ghent University, the criterion-referenced option does closely relate to what we call ‘hogere
cesuur’.
Norm-referenced assessments or relative methods are an evaluation form in which an examinee’s
performance is compared to that of the current group of students participating in the test. Norm-
referenced standard setting is thus based on test results, with the performances of an appropriate
peer group (i.e. ‘norm group’) as the point of reference. This form of assessment is mainly used to
rank students rather than to measure individual performance against a standard or criterion.
Consequently, standards in this format will vary depending on group differences (Cohen‐Schotanus
& Van der Vleuten, 2010; Lesage, Valcke, & Sabbe, 2013).
A second category are the criterion-referenced assessments or absolute methods, which are
designed to measure performance of students against a specified achievement level. A pre-fixed cut-
off score is defined and allows to take the effect of guessing into account. This can be illustrated
with an example: the standard of a multiple choice test comprising 40 questions with 4 options can
be set at 50%. Since some questions may be answered correctly by randomly guessing, the passing
score can be increased to 25 out of 40 questions (Cohen‐Schotanus & Van der Vleuten, 2010;
Lesage, Valcke, & Sabbe, 2013). This form of standard setting is most appropriate for tests of
competence, where to goal is to ensure that the examinees have sufficient knowledge for a particular
purpose (Norcini, 2003). This form of standard setting is independent of test results, but can cause
variation in failure rates, merely as a function of test difficulty (Cohen‐Schotanus & Van der
13
Vleuten, 2010; Lesage, Valcke, & Sabbe, 2013). It is clear that the principles of this type of
assessment are applied by the University of Ghent.
It should, however, be recognized that there are still drawbacks related to this alternative scoring
method. This way, it remains difficult to justify the fact that students are forced to guess when they
do not know the answer for sure. In disciplines where it is of utmost importance to know the answer
with certainty, which is for instance the case in medical training education for doctors, this method
might not seem very appropriate. Furthermore, another concern may be the process of setting a cut-
off score (Lesage, Valcke, & Sabbe, 2013).
The higher passing score has to be determined in a way that the probability that a student passes a
test through guessing is similar in case of negative marking as well as in case of standard setting. At
Ghent University, teachers can make use of the standard formula for setting a higher cut-off score
or may determine the passing score themselves. The standard formula takes into account the
likelihood that students guess the correct answers, which depends on the number of choices (n). The
standard formula can be written as follows (Universiteit Gent, 2017):
∑(𝑛𝑖 + 1)
2𝑛𝑖
𝑁
𝑖=1
𝑊𝑖
N reflects the number of questions
ni reflects the number of choices per question
Wi reflects the weights assigned to each question
Subsequently, the number of correct answered questions has to be converted to a final grade.
Students just reaching the cut-off score, will obtain a 10/20. The maximum score will be given if a
student has answered every question correctly and a zero is given when all questions were answered
wrongly or not answered at all. In order to calculate the final grade, the following formula can be
used:
14
z = 10 +10
N − c (y − c)
z reflects the final grade of the student
N reflects the number of questions
c reflects the higher passing grade
y reflects the number of correctly answered questions
For example, in case of 40 MC questions with four answer options for each question, the chance of
guessing the correct answer is 25%, which corresponds to ten questions. Half of the other remaining
30 questions is 15. So 25 (i.e. 10 + 15) of the 40 questions have to be answered correctly to get a
score of 10/20.
The university of Ghent already performed a first evaluation of the implementation of the new
scoring method, especially regarding the impact of the method on exam scores. An essential finding
of that study was that the transition to the new system mainly appears to benefit female students.
The study revealed that students with a low tendency to guess have significant higher final grades
in case this new scoring method is applied in contrast to the NM scoring method. For students with
a high guessing tendency, the transition seems to make no difference The students with lower
tendencies to guess mainly consist of women. Women achieve slightly higher marks on 20 when
this scoring method is being used: +0.89 in comparison to +0.46 for male students. Also the
percentage of graduated female student increases, while that of male students remains the same (Van
de Poele & Sabbe, 2016). However, it should be noted that this study provides no evidence for the
fact that female students now outperform men on MC exams which are scored retrospectively for
guessing. It only shows that women benefit more from the transition in marking system than men.
However, for this study, the following hypothesis will be tested regarding the relationship between
gender and performance:
Hypothesis 1: Female students perform significantly better than male students when MC
examinations are corrected retrospectively for guessing. However, a gender effect in favour of
women will be weaker for questions belonging to higher levels of Bloom’s taxonomy.
15
1.4 Other explanatory factors of performance
Besides gender, previous research has also related other factors to superior examination
performance. In this section, other students’ characteristics are discussed that may contribute to
higher performance. For each of these characteristics, a hypothesis about the possible
relationship with performance on MC examinations, that are corrected retrospectively for
guessing, is formulated. As research about this marking method for MC assessment is very
scarce, the formulation of the hypotheses is based on literature about general exam
performance, regardless the exam format and scoring method being used. In table 1, a literature
review can be found of the consulted studies regarding the influence of different students’
characteristics on performance.
1.4.1 Prior experience, familiarity and preference
First of all, one’s chance to perform relatively better on MC examinations may be enhanced by
one’s prior experience in taking such exams. Krieg & Uyar (2001) examined whether students
who retake a course have a propensity to perform better on the MC exam of that course. They
expected a positive effect of retaking a course on performance, as those students have had prior
exposure to the course material and feel a greater pressure to succeed. They indeed concluded
that repeating a course has a significantly positive influence on performance on the MC exam
as the students in question possess added experience with similar MC questions. This variable
of repeating a course may reflect one’s experience in taking similar MC exams and will,
therefore, also be used in this research. Furthermore, past success or proven ability in taking
MC exams may familiarize students with this type of examination. Consequently, these students
may achieve higher grades in contrast to others who are not comfortable with this type of
examination. Therefore, this study will measure how familiar students feel with retrospective
correcting for guessing as the current marking method for MC exams and how it affects their
score. In addition to this, I expect that students who prefer this alternative scoring method above
the NM scoring method, also perform better compared to those who prefer NM. Therefore,
students will be asked which of these two scoring methods they prefer. Consequently, the
following hypotheses will be tested:
Hypothesis 2a: Repeating a course is associated with higher performance on MC
examinations, that are corrected retrospectively for guessing.
16
Hypothesis 2b: Familiarity with MC examinations that are corrected retrospectively for
guessing is associated with higher performance on MC examinations where this scoring
method is applied.
Hypothesis 2c: Preference for MC examinations that are corrected retrospectively for
guessing is associated with higher performance on MC examinations where this scoring
method is applied.
1.4.2 Lesson attendance
It has also widely been assumed that students benefit from attending lectures, since lesson
attendance is positively related to examination performance (Krieg & Uyar, 2001). It can be
questioned, however, whether this is still the case today due to huge developments in
information technology, also in the field of education. These new technologies make alternative
educational models possible, such as distance learning (Stanca, in: Aden, Yahye, & Dahir,
2013). If lesson attendance is, indeed, a significant predictor of performance, this would be a
relevant finding for both students and instructors. On the one hand, it may motivate students to
attend classes, because this is related to higher learning outcomes. On the other hand, it can also
have a motivating effect for instructors as this may convince them that their instructing does
matter for the learning outcomes of their students. Research of Aden, Yahye & Dahir (2013)
found that students who attend lessons have a significantly higher chance of passing a course.
The study of Kirby & McElroy (2003) indicated only a small positive effect of lecture
attendance on the probability of passing a course. They found that lesson attendance is more
crucial for enhancing a grade rather than obtaining the pass mark. The results of the study of
Cortright, Lujan, Cox, & DiCarlo (2011) extend previous findings by documenting that the
impact of lecture attendance on examination performance is sex specific. According to them,
regular class attendance has a stronger impact on exam performance for female students than it
has on the performance of male students. Based on prior results, the following hypothesis on
lesson attendance will be tested:
Hypothesis 3: Lesson attendance is positively associated with performance on MC exams,
which are corrected retrospectively for guessing.
17
1.4.3 Study time
Also invested study time has often been examined as another potential predictor of academic
performance. Some researchers found a significant relation between time spent studying and
performance on exams (Rau & Durand, 2000; Stinebrickner & Stinebrickner, 2004; Diseth,
Pallesen, Brunborg, & Larsen, 2010). However, other authors found no direct link between the
amount of time spent on studying and academic performance (e.g. Nonis & Hudson, 2006).
Also Plant, Ericsson, Hill, & Asberg (2005) found that the amount of study time by college
students is a rather weak predictor of academic performance. Their research found that the
relationship between invested study time and performance can be influenced by other factors
such as the quality of the study environment, previously attained study skills and aspects of a
certain discipline. For instance, students studying in a quiet environment may study more
effectively and need less study time to achieve comparable grades as students working in a
disruptive environment. Nevertheless, this quantitative factor of students’ learning activities
will again be tested in this master thesis for a possible influence on performance. As the
influence of lesson attendance will be examined separately, this variable will focus on the time
spent on studying outside of class. Though no clear empirical evidence exists, the following
relationship will be assumed:
Hypothesis 4: There is a positive relationship between the time spent on studying and
performance on MC exams, which are corrected retrospectively for guessing.
1.4.4 Students’ perceptions about course difficulty
Also students’ perceptions about course difficulty may play a role in explaining differences in
performance between students. Findings about the link between perceived difficulty and
performance have, however, not been consistent. Foos (1992) found that students who perceive
a course as rather difficult, perform better on the exam than students who observe the course
material as rather easy. This can be explained by the fact that students are more motivated to
study and work harder when they expect the exam to be difficult. However, the studies of Hong
(1999) and Combs, Michael, & Fiore (2002) found that beliefs about test difficulty had no direct
influence on test performance. As previous literature shows mixed results, this thesis will also
investigate the potential relationship between students’ perceptions about course difficulty and
their corresponding grades. The following hypothesis about the relationship between perceived
course difficulty and performance is formulated:
18
Hypothesis 5: As students perceiving a course as rather difficult, are expected to study harder
for that course, they will obtain higher scores on the MC exam, which is corrected
retrospectively for guessing.
1.4.5 Learning approaches
Further, the process of learning can have a significant impact on learning outcomes (Davidson,
2002). Research on learning in higher education states that students have a preferred way of
approaching their studies. A widely used dichotomy in the manner students approach their
learning task is “deep” versus “surface” learning (Marton & Saljo, in: Scouller, 1998). A
learning approach encompasses two elements: the first element entails the strategy or the
manner a student approaches a learning task and the second component is the motive or reason
why a student wants to approach it. A deep learning approach involves a personal commitment
in learning and a sincere interest in the subject. There is a strong, internal incentive to
thoroughly understand the course material and relate new insights to previous acquired
knowledge. In contrast, students employing a surface approach only carry out a learning task
to either embrace positive consequences or to avoid failure. These students only have the
intention to memorise facts in order to reproduce them during examinations in order to pass a
course (Scouller, 1998). Previous research has mainly emphasized the importance of the deep
learning approach in order to reach high-quality learning outcomes, such as analytical and
conceptual thinking skills, which cannot be achieved through a surface approach to learning
(e.g. Hall, Ramsay, & Raven, 2004; Everaert, Opdecam, & Maussen, 2017). Also Byrne, Flood,
& Willis (2002) found that the deep learning approach was positively associated with higher
academic performance. However, they only found evidence for the relationship between
performance and learning approaches for female students, while little evidence was found for
their male counterparts. Furthermore, the findings of a study of Davidson (2002) made a
distinction between complex and more simple examination questions. He came to the
conclusion that the use of a deep learning approach has a significant positive effect on
performance on more complex examination questions, while no significant relationship was
found between the deep approach and performance on more simple questions. Based on prior
literature, the following hypothesis can be formulated:
Hypothesis 6: The deep approach has a positive significant influence on performance on MC
examinations, that are scored retrospectively for guessing, while the opposite effect occurs
for the surface approach.
19
1.4.6 Motivation
Finally, also the relationship between the motivational process and academic performance in
higher education has received increasing empirical attention the last decades. Especially, the
quality of students’ motivation has been investigated and refers to the kind of motivation that
triggers the learning behaviour. A commonly made distinction is the one between intrinsic and
extrinsic motivated behaviour. When students are intrinsically motivated, they get engaged in
learning activities for its own sake and out of interest. On the other hand, extrinsically motivated
students want to achieve certain outcomes which are separable from the learning itself
(Vansteenkiste, Lens, & Deci, 2006).
An important theory of motivation that addresses these issues of intrinsic and extrinsic
motivation is the Self-determination theory (SDT). This theory was initially developed by Deci
& Ryan (1985) and has been refined by scholars from different countries. In SDT, different
forms of behavioural regulation can be distinguished based on the degree to which they
represent autonomous (i.e. self-determined) functioning. Intrinsic motivation represents fully
autonomous functioning, while extrinsic motivated behaviour is less self-determined and more
controlled. However, extrinsic motivation can be further subdivided in different categories
according to the extent it has been internalized: the more internalized and integrated with one’s
self, the more it can serve as a basis for autonomous functioning. The categories, ranging from
the least to most completely internalized, include (Ryan & Deci, 2000):
1. External regulation: behaviours are enacted to obtain a reward or to avoid a punishment.
2. Introjected regulation: people do something because they would feel guilty about it
when they did not (e.g. studying for exams because parents insist).
3. Identified regulation: considering the value of the activity as personally important, they
accept the benefits of an activity (e.g. studying because one considers it as valuable).
4. Integrated regulation: identified regulations have been combined with other aspects of
one’s self.
20
Figure 1: The internalization continuum depicting the various types of
extrinsic motivation posited within self-determination theory (Niemiec & Ryan,
2009)
Turner, Chandler, & Heffer (2009) have shown that intrinsic motivation is positively associated
with academic performance. Engagement in activities serving the realisation of intrinsic rather
than extrinsic goals endorses a deeper processing of learning material and hence, a greater
conceptual understanding of it. Consequently, the following hypothesis will be examined:
Hypothesis 7: Intrinsic motivation is positively associated with performance on MC
examinations, which are corrected retrospectively for guessing, while the opposite effect
occurs for extrinsic motivation.
A possible explanation for the positive relationship between intrinsic motivation and academic
success may be the consequence of the relatedness between motivation and learning
approaches. It is more likely that students who are highly intrinsically motivated to enrol in a
given course, will adopt a deep learning approach. Extrinsically motivated students, on the
contrary, do not wish to become actively involved in the subject matter and are only
concentrating on what is necessary for assessment. Consequently, the latter group of students
will rather employ a surface learning strategy as their intention is to pass a course without
investing a lot of efforts (De Lange & Mavondo, 2004)
21
Table 1: literature review regarding the influence of students’ characteristics on performance
GENERAL LITERATURE IN HIGHER EDUCATION
Author Year of
publication
Country of study Discipline Measurement of
performance
Variable(s)
Foos 1992 US Psychology Multiple choice (MC);
Constructed-response (CR)
Students’ perceptions about
course difficulty
Hong 1999 US Statistics General performance Students’ perceptions about
course difficulty
Rau & Durand 2000 US Sociology General performance Study time
Wester & Henriksson 2000 Sweden Mathematics Multiple choice (MC);
Constructed-response (CR)
Gender
Krieg & Uyar 2001 US Economics & business
statistics
Multiple choice (MC);
Constructed-response (CR)
Gender, repeating a course,
lesson attendance
Chan & Kennedy 2002 Canada Economics Multiple choice (MC);
Constructed-response (CR)
Gender
Combs, Michael, &
Fiore
2002 US Psychology Multiple choice (MC) Students’ perceptions about
course difficulty
Kirby & McElroy 2003 Ireland Economics Multiple choice (MC) Lesson attendance
Stinebrickner &
Stinebrickner
2004 US Arts General performance Study time
Plant, Ericsson, Hill, &
Asberg
2005 US Psychology General performance Study time, ability
Leaver & van Walbeek
2006 South Africa Economics Multiple choice (MC)
Gender
Nonis & Hudson 2006 US Business courses (e.g.
accounting, finance,
management)
General performance Study time
Du Plessis & Du
Plessis
2007 South Africa Economics Multiple choice (MC) +
penalty for wrong answers;
Constructed-response (CR)
Gender
Hartley, Betts, &
Murray
2007 UK Psychology Multiple choice (MC);
Constructed-response
(CR);
Projects/ dissertations
Gender
Betts, Elder, Hartley, &
Trueman
2009 UK Psychology Multiple choice (MC) +
penalty for wrong answers
Gender
22
Turner, Chandler, &
Heffer, 2009
2009 US Psychology General performance Motivation
Diseth, Pallesen,
Brunborg, & Larsen
2010 Norway Psychology General performance Study time
Cortright, Lujan, Cox,
& DiCarlo
2011 US Physiology Multiple choice (MC) Lesson attendance
Marín & Rosa-García 2011 Spain Political economy Multiple choice (MC) +
penalty for wrong answers
Gender
LITERATURE IN ACCOUNTING EDUCATION
Author Year of
publication
Country of study Measurement of
performance
Variable(s)
Byrne, Flood, & Willis 2002 Ireland Constructed-response (CR);
Group presentations
Learning approaches
Davidson 2002 Canada Multiple choice (MC);
Constructed-response (CR)
Learning approaches
Hall, Ramsay, &
Raven
2004 Australia General performance Learning approaches
Nonis & Hudson 2006 US General performance Study time, motivation, ability
Bible, Simkin, &
Kuechler
2008 US Multiple choice (MC);
Constructed-response (CR)
Gender
Arthur & Everaert 2012 Belgium Multiple choice (MC);
Constructed-response (CR)
Gender
Aden, Yahye, & Dahir 2013 Somalia General performance Lesson attendance
Everaert, Opdecam, &
Maussen
2017 Belgium Multiple choice (MC);
Constructed-response (CR)
Learning approaches
*Articles regarding the influence of students’ characteristics on performance have been searched for the period 1990 through 2017 in both general education
literature and accounting education literature.
23
2 Research design & methodology 2.1 Research goal & questions
The debate in literature about scoring methods for MC examinations has mainly been single-
sided and mostly focussed on NR scoring versus NM. As both have shown inherent drawbacks,
there is need to explore alternative scoring methods to reduce the gaps between theoretical
options and reality in order to support test developers (Lesage, Valcke, & Sabbe, 2013). There
is, however, a lack of available research that offers alternative scoring methods. Hence, this
master thesis can make an important contribution as the main focus will be on performance on
MC examinations, when an alternative scoring method is applied, being the ‘retrospective
correcting for guessing’. Especially, the relationship between performance on these type of
exams and gender will be examined, as prior research often identified gender differences in
performance on MC exams. Therefore, the main goal of this master dissertation is to investigate
whether a certain form of gender bias also occurs in case MC exams are corrected
retrospectively for guessing. Furthermore, it will also be examined if other students’
characteristics, as listed in 1.4, can be held responsible for differences in students’ performance
on MC examinations that are retrospectively corrected for guessing. This leads to the following
research questions:
Research questions:
1. Does gender have an influence on performance on multiple choice that are corrected
retrospectively for guessing?
2. Which other students’ characteristics (of those described in 1.4) have an influence on
performance on multiple choice that are corrected retrospectively for guessing?
2.2 Research techniques
The research techniques that are applied are a literature review, followed by surveys. In this
section, I will discuss how data have been gathered and explain why I chose a survey as research
technique. This research will mainly have a deductive character. The first part of the study
draws on scientific literature, which made it possible to formulate hypotheses. In a second step,
it will be examined whether these hypotheses can be confirmed or rejected by analysing the
results (van Thiel, 2010).
24
2.2.1 Surveys
After thoroughly exploring literature, I was able to formulate questions for the survey. The
survey can be found in appendix 1. Surveys are a quantitative research method where questions
are asked in a direct way to a sample of individuals. Surveys can be used to gather facts, but are
mainly used to gather information about the views and attitudes of people on a research topic.
Data gathered by surveys thus mainly include opinions, behaviour and characteristics. I did not
opt for qualitative interviews to test the hypotheses as surveys can be executed on a larger-scale
in a shorter time-period: it allows to question a large number of respondents and more variables
can be included. There is the possibility of quick response and good follow-up and interview
bias is also excluded (van Thiel, 2010).
I preferred a written questionnaire over a web survey, since my promotor dr. Evelien Opdecam
gave me the opportunity to let my respondents fill out the questionnaire during one of their
courses. In this way, I was able to exercise more control on the response rate, which is not
evident in case of a web survey (van Thiel, 2010). I think students were also more willing to
fill out the survey, since they were allowed to do it during the course and did not need spend
their spare time on this.
Almost all questions included in the questionnaire were closed-ended questions, which means
that respondents have to choose their answer from a list of pre-selected answer possibilities.
Closed-ended questions make it easier to compare the results of different respondents and
subsequently to analyse them statistically afterwards. For a substantial part of the questions
Likert scales have been used, which measure the attitudes of respondents. These scales require
respondents to indicate on a scale (usually going from one to five/seven) to which degree they
agree or disagree with a particular statement (van Thiel, 2010). The majority of these statements
used in the survey is taken from previous research. The learning approach, for example, was
measured by the Revised Two Factor Study Process Questionnaire (R-SPQ-2F), which will be
discussed more in detail in a later paragraph (Biggs, Kember, & Leung, 2001).
25
2.2.1.1 Sample
The study is conducted for a sample of Belgian students following the course of corporation tax
at the University of Ghent. More specifically, the population consisted of 350 third bachelor
students enrolled in the Bachelor of Business Administration. These students have to follow a
course in corporation tax during the first semester. During the last course before the exam, the
students were asked to complete the questionnaire during class time. Although 350 students
subscribed to this course, a relative large part of them were absent during this last lesson.
Consequently, a total of 129 students have completed the questionnaire. Among them, there
were 49 male students, 77 female students and 3 respondents who did not indicate their gender.
Since 329 students participated in the exam, the response rate is equal to 39.2%.
The questionnaire included some general questions (e.g. gender), followed by several questions
specifically related to the course of corporation tax as well as questions related to scoring
methods for multiple choice assessment. At the end of the questionnaire, students were asked
to write their student number. When they did not have their student card with them to write
down their student number, they were asked to leave their name on the questionnaire. This way
of identification was necessary to relate the answers of the questionnaire to their marks on the
exam of corporation tax. Further on, these names were converted to numbers in order to make
sure that the data were treated anonymously.
2.3 Measurement 2.3.1 Dependent variables: performance
A data-analysis is executed on the results of the exam of the course ‘corporation tax’. Hence,
the dependent variable used in this study is the performance or obtained mark on the final exam.
The exam consisted of 40 multiple choice questions. The MC questions can be subdivided in
three types of questions, which can also be linked to Bloom’s taxonomy. A first category
involved 15 theoretical questions, which can be linked to the first level of Bloom’s taxonomy
(i.e. knowledge), which require the recall of learned information. The second level of Bloom’s
taxonomy, being comprehension, requires students to demonstrate their ability to translate
knowledge into a new context, for instance from words to numbers. Accordingly, the 19
calculations that had to be solved at the exam can be considered as belonging to this level.
Finally, six application questions measured students’ competence to use information in new
situations. This type of questions can therefore be assigned to the third level of Bloom’s
taxonomy, being “application”.
26
The first dependent variable is the score on the exam, when using retrospectively correcting for
guessing. This means that for each MC question, one point can be earned for a correct answer
and there is no deduction for incorrect answers. Omitted answers are also given zero points.
However, this method requires students to answer more than half of the questions correctly to
pass the course. More specifically, the higher passing grade in this exam was set at 25.79. The
final grade of the students who just obtained the higher passing grade will correspond to 20/40
when retrospective correcting for guessing is applied.
A second dependent variable will be the score on 40, when no correction for guessing would
be implemented. Similarly, right answers are rewarded with positive values and incorrect or
omitted answers are scored with a value of zero. However, contrary to retrospective correcting
for guessing, students only have to answer half of the questions correctly to pass the course and
obtain a final grade of 20/40. Hence, this variable measures performance in case NR scoring
was applied as the marking method.
During the first exam period, 329 students have participated in the exam. 185 of them were
male students and 144 of them were female students. Although 129 students completed the
questionnaire, only for 112 of them it will be possible to link their answers of the survey to their
score on the exam. On the one hand, this is due to the fact that it was impossible for eleven
students to identify them as they did not leave their student number or name on the survey. On
the other hand, six students who completed the questionnaire, did not participate in the exam.
2.3.2 Independent variables The independent variables are gender, prior experience, familiarity with retrospective
correcting for guessing, preference of scoring method, lesson attendance, study time,
perceptions about course difficulty, learning approaches and motivation.
The data for the gender variable were collected by the questionnaire. The first question asked
to students was to indicate their sex. Subsequently, this variable is coded as 0 for male students
and 1 the female students.
To get an idea about the prior experience of students with taking similar MC questions, the
question has been posed how many times a student already has participated in the exam of
corporation tax. This question makes it possible to make a distinction between students
27
repeating the course and students following the course for the first time. Secondly, it has been
asked how familiar students feel with the retrospective correcting for guessing scoring method
that is nowadays applied at the University of Ghent. Statements had to be answered on a five-
point Likert scale, ranging from ‘strongly disagree’ to ‘strongly agree’. In appendix 2, an
overview of the items, the Cronbach’s alpha and the factor loadings are listed. The second item
has been deleted, as the factor loading for that item (“The fact that a larger number of questions
has to be answered correctly in case of retrospective correcting for guessing, scares me”) is
extremely low. Deleting this item, resulted in a(n) (rather low) alpha of 0.47 for this variable.
Thirdly, their preference for scoring methods used in MC examinations has been asked. More
specifically, students had to indicate on a scale from one to zero which scoring method they
preferred, with one being absolutely the NM method and ten being absolutely the retrospective
correcting for guessing method.
Also the variable “lesson attendance” has been included in the survey. For both theory and
exercise classes, students were asked how much of the lessons they have attended. The possible
answers included: 0 – 19%, 20 – 39%, 40 – 59%, 60 – 79%, 80 – 100% or no lessons at all.
Besides class attendance, students were asked to report the average number of hours per week
they spent at home working on the corporation tax course. The possible answers ranged from
less than one hour per week to more than six hours a week.
Furthermore, students were asked about their perceptions of the subject difficulty. On a scale
of 1 – 10, with 1 being easy and 10 being difficult, students had to indicate how difficult they
perceived the subject matter.
The variable “learning approaches” can be measured by different instruments. One of these
is the study process questionnaire (SPQ) developed by Biggs. In this study the revised version
of SPQ, i.e. the Revised Two Factor Study Process Questionnaire (R-SPQ-2F), was used as this
entails a questionnaire, which can be used by faculties to measure the learning approaches of
students. This questionnaire consists of 20 questions, using a five-point Likert scale. Half of the
questions measures the deep approach, while the other half measures the surface approach
(Biggs, Kember, & Leung, 2001). As the course of corporation tax is solely followed by Dutch-
speaking students, the questions have been translated into Dutch. In appendix 3, an overview
of the items, the Cronbach alphas and the factor loadings for the two constructs, deep approach
28
and surface approach, are listed. A limiting value of 0.30 is used as a point of reference for the
factor loadings. The Cronbach’s alpha of the deep approach amounts to 0.65. Concerning the
surface approach, the factor loading of item 1 (“my aim is to pass the course while doing as
little work as possible”) is lower than the limiting value. This item has been deleted, resulting
in an alpha of 0.63 for the surface approach.
Motivation was measured by means of a self-regulation questionnaire (SRQ) that evaluates
domain-specific individual differences in types of motivation or regulation. Respondents have
been asked why they behave in a certain way. For each behaviour, a predefined list of reasons
was given, which represent the different types of regulation (Self-determination theory, 2017).
Again, each statement had to be answered on a five-point Likert scale, ranging from ‘strongly
disagree’ to ‘strongly agree’. Motivational scores have been computed by means of the
“Relative Autonomy Index” (RAI), where regulations are weighted according to their place on
the autonomy continuum. Hence, RAI is a composite score of relative autonomy. This index
subtracts controlled forms of motivation from autonomous forms. The most common formula
is (Chemolli & Gagné, 2014):
RAI = 2 X intrinsic + identified – introjection – 2 X external
The value for the RAI could range between -12 and +12. A higher positive score for the RAI
means that the student is more autonomously motivated, whereas lower negative scores indicate
less autonomous motivation (Self-determination theory, 2017).
However, after performing a factor analysis to verify the scale construction, it has been decided
to not use the results of this index for the analyses due to very weak factor loadings. The results
of testing the sub-scales of the RAI can be found in appendix 4. The different subsets of the
scale represent another dimension of relative autonomy. The statements within each dimension
of autonomy were expected to load strongly on one and the same component. However, it can
be observed that the items are not unidimensional for each of the four sub-scales. Some items
loaded strongly on multiple components; this is also called “cross-loadings”. Moreover, some
items did not even load strongly on a single component. Further, the Cronbach’s alphas, used
to the determine the sub-scales’ reliability, are very small concerning identified and introjected
regulation. These unsatisfactory results of the factor and reliability analyses might be explained
by the fact that it is the first year this self-regulation questionnaire has been used here. The
questionnaire has not yet been fine-tuned and further improvements will probably be necessary.
29
Another possible explanation may be the fact that the questions about motivation was the last
part in survey. The respondents probably were less attentive in answering these last number of
questions compared to the beginning of the survey (van Thiel, 2010). Moreover, the items that
had to be answered regarding motivation were numerous. More precisely, twenty statements
were included to examine students’ motivation. Hence, it is recommended for future research
to take these pitfalls into account.
2.3.3 Control variable Finally, ability has been added as a control variable as prior literature found that ability and
academic performance are strongly positively correlated (e.g. Everaert, Opdecam, & Maussen,
2017). In this study, ability will reflect a total score on thousand. More precisely, this score is
the weighted average of the exam results of the students during their second bachelor. The study
volumes of each course have been used as weights.
30
2.4 Analysing the results
2.4.1 Independent samples T-test To compare male and female students with each other, an “independent samples t-test” is
conducted through the statistical computer program, SPSS Statistics 24. The independent
samples t-test is used to compare the means of two independent groups in order to determine
whether the associated populations means are significantly different. Gender differences are
examined for all the variables. The sample is divided in two groups by means of the categorical
variable ‘gender’ (0 = male, 1 = female).
To compare two groups by means of this t-test, the data of these groups have to comply with
certain conditions. First of all, the sample must be composed randomly. Additionally, both
groups are required to follow a normal distribution. This requirement is met, since it is assumed
that a sample is normally distributed if there are more than thirty observations in each group.
Furthermore, both samples must be independent of each other; there may be no relationship
between the subjects in each sample. This condition is met as subjects can only belong to one
group and the scores of male students cannot be influenced by the scores of female students and
vice versa. Finally, the variances should approximately be equal across both groups. To test this
assumption of variance homogeneity, the Levene’s Test for Equality of Variances will be used
(Morgan, Leech, Gloeckner, & Barrett, 2004).
2.4.2 Regression analyses
Furthermore, in order to test the hypotheses, ordinary least squares regressions are performed
to study the relationship between the independent variables and the performance on the final
exam. Regressions examine the influence of an independent variable X on the dependent
variable Y. After performing “single” regressions for each hypothesis, a robustness check will
done by including all independent variables in one regression model. When multiple variables
are taken into account, attention has to be paid to the possible occurrence of multicollinearity.
Multicollinearity exists when two or more independent variables, also called predictors, are
highly correlated. Multicollinearity can be tested by calculating the variance inflation factor
(VIF). If the value for VIF is lower than ten, then there are no problems related to
multicollinearity. In case the value is higher than ten, multicollinearity may result in unstable
coefficient estimates, which are difficult to interpret. Nevertheless, multicollinearity can be
easily dealt with by eliminating or merging predictors that are highly correlated (Verlet, 2015).
31
3 Research findings 3.1 Descriptive statistics First, the descriptive statistics of the data will be discussed. In table 2 below, the average scores
on the dependent variable, i.e. performance on the exam with and without retrospective
correcting for guessing, can be compared. The table shows that the mean score is 30.71 when
no correcting for guessing is implemented and hence, NR scoring is the marking method being
used. This mean score decreases to 26.93 when scores are corrected retrospectively for
guessing.
Furthermore, as the exam consisted of three categories of multiple choice questions, the overall
score on the exam can be further refined. Although there were different amounts of theoretical
questions, calculations and application questions, the score on each type of question has been
rescaled to a mark on 10. In this way, the scores on each type of question can be compared more
easily. Students performed best with regard to the theoretical questions; on average 82.4% of
the questions were answered correctly. For the application questions, students answered on
average 80.4% of the questions correctly. Students performed worst on the calculations; they
solved on average 71.2% of the calculations correctly.
Table 2: Descriptives performance on the exam
N Minimum Maximum Mean Standard-
deviation
Performance with NR scoring (mark
on 40)
112 13.00 39.00 30.71 5.67
Performance with retrospective
correcting for guessing (mark on 40)
112 1.99 38.59 26.93 7.98
Performance theoretical questions
(mark on 10)
112 3.33 10.00 8.24 1.52
Performance calculations
(mark on 10)
112 2.63 10.00 7.12 1.75
Performance application questions
(mark on 10)
112 0.00 10.00 8.04 1.85
The histogram in figure 2 shows the underlying frequency distribution in case retrospective
correction for guessing is introduced. The figure shows a skewed right distribution of
performance on the exam. Whereas only three students of the 112 respondents failed to pass
the exam when no correction for guessing was introduced, this number increased to 21
respondents when applying the retrospective correction for guessing.
32
Figure 2: Histogram performance with retrospective correcting for guessing
(mark on 40)
As shown in the frequency table below, there were 40 male students (35.7%) and 72 female
students (64.3%) among the 112 respondents.
Table 3: Frequencies gender
Frequency Valid % Cumulative %
Valid Male 40 35.7 35.7
Female 72 64.3 100.0
Total 112 100.0
The descriptive statistics of the independent variables familiarity with retrospective correcting
for guessing, preference of scoring method, perceptions about course difficulty, learning
approaches and the control variable ability are summarized in table 4.
First, the degree to which students feel familiar with the retrospective correcting for guessing
scoring method in MC examination has been questioned. On average, students feel very familiar
with this type of examination (mean = 4.59 on a five-point Likert scale).
33
Furthermore, table 4 shows that, on average, students prefer the retrospective correcting for
guessing method, as nowadays applied at the University of Ghent, above the NM scoring
method (mean = 8.89). This preference has been measured on a scale from one to ten, with one
being an absolute preference for NM, and 10 being an absolute preference for retrospective
correction for guessing. Regarding students’ perceptions about course difficulty, one can
conclude that, on average, students perceive corporation tax as a rather difficult course (mean
= 7.89). Looking at the deep approach, we see that the mean is located between a low and
neutral deep approach (mean = 2.62 on a five-point Likert scale), tending more towards neutral.
The mean of the surface approach is situated between a low and neutral surface approach (mean
= 2.68), tending also more towards neutral. Finally, the mean of the control variable ability is
579.40 with a standard deviation of 93.52.
Table 4: Descriptives familiarity, preference, perceptions course difficulty, learning
approaches & ability
Variable N Minimum Maximum Mean Standard-
deviation
Familiarity with retrospective
correcting for guessing
129 1.00 5.00 4.59 0.62
Preference scoring method
(mark on 10)
129 3.00 10.00 8.89 1.66
Perceptions course difficulty
(mark on 10)
129 3.00 10.00 7.89 1.18
Deep Approach 126 1.50 4.10 2.62 0.47
Surface approach 126 1.44 4.44 2.68 0.53
Ability (mark on 1000) 110 318.00 790.00 579.40 93.52
For the other independent variables, the frequency tables are included. The variable of prior
experience has been measured by asking how many times students have participated in the
exam. Table 5 shows that the majority of the respondents participated for the first time during
last exam period in January. More precisely, 96.1% of the 129 students did not participate
earlier. Consequently, only 3.9% of the respondents were repeating the course. As only such a
small fraction of the respondents are retaking the course, it may be clear that this variable will
not have a significant explanatory power for differences in performance. Hence, this variable
will be eliminated in further analyses.
Table 5: Frequencies times participated in the exam
Frequency % Cum. %
Valid For the 1st time this exam period 124 96.1 96.1
Already 1 time in the past 3 2.3 98.4
Already 2 times in the past 2 1.6 100.0
Total 129 100.0
34
For lesson attendance, we can conclude that, on average, students attended most of the lessons.
This holds both for the exercises and theory courses. The score on this question could range
from zero (meaning attended no course at all) till five (meaning attended between 80% and
100% of the courses). On average, students attended more exercise classes than theory classes.
As shown in the frequency tables 6 and 7, 103 respondents attended between 80% and 100%
of the exercises courses, whereas this number decreases to 85 respondents for the theory classes.
Table 6: Frequencies lesson attendance (exercises)
Frequency % Cum. %
Valid Never (0) 1 0.8 0.8
0 – 19% (1) 6 4.7 5.4
20 – 39% (2) 0 0.0 5.4
40 – 59% (3) 12 9.3 14.7
60 – 79% (4) 7 5.4 20.2
80 – 100% (5) 103 79.8 100.0
Total 129 100.0
Table 7: Frequencies lesson attendance (theory)
Frequency % Cum. %
Valid Never (0) 1 0.8 0.8
0 – 19% (1) 6 4.7 5.4
20 – 39% (2) 8 6.2 11.6
40 – 59% (3) 5 3.9 15.5
60 – 79% (4) 24 18.6 34.1
80 – 100% (5) 85 65.9 100.0
Total 129 100.0
Besides class attendance, it has been asked how many hours students on average weekly spent
studying the material, outside of class. As shown in the frequency table 8, the answers could
range from less than one hour a week to more than six hours a week. More than half of the
respondents reported to work less than one hour a week at home for this course. Only 15.5% of
the respondents indicated to weekly spend more than two hours at home working on this course.
We can therefore conclude that not many students were encouraged to spend a large number of
hours working on that course at home, although they perceived corporation tax as a quite
difficult course.
Table 8: Frequencies weekly reported study time (excl. lessons)
Frequency % Cum. %
Valid < 1 hour (1) 66 51.2 51.2
Between 1 and 2 hours (2) 43 33.3 84.5
Between 2 and 3 hours (3) 9 7.0 91.5
Between 3 and 4 hours (4) 7 5.4 96.9
Between 4 and 5 hours (5) 3 2.3 99.2
Between 5 and 6 hours (6) 1 0.8 100.0
> 6 hours (7) 0 0.0 100.0
Total 129 100.0
35
Concerning the learning approaches, a further division into four groups of students can be
made: a group of students with low scores for deep approach and high scores for surface
approach (1), a group of students employing low levels for both approaches (2), a group of
students with high scores for deep approach and low scores for surface approach (3) and a fourth
group of students employing high levels for both learning approaches (4). To assign students to
a particular group, the mean of both learning approaches was used as a threshold. For instance,
students with a lower score than the average of the deep approach, but a higher score than mean
of the surface approach are assigned to the first quadrant.
The distribution of the students across the four groups is shown in table 9 and also visualized
in the mean plot in figure 3. The largest group of students employed low levels for both learning
approaches (n = 36). Although this is a surprising group, prior research has also identified a
profile that consisted of low scores on both learning approaches. For instance, a recent study of
Everaert, Opdecam, & Maussen (2017) also found a large cohort of students scoring low on
both learning approaches. They called these students “rote learners”. Rote learners typically
resort to a repetitive strategy by revising material until it is remembered, but they do not really
understand the material, and hence fail to use it. Low scores for the deep approach and high
scores for the surface approach were found for 32 students. For an approximately equally large
group of students, the opposite trend was found, being high scores on the deep approach and
low scores on the surface approach (n = 31). Finally, the smallest group of students employed
high levels for both learning approaches (n = 25). The fact that this cohort contains the least
students may be explained by the fact that the learning approaches are in theory mutually
exclusive; students will not maintain both approaches simultaneously (Biggs, 1987).
Nevertheless, more than 20% of the respondents belonged to this group.
Table 9: Frequencies quadrants of learning approaches
Frequency % Cum. %
Valid Low deep approach; high surface approach (1) 32 25.8 25.8
Low deep approach; low surface approach (2) 36 29.0 54.8
High deep approach; low surface approach (3) 31 25.0 79.8
High deep approach; high surface approach (4) 25 20.2 100.0
Total 124 100.0
36
Figure 3: Plot of the learning approaches (Mean-split)
37
3.2 Correlations Table 10: Correlation table
Performance
retrospect.
correcting
(mark on 40)
Performance
NR scoring
(mark on 40)
Gender
Familiarity
with
retrospect.
correcting
Preference
scoring
method
Lesson
attendance
(exercises)
Lesson
attendance
(theory)
Time
weekly
spent
Perceptions
course
difficulty
Deep
approach
Surface
approach
Ability
(mark on 1000)
Performance
retrospect.
correcting
(mark on 40)
1
Performance NR
scoring (mark on 40)
1.000*** 1
Gender -0.061 -0.061 1
Familiarity
with retrospect.
correcting
0.016 0.016 -0.086 1
Preference scoring
method
0.057 0.057 0.280*** 0.032
1
Lesson attendance
(exercises)
0.105 0.105 0.097 0.024 0.055 1
Lesson attendance
(theory)
0.138 0.138 -0.104 0.086 0.046 0.711*** 1
Time weekly spent 0.239** 0.239** 0.148 -0.105 0.152* 0.091 0.151* 1
Perceptions course
difficulty
0.122 0.122 0.216** -0.108 0.086 0.181** 0.110 0.144 1
Deep approach 0.034 0.034 -0.024 -0.093 0.164* -0.157 -0.085 0.254*** -0.103 1
Surface approach -0.139 -0.139 0.034 0.002 -0.023 0.129 0.139 -0.052 -0.040 -0.209** 1
Ability
(mark on 1000)
0.659*** 0.659*** 0.115 0.135 0.180 0.209** 0.273*** 0.171* 0.154 -0.032 -0.291*** 1
Correlations performance on different types of questions
Theoretical
questions
(mark on 10)
0.886*** 0.886*** -0.031 0.029 0.036 0.125 0.140 0.219** 0.127 -0.036 -0.180* 0.584***
Calculations
(mark on 10)
0.922*** 0.922*** -0.144 -0.020 0.028 0.120 0.143 0.234** 0.080 0.059 -0.121 0.631***
Application
questions
(mark on 10)
0.525*** 0.525*** 0.183* 0.082 0.133 -0.082 -0.012 0.070 0.123 0.071 0.020 0.279***
*** indicates correlation is significant at the 0.01 level, ** indicates correlation is significant at the 0.05 level, * indicates correlation is significant at the 0.10 level.
38
Table 10 shows the correlations between all the different variables. On the main diagonal, every
variable is correlated with itself, which leads to perfect correlations (r = 1). The dependent
variable performance on the exam is, both in case of retrospective correcting for guessing and
NR scoring, significantly positively correlated with the amount of time students weekly spent
at home working on the course of corporation tax (r = 0.239, p < 0.05) and with ability (r =
0.659, p < 0.01). On the one hand, this means that students who spent more time studying for
the course, achieved a higher grade on the exam. On the other hand, students who obtained a
high total score during their second bachelor, also obtained a higher score on the exam of
corporation tax in comparison to those with low ability levels.
The independent variable gender shows a significant positive correlation with the variable
measuring the preference of scoring method (r = 0.280, p < 0.01) and with the perceptions of
students about course difficulty (r = 0.216, p < 0.05). This means that women have a higher
preference for the retrospective correcting for guessing scoring method, and that they also
perceive the course of corporation tax as more difficult.
The variable measuring the preference of scoring method shows a positive correlation with the
weekly invested study time (r = 0.152) and the deep learning approach (r = 0.164). These
correlations are, however, only significant at the 0.10 level.
The variables attendance of exercises classes and attendance of theory classes are strongly
positively correlated with each other (r = 0.711, p < 0.01). Furthermore, exercises lesson
attendance has a significant positive correlation with the perceptions about course difficulty,
indicating that students who attended more exercises classes, perceived the course to be more
difficult (r = 0.181, p < 0.05). The attendance of theory classes shows a positive correlation
with the weekly invested study time, though only significant at the 0.10 level (r = 0.151).
The learning approaches are significantly negatively correlated with each other (r = -0.209, p
< 0.05). This negative correlation between the deep and surface learning approach makes sense,
since the learning approaches are, in theory, mutually exclusive (Biggs, 1987). A high score for
one learning approach normally results in a weak score on the other approach. Furthermore, the
deep approach has a positive correlation with the amount of weekly invested study time (r =
0.254, p < 0.01). This means that students with a deep approach spent more time on studying
the course of corporation tax at home.
39
Besides a strong, positive correlation with performance on the exam, the control variable ability
is significantly positively correlated with exercises lesson attendance (r = 0.209, p < 0.05), with
theory lesson attendance (r = 0.273, p < 0.01), with weekly invested study time (r = 0.171, p <
0.10) and significantly negatively correlated with the surface approach (r = -0.291, p < 0.01).
The dependent variable of performance can be further divided in performance on three types of
questions: theoretical questions, calculations and applications questions. As shown below in
table 10, performance on each type of question is significantly positively correlated with general
performance on the exam, both with and without retrospective correcting for guessing. This
means that a high mark on each type of question is associated with a high grade on the final
exam. Moreover, there is a positive correlation between gender and performance on application
questions, indicating that female students perform better than men on this category of questions
(r = 0.183). This correlation is only significant at the 0.10 level, though it should be noted that
significance at the 0.05 level was borderline missed (p = 0.053). Furthermore, the amount of
time students weekly spent working on the corporation tax at home is significantly positively
correlated with performance on theoretical questions (r = 0.219, p < 0.05) and performance on
the calculations (r = 0.234, p < 0.05). There is also a negative correlation between performance
on theoretical questions and the surface approach, significant at the 0.10 level (r = -0.180, p =
0.059). Finally, performance on each type of question is significantly positively correlated with
ability.
40
3.3 Gender differences By means of the independent samples t-test, gender differences are examined for all variables.
In table 11, an overview can be found of the mean scores of men and women on the different
variables, the mean differences between the sexes, the obtained t-test score and the
corresponding level of significance. The Levene’s test was applied to check whether the
variances were approximately equal across both groups. The p-value of this test was larger than
0.05 for almost all variables, meaning that equal variances for these variables could be assumed.
Except for the variable measuring the preference of scoring method and the weekly invested
study time, the Levene’s test showed a p-value below 0.05. A significant score for this test
implies that equal variances are not assumed. This can be solved by using the data of the “Welch
modified t-test” or choosing the Mann-Whitney U-test, which is the preferred method. The
Mann-Whitney U-test is a valuable alternative when the condition of equal variances is not met.
The results of the Mann-Whitney U-test test for these two variables can also be found in the
table below, panel B.
Table 11: Gender differences (Independent samples T-test & Mann-Whitney U-test)
Panel A: T-test Variable
Mean men Mean
women Mean
difference t p-value
Performance with NR scoring (mark on 40)
31.18 30.46 0.72 0.64 0.524
Performance with retrospective correcting for guessing (mark on 40)
27.58 26.57 1.01 0.64 0.524
Performance theoretical questions (mark on 10)
8.30 8.20 0.10 0.32 0.749
Performance calculations (mark on 10)
7.46 6.94 0.52 1.52 0.130
Performance application questions (mark on 10)
7.58 8.29 -0.70 -1.96 0.053
Familiarity with retrospective correcting
4.69 4.59 0.10 0.91 0.366
Lesson attendance (exercises) 4.43 4.64 -0.21 -1.02 0.310
Lesson attendance (theory) 4.53 4.28 0.25 1.10 0.273
Perceptions course difficulty (mark on 10)
7.60 8.13 -0.53 -2.32 0.022
Deep Approach 2.62 2.60 0.02 0.25 0.805
Surface approach 2.62 2.66 -0.04 -0.35 0.726
Ability (mark on 1000) 569.78 591.72 -21.94 -1.18 0.242
Panel B: Mann Whitney U-test
Variable Mean men Mean women Mean difference p-value
Preference scoring method (mark on 10)
8.28 9.22 -0.95 0.004
Time weekly spent (excl. lessons) 1.60 1.93 -0.33 0.411
41
From this table, it can be concluded that, although male students on average performed better
on the exam, the performance between men and women did not differ significantly. A sound
consequence of implementing retrospective correcting for guessing is that the mean difference
in performance becomes even larger in favour of male students. This is due to the fact that a
higher cut-off score has to be reached to pass the exam in comparison to NR scoring, where no
correction for guessing is applied. Consequently, it seems that the first part of hypothesis 1,
stating that female students outperform male students when MC exams are scored
retrospectively for guessing, cannot be confirmed. When looking at ability, an opposite trend is
observed. On average, female students obtained a higher total score during their second
bachelor than their male counterparts. Nevertheless, this difference in performance between
men and women is again not significant.
Also when looking at the different types of MC questions, no significant gender differences in
performance were found. However, regarding application questions, it has to be mentioned that
significance at the 0.05 level is borderline missed (p = 0.053). Female students in this sample
performed better than male students for applications questions. Applications are, according to
Bloom’s taxonomy, more complex to solve than theoretical questions and calculations. Hence,
contrary to the second part of hypothesis 1, there appears a small gender effect in favour of
women when more complex questions are involved. The differences in mean scores between
both sexes on the three types of questions are visualised in figure 4 below. From this figure it
can also be seen that women perform considerably better regarding application questions.
42
Figure 4: Mean scores on the different types of MC questions (mark on 10)
Concerning the other independent variables, table 11 shows significant differences between
male students and female students related to the preference of scoring method and students’
perceptions about course difficulty. With regard to the preference of scoring method, female
students have a significant higher preference for this non-conventional scoring method
compared to male students (mean of 8.28 for male versus 9.22 for female, p = 0.004).
Furthermore, female students perceive the course of corporation tax significantly more difficult
than their male counterparts (mean of 7.60 for male versus 8.13 for female, p = 0.022).
Hence, it can be concluded that table 11 above has shown quite similar results as those detected
in the correlation table. Significant gender differences appear with regard to performance on
application questions, preference of scoring method and perceptions about course difficulty.
43
3.4 Hypotheses testing In this section, the hypotheses are tested by examining the influence of the independent
variables on performance on the exam, which was corrected retrospectively for guessing.
Additionally, the results are shown in case no “standard setting” was applied. It will become
clear that the same results are yielded as when retrospective correcting for guessing is used.
Furthermore, the possible relationships between each independent variable and performance on
the distinct types of questions are tested as well, though there were no hypotheses formulated,
except for gender, regarding performance on these categories of questions.
3.4.1 Hypothesis 1
The first hypothesis claimed that female students perform better than male students on MC
examinations that are scored retrospectively for guessing. This was, however, not detected in
the t-test table. A gender effect was neither found for performance on the distinct types of
questions, with the exception of a small gender effect for performance on application questions.
Therefore, additional ANCOVAs with performance as dependent variable and gender as
independent variable are performed. ANCOVAs are used for comparing groups on a dependent
variable and when it is expected that another variable (i.e. “the covariate”) also affects the
dependent variable in addition to the independent variable (De Moor & Van Maele, 2008).
Since performance is highly correlated with ability, ability is added as the covariate. The results
can be found in table 12. A significant impact of gender on performance on calculations and
performance on application questions is found, while controlling for ability. Similar to the
results of the t-test, the results of the ANCOVAs reveal that female students performed better
on applications in comparison to male students. As detected in the t-tests, significance at the
0.05 level was borderline missed, but attained at the 0.10 level. Regarding calculations, men
performed significantly better than women. Thus, although men in this sample on average have
a lower ability level, they do better on calculations compared to women. For general exam
performance, both with and without retrospective correcting for guessing, no significant impact
of gender has been found. The same conclusion can be drawn for performance on the most
simple questions, being the theoretical questions.
44
Table 12: ANCOVA for gender differences in performance (control variable: ability)
Variable Estimated marginal mean
men
Estimated marginal mean
women
F p-value
Performance with NR scoring (mark on 40)
31.68 30.36 2.24 0.137
Performance with retrospective correcting for guessing (mark on 40)
28.29 26.43 2.24 0.137
Performance theoretical questions (mark on 10)
8.37 8.20 0.49 0.485
Performance calculations (mark on 10)
7.66 6.89 7.90 0.006
Performance application questions (mark on 10)
7.60 8.29 3.73 0.056
Consequently, hypothesis 1 cannot be confirmed. First, no significant impact of gender was
found for performance on the exam, which was corrected retrospectively for guessing. Second,
a gender effect in favour of male students with calculations and a gender effect in favour of
female students with applications have been detected. This contradicts prior findings of Leaver
& van Walbeek (2006) who found a gender effect in favour of male students for all types of
questions and especially for the more complex questions.
3.4.2 Hypotheses 2 Hypothesis 2a supposed that repeating a course is associated with a higher performance on MC
exams which are corrected retrospectively for guessing. However, this hypothesis will not be
tested as the number of respondents retaking the course is very low (n = 5).
Hypothesis 2b stated that students who are familiar with MC exams that are corrected
retrospectively for guessing, are predisposed to perform better on MC exams where this scoring
method is applied. Table 13 shows that a positive but insignificant coefficient is found
(coefficient = 0.237). Consequently, hypothesis 2b can be rejected. Reasonably, a significant
impact of familiarity with retrospective correcting for guessing has neither been found for exam
performance when NR scoring is applied nor for the performance scores on the three distinct
types of questions.
45
Table 13: Regression of familiarity with retrospective correcting for guessing on
performance
Performance
with
retrospect.
correcting
(mark on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 25.835
(3.963)***
29.938
(6.464)***
7.862
(6.349)***
7.421
(5.184)***
6.741
(4.479)***
Familiarity with
retrospective
correcting for
guessing
0.237
(0.169)
0.168
(0.169)
0.081
(0.306)
-0.064
(-0.209)
0.280
(0.866)
Model
F 0.029 0.029 0.094 0.044 0.750
p-value 0.866 0.866 0.760 0.835 0.388
R² 0.000 0.000 0.001 0.000 0.007
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
Similarly, no significant impact of familiarity with retrospective correcting for guessing on performance has been found
when adding ability as a control variable to the regression model. In this model, only ability turned out to be an important predictor
of performance.
Hypothesis 2c asserts that preference for MC examinations that are corrected retrospectively
for guessing is associated with higher performance on MC examinations where this scoring
method is applied. Table 14 shows a positive, but insignificant regression coefficient for
preference of scoring method (coefficient = 0.279). Hence, also hypothesis 2c cannot be
confirmed. Also regarding exam performance without retrospective correcting for guessing and
the performance scores for the three distinct types of questions, a significant impact of the
preferred scoring method has reasonably not been found.
Table 14: Regression of preference of scoring method on performance
Performance
with
retrospect.
correcting
(mark on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 24.448
(5.798)***
28.950
(9.663)***
7.943
(9.903)***
6.854
(7.391)***
6.690
(6.899)***
Preference
scoring method
0.279
(0.598)
0.199
(0.599)
0.033
(0.375)
0.030
(0.296)
0.152
(1.411)
Model
F 0.358 0.358 0.140 0.088 1.991
p-value 0.551 0.551 0.709 0.768 0.161
R² 0.003 0.003 0.001 0.001 0.018
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
Similarly, no significant impact of preference of scoring method on performance has been found when adding ability as a control
variable to the regression model. In this model, only ability turned out to be an important predictor of performance.
46
3.4.3 Hypothesis 3 Hypothesis 3 states that there is a positive relationship between lesson attendance and
performance on MC exams, which are corrected retrospectively for guessing. The respective
coefficients for theory and exercises lesson attendance are 0.871 and 0.146 in case of
retrospective correcting for guessing. The coefficients are, however, insignificant, meaning that
lesson attendance seems to have no impact on exam performance. From table 15, we can
conclude that hypothesis 3 cannot be confirmed. Furthermore, no significant impact of lesson
attendance was found for performance on the three distinct types of questions. These findings
conflict with previous research such as the studies of Kirby & McElroy (2003) and Aden,
Yahye, & Dahir (2013) who found that lesson attendance has a significant positive effect on
performance.
Table 15: Regression of lesson attendance on performance
Performance
with
retrospect.
correcting (mark
on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 22.458
(6.490)***
27.536
(11.200)***
7.285
(11.092)***
6.041
(7.958)***
8.549
(10.617)***
Lesson
attendance
(theory)
0.871
(0.957)
0.619
(0.957)
0.138
(0.796)
0.175
(0.874)
0.135
(0.635)
Lesson
attendance
(exercises)
0.146
(0.150)
0.104
(0.151)
0.077
(0.417)
0.070
(0.328)
-0.241
(-1.064)
Model
F 1.065 1.066 1.185 1.187 0.573
p-value 0.348 0.348 0.310 0.309 0.565
R² 0.019 0.019 0.021 0.021 0.010
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
Similarly, no significant impacts of both theory and exercises lesson attendance on performance have been found when adding
ability as a control variable to the regression model. In this model, only ability turned out to be an important predictor of
performance.
However, an additional assumption can be made with regard to lesson attendance. More
specifically, it may be assumed that every student who attended the last course of corporation
tax, filled out the survey. Consequently, students who did not participate in the survey are
assumed to be absent during this last course. When comparing performance between the
respondents who completed the questionnaire and were present during the last class on the one
hand, and the students who did not participate in the survey and who are assumed to be absent
on the other hand, significant differences in performances are found. The results of the t-test,
comparing these two groups, are shown in table 16. From this test, it can be concluded that
students who attended the last course have performed significantly better on the exam compared
47
to students who are assumed to be absent as they did not participate in the survey (p = 0.000).
This significant difference in performance also applies for the performance scores on the three
distinct types of questions.
Table 16: Additional t-test regarding attendance of last course
Variable
Mean
absent
students
Mean
respondents
Mean
difference
t p-value
Performance with NR scoring
(mark on 40)
27.18 30.71 -3.53 -5.15 0.000
Performance with retrospective
correcting for guessing
(mark on 40)
21.96 26.93 -4.97 -5.15 0.000
Performance theoretical questions
(mark on 10)
7.41 8.24 -0.82 -4.24 0.000
Performance calculations
(mark on 10)
6.19 7.12 -0.93 -4.48 0.000
Performance application questions
(mark on 10)
7.18 8.04 -0.85 -3.80 0.000
3.4.4 Hypothesis 4 Hypothesis 4 asserts that weekly invested study time has a positive effect on performance on
MC exams, that are corrected retrospectively for guessing. Table 17 shows that the coefficient
for this variable is positive and significant (coefficient = 1.767, p < 0.05). This means that
students who weekly spent more time studying at home for the course of corporation tax,
achieved higher grades on the exam that was scored retrospectively for guessing. This finding
supports hypothesis 4. The R² is equal to 0.057, meaning that 5.7% of the variance in
performance on the exam can be explained by the variable of the weekly invested study time.
Logically, the same positive and significant impact of time spent is found for performance on
the exam when no retrospective correcting for guessing is applied. With regard to performance
on the three distinct types of questions, a positive and significant effect of time spent was found
for the theoretical questions (coefficient = 0.307, p < 0.05) and calculations (coefficient = 0.380,
p < 0.05). Concerning performance on application questions, no significant impact of reported
study time was found.
48
Table 17: Regression of time weekly spent on performance
Performance
with
retrospect.
correcting
(mark on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 23.726
(16.444)***
28.439
(27.745)***
7.681
(27.882)***
6.435
(20.283)***
7.819
(22.774)***
Time weekly
spent
1.767
(2.580)**
1.255
(2.580)**
0.307
(2.350)**
0.380
(2.525)**
0.119
(0.733)
Model
F 6.656 6.654 5.523 6.375 0.537
p-value 0.011 0.011 0.021 0.013 0.465
R² 0.057 0.057 0.048 0.055 0.005
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
When adding ability as a control variable to the regression model, a positive impact of time spent was still found for performance
with and without retrospective correcting for guessing, though only significant at the 0.10 level. The significant impact of time spent
on performance on theoretical questions and calculations disappeared completely when controlling for ability. These results might
be explained by the correlation between time spent and ability.
3.4.5 Hypothesis 5 Hypothesis 5 supposes that students perceiving a course as rather difficult will perform better
on the MC examination of this course, which is scored retrospectively for guessing, because
they will put more efforts in studying the subject. Table 18 shows that the coefficient for
perceptions about course difficulty is positive, however, not significant (coefficient = 0.830).
Consequently, this finding leads to the rejection of hypothesis 5. Similarly, no significant
impact of perceptions about course difficulty was found for exam performance without
retrospective correcting for guessing and for performance on the three distinct types of
questions.
Table 18: Regression of perceptions about course difficulty on performance
Performance
with
retrospect.
correcting
(mark on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 20.338
(3.937)***
26.032
(7.093)***
6.937
(7.072)***
6.172
(5.417)***
6.497
(5.430)***
Perceptions
course difficulty
0.830
(1.289)
0.590
(1.290)
0.164
(1.340)
0.120
(0.844)
0.194
(1.300)
Model
F 1.663 1.663 1.796 0.713 1.691
p-value 0.200 0.200 0.183 0.400 0.196
R² 0.015 0.015 0.016 0.006 0.015
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
Similarly, no significant impact of perceptions about course difficulty on performance has been found when adding ability as a
control variable to the regression model. In this model, only ability turned out to be an important predictor of performance.
49
3.4.6 Hypothesis 6 Hypothesis 6 assumes a positive significant impact of the deep approach and a negative
significant impact of the surface approach on performance on the exam, which was corrected
retrospectively for guessing. Table 19 shows that only a significant impact of the surface
approach is found. The coefficient for the surface approach is -2.650. It should, however, be
noted that significance is only attained at the 0.10 level. Hence, hypothesis 6 can only partly be
supported. With regard to performance on the theoretical questions, a negative and more
significant effect of the surface approach is found (coefficient = -0.702, p < 0.05). This means
that students who have a high surface approach, performed less well on theoretical questions.
The R² for this regression model equals 0.062, meaning that 6.2 % of the variance in
performance on theoretical questions can be explained by the learning approach students
employ. Concerning performance on calculations and application questions, no significant
impact of the surface approach was found. Regarding the deep learning approach, no significant
impact on performance on the exam, that was corrected retrospectively for guessing, was found.
A significant positive impact of the deep approach was neither found for performance on the
three categories of questions.
Table 19: Regression of learning approaches on performance
Performance
with
retrospect.
correcting
(mark on 40)
Performance
with NR scoring
(mark on 40)
Performance
theoretical
questions
(mark on 10)
Performance
calculations
(mark on 10)
Performance
application
questions
(mark on 10)
C 35.328
(5.365)***
36.683
(7.840)***
11.112
(9.058)***
8.370
(5.720)***
6.855
(4.357)***
Deep approach -0.405
(-0.244)
-0.288
(-0.244)
-0.367
(-1.187)
0.037
(0.099)
0.322
(0.813)
Surface approach -2.650
(-1.829)*
-1.883
(-1.829)*
-0.702
(-2.600)**
-0.482
(-1.497)
0.142
(0.411)
Model
F 1.713 1.714 3.492 1.269 0.348
p-value 0.185 0.185 0.034 0.285 0.707
R² 0.032 0.032 0.062 0.024 0.007
Note: t-statistics are in parentheses.
*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.
Due to the high correlation between the surface approach and ability, the learning approaches and ability have not been taken
together into one regression model.
50
3.4.7 Robustness check In what follows, the results of the multiple regression analyses, which included all the
independent variables, are discussed. Only the variable measuring how many times students
participated in the exam is excluded. As already mentioned, a condition for multiple regression
is that the independent variables are not highly correlated as this can result in redundant
information in the regression model. This problem, called multicollinearity, can be calculated
by means of the variance inflation factor (VIF). When taking performance on the exam with
“standard setting” as the dependent variable, the highest value for VIF is 2.163 for the variable
of theory lesson attendance. Hence, multicollinearity is out of question here (Verlet, 2015).
These results are similar as with the “single regressions” and are summarized in table 20. On
the one hand, there is a significant positive effect of the weekly invested study time on exam
performance (coefficient = 0.243, p = 0.020). On the other hand, the surface approach has a
negative effect on exam performance, significant at the 0.10 level (coefficient = -0.192, p =
0.057). The adjusted R² takes into account the amount of variables that have been included as
independent variables and indicates to which degree the variance in the score on the exam,
corrected retrospectively for guessing, can be explained by all independent variables in the
model (Verlet, 2015). The adjusted R² of this regression model equals 0.059, meaning that 5.9%
of the variance in performance on the exam can be explained by the regression model.
Table 20: Regression of all the independent variables on performance with retrospective
correcting for guessing
Variable Standardized
coefficients
t-value p-value
Beta
Constant 1.786 0.077
Gender -0.115 -1.118 0.266
Preference scoring method 0.035 0.351 0.726
Familiarity with retrospective
correcting
0.038 0.394 0.694
Lesson attendance
(exercises)
0.061 0.440 0.661
Lesson attendance (theory) 0.059 0.425 0.672
Time weekly spent 0.243 2.357 0.020
Perceptions course difficulty 0.112 1.115 0.268
Deep approach -0.066 -0.620 0.537
Surface approach -0.192 -1.929 0.057
Model summary
Dependent variable Performance with retrospective correcting for guessing
F (model) 1.751
p-value (model) 0.088
Adjusted R² 0.059
* Logically, the same results have been found when taking performance on the exam without retrospective correcting for guessing
(i.e. “performance with NR scoring”) as the dependent variable in the regression model.
51
Also when taking the performance on theoretical questions as the dependent variable, similar
conclusions can be drawn as with the “single” regressions. Table 21 shows that weekly invested
study time has a significantly positive influence (coefficient = 0.217, p = 0.036), while the
surface approach has a significantly negative influence (coefficient = -0.269, p = 0.007). The
adjusted R² of this regression model is 0.080, meaning that 8% of the variance in performance
on theoretical questions can be explained by this regression model.
Table 21: Regression of all the independent variables on performance on theoretical
questions
Variable Standardized
coefficients
t-value p-value
Beta
Constant 3.746 0.000
Gender -0.074 -0.721 0.472
Preference scoring method 0.012 0.120 0.905
Familiarity with retrospective
correcting
0.036 0.378 0.706
Lesson attendance
(exercises)
0.065 0.478 0.634
Lesson attendance (theory) 0.078 0.571 0.569
Time weekly spent 0.217 2.129 0.036
Perceptions course difficulty 0.104 1.044 0.299
Deep approach -0.146 -1.380 0.171
Surface approach -0.269 -2.738 0.007
Model summary
Dependent variable Performance on theoretical questions (mark on 10)
F (model) 2.028
p-value (model) 0.044
Adjusted R² 0.080
* The VIF’s are all below 3.
In table 22, the results of the regression model using performance on calculations as the
dependent variable, are shown. Again, there is a significantly positive impact of the weekly
invested study time (coefficient = 0.245, p = 0.019). Furthermore, the gender coefficient is
negative and significant (coefficient = -0.209, p = 0.045). This means that male students
performed better on calculations than female students. The adjusted R² of this regression model
equals 0.064, meaning that 6.4% of the variance in performance on calculations can be
explained by this regression model.
52
Table 22: Regression of all the independent on performance on calculations
Variable Standardized
coefficients
t-value p-value
Beta
Constant 2.478 0.015
Gender -0.209 -2.030 0.045
Preference scoring method 0.019 0.194 0.847
Familiarity with retrospective
correcting
-0.007 -0.076 0.940
Lesson attendance
(exercises)
0.121 0.880 0.381
Lesson attendance (theory) 0.018 0.131 0.896
Time weekly spent 0.245 2.377 0.019
Perceptions course difficulty 0.074 0.741 0.461
Deep approach -0.035 -0.327 0.744
Surface approach -0.162 -1.632 0.106
Model summary
Dependent variable Performance on calculations (mark on 10)
F (model) 1.811
p-value (model) 0.076
Adjusted R² 0.064
* The VIF’s are all below 3.
Finally, table 23 shows the results of the regression model with performance on applications as
the dependent variable. The independent variables seem to have no significant impact on
performance on application questions. Only for gender, significance is attained at the 0.10 level
(p = 0.074). The gender coefficient is positive, indicating that female students performed
significantly better on application questions than male students (coefficient = 0.189). The
adjusted R² of this regression model is 0.029, meaning that only 2.9% of the variance in
performance on applications can be explained by this regression model. The low explanatory
power of the independent variables for performance on applications might be due to the fact
that the exam of corporation tax contained only six application questions. Hence, results
regarding performance on applications might be influenced and have to be interpreted with
caution.
53
Table 23: Regression of all the independent variables on performance on application
questions
Variable Standardized
coefficients
t-value p-value
Beta
Constant 0.802 0.424
Gender 0.189 1.807 0.074
Preference scoring method 0.096 0.939 0.350
Familiarity with retrospective
correcting
0.141 1.428 0.156
Lesson attendance
(exercises)
-0.186 -1.327 0.188
Lesson attendance (theory) 0.084 0.600 0.550
Time weekly spent 0.058 0.552 0.582
Perceptions course difficulty 0.134 1.310 0.193
Deep approach 0.064 0.587 0.558
Surface approach 0.058 0.576 0.566
Model summary
Dependent variable Performance on application questions (mark on 10)
F (model) 1.359
p-value (model) 0.217
Adjusted R² 0.029
* The VIF’s are all below 3.
54
4 Discussion
In the present study, the relationship between performance on multiple choice exams that are
corrected retrospectively for guessing and gender was the main research focus. This focus has
grown out of concern that a gender effect may occur. Many previous studies found a gender
effect in favour of male students in MC examinations, especially in case these exams were
scored by means of negative marking (NM). Concerning MC exams which are corrected
retrospectively for guessing, as nowadays applied at the University of Ghent, research is very
scarce. Since prior literature also held other students’ characteristics responsible for differences
in performance, their explanatory power has been investigated in this study. In what follows,
the findings of each of the hypotheses are discussed and other important comments are given.
First, it can be concluded that gender has no significant impact on the performance on the
exam, which was corrected retrospectively for guessing. Additionally, the score on the exam
has been further refined to performance on different categories of questions. Bloom’s taxonomy
has been used to categorize questions according to the level of cognitive reasoning required.
The exam of corporation tax consisted of theoretical questions, calculations and application
questions, which can be assigned respectively to the first (“knowledge”), second
(“comprehension”) and third level (“application”) of this taxonomy. The higher the level of
Bloom’s hierarchy, the more complex questions become (Leaver & van Walbeek, 2006). By
performing ANCOVAs of gender on performance, while controlling for ability, a significant
gender effect was found for the calculations and the application questions. Concerning
calculations, male students performed significantly better compared to female students. Also
when including all the independent variables in a regression model, the gender coefficient was
negative and significant. This finding is in line with the study of Du Plessis & Du Plessis (2007)
and Declerck (2010) who also found that male students scored consistently better on MC
questions of quantitative nature. For performance on application questions, the results show a
positive gender coefficient and a trend towards significance at the 0.05 level. This positive
gender effect means that women performed significantly better than men on application
questions. Applications can be considered as the most difficult questions that have been posed
on the exam. They do not only require students to memorize and understand material, but
involve higher levels of thinking as students have to apply previously learned information in
new situations. Finally, no significant gender effect has been found with regard to performance
on theoretical questions, which belong to the lowest level of Bloom’s taxonomy. These findings
55
do not correspond with the results of Leaver & van Walbeek (2006) who found a significant
gender effect in favour of male students for all types of questions, and especially for those
categorized at higher levels of Bloom’s taxonomy. However, in this study, caution is needed
when interpreting and generalizing the results concerning performance on applications. As the
exam of corporation tax contained only six application questions, this might have influenced
the results. Hence, no evidence was found that supports hypothesis 1.
Although not expected in advance, other gender differences have been detected. The correlation
table and the results of the t-test revealed that women perceive the corporation tax course more
difficult compared to men. Furthermore, female students have a higher preference for the
retrospective correcting for guessing scoring method than male students. This might be
explained by the fact that the transition from NM towards ‘standard setting’ at the University
of Ghent mainly benefits women. The higher risk aversion in women, as frequently observed
in prior literature, becomes completely irrelevant with this non-conventional scoring method.
The transition increased the mark on 20 with 0.89 for women and with 0.46 for men (Van de
Poele & Sabbe, 2016).
Second, a significant positive effect of time spent on exam performance has been detected,
supporting hypothesis 4. This means, the more students studied at home for this course, the
higher their grades on the exam which was scored retrospectively for guessing. Hence, it can
be concluded that these findings are in line with the studies of Rau & Durand (2000),
Stinebrickner & Stinebrickner (2004), Diseth, Pallesen, Brunborg, & Larsen (2010). A
significant, positive impact of time spent was also found for general performance on the rather
simple questions, being the theoretical questions and calculations. Also when the other
independent variables were included in the regression model, a significant positive effect of
weekly invested study time was found. However, regarding performance on application
questions, no positive impact of time spent was found. On the one hand, results for application
questions might again be influenced by the limited number of applications on the exam. On the
other hand, the fact that invested study time has no significant impact on performance on
applications may be explained by differences between students regarding skill sets. Students
with a higher critical thinking capacity, for instance, might attain the same or better marks on
applications with less study time invested (Plant et al., 2005).
56
Third, when investigating the relation between students’ learning approaches and exam
performance, evidence for hypothesis 6 is only found with regard to the surface approach. The
use of the surface approach has a negative influence on performance on the exam when
retrospective correcting for guessing is used. However, it should be noted that significance was
only attained at the 0.10 level. When looking at the impact of the surface approach on the
general performance scores on the three categories of questions, a negative and more significant
effect of the surface approach on performance on theoretical questions was found. This means
that having a surface approach results in lower performance on theoretical questions. When
adding the other independent variables in the regression model, there is an even more significant
impact of the surface approach on performance on theoretical questions. As such, it seems that
the negative relationship between a surface approach and performance on theoretical questions
is a robust one. Concerning performance on calculations and applications, no significant
negative effect of the surface approach was found. With regard to the deep learning approach,
a significant impact was neither found for performance on the exam nor for performance on the
three types of questions. Numerous prior research, such as the study of Diseth & Martinsen
(2003), also did not find evidence for the deep approach and only found that higher surface
approach scores are associated with less successful academic performance. Furthermore, these
authors argue that this may be due to the fact that academic courses frequently include a fixed
curriculum and that the standards for good exam performance are well defined. Consequently,
students are not really encouraged or invited to explore subjects which are not included in the
curriculum. Hence, students may feel more inclined to adopt a surface approach to learning.
Moreover, the Cronbach’s alphas for the learning approaches are somewhat lower than the
alpha values of 0.73 and 0.64 obtained by the study of Biggs, Kember, & Leung (2001). This
is particularly true for the Cronbach’s alpha of the deep learning approach, which only
amounted 0.65 in this study.
Though the deep approach had no significant influence on performance, a significant positive
correlation was found between the deep approach and time spent studying at home for the
course. Students with a higher deep approach, reported to have spent more time on corporation
tax during the semester. This is not surprising, as students with a deep learning approach have
a sincere interest in the subject and put more efforts in thoroughly understanding the material
compared to those with a surface approach. This significant positive effect of the deep learning
approach on time spent was also found by Everaert, Opdecam, & Maussen (2017).
57
Further, plotting students’ approaches to learning with a mean split, indicated that 25.8% of the
students mainly had a surface approach and 25% of the students mainly employed a deep
approach. However, 29% of the students scored below the mean for both learning approaches.
Although this is a quite surprisingly large group, prior research has identified a profile that
consisted of low scores on both learning approaches. A study of law students found that 23%
of the students employed low levels on both learning approaches (Lindblom-Ylanne, in:
Gijbels, Van de Watering, Dochy, & Van den Bossche, 2005). Also a more recent study of
Everaert, Opdecam, & Maussen (2017) found a large cohort of students scoring low on both
approaches. They called these students rote learners. The aim of these students is not to
thoroughly understand the material, but they rather seek self-fulfilment by revising and revising
the material. These students often know the course by heart, but do not understand what they
have learned. Consequently, they fail to apply the learned information in new situations. They
are called rote learners, because they are willing to invest more effort in studying than strictly
necessary to pass the course, but probably not in an adequate manner. Although this kind of
combination in learning approaches is considered as “disintegrated”, this profile is quite typical
for novice students (Gijbels et al., 2005). The students in this sample are, however, third
bachelor students, and hence, can no longer be considered as “novice”. Nevertheless, it seems
that they still face difficulties in approaching their studies. Finally, 20.2% of the respondents
reported high scores on both learning approaches. Although the smallest number of respondents
fitted in this group, it is still a relatively large group. This is surprising as the learning
approaches are theoretically considered mutually exclusive (Biggs, 1987). Again Gijbels et al.
(2005) argued that this profile is quite typical of novice students. Hence, it can be concluded
that a large percentage of the students in the sample struggles to find a suitable method for
approaching their studies.
Fourth, concerning familiarity with MC exams that are scored retrospectively for guessing,
the descriptive statistics revealed that students feel very comfortable with this marking method
and understand how grades are calculated. This is a great advantage compared to the NM
scoring method, which was frequently applied at Ghent University in the past. In case of NM,
students were often too occupied figuring out the most optimal answering strategy as different
teachers sometimes attached different amounts of penalties to incorrect answers. However,
feeling familiar with retrospective correcting for guessing did not result in higher performance
on these type of exams. Nevertheless, more research is needed here, since the Cronbach’s alpha
was extremely low and might have influenced the results. This low value is possibly due to the
58
fact that the statements are not based on an existing scale, because such an instrument was not
found in prior literature. Hence, we produced a scale of our own to take this factor into account.
Further, the low value of the Cronbach’s alpha can be attributed to the low number of statements
included. Also no evidence was found for hypothesis 2c, stating that students preferring the
retrospective correcting for guessing scoring method perform better on this type of MC
examination than those with a higher preference for NM. The absence of a significant impact
on performance may be explained by the great unanimity among respondents about the
preferred scoring method. Only a very small percentage of the respondents (4.7%) reported to
have a higher preference for the NM scoring method instead of retrospective correcting for
guessing.
Five, linking the answers of the survey regarding theory and exercises class attendance to
performance, did not reveal a significant impact of lesson attendance on performance. This is
not in line with the studies of Krieg & Uyar (2001), Kirby & McElroy (2003), and Aden, Yahye,
& Dahir (2013). However, the results in this thesis might be influenced, as responses only have
been collected of students who attended the last class. The vast majority of these respondents
reported to have attended between 80% and 100% of both theory and exercises classes. Notably,
more than half of the students who subscribed to the course of corporation tax, were absent.
This is quite surprising as important information concerning the exam is often communicated
during this last class. Consequently, it is conceivable that many of these absent students also
skipped other classes of this course. When assuming that those students who did not participate
in the survey were absent during the last course and assuming that every present student filled
out the survey, significant performance differences were found between the students attending
the last course and the absent students. Students who attended the last lecture obtained
significantly better marks on the exam that was scored retrospectively for guessing as well as
on each type of question compared to these who skipped the last course. On the one hand, this
finding may make students aware of the importance of attending lessons, as this may result in
higher grades. On the other hand, this finding may also motivate instructors as it shows that
teaching indeed has a positive influence on the performance outcomes of students. Research
into e-learning found that an important reason for absenteeism in higher education is that
students nowadays dispose of technology alternatives, such as the platform “Minerva” at Ghent
University (Naber & Köhle: in Massingham & Herrington, 2006). Furthermore, it was also
noticeable that a relatively higher number of female students attended the last class, although a
59
higher proportion of male students (n = 185) participated in the exam in January in comparison
to female students (n = 144).
Six, also perceptions about course difficulty had no significant effect on performance. On the
one hand, this finding is in line with results of the studies of Hong (1999) and Combs, Michael,
& Fiore (2002) who found no direct association between beliefs about difficulty and
performance. On the other hand, this finding contradicts the results of Foos (1992) who found
that students perceiving a course as difficult, will perform better because they will work harder.
However, as shown in the correlation table, no positive association was found between
perceptions about course difficulty and invested study time. Those perceiving the course as
quite difficult did not spend more time studying the material at home compared to those who
observed the course as more easy. This might explain why no significant differences in
performance are found. Finally, though not hypothesized in advance, perceptions about course
difficulty hold a significantly positive correlation with exercises class attendance. This means
that students attending more exercises classes, perceived the course more difficult.
60
4.1 Limitations
There are several limitations to this study that one has to be aware of. First of all, the possible
appearance of response bias is a limitation inherent to the research method of surveys. Response
bias refers to the tendency of respondents to systematically respond to questions on a different
basis than the content of the items. A common response tendency is socially desirable
responding, which means that a respondent adjusts his answers to what one thinks is socially
acceptable or politically correct or what one thinks the researcher would like to hear.
Consequently, a survey may only show what people claim to do or think, and may not always
correspond with reality (van Thiel, 2010). However, I tried to capture this gap as much as
possible by giving clear instructions for completing the survey. I emphasized that the data
would be dealt with in a reliable way and that results would be treated anonymously. This way,
I wanted to guarantee students that their responses would not be passed on to the responsible
teacher so that they could answer really honestly.
The second limitation is due to the rather low number of observations. Although 350 students
subscribed to the course of corporation tax, more than half of them were absent during the last
lesson when the survey was distributed. As this was the last class before the exam, this low
turnout was not expected in advance. Moreover, as some respondents could not be identified,
their answers to the questionnaire could not be linked to their score on the exam. Also for those
who completed the survey, but did not participate in the exam, influences on performance could
not be investigated.
External validity is a third research limitation and refers to the extent that findings of this study
can be transferred to or applied in situations, other than the context in which the study was
conducted. Although research aims at generating information that can be used in other settings
as well, it should be acknowledged that a study can never produce generally transferable results
(Malterud, 2001). As there is a gap in literature concerning non-conventional scoring methods,
there are at present no clear indications that the findings of this study will also apply for other
MC exams that use a retrospective correcting for guessing scoring method. Furthermore, the
fact that this study only uses data of one course at one university, also limits the generalizability
of the results.
61
Fourth, as already mentioned, the Cronbach’s alfa for familiarity was very low. Also for the
learning approaches, and especially for the deep approach, they were below the values obtained
by Biggs, Kember, & Leung (2001).
Five, many previous studies also investigated the effect of student motivation on academic
performance in higher education and concluded that motivation is an important predictor for
performance (e.g. Turner, Chandler, & Heffer, 2009). However, due to the weak outcomes of
the factor and reliability analyses of the measurement instrument for motivation, this study
could not produce reliable findings for this variable.
4.2 Future research
More research is needed to investigate whether gender differences in performance occur with
MC examinations that are corrected retrospectively for guessing. Besides gender differences,
further empirical studies should also measure which other factors may have an influence on
performance on these exams. A particular interesting avenue for future research is to compare
the explanatory value of different students’ characteristics for performance on exams scored by
means of negative marking (NM) and for performance on exams scored by means of “standard
setting”. This was not possible in the current study as students were told that “standard setting”
was the scoring method being used at the exam. When NM was applied, students would have
omitted several items and consequently, they would have obtained different marks on the exam.
Furthermore, it would be very interesting to replicate this research across different disciplines
in multiple higher education institutions. In addition, also other factors (e.g. the time restrictions
during examinations) can be considered, which were not taken into account in the present study.
62
5 Conclusion
A major contribution of this study is that it extends prior literature on gender bias in MC
examinations since an alternative scoring method is explored, being retrospective correcting for
guessing, also known as “standard setting” or “hogere cesuur” in Dutch. Besides gender
differences, it has been tested whether other students’ characteristics lead to advantages in
taking these type of MC exams. Furthermore, this study also investigated the effect of the
different students’ characteristics on performance on the three distinct types of questions.
This study found no evidence of the existence of a gender effect in relation to performance on
MC exams that are corrected retrospectively for guessing. No significant differences between
men and women have been found for performance on the exam. Though, when making a
distinction between general scores on the different types of questions being posed on the exam,
other conclusions can be drawn. On the one hand, results showed that male students performed
significantly better on calculations compared to female students. On the other hand, female
students outperformed male students when application questions were involved. Regarding the
most simple questions, being theoretical questions, no gender effect on performance was found.
It is, nevertheless, recommended for instructors to incorporate a balanced mix of different types
of questions in MC exams.
Besides gender, this study also took other factors into account which might affect the
performance of students. In fact, statistically significant performance differences have been
found for weekly invested study time and the use of the surface learning approach. First, the
present study found that invested study time is a strong predictor of performance. Weekly
invested study time has a positive impact on performance on exams, that are scored
retrospectively for guessing. Also for general performance on theoretical questions and
calculations, self-study time had an independent effect on performance above and beyond the
other students’ characteristics and qualitative aspects of learning activities. Hence, educators
should convince their students of the importance to invest adequate time in their learning
activities on a frequent basis, throughout the whole semester.
63
Second, regarding the approaches to learning, the results indicated that the use of the surface
learning approach leads to lower academic performance, while the deep approach unexpectedly
did not predict achievement. For general performance on the exam, that was scored
retrospectively for guessing, the use of the surface approach showed, however, only a slight
trends towards significance (p < 0.10). For performance on theoretical questions, being the most
simple MC questions, the surface approach showed a more significant association with lower
performance. Consequently, it is recommended that educators try to discourage students to
employ a surface learning approach. The use of a surface approach can, for instance, be made
less attractive by matching the level of the subject with students’ prior knowledge. When
students employed a surface approach in previous subjects, they will realise that they do not
have the expected prior knowledge at the start of a new subject. Furthermore, restricting the
workload to a level that allows students to explore the material more thoroughly, may also
discourage the use of the surface approach (Biggs & Tang, 2007). Finally, the results indicated
that a large group of students (“rote learners”) scored low on both learning approaches, which
also requires further attention. Assessment has to be aligned with the desired learning outcomes
in a way that success for the rote recall of information is reduced.
Regarding lesson attendance, no evidence of a significant impact on exam performance was
initially found. The reported degree of class attendance was not associated with performance
on the exam nor with performance on the three distinct types of questions. Nevertheless, when
comparing the scores of the students attending the last class with the scores of those who are
assumed to be absent, strong differences in performance occurred. Significant performance
differences were found for performance on the exam when retrospective correcting for guessing
is applied and for general performance on the different types of questions as well. As such, it
seems that attending lectures indeed has a positive effect on academic achievement. For the
other independent variables, no evidence has been found of a significant impact on performance
on exams scored retrospectively for guessing. However, as described above, several limitations
urge for more, extensive research in this field.
VII
Bibliography Aden, A.A., Yahye, Z.A., & Dahir, A.M. (2013). The Effect of Students’ Attendance on
Academic Performance: A Case Study at Simad University Mogadishu. Academic Research
International, 4 (6), 409 – 417.
Arthur, N., & Everaert, P. (2012). Gender and performance in accounting examinations:
Exploring the impact of examination format. Accounting Education, 21(5), 471–487.
Beller, M., & Gafni, N. (2000). Can item format (multiple choice vs. open-ended) account for
gender differences in mathematics achievement? Sex roles, 42(1 – 2), 1 – 21.
Betts, L. R., Elder, T. J., Hartley, J. & Trueman, M. (2009). Does correction for guessing reduce
students’ performance on multiple-choice examinations? Yes? No? Sometimes? Assessment &
Evaluation in Higher Education, 34(1), 1–15.
Bible, L., Simkin, M.G., & Kuechler, W.L. (2008). Using Multiple-Choice Tests to Evaluate
Students' Understanding of Accounting. Accounting Education, 17:1, S55 - S68.
Biggs, J. B. (1987). Student approaches to Learning and Studying. Hawthorn, Victoria:
Austrian Council of Educational Research.
Biggs, J., Kember, D., & Leung, D. Y. (2001). The revised two-factor study process
questionnaire: R – SPQ – 2F. British Journal Of Educational Psychology, 71(1), 133-149.
Biggs, J. & Tang, C. (2007). Teaching for Quality Learning at University (3rd Ed.).
Maidenhead: McGraw Hill Education & Open University Press.
Byrne, M., Flood, B., & Willis, P. (2002). The relationship between learning approaches and
learning outcomes: a study of Irish accounting students. Accounting Education, 11(1), 27-42.
Chan, N., & Kennedy, P. E. (2002). Are multiple choice exams easier for economics students?
A comparison of multiple-choice and ‘equivalent’ constructed-response exam questions.
Southern Economic Journal, 68(4), 957-971.
VIII
Chemolli, E., & Gagné, M. (2014). Evidence against the continuum structure underlying
motivation measures derived from self-determination theory. Psychological Assessment,
26(2),575–585.
Cohen‐Schotanus J., & Van der Vleuten, C. (2010). A standard setting method with the best
performing studentsas point of reference: Practical and affordable. Medical Teacher, 32, 154‐
160.
Combs, H. M., Michael, L., & Fiore, B. (2002). Easy Test or Hard Test, Does it Matter? The
Impact of Perceived Test Difficulty on Study Time and Test Anxiety. Retrieved on March 24,
2017, via http://www.kon.org/urc/v6/combs.html
Cortright, R., Lujan, H., Cox, J., & DiCarlo, S. (2011). Does sex (female versus male) influence
the impact of class attendance on examination performance? Advances in Physiology
Education, 35, 416-420.
Davidson, R. A. (2002). Relationship of study approach and exam performance. Journal of
Accounting Education, 20(1), 29-44.
Deci, E. L., & Ryan, R. M. (1985). Intrinsic motivation and self-determination in human
behavior. New York: Plenum
Declerck, S. (2010). De invloed van gender en examen formaat op de prestaties van studenten
[Masterproef]. Gent: Universiteit Gent Master in de bedrijfseconomie.
De Lange, P., & Mavondo, F. (2004). Gender and motivational differences in approaches to
learning by a cohort of open learning students. Accounting Education, 13(4), 431-448
De Moor, G., & Van Maele, G. (2008). Inleiding tot de biomedische statistiek. Leuven: Acco.
Diseth, A. & Martinsen, O. (2003). Approaches to learning, cognitive style, and motives as
predictors of academic achievement. Educational Psychology, 23(2), 195 – 207.
IX
Diseth, A., Pallesen, S., Brunborg, G. S., & Larsen, S. (2010). Academic achievement among
first semester undergraduate psychology students: The role of course experience, effort,
motives and learning strategies. Higher Education, 59, 335 –352.
Downing S. M. (2003). Guessing on selected-response examinations. Medical Education, 37,
670 – 671.
Duchesne, I., & Nonneman, W. (1998). The demand for higher education in Belgium.
Economics of Education Review, 17(2), 211-218.
Du Plessis, S., & Du Plessis, S. (2007). A new and direct test of the ‘gender bias’ in multiple-
choice questions, Stellenbosch Economic Working Paper.
Espinosa, M. P. & Gardeazabal, J. (2010). Optimal correction for guessing in multiple-choice
tests. Journal of Mathematical Psychology, 54(5), 415–425.
Everaert, P., Opdecam, E., & Maussen, S. (2017). The Relationship between Motivation,
Learning approaches, Academic Performance and Time Spent. Accounting Education, 26(1),
78-107.
Foos, P. W. (1992). Test performance as a function of expected form and difficulty. Journal of
Experimental Education, 60(3), 205-211.
Gijbels, D., Van de Watering, G., Dochy, F., & Van den Bossche, P. (2005). The relationship
between students’ approaches to learning and the assessment of learning outcomes. European
Journal of Psychology of Education, 20(4), 327 – 341.
Hall, M., Ramsay, A., & Raven, J. (2004). Changing the learning environment to promote deep
learning approaches in first-year accounting students. Accounting Education, 13(4); 489-505.
Hartley, J., Betts, L., & Murray,W. (2007). Gender and assessment: differences, similarities and
implications. Psychology Teaching Review, 13(1), 34-47.
X
Hong, E. (1999). Test anxiety, perceived test difficulty, and test performance: temporal patterns
of their effects. Learning and Individual Differences,11(4), 431-448.
Kastner, M., & Stangl, B. (2011). Multiple Choice and Constructed Response Tests: Do Test
Format and Scoring Matter? Procedia - Social and Behavioral Sciences, 12, 263-273.
Kirby, A., & McElroy, B. (2003). The Effect of Attendance on Grade for First Year Economics
Students in University college Cork. The Economic and Social Review, 34(3), 311-326.
Krieg, R.G., & Uyar, B. (2001). Student Performance in Business and Economics Statistics:
Does Exam Structure Matter?, Journal of Economics and Finance, 25(2), 229-241.
Leaver, R., & van Walbeek, C. (2006). Gender bias" in multiple choice questions: does the type
of question make a difference? University of Cape Town, School of Economics Working Paper.
Lesage, E., Valcke, M., & Sabbe, E. (2013). Scoring methods for multiple choice assessment
in higher education: is it still a matter of number right scoring or negative marking? STUDIES
IN EDUCATIONAL EVALUATION, 39(3), 188–193.
Malterud, K. (2001). Qualitative research: standards, challenges, and guidelines. The Lancet,
358: p. 483-488.
Marín, C., & Rosa-García, A. (2011). Gender bias in risk aversion: evidence from multiple
choice exams. Working Paper 39987, MPRA.
Massingham, P., & Herrington, T. (2006). Does Attendance Matter? An Examination of Student
Attitudes, Participation, Performance and Attendance, Journal of University Teaching &
Learning Practice, 3(2).
Morgan, G.A., Leech, N. L., Gloeckner, G. W., & Barrett, K.C. (2004). SPSS for Introductory
Statistics: Use and Interpretation (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
XI
Niemiec, C. P., & Ryan, R. M. (2009). Autonomy, competence, and relatedness in the
classroom: Applying self-determination theory to educational practice. Theory and Research in
Education, 7, 133-144.
Nonis, S.A., & Hudson, G. I. (2006). Academic Performance of College Students: Influence of
Time Spent Studying and Working, Journal of Education for Business, 81:3, 151-159.
Norcini, J.J. (2003). Setting standards on educational tests. Medical Education, 37, 464-469.
Plant, E. A., Ericsson, K. A., Hill, L., & Asberg, K. (2005). Why study time does not predict
grade point average across college students: Implications of deliberate practice for academic
performance. Contemp Educ Psychol, 30(1), 96-116.
Rau, W., & Durand, A. (2000). The academic ethic and college grades: Does hard work help
students to « make the grade » ? Sociology of education, 19-38.
Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic
motivation, social development, and well-being. American Psychologist, 55, 68-78.
Scouller, K. (1998). The influence of assessment method on students' learning approaches:
multiple choice question examination versus assignment essay. Higher Education, 35(4), 453-
472.
Self-determination theory. (2017). Self-regulation questionnaires. Retrieved on February 26,
2017, from http://selfdeterminationtheory.org/self-regulation-questionnaires/
Stinebrickner, R., & Stinebrickner, T. (2004). Time-use and college outcomes. Journal of
Econometrics, 121, 243–269.
Turner, E. A., Chandler, M., & Heffer, R. W. (2009). The influence of parenting styles,
achievement motivation, and self-efficacy on academic performance in college students.
Journal of College Student Development, 50(3), 337-346.
XII
Universiteit Gent. (2017). Geen giscorrectie meer bij meerkeuzevragen. Retrieved on February
10, 2017, from http://www.ugent.be/student/nl/studeren/examens/geen-giscorrectie-meer-bij-
meerkeuzevragen
Van de Poele, L. , & Sabbe, E. (2016). Hogere cesuur [PowerPoint-presentatie]. Retrieved on
February 28, 2017, via
https://www.timvervoort.com/lno2/documenten/2016/De_SIG_evalueren_goes_classic_hoger
e_cesuur_sessie.pdf
Vansteenkiste, M., Lens, W., & Deci, E. L. (2006). Intrinsic versus extrinsic goal contents in
self-determination theory: another look at the quality of academic motivation. Educational
psychologist, 41(1), 19-31.
van Thiel, S. (2010). Bestuurskundig onderzoek, een methodologische inleiding. Bussum:
Coutinho
Verlet, D. (2015). Onderzoeksmethoden: 6e sessie SPSS regressieanalyse. Faculteit Economie
en Bedrijfskunde, Universiteit Gent
Wester, A., & Henriksson, W. (2000). The interaction between item format and gender
differences in mathematics performance based on TIMSS data. Studies in Educational
Evaluation, 26, 79–90.
Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ:
Lawrence Erlbaum Associates.
Woodford, K., & Bancroft, P. (2004). Using multiple choice questions effectively in information
technology education. Paper presented at the 21st ASCILITE Conference, Perth.
1
Appendices Appendix 1: Survey
Beste student,
Ik ben een masterstudent in de Bedrijfseconomie. In het kader van mijn masterproef doe ik
onderzoek naar gender bias bij standardsetting als evaluatiemethode bij multiple-choice examens.
Het doel van mijn onderzoek is inzicht te krijgen in welke factoren de score op dergelijke examens
kunnen beïnvloeden. De enquête bestaat uit enkele algemene vragen en vervolgens meerdere
vragen die specifiek verband houden met het vak vennootschapsbelasting. We benadrukken dat de
data strikt vertrouwelijk zullen worden behandeld en geen specifieke informatie zal worden
doorgespeeld naar de lesgevers.
Het invullen van deze vragenlijst zal 10 à 15 minuten van uw tijd in beslag nemen.
Indien je nog vragen of opmerkingen mocht hebben over het onderzoek, neem dan gerust contact
met mij op via [email protected].
Dank bij voorbaat.
Met vriendelijke groeten,
Daphné Dejonckheere
INSTRUCTIES BIJ HET INVULLEN VAN DE VRAGENLIJST:
I. Voor de kwaliteit van het onderzoek is het van belang dat je alle vragen
beantwoordt.
II. Per vraag is er slechts één antwoord mogelijk. In geval van twijfel tussen meerdere
antwoorden, probeer dan toch het antwoord aan te duiden dat het meest aansluit bij
jouw werkelijke situatie.
III. Er bestaan geen foute antwoorden op de vragen in deze enquête. Probeer echter wel
eerlijk te zijn bij het beantwoorden van de vragen. Elk eerlijk antwoord is immers
wel een goed antwoord!
1. Wat is je geslacht?
Man
Vrouw
2
Gemakkelijk Moeilijk
MMoeilijk
2. Wat is je geboortejaar?
…….…….
3. Hoeveel keer heb je het examen vennootschapsbelasting al afgelegd?
Ik zal het examen vennootschapsbelasting voor de eerste keer afleggen komende
examenperiode.
Ik heb het examen vennootschapsbelasting al één keer afgelegd in het verleden.
Ik heb het examen vennootschapsbelasting al twee keer afgelegd in het verleden.
Ik heb het examen vennootschapsbelasting al meer dan twee keer afgelegd in het
verleden.
4. Hoeveel procent van de oefeningenlessen vennootschapsbelasting heb je dit semester
bijgewoond?
0 – 19%
20 – 39%
40 – 59%
60 – 79%
80 – 100%
Ik ga nooit naar de oefeningenles.
5. Hoeveel procent van de theorielessen vennootschapsbelasting heb je dit semester
bijgewoond?
0 – 19%
20 – 39%
40 – 59%
60 – 79%
80 – 100%
Ik ga nooit naar de theorieles.
6. Hoe moeilijk is de inhoud van het vak vennootschapsbelasting voor jou? Omcirkel het
meest passende antwoord.
1 2 3 4 5 6 7 8 9 10
3
Giscorrectie(*) Standard setting(**)
MMoeilijk
7. Hoeveel tijd heb je gemiddeld wekelijks aan het vak vennootschapsbelasting
gespendeerd (exclusief de lessen die je hebt bijgewoond)?
Minder dan 1 uur per week
Tussen 1 en 2 uren per week
Tussen 2 en 3 uren per week
Tussen 3 en 4 uren per week
Tussen 4 en 5 uren per week
Tussen 5 en 6 uren per week
Meer dan 6 uren per week per week
8. Welke evaluatiemethode geniet jouw voorkeur bij multiple-choice examens? Omcirkel
het meest passende antwoord.
(*) Bij de toepassing van giscorrectie krijg je voor elk goed antwoord een positieve score, maar
verlies je ook punten bij een verkeerd antwoord of een open gelaten vraag (Universiteit Gent,
2016).
(**) Bij toepassing van standard setting of een hogere cesuur kan je geen punten verliezen als
je een meerkeuzevraag verkeerd hebt beantwoord, maar je moet wel meer dan de traditionele
50% van de vragen juist beantwoorden om te kunnen slagen (Universiteit Gent, 2016).
9. Duid aan in hoeverre je akkoord bent met de volgende stellingen over standardsetting
als evaluatiemethode.
He
lem
aa
l n
iet
ak
ko
ord
Ee
rd
er n
iet
ak
ko
ord
Ak
ko
ord
, n
och
nie
t
ak
ko
ord
Ee
rd
er a
kk
oo
rd
He
lem
aa
l a
kk
oo
rd
A. Ik heb al veel examens gemaakt waar standardsetting als
verbetermethode werd gehanteerd. 1 2 3 4 5
B. Het schrikt mij af dat er bij standard setting een groter aantal
vragen juist moet beantwoord worden om te kunnen slagen. 1 2 3 4 5
C. Ik begrijp hoe de scores berekend worden op examens met
standardsetting als verbetermethode. 1 2 3 4 5
1 2 3 4 5 6 7 8 9 10
4
10. Duid aan in hoeverre volgende stellingen voor jou van toepassing zijn. Denk hierbij
aan het vak vennootschapsbelasting!
No
oit
of
ze
lde
n v
an
to
ep
assin
g
So
ms v
an
to
ep
assin
g
De
he
lft v
an
de
tij
d
va
n t
oe
pa
ssin
g
Va
ak
va
n
to
ep
assin
g
Alt
ijd
va
n
to
ep
assin
g
A. Mijn doel is om te slagen voor het vak door er zo weinig mogelijk
werk in te steken. 1 2 3 4 5
B. Ik ben pas tevreden wanneer ik genoeg gestudeerd heb aan een
hoofdstuk, zodat ik mijn eigen conclusies kan vormen. 1 2 3 4 5
C. Ik studeer enkel mijn slides of hetgeen gezien is in de les grondig. 1 2 3 4 5
D. Ik vind dat onderwerpen grondig bestuderen niet nuttig is. Het is
een verspilling van tijd, omdat je enkel een 10 nodig hebt om te
slagen.
1 2 3 4 5
E. Ik vind dat ik op de meeste examens kan slagen door belangrijke
onderdelen van buiten te leren i.p.v. deze proberen te begrijpen. 1 2 3 4 5
F. Ik test mezelf op belangrijke onderwerpen in een cursus tot ik ze
volledig begrijp. 1 2 3 4 5
G. Studeren geeft me een gevoel van persoonlijke voldoening. 1 2 3 4 5
H. Ik heb het gevoel dat vrijwel elk onderwerp zeer interessant kan
zijn, zodra ik mij er in verdiep. 1 2 3 4 5
I. Ik vind de meeste nieuwe onderwerpen interessant en spendeer er
extra tijd aan om er zo meer inzicht in te verkrijgen. 1 2 3 4 5
J. Wanneer ik mijn cursus niet zo interessant vind, beperk ik het
studeren tot het minimum. 1 2 3 4 5
K. Moeilijke stukken uit de leerstof leer ik gewoon van buiten en
herhaal ik, tot ik alles volledig uit het hoofd ken, ook al begrijp ik
het niet helemaal.
1 2 3 4 5
L. Ik vind dat studeren even interessant kan zijn als een goed boek
lezen of een goede film bekijken. 1 2 3 4 5
M. Ik beperk mijn studie tot wat specifiek aangegeven is, omdat ik
denk dat extra dingen (zoals extra informatie opzoeken) niet
noodzakelijk zijn.
1 2 3 4 5
N. Ik werk hard voor mijn studies, omdat ik het interessant vind. 1 2 3 4 5
O. Ik spendeer veel van mijn vrije tijd aan het meer te weten komen
over interessante onderwerpen, die behandeld werden in de
verschillende lessen.
1 2 3 4 5
5
P. Ik geloof dat professoren niet zouden mogen verwachten dat
studenten veel tijd spenderen aan het bestuderen van onderwerpen,
waarvan iedereen weet dat ze niet zullen ondervraagd worden.
1 2 3 4 5
Q. Ik ga naar de meeste oefeningenlessen met specifieke vragen, waar
ik een antwoord op wil krijgen. 1 2 3 4 5
R. Ik vind het belangrijk om de leerstof in het handboek grondig te
bekijken vooraleer ik naar de oefeningenles ga. 1 2 3 4 5
S. Ik zie geen nut in het bestuderen van onderwerpen die toch niet
gevraagd zullen worden op het examen. 1 2 3 4 5
T. De oplossingen van de oefeningen van buiten leren, is voor mij
wellicht de beste manier om te slagen voor het examen. 1 2 3 4 5
11. Duid aan in hoeverre je akkoord bent met de volgende stellingen.
1. Ik heb gekozen voor deze studierichting omdat…
He
lem
aa
l n
iet
ak
ko
ord
Ee
rd
er n
iet
ak
ko
ord
Ak
ko
ord
, n
och
nie
t
ak
ko
ord
Ee
rd
er a
kk
oo
rd
He
lem
aa
l a
kk
oo
rd
A. Ik anders spijt zou hebben als ik het niet had gedaan. 1 2 3 4 5
B. Anderen (ouders, vrienden, leerkrachten,…) me hiertoe hebben
verplicht. 1 2 3 4 5
C. Dit voor mij een persoonlijk belangrijke keuze was. 1 2 3 4 5
D. Omdat deze studierichting me interesseerde. 1 2 3 4 5
2. Ik let goed op in de lessen omdat…
He
lem
aa
l n
iet
ak
ko
ord
Ee
rd
er n
iet
ak
ko
ord
Ak
ko
ord
, n
och
nie
t
ak
ko
ord
Ee
rd
er a
kk
oo
rd
He
lem
aa
l a
kk
oo
rd
E. Ik me zeer graag wil verdiepen in het vak
vennootschapsbelasting. 1 2 3 4 5
F. Ik me schuldig zal voelen als ik het niet doe. 1 2 3 4 5
G. Ik nieuwe dingen wil bijleren. 1 2 3 4 5
H. Ik verondersteld word om dit te doen. 1 2 3 4 5
6
3. Ik heb de oefeningen (soms) vooraf voorbereid omdat…
He
lem
aa
l n
iet
ak
ko
ord
Ee
rd
er n
iet
ak
ko
ord
Ak
ko
ord
, n
och
nie
t a
kk
oo
rd
Ee
rd
er a
kk
oo
rd
He
lem
aa
l
ak
ko
ord
I. Ik me schuldig zou voelen als ik het niet had gedaan. 1 2 3 4 5
J. Ik het boeiend vond om de oefeningen voor te bereiden. 1 2 3 4 5
K. Anderen (ouders, vrienden, docenten, …) dit van mij hebben
verwacht. 1 2 3 4 5
L. Ik het belangrijk vond om deze oefeningen voor te bereiden. 1 2 3 4 5
4. Ik doe mijn uiterste best voor het vak
vennootschapsbelasting omdat…
He
lem
aa
l n
iet
ak
ko
ord
Ee
rd
er n
iet
ak
ko
ord
Ak
ko
ord
, n
och
nie
t a
kk
oo
rd
Ee
rd
er a
kk
oo
rd
He
lem
aa
l
ak
ko
ord
M. Anderen (familie, vrienden, …) het verwachten van me. 1 2 3 4 5
N. Ik anderen de indruk wil geven dat ik een goede student ben. 1 2 3 4 5
O. Ik vennootschapsbelasting interessant vind. 1 2 3 4 5
P. Mijn ouders anders teleurgesteld zijn in mij. 1 2 3 4 5
Q. Ik me anders slecht ga voelen als ik niet de gewenste score
behaal. 1 2 3 4 5
R. Ik hoge cijfers wil behalen op het examen. 1 2 3 4 5
S. Ik trots op mezelf kan zijn. 1 2 3 4 5
T. Ik verondersteld word om dit te doen. 1 2 3 4 5
Gelieve hieronder nog je stamnummer in te vullen:
7
Als je je studentenkaart niet bij hebt en je kent je studentennummer niet, gelieve dan je naam in te
vullen.
Voornaam: ………………………………………..………………
Naam:……………………………………………………………..
We benadrukken dat de gegevens strikt vertrouwelijk zijn en dat de namen zullen omgezet worden
in nummers, zodat de data anoniem kan behandeld worden.
Bedankt voor jouw deelname aan het onderzoek!
8
Appendix 2: Factor loadings and Cronbach’s alpha familiarity
Item Cronbach’s alpha Factor loading
Familiarity with retrospective correcting for guessing 0.47
I have already made many exams which were corrected
retrospectively for guessing.
0.82
The fact that a larger number of questions has to be answered
correctly in case of retrospective correcting for guessing, scares
me.
0.05
I understand how scores are calculated on exams which are
corrected retrospectively for guessing.
0.82
9
Appendix 3: Factor loadings and Cronbach’s alphas R-SPQ-2F
Item Cronbach’s alpha Factor loading
Deep approach 0.65
2. I find that I have to do enough work on a chapter so that I can
form my own conclusion before I am satisfied.
0.45
6. I test myself on important topics in a course until I understand
them completely.
0.47
7. I find that at times studying gives me a feeling of deep
personal satisfaction.
0.64
8. I feel that virtually any topic can be highly interesting once I
get into it.
0.49
9. I find most new topics interesting and often spend extra time
trying to obtain more insights into them.
0.60
12. I find that studying academic topics can at times be as
exciting as a good novel or movie.
0.41
14. I work hard at my studies because I find the material
interesting.
0.57
15. I spend a lot of my free time finding out more about
interesting topics which have been discussed in different
classes.
0.40
17. I come to most exercise classes with questions in mind that
I want answering.
0.33
18. I make a point of studying the course material in the textbook
thoroughly before going to the exercise classes.
0.49
Surface approach 0.63
1. My aim is to pass the course while doing as little work as
possible.
0.25
3. I only study seriously what’s given out in class or in the course
outlines.
0.45
4. I find it not helpful to study topics in depth. It confuses and
wastes time, when you all need is a 10 to pass the course.
0.56
5. I find I can get by in most examinations by memorising key
sections rather than trying to understand them.
0.74
10. I do not find my course very interesting, so I keep my work
to the minimum.
0.34
11. I learn some things by rote, going over and over them until I
know them by heart even if I do not understand them.
0.56
13. I generally restrict my study to what is specifically set as I
think it is unnecessary to do anything extra.
0.45
16. I believe that lecturers should not expect students to spend
significant amounts of time studying material everyone knows
won’t be examined.
0.42
19. I see no point in learning material which is not likely to be in
the examination.
0.40
20. I find the best way to pass examinations is to try to
remember the solution of the exercises.
0.49
10
Appendix 4: Factor loadings and Cronbach’s alphas RAI
Item Cronbach’s
alpha
Factor loading
Intrinsic 0.66 Component 1
Component 2
Component 3
Component
4
4. I have chosen this field of study because it
interested me.
0.115 0.146 0.029 0.805
5. I pay attention in class because I want to
deepen my knowledge in the subject of
corporation tax.
0.040 0.886 0.097 -0.070
10. I (sometimes) prepared the exercises in
advance, because it fascinated me to prepare
them.
-0.137 0.385 0.704 -0.130
15. I do the best I can for the course of
corporation tax because I find it interesting.
0.090 0.876 0.071 0.094
Identified 0.38
3. I have chosen this field of study because this
was an important choice for me.
0.130 0.059 -0.078 0.755
7. I pay attention in class because I want to learn
new things.
-0.060 0.614 0.130 0.357
12. I (sometimes) prepared the exercises in
advance, because I found it important to prepare
these exercises.
-0.031 0.161 0.817 0.090
18. I do the best I can for the course of
corporation tax because I want to achieve high
grades on the exam.
0.420 0.296 0.168 0.278
Introjected 0.54
1. I have chosen this field of study because I
would regret it if I did not have done it.
0.265 0.171 -0.236 0.217
6. I pay attention in class because I would feel
guilty if I did not.
0.465 -0.282 0.305 0.217
9. I (sometimes) prepared the exercises in
advance, because I would feel guilty if I did not.
0.225 -0.006 0.753 0.004
14. I do the best I can for the course of
corporation tax because I want to give others the
impression that I am a good student.
0.412 0.441 0.332 -0.066
17. I do the best I can for the course of
corporation tax because I would feel bad if I did
not achieve the desired mark.
0.634 0.128 0.020 0.224
19. I do the best I can for the course of
corporation tax because I can be proud of
myself.
0.484 0.277 0.015 0.273
External 0.69
2. I have chosen this field of study because other
people (parents, friends, teachers, …) forced
me to do so.
0.244 0.077 -0.076 -0.755
8. I pay attention in class because I am
supposed to do so.
0.611 -0.313 0.043 0.015
11. I (sometimes) prepared the exercises in
advance, because other people (parents,
friends, teachers, …) expect this from me.
0.168 0.015 0.667 0.050
13. I do the best I can for the course of
corporation tax because other people (family,
friends, …) expect this from me.
0.635 0.104 0.363 -0.268
16. I do the best I can for the course of
corporation tax because otherwise I would
disappoint my parents.
0.650 0.215 0.112 -0.211
20. I do the best I can for the course of
corporation tax because I’m supposed to do so.
0.754 -0.115 -0.082 -0.077