THE EFFECT OF GENDER ON MULTIPLE CHOICE EXAMS: … · 2017-08-04 · Dit onderzoek zal proberen bij...

THE EFFECT OF GENDER ON

MULTIPLE CHOICE EXAMS:

RETROSPECTIVE CORRECTING FOR

GUESSING

Aantal woorden: 20.877

Daphné Dejonckheere Stamnummer: 01271215

Promotor: dr. Evelien Opdecam

Masterproef voorgedragen tot het bekomen van de graad van:

Master of Science in de Bedrijfseconomie Afstudeerrichting: bedrijfseconomie

Academiejaar: 2016 - 2017

THE EFFECT OF GENDER ON

MULTIPLE CHOICE EXAMS:

RETROSPECTIVE CORRECTING FOR

GUESSING Aantal woorden: 20.877

Daphné Dejonckheere Stamnummer: 01271215

Promotor: dr. Evelien Opdecam

Masterproef voorgedragen tot het bekomen van de graad van:

Master of Science in de Bedrijfseconomie Afstudeerrichting: bedrijfseconomie

Academiejaar: 2016 - 2017

PERMISSION

I declare that the content of this Master’s Dissertation may be consulted and/or reproduced,

provided that the source is referenced.

Daphné Dejonckheere

Nederlandstalige samenvatting

Multiple choice examens zijn een algemeen gekend examen formaat voor het meten van kennis

van studenten in hoger onderwijs. Voorgaande literatuur vraagt echter een verdere verkenning

van alternatieve scoringsmethodes voor multiple choice examens, aangezien de twee meest

gebruikte scoringsmethodes, zijnde “number right scoring” en “giscorrectie”, beide inherente

nadelen vertonen. Een belangrijke bezorgdheid bij het gebruik van meerkeuzevragen is dat het

bepaalde groepen van studenten zou bevoordelen. Eerder onderzoek stelde immers vaak vast

dat mannen een voordeel zouden hebben bij meerkeuzevragen ten opzichte van vrouwen.

Significante verschillen in prestaties tussen mannen en vrouwen op multiple choice examens

zijn vastgesteld op examens zowel met als zonder giscorrectie. Dit onderzoek zal proberen bij

te dragen aan voorgaand onderzoek door na te gaan of er ook significante verschillen tussen de

prestaties van mannen en vrouwen optreden indien er een alternatieve methode wordt gebruikt

om multiple choice examens te verbeteren. De alternatieve verbetermethode die in dit

onderzoek zal worden onderzocht, wordt standard setting of hogere cesuur genoemd. Bij deze

methode verliest men geen punten bij een verkeerd antwoord, maar moeten studenten wel meer

dan de helft van de vragen juist beantwoorden om te kunnen slagen voor het examen. Naast

geslacht, zal ook rekening worden gehouden met andere eigenschappen van studenten die

mogelijks verschillen in prestaties tussen studenten kunnen verklaren. Dit onderzoek werd

uitgevoerd bij derde bachelor studenten handelswetenschappen. De resultaten tonen geen

significante verschillen in prestaties tussen mannen en vrouwen op examens waar standard

setting als verbetermethode werd gehanteerd. Indien echter gekeken wordt naar de scores van

studenten op de verschillende types van meerkeuzevragen, tonen de resultaten aan dat

mannelijke studenten duidelijk beter presteerden op rekenvragen, terwijl vrouwen beter

scoorden op toepassingen. Met betrekking tot de andere factoren, bleek dat wekelijkse studietijd

en een oppervlakkige leerbenadering respectievelijk een significant positieve en een significant

negatieve invloed hebben op examenprestaties waar standard setting wordt gehanteerd. Tot slot

behaalden de studenten die de laatste les bijwoonden een significant hogere score op het examen

dan de studenten die afwezig waren.

Abstract

Multiple choice examinations are a widely known exam format for measuring students’

knowledge in higher education. Previous literature calls, however, for further exploration of

alternative scoring methods for multiple choice assessment, since the two most commonly used

scoring methods – “number right scoring” and “negative marking” – both have shown inherent

drawbacks. One major concern with use of multiple choice questions is that it would favour

particular groups of students. More specifically, prior research often identified a gender bias in

favour of male students with multiple choice questions. Gender differences in performance on

multiple choice exams have occurred both with and without the use of penalties for wrong

answers. This study will try to contribute to prior research by examining whether a significant

gender effect also exists in case an alternative method is used to score multiple choice exams.

The alternative scoring method that will be explored in this study is called retrospective

correcting for guessing (also called “standard setting” or “hogere cesuur” in Dutch). This

scoring method does not penalize wrong answers, but students have to answer more than half

of the questions correctly in order to pass the exam. Besides gender, this study will also take

other students’ characteristics into account which may explain differences in performance

among students. The study is administered in a third-year undergraduate course of Business

Administration students. The results provide no evidence for significant gender differences in

performance on multiple choice exams that are corrected retrospectively for guessing. When

looking at performance scores on the different types of multiple choice questions though, male

students performed significantly better on calculations, while women outperformed men on

application questions. With regard to other students’ characteristics, this study found that

weekly invested study time and a surface approach to learning respectively have a significantly

positive and negative influence on performance on exams which are corrected retrospectively

for guessing. Finally, students who attended the last lecture achieved significantly higher marks

compared to students who were absent.

I

Preface

This master thesis can be considered as the final proof of competence for obtaining the Master

of Science degree in Complementary Studies in Business Economics at the University of Ghent.

Several persons have substantially contributed to this master thesis. Therefore, I would like to

take this opportunity to express my gratitude towards those people who have helped me through

this.

First and foremost, I sincerely want to show gratitude towards my promotor, dr. Evelien

Opdecam. She presented me this interesting topic and offered me the opportunity to work on

this subject. I also want to thank her for her faith in my capabilities and her regular feedback

and guidance to improve the quality of this study.

Secondly, I would like to thank students enrolled in the third bachelor of Business

Administration at our university, who have completed my survey. Without their participation,

it would be impossible to investigate this topic. Their answers formed the backbone of this

study.

Thirdly, I want to thank my parents for the opportunity to follow this education and their

continuous support, even during hard times.

Lastly, I owe gratitude to Lukas, my boyfriend, for his moral support, comforting words and

sincere interest in my work. He also carefully read my text and corrected the grammatical

mistakes I made.


II

Table of contents

Introduction 1

1 Theoretical framework 3

1.1 Multiple choice examinations in higher education 3

1.2 The influence of gender on performance on multiple choice exams 4

1.3 Scoring methods for multiple choice assessment 7

1.3.1 Conventional scoring methods 7

1.3.1.1 Number right (NR) scoring 7

1.3.1.2 Negative marking (NM) 8

1.3.2 Retrospective correcting for guessing 11

1.4 Other explanatory factors of performance 15

1.4.1 Prior experience, familiarity and preference 15

1.4.2 Lesson attendance 16

1.4.3 Study time 17

1.4.4 Students’ perceptions about course difficulty 17

1.4.5 Learning approaches 18

1.4.6 Motivation 19

2 Research design & methodology 23

2.1 Research goal & questions 23

2.2 Research techniques 23

2.2.1 Surveys 24

2.2.1.1 Sample 25

2.3 Measurement 25

2.3.1 Dependent variables: performance 25

2.3.2 Independent variables 26

2.3.3 Control variable 29

2.4 Analysing the results 30

2.4.1 Independent samples T-test 30

2.4.2 Regression analyses 30

3 Research findings 31

III

3.1 Descriptive statistics 31

3.2 Correlations 37

3.3 Gender differences 40

3.4 Hypotheses testing 43

3.4.1 Hypothesis 1 43

3.4.2 Hypotheses 2 44





3.4.7 Robustness check 50

4 Discussion 54

4.1 Limitations 60

4.2 Future research 61

5 Conclusion 62

Bibliography VII

Appendices 1

Appendix 1: Survey 1

Appendix 2: Factor loadings and Cronbach’s alpha familiarity 8

Appendix 3: Factor loadings and Cronbach’s alphas R-SPQ-2F 9

Appendix 4: Factor loadings and Cronbach’s alphas RAI 10

IV

List of used abbreviations

CR Constructed-response

MC Multiple choice

NM Negative marking

NR Number right (scoring)

RAI Relative autonomy index

R-SPQ-2F Revised two factor study process questionnaire

SDT Self-determination theory

SPQ Study process questionnaire

SRQ Self-regulation questionnaire

VIF Variance inflation factor

V

List of tables and figures

TABLES:

Table 1: Literature review regarding the influence of students’ characteristics on

performance

Table 2: Descriptives performance on the exam

Table 3: Frequencies gender

Table 4: Descriptives familiarity, preference, perceptions course difficulty, learning

approaches & ability

Table 5: Frequencies times participated in the exam

Table 6: Frequencies lesson attendance (exercises)

Table 7: Frequencies lesson attendance (theory)

Table 8: Frequencies weekly reported study time (excl. lessons)

Table 9: Frequencies quadrants of learning approaches

Table 10: Correlation table

Table 11: Gender differences (Independent samples T-test & Mann-Whitney U-test)

Table 12: ANCOVA for gender differences in performance (control variable: ability)

Table 13: Regression of familiarity with retrospective correcting for guessing on performance

Table 14: Regression of preference of scoring method on performance

Table 15: Regression of lesson attendance on performance

Table 16: Additional t-test regarding attendance of last course

Table 17: Regression of time weekly spent on performance

Table 18: Regression of perceptions about course difficulty on performance

Table 19: Regression of learning approaches on performance

Table 20: Regression of all the independent variables on performance with retrospective

correcting for guessing

Table 21: Regression of all the independent variables on performance on theoretical questions

Table 22: Regression of all the independent variables on performance on calculations

Table 23: Regression of all the independent variables on performance on application

questions

VI

FIGURES:

Figure 1: The internalization continuum depicting the various types of extrinsic motivation

posited within self-determination theory

Figure 2: Histogram performance with retrospective correcting for guessing (mark on 40)

Figure 3: Plot of the learning approaches (Mean-split)

Figure 4: Mean scores on the different types of MC questions (mark on 10)

1

Introduction

Since 1953 a spectacular growth in higher education enrolments in Belgium can be observed, which

had implications for the format of examination. As courses are followed by larger groups of students,

instructors have to score considerable amounts of exams (Duchesne & Nonneman, 1998). This task

of grading exams can be a very time-consuming task for instructors. Consequently, many

constructed-response (CR) tests have been replaced by multiple choice (MC) examinations, for

which computerized evaluation is possible (Kastner & Stangl, 2011). As examinations in higher

education mainly aim at extracting the knowledge of students from their responses, test scores have

to reflect the “true” level of knowledge mastery of students. Hence, a lot of education literature

concentrated on scoring methods for these MC test formats (Lesage, Valcke, & Sabbe, 2013).

This debate has, however, mainly been single-sided and concentrated on two most commonly used

scoring methods: number right (NR) scoring versus negative marking (NM). Results of former

studies indicated that both methods do not seem to meet the expectations and have inherent

drawbacks with regard to test validity and reliability. Whereas the major problem with NR scoring

is the fact that students can gain marks through guessing, the use of a penalties in case of NM is said

to favour particular groups of students (Lesage, Valcke, & Sabbe, 2013). Several authors talk about

a gender bias as implementing correction for guessing results in a different level of omitted items

between male students and female students and consequently to differences in performances (Betts,

Elder, Hartley, & Trueman, 2009). Also other drawbacks of the correction for guessing format have

been mentioned in literature and will be discussed later on.

Therefore, a growing need arises to explore alternative approaches for scoring MC exams in order

to inform and support instructors and other test designers (Lesage, Valcke, & Sabbe, 2013).

However, when exploring literature, a substantial gap appears with regard to these “non-

conventional” scoring methods. Therefore, this study will try to contribute to previous research by

switching the focus to a non-conventional scoring method: the retrospective correcting for guessing.

The lack of research that has been devoted to this alternative approach, combined with the fact that

the University of Ghent decided in 2014 to replace the correction for guessing by this non-

conventional method (also known as “standard setting” or “hogere cesuur”), are the main reasons

2

for choosing this scoring method as the main focus of this dissertation. Nevertheless, it should be

acknowledged that there still exist other scoring methods, which may also benefit from additional

research.

This study will analyse whether differences in marks on multiple choice exams, that are corrected

retrospectively for guessing, can be attributed to different characteristics of students. It will mainly

be examined whether a gender difference in performance also appears in case the retrospective

correcting for guessing scoring method is applied, and if so, the extent of that gender effect.

However, also the possible influence of other variables such as lesson attendance, invested study

time, learning approaches, etc. on performance will be discussed. This research will be done for a

sample of third bachelor Business Administration (“handelswetenschappen”) students, following a

course of corporation tax (“vennootschapsbelasting”) at the University of Ghent.

The remainder of this study is organized as follows. First, an outline of relevant literature regarding

previous research on multiple choice (MC) examination and their scoring methods will be given.

This literature study will lead to the formulation of hypotheses. Next, details concerning the

methodology and the data used in this study will be given. Thirdly, the results of the analyses will

be presented and discussed, which will lead to either the confirmation or rejection of the hypotheses.

Finally, conclusions of this study are contained in the last section.

3

1 Theoretical framework 1.1 Multiple choice examinations in higher education

Multiple-choice (MC) examinations have become a widespread evaluation tool within higher

education. MC questions have a stem and a set of possible answers from which examinees have to

select the correct answer(s). Contrary to MC tests, constructed-response (CR) questions require

students to independently formulate their own answers, which might be a short answer, an essay, a

diagram, an explanation of a procedure or a solution to a mathematical question (Kastner & Stangl,

2011). The frequent use of MC examinations can be observed in different disciplines such as

accounting (Bible, Simkin, & Kuechler, 2008; Arthur & Everaert, 2012), economics (Chan &

Kennedy, 2002; Du Plessis & Du Plessis, 2007), psychology (Betts, Elder, Hartley, & Trueman,

2009), information technology (Woodford & Bancroft, 2004) and mathematics (Beller & Gafni,

2000).

This increasing use of MC examinations can be attributed to several benefits this format offers in

comparison to CR tests. The main advantages for instructors include the possibility to cover a broad

range of subjects in a single examination, and this for large cohorts of students, greater efficiency

and reliability in scoring (Betts, Elder, Hartley, & Trueman, 2009; Kastner & Stangl, 2011). For

students, the most important benefits of MC exams are the following: the perception that this scoring

method is more objective, the fact that their writing skills and writing speed are no determining

factors and a heightened confidence in their ability to improve their marks through making correct

guesses or uncovering the solution by a process of elimination (Bible, Simkin, & Kuechler, 2008).

Nevertheless, prior literature also mentioned several drawbacks of MC tests. First of all, the

possibility to gain marks through lucky guessing is a very pronounced concern by researchers

regarding the reliability of MC examinations (Betts, Elder, Hartley, & Trueman, 2009). A second

concern is whether these tests assess the same level of understanding as CR tests. A third

disadvantage relates to potential ambiguity in MC questions themselves (Tsui et al., in: Bible,

Simkin, & Kuechler, 2008). Fourthly, it has been argued that these tests do not adequately measure

students’ critical, communication and analytical skills, although these skills are actively encouraged

in higher education and essential in preparing students for future employment (Bible, Simkin, &

4

Kuechler, 2008). Fifthly, it is argued by some authors that MC examination typically promotes

‘surface’ rather than ‘deep’ learning as MC questions may encourage students to memorize subject

matter instead of understanding concepts (Williams & Clark, in: Betts, Elder, Hartley, & Trueman,

2009). Finally, the debate whether the use of MC questions favours particular groups of students

presents a particular case in research literature. More specifically, several studies identified a gender

bias in favour of male students with MC questions. However, findings about this issue have not been

consistent and will be discussed in the next section.

1.2 The influence of gender on performance on multiple choice exams

Considerable attention in prior literature has been devoted to the question whether exam format

matters when measuring student performance. It is often argued that, depending on personal

characteristics, some students are predisposed to perform better on a certain mode of assessment

(Krieg & Uyar, 2001). Especially, the relationship between gender and student performance on MC

examinations has been a prominent research focus in education literature. In what follows, prior

findings about the relationship between gender and performance on MC exams will be discussed.

A substantial amount of local and international research has identified a gender bias in favour of

male students with MC examinations. Bias or systematic error appears when it is impossible to

measure all subgroups of the population in the same way. Consequently, gender bias can be defined

as a systematic error in the measurement of differences in skills between men and women

(Willingham & Cole, 1997). Several studies found that MC questions favour male students more

compared to female students. For instance, this effect was confirmed in accounting examinations by

Arthur & Everaert (2012). Although women outperformed their male counterparts in both MC and

CR exam formats, their superior performance in MC questions decreased, when compared to CR

questions. Both for theory and exercise MC questions, male students seemed to have a relative

advantage over females. Also research of Krieg & Uyar (2001), investigating the importance of

exam structure in economics, found that male students are predisposed to perform better on MC

exams.

5

Leaver & van Walbeek (2006) supplemented prior research by examining whether certain types of

MC questions induced a stronger “gender bias” than other types. Therefore, they examined whether

the gender difference could be explained by either the content type or degree of cognitive reasoning

needed in order to answer questions. According to content, they divided questions in five categories:

1) Quantitative questions (i.e. calculations)

2) Qualitative questions (i.e. descriptions, definitions)

3) Specific graphical questions (i.e. finding a specific solution based on a graph)

4) General graphical questions (i.e. general shifts of curves based on a graph)

5) Factual questions (i.e. general knowledge or current affairs )

Secondly, they classified questions according to the level of cognitive reasoning by making use of

the Cognitive Model of Bloom’s taxonomy. The main idea of this taxonomy is that educational

objectives can be organized in a hierarchy from less to more complex. The six levels are successive,

meaning that one level has to be mastered before a following level can be attained. However, there

is only consensus that the first four classes of the taxonomy form a hierarchy, while there is some

disagreement about the fifth and sixth category. The most MC questions can, however, be classified

as belonging to one of the first four categories. The six levels of cognitive reasoning include:

1) Knowledge: recalling or recognizing previously learned information

2) Comprehension: understanding the meaning of information, interpreting information

3) Application: using information in new situations

4) Analysis: examining and dividing information into component parts

5) Synthesis: integrating or combining information

6) Evaluation: assessing the value of information (Leaver & van Walbeek, 2006)

The results of this research indicated that female students were outperformed by male students, and

this finding holds for both questions categorized according to content as well as for questions

classified according to Bloom’s taxonomy. With regard to content, females appeared to be at a

disadvantage for all five categories, but to a larger extent in case of quantitative and graphical

questions. In Bloom’s taxonomy, the gender difference becomes more prominent at higher levels:

the more complex the questions, the more likely women failed to answer them correctly (Leaver &

van Walbeek, 2006).

6

Other than a gender bias in favour of male students with MC questions, some studies found evidence

for a positive female gender effect on CR tests. Research of Du Plessis & Du Plessis (2007), for

instance, could not confirm the strong claims about the gender bias in favour of male students in

MC examinations, but showed a positive female gender effect on performance in case of CR

questions. Also Bible, Simkin, & Kuechler (2008) found a small, but significant positive relation

between females and performances on CR questions. Their research indicated that women have a

four percent advantage over men on CR questions.

There is also a stream in literature that contradicts previous findings listed above. Wester &

Henrikkson (2000) made use of identical items in different exam formats to investigate performance

in mathematics. Women performed slightly better than men for the MC items and this difference in

performance remained the same for CR questions. Hence, no significant changes in gender

differences were found when the exam format was changed. Also the study of Hartley, Betts, &

Murray (2007), which compared the scores between male and female final-year psychology students

for different modes of assessment, found that women performed significantly better than men on all

modes of assessment (inclusive MC exams). Finally, there also are studies that found no gender

differences in performance at all. Chan & Kennedy (2002), for instance, found no significant

differences between the performance of male and female students both on MC and CR tests.

7

1.3 Scoring methods for multiple choice assessment

Up till now, the way MC exams can be scored have not been taken into account, though a variety of

options exists. An important discussion in this field concerns the question whether a penalty for

wrong answers should be used or not. Consequently, prior literature mainly concentrated on two

widely used methods: the number right (NR) scoring versus negative marking (NM). These

conventional scoring methods will be discussed in the first following paragraph. Since both methods

show inherent benefits as well as disadvantages and no empirical evidence exists that helps to direct

the choice between both, an alternative approach will be explored as well. This alternative approach

is nowadays applied at Ghent University and will be the main focus of this master thesis.

1.3.1 Conventional scoring methods

1.3.1.1 Number right (NR) scoring

Number right scoring is one of the most simple scoring methods, which rewards right answers with

a positive value, while incorrect or omitted answers are scored with a value of zero (Lesage, Valcke,

& Sabbe, 2013).

Multiple choice exams that use the NR scoring method also seem a very good alternative to the

constructed-response tests, since they use somewhat the same logic. Both exam formats do not

penalize wrong answers, while most other scoring rules (e.g. negative marking) are found to be more

strict than NR scoring and CR tests (Kastner & Stangl, 2011).

However, one of the major drawbacks of this scoring method is the fact that students can answer

correctly through guessing. Consequently, the reliability and validity of test scores decreases as

instructors are not able to distinguish guessed answers from answers based on knowledge (Bar-

Hillel, Budescu, & Attali, in: Lesage, Valcke, & Sabbe, 2013). However, the frequency of blind

guessing may be substantially overestimated. Students hardly resort to blind guessing, which refers

to the process of purely random guessing in which each answer option has an equal chance of being

chosen. Moreover, blind guessing alone is not likely to result in high grades. As Downing (2003)

formulated: “the odds of achieving a perfect score on a test through random guessing alone

approach the odds of winning the lottery” (pp. 670). In case of informed guessing, on the other

hand, students use their partial knowledge in eliminating incorrect answers in order to improve their

chance of picking the correct answer (Downing, 2003).

8

1.3.1.2 Negative marking (NM)

Since student guessing has been an issue since the beginning of the MC format usage, correction for

guessing or negative marking (NM) is nowadays frequently incorporated in MC exams (Betts, Elder,

Hartley, & Trueman, 2009). The predominant correcting model within this method is the ‘rights

minus wrongs’. Contrary to NR scoring, incorrect answers are penalized by deducting a percentage

of a mark. Mostly, the penalty for an incorrect answer is 1

(n−1), with n representing the number of

choices. However, also whole marks are sometimes deducted for incorrect answers (Lesage, Valcke,

& Sabbe, 2013).

As the introduction of negative marking is believed to discourage students to guess, this method

would result in higher reliability and validity of test scores (Muijtjens et al., in: Lesage, Valcke, &

Sabbe, 2013). The test scores represent a more reliable reflection of a student’s capability.

Nevertheless, other problems seem to arise when making use of this scoring method.

First of all, it is argued that this method seems to miss its goal as it does not overcome the guessing

problem, but instead introduces new tensions. One student may show greater risk seeking behaviour

by trying to guess the correct answers more frequently, while another student may be more risk

averse and show a higher tendency to omit items when he or she is not sure. Hence, risk averse

students may be disadvantaged while they might have equal ability levels as their fellow students

who frequently dare to guess. The focus may shift away from measuring students’ knowledge

towards measuring students’ answering strategies and risk taking behaviour (Bar-Hillel, Budescu,

& Attali, in: Lesage, Valcke, & Sabbe, 2013).

A second disadvantage related to the guessing problem are the instructions that should be given to

students in advance. With the introduction of negative marking, students were instructed not to guess

at all. Though, later on, it was stated that students should be recommended to guess when they could

eliminate one or more alternative options (Betts et al., in: Lesage, Valcke, & Sabbe, 2013). It is clear

that this challenges the original underlying principle of this scoring method, namely discourage

guessing. Since students will react differently and inconsistently, it can be concluded that instructors

have to be very cautious when instructing students whether to guess or not. Formulating instructions

that are beneficial to all students seem to be a very difficult task (Budescu & Bar-Hillel, in: Lesage,

9

Valcke, & Sabbe, 2013). Figuring out the most optimal decision strategy under negative marking is

challenging for students as well (Lesage, Valcke, & Sabbe, 2013).

Thirdly, there is also disagreement in literature about the optimal penalty that should be attached to

incorrect answers. Some are in favour of a penalty exceeding the standard penalty of 1

(n−1) (Budescu

& Bar-Hillel, in: Lesage, Valcke, & Sabbe, 2013). A higher penalty can be justified since, although

it may discriminate against risk averse students, this effect is negligible compared to the

measurement error it prevents (Bible, Simkin, & Kuechler, 2008).

Finally, implementing a penalty to discourage guessing behaviour may also be detrimental for

students’ performance. As correction for guessing increases the number of questions unanswered,

lower final grades are a sound, immediate consequence. Research of Betts et al. (2009) found,

however, that this detrimental effect of correcting for guessing only appears in case of closed-book

MC examinations. In case of open-book examinations, the implementation of a penalty does not

lead to significantly poorer performance. Furthermore, the implementation of correcting for

guessing may also lead to gender differences in performance, as men and women have shown

different risk patterns resulting in leaving a different number of questions unanswered.

Consequently, a recent stream of literature concentrated on the debate whether the use of penalties

in MC assessment induces a(n) (additional) gender bias or not. Again, mixed results can be observed

and will be discussed in the next paragraph.

1.3.1.2.1 Gender differences in risk aversion

The use of penalties in MC exams inevitably results in a higher number of omitted items. Differences

between students in the tendency to omit items have been explained by their attitudes towards risk:

more risk averse students omit more items compared to less risk averse students (Espinosa &

Gardeazabal, 2010). Accordingly, students with a lower degree of risk aversion can obtain a higher

score on MC examinations that penalize wrong answers, while more risk averse students suffer a

disadvantage with this kind of exams (Marín & Rosa-García, 2011). As women are more risk averse

than men, it is sometimes argued that this type of examination involves a discrimination against

women.

10

Persistent differences in the number of questions left unanswered between male students and female

students were, for instance, found in the study of Marín & Rosa-García (2011). They observed a

higher risk aversion in women as they consistently answered less questions in comparison to men.

Though women obtained lower scores in comparison to men, these differences in marks were very

small and mostly insignificant. Nevertheless, they concluded that a discrimination against women

exists with this type of MC examinations due to their higher tendency to omit items compared to

men.

There are several studies, however, that found no gender differences at all concerning the degree of

risk aversion in MC exams. Research of Betts et al. (2009), for instance, found that men and women

left an approximately equal percentage of questions omitted. Also the experiment of Du Plessis &

Du Plessis (2007) revealed no evidence of gender differences in the level of risk aversion. Their

experiment, however, yielded another interesting result: a significant difference was found between

the success of guessing by men and women. Male students guessed significantly more MC versions

of questions correctly, which were found difficult to answer in written form.

We can conclude that both these conventional scoring methods affect the test reliability in a negative

way. To overcome the weaknesses of both methods, increasing the number of questions in exams

as well as the number of alternative options for each question may offer a solution. However, in

higher education settings, this is not always a feasible solution since the time given to students for

completing an exam is mostly restricted. Moreover, also test developers may be confronted with

new difficulties when they have to think about additional item options. When these extra alternative

options are not able to act as effective distractors, they will not be able to discourage guessing

behaviour (Lesage, Valcke, & Sabbe, 2013).

11

1.3.2 Retrospective correcting for guessing

Due to the weaknesses of conventional scoring methods, alternative scoring methods will have to

be explored that may overcome these shortcomings. The ultimate goal of test designers is to find an

optimal balance between high reliability as with NM, while at the same time avoiding bias due to

risk-taking behaviour in case of NR scoring (Muijtjens et al.,: in Lesage, Valcke, & Sabbe, 2013).

Also Ghent University has decided to no longer use the NM format of correction for guessing.

Reasons for abolishment were manifold and included amongst others the fact that students differ in

guessing behaviour may cause differences in final grades, the observation that students were too

occupied with tactical considerations whether to answer questions or not, etc. (Universiteit Gent,

2017).

Since the academic year 2014 – 2015, the NM scoring method has been replaced by what is called

‘hogere cesuur’ or ‘standard setting’. A standard can be defined as a score that indicates a boundary

between those who perform well enough and others who do not (Norcini, 2003). Similar to NM

scoring, a correct answer will be rewarded with a positive score. Contrary to NM scoring, wrong

and absent answers will no longer be penalized with negative marks, but will be given a value of

zero. Afterwards, a recalculation of the grades follows where one has to answer more than half of

the questions correctly to pass the exam. This method allows students to fully concentrate on the

content during exams instead of considering whether or not to answer a question (Universiteit Gent,

2017). This alternative scoring method will be the focus of this master thesis. It should be

acknowledged, however, that there still exist alternative non-conventional scoring methods (e.g.

partial-credit scoring methods), but these fall outside the scope of this dissertation.

By exploring educational literature, the term ‘retrospective correcting for guessing’ can be

encountered. According to this format, students are encouraged to answer every question since an

omitted answer is considered as an incorrect answer. The correction for guessing is then

implemented afterwards, or retrospectively, hence the term ‘retrospective correcting for guessing’.

Based on an estimation of guessing behaviour of students, scores are corrected. On the one hand,

this method penalizes blind guessing, which is clearly an advantage compared to NR scoring. On

the other hand, risk taking behaviour of the students becomes irrelevant since students benefit from

answering all the questions as the expected mark for responding cannot be lower than omitting

12

(Lesage, Valcke, & Sabbe, 2013). With the introduction of ‘hogere cesuur’, the University of Ghent

emphasizes these underlying principles.

In this master thesis, the term ‘retrospective correcting for guessing’ will be used preferably, rather

than the Dutch term ‘standard setting’. The reason is mainly to avoid confusion, as standard setting

comprises two main categories in educational literature, being norm-referenced and criterion-

referenced assessment. Since especially one of the two categories differs conceptually from the

Dutch understanding of ‘standard setting’, the term should be used with caution in this context.

Whereas the norm-referenced form of assessment shows no similarities with the method applied at

the Ghent University, the criterion-referenced option does closely relate to what we call ‘hogere

cesuur’.

Norm-referenced assessments or relative methods are an evaluation form in which an examinee’s

performance is compared to that of the current group of students participating in the test. Norm-

referenced standard setting is thus based on test results, with the performances of an appropriate

peer group (i.e. ‘norm group’) as the point of reference. This form of assessment is mainly used to

rank students rather than to measure individual performance against a standard or criterion.

Consequently, standards in this format will vary depending on group differences (Cohen‐Schotanus

& Van der Vleuten, 2010; Lesage, Valcke, & Sabbe, 2013).

A second category are the criterion-referenced assessments or absolute methods, which are

designed to measure performance of students against a specified achievement level. A pre-fixed cut-

off score is defined and allows to take the effect of guessing into account. This can be illustrated

with an example: the standard of a multiple choice test comprising 40 questions with 4 options can

be set at 50%. Since some questions may be answered correctly by randomly guessing, the passing

score can be increased to 25 out of 40 questions (Cohen‐Schotanus & Van der Vleuten, 2010;

Lesage, Valcke, & Sabbe, 2013). This form of standard setting is most appropriate for tests of

competence, where to goal is to ensure that the examinees have sufficient knowledge for a particular

purpose (Norcini, 2003). This form of standard setting is independent of test results, but can cause

variation in failure rates, merely as a function of test difficulty (Cohen‐Schotanus & Van der

13

Vleuten, 2010; Lesage, Valcke, & Sabbe, 2013). It is clear that the principles of this type of

assessment are applied by the University of Ghent.

It should, however, be recognized that there are still drawbacks related to this alternative scoring

method. This way, it remains difficult to justify the fact that students are forced to guess when they

do not know the answer for sure. In disciplines where it is of utmost importance to know the answer

with certainty, which is for instance the case in medical training education for doctors, this method

might not seem very appropriate. Furthermore, another concern may be the process of setting a cut-

off score (Lesage, Valcke, & Sabbe, 2013).

The higher passing score has to be determined in a way that the probability that a student passes a

test through guessing is similar in case of negative marking as well as in case of standard setting. At

Ghent University, teachers can make use of the standard formula for setting a higher cut-off score

or may determine the passing score themselves. The standard formula takes into account the

likelihood that students guess the correct answers, which depends on the number of choices (n). The

standard formula can be written as follows (Universiteit Gent, 2017):

∑(𝑛𝑖 + 1)

2𝑛𝑖

𝑁

𝑖=1

𝑊𝑖

N reflects the number of questions

ni reflects the number of choices per question

Wi reflects the weights assigned to each question

Subsequently, the number of correct answered questions has to be converted to a final grade.

Students just reaching the cut-off score, will obtain a 10/20. The maximum score will be given if a

student has answered every question correctly and a zero is given when all questions were answered

wrongly or not answered at all. In order to calculate the final grade, the following formula can be

used:

14

z = 10 +10

N − c (y − c)

z reflects the final grade of the student

N reflects the number of questions

c reflects the higher passing grade

y reflects the number of correctly answered questions

For example, in case of 40 MC questions with four answer options for each question, the chance of

guessing the correct answer is 25%, which corresponds to ten questions. Half of the other remaining

30 questions is 15. So 25 (i.e. 10 + 15) of the 40 questions have to be answered correctly to get a

score of 10/20.

The university of Ghent already performed a first evaluation of the implementation of the new

scoring method, especially regarding the impact of the method on exam scores. An essential finding

of that study was that the transition to the new system mainly appears to benefit female students.

The study revealed that students with a low tendency to guess have significant higher final grades

in case this new scoring method is applied in contrast to the NM scoring method. For students with

a high guessing tendency, the transition seems to make no difference The students with lower

tendencies to guess mainly consist of women. Women achieve slightly higher marks on 20 when

this scoring method is being used: +0.89 in comparison to +0.46 for male students. Also the

percentage of graduated female student increases, while that of male students remains the same (Van

de Poele & Sabbe, 2016). However, it should be noted that this study provides no evidence for the

fact that female students now outperform men on MC exams which are scored retrospectively for

guessing. It only shows that women benefit more from the transition in marking system than men.

However, for this study, the following hypothesis will be tested regarding the relationship between

gender and performance:

Hypothesis 1: Female students perform significantly better than male students when MC

examinations are corrected retrospectively for guessing. However, a gender effect in favour of

women will be weaker for questions belonging to higher levels of Bloom’s taxonomy.

15

1.4 Other explanatory factors of performance

Besides gender, previous research has also related other factors to superior examination

performance. In this section, other students’ characteristics are discussed that may contribute to

higher performance. For each of these characteristics, a hypothesis about the possible

relationship with performance on MC examinations, that are corrected retrospectively for

guessing, is formulated. As research about this marking method for MC assessment is very

scarce, the formulation of the hypotheses is based on literature about general exam

performance, regardless the exam format and scoring method being used. In table 1, a literature

review can be found of the consulted studies regarding the influence of different students’

characteristics on performance.

1.4.1 Prior experience, familiarity and preference

First of all, one’s chance to perform relatively better on MC examinations may be enhanced by

one’s prior experience in taking such exams. Krieg & Uyar (2001) examined whether students

who retake a course have a propensity to perform better on the MC exam of that course. They

expected a positive effect of retaking a course on performance, as those students have had prior

exposure to the course material and feel a greater pressure to succeed. They indeed concluded

that repeating a course has a significantly positive influence on performance on the MC exam

as the students in question possess added experience with similar MC questions. This variable

of repeating a course may reflect one’s experience in taking similar MC exams and will,

therefore, also be used in this research. Furthermore, past success or proven ability in taking

MC exams may familiarize students with this type of examination. Consequently, these students

may achieve higher grades in contrast to others who are not comfortable with this type of

examination. Therefore, this study will measure how familiar students feel with retrospective

correcting for guessing as the current marking method for MC exams and how it affects their

score. In addition to this, I expect that students who prefer this alternative scoring method above

the NM scoring method, also perform better compared to those who prefer NM. Therefore,

students will be asked which of these two scoring methods they prefer. Consequently, the

following hypotheses will be tested:

Hypothesis 2a: Repeating a course is associated with higher performance on MC

examinations, that are corrected retrospectively for guessing.

16

Hypothesis 2b: Familiarity with MC examinations that are corrected retrospectively for

guessing is associated with higher performance on MC examinations where this scoring

method is applied.

Hypothesis 2c: Preference for MC examinations that are corrected retrospectively for

guessing is associated with higher performance on MC examinations where this scoring

method is applied.

1.4.2 Lesson attendance

It has also widely been assumed that students benefit from attending lectures, since lesson

attendance is positively related to examination performance (Krieg & Uyar, 2001). It can be

questioned, however, whether this is still the case today due to huge developments in

information technology, also in the field of education. These new technologies make alternative

educational models possible, such as distance learning (Stanca, in: Aden, Yahye, & Dahir,

2013). If lesson attendance is, indeed, a significant predictor of performance, this would be a

relevant finding for both students and instructors. On the one hand, it may motivate students to

attend classes, because this is related to higher learning outcomes. On the other hand, it can also

have a motivating effect for instructors as this may convince them that their instructing does

matter for the learning outcomes of their students. Research of Aden, Yahye & Dahir (2013)

found that students who attend lessons have a significantly higher chance of passing a course.

The study of Kirby & McElroy (2003) indicated only a small positive effect of lecture

attendance on the probability of passing a course. They found that lesson attendance is more

crucial for enhancing a grade rather than obtaining the pass mark. The results of the study of

Cortright, Lujan, Cox, & DiCarlo (2011) extend previous findings by documenting that the

impact of lecture attendance on examination performance is sex specific. According to them,

regular class attendance has a stronger impact on exam performance for female students than it

has on the performance of male students. Based on prior results, the following hypothesis on

lesson attendance will be tested:

Hypothesis 3: Lesson attendance is positively associated with performance on MC exams,

which are corrected retrospectively for guessing.

17

1.4.3 Study time

Also invested study time has often been examined as another potential predictor of academic

performance. Some researchers found a significant relation between time spent studying and

performance on exams (Rau & Durand, 2000; Stinebrickner & Stinebrickner, 2004; Diseth,

Pallesen, Brunborg, & Larsen, 2010). However, other authors found no direct link between the

amount of time spent on studying and academic performance (e.g. Nonis & Hudson, 2006).

Also Plant, Ericsson, Hill, & Asberg (2005) found that the amount of study time by college

students is a rather weak predictor of academic performance. Their research found that the

relationship between invested study time and performance can be influenced by other factors

such as the quality of the study environment, previously attained study skills and aspects of a

certain discipline. For instance, students studying in a quiet environment may study more

effectively and need less study time to achieve comparable grades as students working in a

disruptive environment. Nevertheless, this quantitative factor of students’ learning activities

will again be tested in this master thesis for a possible influence on performance. As the

influence of lesson attendance will be examined separately, this variable will focus on the time

spent on studying outside of class. Though no clear empirical evidence exists, the following

relationship will be assumed:

Hypothesis 4: There is a positive relationship between the time spent on studying and

performance on MC exams, which are corrected retrospectively for guessing.

1.4.4 Students’ perceptions about course difficulty

Also students’ perceptions about course difficulty may play a role in explaining differences in

performance between students. Findings about the link between perceived difficulty and

performance have, however, not been consistent. Foos (1992) found that students who perceive

a course as rather difficult, perform better on the exam than students who observe the course

material as rather easy. This can be explained by the fact that students are more motivated to

study and work harder when they expect the exam to be difficult. However, the studies of Hong

(1999) and Combs, Michael, & Fiore (2002) found that beliefs about test difficulty had no direct

influence on test performance. As previous literature shows mixed results, this thesis will also

investigate the potential relationship between students’ perceptions about course difficulty and

their corresponding grades. The following hypothesis about the relationship between perceived

course difficulty and performance is formulated:

18

Hypothesis 5: As students perceiving a course as rather difficult, are expected to study harder

for that course, they will obtain higher scores on the MC exam, which is corrected

retrospectively for guessing.

1.4.5 Learning approaches

Further, the process of learning can have a significant impact on learning outcomes (Davidson,

2002). Research on learning in higher education states that students have a preferred way of

approaching their studies. A widely used dichotomy in the manner students approach their

learning task is “deep” versus “surface” learning (Marton & Saljo, in: Scouller, 1998). A

learning approach encompasses two elements: the first element entails the strategy or the

manner a student approaches a learning task and the second component is the motive or reason

why a student wants to approach it. A deep learning approach involves a personal commitment

in learning and a sincere interest in the subject. There is a strong, internal incentive to

thoroughly understand the course material and relate new insights to previous acquired

knowledge. In contrast, students employing a surface approach only carry out a learning task

to either embrace positive consequences or to avoid failure. These students only have the

intention to memorise facts in order to reproduce them during examinations in order to pass a

course (Scouller, 1998). Previous research has mainly emphasized the importance of the deep

learning approach in order to reach high-quality learning outcomes, such as analytical and

conceptual thinking skills, which cannot be achieved through a surface approach to learning

(e.g. Hall, Ramsay, & Raven, 2004; Everaert, Opdecam, & Maussen, 2017). Also Byrne, Flood,

& Willis (2002) found that the deep learning approach was positively associated with higher

academic performance. However, they only found evidence for the relationship between

performance and learning approaches for female students, while little evidence was found for

their male counterparts. Furthermore, the findings of a study of Davidson (2002) made a

distinction between complex and more simple examination questions. He came to the

conclusion that the use of a deep learning approach has a significant positive effect on

performance on more complex examination questions, while no significant relationship was

found between the deep approach and performance on more simple questions. Based on prior

literature, the following hypothesis can be formulated:

Hypothesis 6: The deep approach has a positive significant influence on performance on MC

examinations, that are scored retrospectively for guessing, while the opposite effect occurs

for the surface approach.

19

1.4.6 Motivation

Finally, also the relationship between the motivational process and academic performance in

higher education has received increasing empirical attention the last decades. Especially, the

quality of students’ motivation has been investigated and refers to the kind of motivation that

triggers the learning behaviour. A commonly made distinction is the one between intrinsic and

extrinsic motivated behaviour. When students are intrinsically motivated, they get engaged in

learning activities for its own sake and out of interest. On the other hand, extrinsically motivated

students want to achieve certain outcomes which are separable from the learning itself

(Vansteenkiste, Lens, & Deci, 2006).

An important theory of motivation that addresses these issues of intrinsic and extrinsic

motivation is the Self-determination theory (SDT). This theory was initially developed by Deci

& Ryan (1985) and has been refined by scholars from different countries. In SDT, different

forms of behavioural regulation can be distinguished based on the degree to which they

represent autonomous (i.e. self-determined) functioning. Intrinsic motivation represents fully

autonomous functioning, while extrinsic motivated behaviour is less self-determined and more

controlled. However, extrinsic motivation can be further subdivided in different categories

according to the extent it has been internalized: the more internalized and integrated with one’s

self, the more it can serve as a basis for autonomous functioning. The categories, ranging from

the least to most completely internalized, include (Ryan & Deci, 2000):

1. External regulation: behaviours are enacted to obtain a reward or to avoid a punishment.

2. Introjected regulation: people do something because they would feel guilty about it

when they did not (e.g. studying for exams because parents insist).

3. Identified regulation: considering the value of the activity as personally important, they

accept the benefits of an activity (e.g. studying because one considers it as valuable).

4. Integrated regulation: identified regulations have been combined with other aspects of

one’s self.

20

Figure 1: The internalization continuum depicting the various types of

extrinsic motivation posited within self-determination theory (Niemiec & Ryan,

2009)

Turner, Chandler, & Heffer (2009) have shown that intrinsic motivation is positively associated

with academic performance. Engagement in activities serving the realisation of intrinsic rather

than extrinsic goals endorses a deeper processing of learning material and hence, a greater

conceptual understanding of it. Consequently, the following hypothesis will be examined:

Hypothesis 7: Intrinsic motivation is positively associated with performance on MC

examinations, which are corrected retrospectively for guessing, while the opposite effect

occurs for extrinsic motivation.

A possible explanation for the positive relationship between intrinsic motivation and academic

success may be the consequence of the relatedness between motivation and learning

approaches. It is more likely that students who are highly intrinsically motivated to enrol in a

given course, will adopt a deep learning approach. Extrinsically motivated students, on the

contrary, do not wish to become actively involved in the subject matter and are only

concentrating on what is necessary for assessment. Consequently, the latter group of students

will rather employ a surface learning strategy as their intention is to pass a course without

investing a lot of efforts (De Lange & Mavondo, 2004)

21

Table 1: literature review regarding the influence of students’ characteristics on performance

GENERAL LITERATURE IN HIGHER EDUCATION

Author Year of

publication

Country of study Discipline Measurement of

performance

Variable(s)

Foos 1992 US Psychology Multiple choice (MC);

Constructed-response (CR)

Students’ perceptions about

course difficulty

Hong 1999 US Statistics General performance Students’ perceptions about

course difficulty

Rau & Durand 2000 US Sociology General performance Study time

Wester & Henriksson 2000 Sweden Mathematics Multiple choice (MC);


Gender

Krieg & Uyar 2001 US Economics & business

statistics

Multiple choice (MC);


Gender, repeating a course,

lesson attendance

Chan & Kennedy 2002 Canada Economics Multiple choice (MC);


Gender

Combs, Michael, &

Fiore

2002 US Psychology Multiple choice (MC) Students’ perceptions about

course difficulty

Kirby & McElroy 2003 Ireland Economics Multiple choice (MC) Lesson attendance

Stinebrickner &

Stinebrickner

2004 US Arts General performance Study time

Plant, Ericsson, Hill, &

Asberg

2005 US Psychology General performance Study time, ability

Leaver & van Walbeek

2006 South Africa Economics Multiple choice (MC)

Gender

Nonis & Hudson 2006 US Business courses (e.g.

accounting, finance,

management)

General performance Study time

Du Plessis & Du

Plessis

2007 South Africa Economics Multiple choice (MC) +

penalty for wrong answers;


Gender

Hartley, Betts, &

Murray

2007 UK Psychology Multiple choice (MC);

Constructed-response

(CR);

Projects/ dissertations

Gender

Betts, Elder, Hartley, &

Trueman

2009 UK Psychology Multiple choice (MC) +

penalty for wrong answers

Gender

22

Turner, Chandler, &

Heffer, 2009

2009 US Psychology General performance Motivation

Diseth, Pallesen,

Brunborg, & Larsen

2010 Norway Psychology General performance Study time

Cortright, Lujan, Cox,

& DiCarlo

2011 US Physiology Multiple choice (MC) Lesson attendance

Marín & Rosa-García 2011 Spain Political economy Multiple choice (MC) +

penalty for wrong answers

Gender

LITERATURE IN ACCOUNTING EDUCATION

Author Year of

publication

Country of study Measurement of

performance

Variable(s)

Byrne, Flood, & Willis 2002 Ireland Constructed-response (CR);

Group presentations

Learning approaches

Davidson 2002 Canada Multiple choice (MC);


Learning approaches

Hall, Ramsay, &

Raven

2004 Australia General performance Learning approaches

Nonis & Hudson 2006 US General performance Study time, motivation, ability

Bible, Simkin, &

Kuechler

2008 US Multiple choice (MC);


Gender

Arthur & Everaert 2012 Belgium Multiple choice (MC);


Gender

Aden, Yahye, & Dahir 2013 Somalia General performance Lesson attendance

Everaert, Opdecam, &

Maussen

2017 Belgium Multiple choice (MC);


Learning approaches

*Articles regarding the influence of students’ characteristics on performance have been searched for the period 1990 through 2017 in both general education

literature and accounting education literature.

23

2 Research design & methodology 2.1 Research goal & questions

The debate in literature about scoring methods for MC examinations has mainly been single-

sided and mostly focussed on NR scoring versus NM. As both have shown inherent drawbacks,

there is need to explore alternative scoring methods to reduce the gaps between theoretical

options and reality in order to support test developers (Lesage, Valcke, & Sabbe, 2013). There

is, however, a lack of available research that offers alternative scoring methods. Hence, this

master thesis can make an important contribution as the main focus will be on performance on

MC examinations, when an alternative scoring method is applied, being the ‘retrospective

correcting for guessing’. Especially, the relationship between performance on these type of

exams and gender will be examined, as prior research often identified gender differences in

performance on MC exams. Therefore, the main goal of this master dissertation is to investigate

whether a certain form of gender bias also occurs in case MC exams are corrected

retrospectively for guessing. Furthermore, it will also be examined if other students’

characteristics, as listed in 1.4, can be held responsible for differences in students’ performance

on MC examinations that are retrospectively corrected for guessing. This leads to the following

research questions:

Research questions:

1. Does gender have an influence on performance on multiple choice that are corrected

retrospectively for guessing?

2. Which other students’ characteristics (of those described in 1.4) have an influence on

performance on multiple choice that are corrected retrospectively for guessing?

2.2 Research techniques

The research techniques that are applied are a literature review, followed by surveys. In this

section, I will discuss how data have been gathered and explain why I chose a survey as research

technique. This research will mainly have a deductive character. The first part of the study

draws on scientific literature, which made it possible to formulate hypotheses. In a second step,

it will be examined whether these hypotheses can be confirmed or rejected by analysing the

results (van Thiel, 2010).

24

2.2.1 Surveys

After thoroughly exploring literature, I was able to formulate questions for the survey. The

survey can be found in appendix 1. Surveys are a quantitative research method where questions

are asked in a direct way to a sample of individuals. Surveys can be used to gather facts, but are

mainly used to gather information about the views and attitudes of people on a research topic.

Data gathered by surveys thus mainly include opinions, behaviour and characteristics. I did not

opt for qualitative interviews to test the hypotheses as surveys can be executed on a larger-scale

in a shorter time-period: it allows to question a large number of respondents and more variables

can be included. There is the possibility of quick response and good follow-up and interview

bias is also excluded (van Thiel, 2010).

I preferred a written questionnaire over a web survey, since my promotor dr. Evelien Opdecam

gave me the opportunity to let my respondents fill out the questionnaire during one of their

courses. In this way, I was able to exercise more control on the response rate, which is not

evident in case of a web survey (van Thiel, 2010). I think students were also more willing to

fill out the survey, since they were allowed to do it during the course and did not need spend

their spare time on this.

Almost all questions included in the questionnaire were closed-ended questions, which means

that respondents have to choose their answer from a list of pre-selected answer possibilities.

Closed-ended questions make it easier to compare the results of different respondents and

subsequently to analyse them statistically afterwards. For a substantial part of the questions

Likert scales have been used, which measure the attitudes of respondents. These scales require

respondents to indicate on a scale (usually going from one to five/seven) to which degree they

agree or disagree with a particular statement (van Thiel, 2010). The majority of these statements

used in the survey is taken from previous research. The learning approach, for example, was

measured by the Revised Two Factor Study Process Questionnaire (R-SPQ-2F), which will be

discussed more in detail in a later paragraph (Biggs, Kember, & Leung, 2001).

25

2.2.1.1 Sample

The study is conducted for a sample of Belgian students following the course of corporation tax

at the University of Ghent. More specifically, the population consisted of 350 third bachelor

students enrolled in the Bachelor of Business Administration. These students have to follow a

course in corporation tax during the first semester. During the last course before the exam, the

students were asked to complete the questionnaire during class time. Although 350 students

subscribed to this course, a relative large part of them were absent during this last lesson.

Consequently, a total of 129 students have completed the questionnaire. Among them, there

were 49 male students, 77 female students and 3 respondents who did not indicate their gender.

Since 329 students participated in the exam, the response rate is equal to 39.2%.

The questionnaire included some general questions (e.g. gender), followed by several questions

specifically related to the course of corporation tax as well as questions related to scoring

methods for multiple choice assessment. At the end of the questionnaire, students were asked

to write their student number. When they did not have their student card with them to write

down their student number, they were asked to leave their name on the questionnaire. This way

of identification was necessary to relate the answers of the questionnaire to their marks on the

exam of corporation tax. Further on, these names were converted to numbers in order to make

sure that the data were treated anonymously.

2.3 Measurement 2.3.1 Dependent variables: performance

A data-analysis is executed on the results of the exam of the course ‘corporation tax’. Hence,

the dependent variable used in this study is the performance or obtained mark on the final exam.

The exam consisted of 40 multiple choice questions. The MC questions can be subdivided in

three types of questions, which can also be linked to Bloom’s taxonomy. A first category

involved 15 theoretical questions, which can be linked to the first level of Bloom’s taxonomy

(i.e. knowledge), which require the recall of learned information. The second level of Bloom’s

taxonomy, being comprehension, requires students to demonstrate their ability to translate

knowledge into a new context, for instance from words to numbers. Accordingly, the 19

calculations that had to be solved at the exam can be considered as belonging to this level.

Finally, six application questions measured students’ competence to use information in new

situations. This type of questions can therefore be assigned to the third level of Bloom’s

taxonomy, being “application”.

26

The first dependent variable is the score on the exam, when using retrospectively correcting for

guessing. This means that for each MC question, one point can be earned for a correct answer

and there is no deduction for incorrect answers. Omitted answers are also given zero points.

However, this method requires students to answer more than half of the questions correctly to

pass the course. More specifically, the higher passing grade in this exam was set at 25.79. The

final grade of the students who just obtained the higher passing grade will correspond to 20/40

when retrospective correcting for guessing is applied.

A second dependent variable will be the score on 40, when no correction for guessing would

be implemented. Similarly, right answers are rewarded with positive values and incorrect or

omitted answers are scored with a value of zero. However, contrary to retrospective correcting

for guessing, students only have to answer half of the questions correctly to pass the course and

obtain a final grade of 20/40. Hence, this variable measures performance in case NR scoring

was applied as the marking method.

During the first exam period, 329 students have participated in the exam. 185 of them were

male students and 144 of them were female students. Although 129 students completed the

questionnaire, only for 112 of them it will be possible to link their answers of the survey to their

score on the exam. On the one hand, this is due to the fact that it was impossible for eleven

students to identify them as they did not leave their student number or name on the survey. On

the other hand, six students who completed the questionnaire, did not participate in the exam.

2.3.2 Independent variables The independent variables are gender, prior experience, familiarity with retrospective

correcting for guessing, preference of scoring method, lesson attendance, study time,

perceptions about course difficulty, learning approaches and motivation.

The data for the gender variable were collected by the questionnaire. The first question asked

to students was to indicate their sex. Subsequently, this variable is coded as 0 for male students

and 1 the female students.

To get an idea about the prior experience of students with taking similar MC questions, the

question has been posed how many times a student already has participated in the exam of

corporation tax. This question makes it possible to make a distinction between students

27

repeating the course and students following the course for the first time. Secondly, it has been

asked how familiar students feel with the retrospective correcting for guessing scoring method

that is nowadays applied at the University of Ghent. Statements had to be answered on a five-

point Likert scale, ranging from ‘strongly disagree’ to ‘strongly agree’. In appendix 2, an

overview of the items, the Cronbach’s alpha and the factor loadings are listed. The second item

has been deleted, as the factor loading for that item (“The fact that a larger number of questions

has to be answered correctly in case of retrospective correcting for guessing, scares me”) is

extremely low. Deleting this item, resulted in a(n) (rather low) alpha of 0.47 for this variable.

Thirdly, their preference for scoring methods used in MC examinations has been asked. More

specifically, students had to indicate on a scale from one to zero which scoring method they

preferred, with one being absolutely the NM method and ten being absolutely the retrospective

correcting for guessing method.

Also the variable “lesson attendance” has been included in the survey. For both theory and

exercise classes, students were asked how much of the lessons they have attended. The possible

answers included: 0 – 19%, 20 – 39%, 40 – 59%, 60 – 79%, 80 – 100% or no lessons at all.

Besides class attendance, students were asked to report the average number of hours per week

they spent at home working on the corporation tax course. The possible answers ranged from

less than one hour per week to more than six hours a week.

Furthermore, students were asked about their perceptions of the subject difficulty. On a scale

of 1 – 10, with 1 being easy and 10 being difficult, students had to indicate how difficult they

perceived the subject matter.

The variable “learning approaches” can be measured by different instruments. One of these

is the study process questionnaire (SPQ) developed by Biggs. In this study the revised version

of SPQ, i.e. the Revised Two Factor Study Process Questionnaire (R-SPQ-2F), was used as this

entails a questionnaire, which can be used by faculties to measure the learning approaches of

students. This questionnaire consists of 20 questions, using a five-point Likert scale. Half of the

questions measures the deep approach, while the other half measures the surface approach

(Biggs, Kember, & Leung, 2001). As the course of corporation tax is solely followed by Dutch-

speaking students, the questions have been translated into Dutch. In appendix 3, an overview

of the items, the Cronbach alphas and the factor loadings for the two constructs, deep approach

28

and surface approach, are listed. A limiting value of 0.30 is used as a point of reference for the

factor loadings. The Cronbach’s alpha of the deep approach amounts to 0.65. Concerning the

surface approach, the factor loading of item 1 (“my aim is to pass the course while doing as

little work as possible”) is lower than the limiting value. This item has been deleted, resulting

in an alpha of 0.63 for the surface approach.

Motivation was measured by means of a self-regulation questionnaire (SRQ) that evaluates

domain-specific individual differences in types of motivation or regulation. Respondents have

been asked why they behave in a certain way. For each behaviour, a predefined list of reasons

was given, which represent the different types of regulation (Self-determination theory, 2017).

Again, each statement had to be answered on a five-point Likert scale, ranging from ‘strongly

disagree’ to ‘strongly agree’. Motivational scores have been computed by means of the

“Relative Autonomy Index” (RAI), where regulations are weighted according to their place on

the autonomy continuum. Hence, RAI is a composite score of relative autonomy. This index

subtracts controlled forms of motivation from autonomous forms. The most common formula

is (Chemolli & Gagné, 2014):

RAI = 2 X intrinsic + identified – introjection – 2 X external

The value for the RAI could range between -12 and +12. A higher positive score for the RAI

means that the student is more autonomously motivated, whereas lower negative scores indicate

less autonomous motivation (Self-determination theory, 2017).

However, after performing a factor analysis to verify the scale construction, it has been decided

to not use the results of this index for the analyses due to very weak factor loadings. The results

of testing the sub-scales of the RAI can be found in appendix 4. The different subsets of the

scale represent another dimension of relative autonomy. The statements within each dimension

of autonomy were expected to load strongly on one and the same component. However, it can

be observed that the items are not unidimensional for each of the four sub-scales. Some items

loaded strongly on multiple components; this is also called “cross-loadings”. Moreover, some

items did not even load strongly on a single component. Further, the Cronbach’s alphas, used

to the determine the sub-scales’ reliability, are very small concerning identified and introjected

regulation. These unsatisfactory results of the factor and reliability analyses might be explained

by the fact that it is the first year this self-regulation questionnaire has been used here. The

questionnaire has not yet been fine-tuned and further improvements will probably be necessary.

29

Another possible explanation may be the fact that the questions about motivation was the last

part in survey. The respondents probably were less attentive in answering these last number of

questions compared to the beginning of the survey (van Thiel, 2010). Moreover, the items that

had to be answered regarding motivation were numerous. More precisely, twenty statements

were included to examine students’ motivation. Hence, it is recommended for future research

to take these pitfalls into account.

2.3.3 Control variable Finally, ability has been added as a control variable as prior literature found that ability and

academic performance are strongly positively correlated (e.g. Everaert, Opdecam, & Maussen,

2017). In this study, ability will reflect a total score on thousand. More precisely, this score is

the weighted average of the exam results of the students during their second bachelor. The study

volumes of each course have been used as weights.

30

2.4 Analysing the results

2.4.1 Independent samples T-test To compare male and female students with each other, an “independent samples t-test” is

conducted through the statistical computer program, SPSS Statistics 24. The independent

samples t-test is used to compare the means of two independent groups in order to determine

whether the associated populations means are significantly different. Gender differences are

examined for all the variables. The sample is divided in two groups by means of the categorical

variable ‘gender’ (0 = male, 1 = female).

To compare two groups by means of this t-test, the data of these groups have to comply with

certain conditions. First of all, the sample must be composed randomly. Additionally, both

groups are required to follow a normal distribution. This requirement is met, since it is assumed

that a sample is normally distributed if there are more than thirty observations in each group.

Furthermore, both samples must be independent of each other; there may be no relationship

between the subjects in each sample. This condition is met as subjects can only belong to one

group and the scores of male students cannot be influenced by the scores of female students and

vice versa. Finally, the variances should approximately be equal across both groups. To test this

assumption of variance homogeneity, the Levene’s Test for Equality of Variances will be used

(Morgan, Leech, Gloeckner, & Barrett, 2004).

2.4.2 Regression analyses

Furthermore, in order to test the hypotheses, ordinary least squares regressions are performed

to study the relationship between the independent variables and the performance on the final

exam. Regressions examine the influence of an independent variable X on the dependent

variable Y. After performing “single” regressions for each hypothesis, a robustness check will

done by including all independent variables in one regression model. When multiple variables

are taken into account, attention has to be paid to the possible occurrence of multicollinearity.

Multicollinearity exists when two or more independent variables, also called predictors, are

highly correlated. Multicollinearity can be tested by calculating the variance inflation factor

(VIF). If the value for VIF is lower than ten, then there are no problems related to

multicollinearity. In case the value is higher than ten, multicollinearity may result in unstable

coefficient estimates, which are difficult to interpret. Nevertheless, multicollinearity can be

easily dealt with by eliminating or merging predictors that are highly correlated (Verlet, 2015).

31

3 Research findings 3.1 Descriptive statistics First, the descriptive statistics of the data will be discussed. In table 2 below, the average scores

on the dependent variable, i.e. performance on the exam with and without retrospective

correcting for guessing, can be compared. The table shows that the mean score is 30.71 when

no correcting for guessing is implemented and hence, NR scoring is the marking method being

used. This mean score decreases to 26.93 when scores are corrected retrospectively for

guessing.

Furthermore, as the exam consisted of three categories of multiple choice questions, the overall

score on the exam can be further refined. Although there were different amounts of theoretical

questions, calculations and application questions, the score on each type of question has been

rescaled to a mark on 10. In this way, the scores on each type of question can be compared more

easily. Students performed best with regard to the theoretical questions; on average 82.4% of

the questions were answered correctly. For the application questions, students answered on

average 80.4% of the questions correctly. Students performed worst on the calculations; they

solved on average 71.2% of the calculations correctly.

Table 2: Descriptives performance on the exam

N Minimum Maximum Mean Standard-

deviation

Performance with NR scoring (mark

on 40)

112 13.00 39.00 30.71 5.67

Performance with retrospective

correcting for guessing (mark on 40)

112 1.99 38.59 26.93 7.98

Performance theoretical questions

(mark on 10)

112 3.33 10.00 8.24 1.52

Performance calculations

(mark on 10)

112 2.63 10.00 7.12 1.75

Performance application questions

(mark on 10)

112 0.00 10.00 8.04 1.85

The histogram in figure 2 shows the underlying frequency distribution in case retrospective

correction for guessing is introduced. The figure shows a skewed right distribution of

performance on the exam. Whereas only three students of the 112 respondents failed to pass

the exam when no correction for guessing was introduced, this number increased to 21

respondents when applying the retrospective correction for guessing.

32

Figure 2: Histogram performance with retrospective correcting for guessing

(mark on 40)

As shown in the frequency table below, there were 40 male students (35.7%) and 72 female

students (64.3%) among the 112 respondents.

Table 3: Frequencies gender

Frequency Valid % Cumulative %

Valid Male 40 35.7 35.7

Female 72 64.3 100.0

Total 112 100.0

The descriptive statistics of the independent variables familiarity with retrospective correcting

for guessing, preference of scoring method, perceptions about course difficulty, learning

approaches and the control variable ability are summarized in table 4.

First, the degree to which students feel familiar with the retrospective correcting for guessing

scoring method in MC examination has been questioned. On average, students feel very familiar

with this type of examination (mean = 4.59 on a five-point Likert scale).

33

Furthermore, table 4 shows that, on average, students prefer the retrospective correcting for

guessing method, as nowadays applied at the University of Ghent, above the NM scoring

method (mean = 8.89). This preference has been measured on a scale from one to ten, with one

being an absolute preference for NM, and 10 being an absolute preference for retrospective

correction for guessing. Regarding students’ perceptions about course difficulty, one can

conclude that, on average, students perceive corporation tax as a rather difficult course (mean

= 7.89). Looking at the deep approach, we see that the mean is located between a low and

neutral deep approach (mean = 2.62 on a five-point Likert scale), tending more towards neutral.

The mean of the surface approach is situated between a low and neutral surface approach (mean

= 2.68), tending also more towards neutral. Finally, the mean of the control variable ability is

579.40 with a standard deviation of 93.52.

Table 4: Descriptives familiarity, preference, perceptions course difficulty, learning

approaches & ability

Variable N Minimum Maximum Mean Standard-

deviation

Familiarity with retrospective


129 1.00 5.00 4.59 0.62

Preference scoring method

(mark on 10)

129 3.00 10.00 8.89 1.66

Perceptions course difficulty

(mark on 10)

129 3.00 10.00 7.89 1.18

Deep Approach 126 1.50 4.10 2.62 0.47

Surface approach 126 1.44 4.44 2.68 0.53

Ability (mark on 1000) 110 318.00 790.00 579.40 93.52

For the other independent variables, the frequency tables are included. The variable of prior

experience has been measured by asking how many times students have participated in the

exam. Table 5 shows that the majority of the respondents participated for the first time during

last exam period in January. More precisely, 96.1% of the 129 students did not participate

earlier. Consequently, only 3.9% of the respondents were repeating the course. As only such a

small fraction of the respondents are retaking the course, it may be clear that this variable will

not have a significant explanatory power for differences in performance. Hence, this variable

will be eliminated in further analyses.

Table 5: Frequencies times participated in the exam

Frequency % Cum. %

Valid For the 1st time this exam period 124 96.1 96.1

Already 1 time in the past 3 2.3 98.4

Already 2 times in the past 2 1.6 100.0

Total 129 100.0

34

For lesson attendance, we can conclude that, on average, students attended most of the lessons.

This holds both for the exercises and theory courses. The score on this question could range

from zero (meaning attended no course at all) till five (meaning attended between 80% and

100% of the courses). On average, students attended more exercise classes than theory classes.

As shown in the frequency tables 6 and 7, 103 respondents attended between 80% and 100%

of the exercises courses, whereas this number decreases to 85 respondents for the theory classes.

Table 6: Frequencies lesson attendance (exercises)

Frequency % Cum. %

Valid Never (0) 1 0.8 0.8

0 – 19% (1) 6 4.7 5.4

20 – 39% (2) 0 0.0 5.4

40 – 59% (3) 12 9.3 14.7

60 – 79% (4) 7 5.4 20.2

80 – 100% (5) 103 79.8 100.0

Total 129 100.0

Table 7: Frequencies lesson attendance (theory)

Frequency % Cum. %

Valid Never (0) 1 0.8 0.8

0 – 19% (1) 6 4.7 5.4

20 – 39% (2) 8 6.2 11.6

40 – 59% (3) 5 3.9 15.5

60 – 79% (4) 24 18.6 34.1

80 – 100% (5) 85 65.9 100.0

Total 129 100.0

Besides class attendance, it has been asked how many hours students on average weekly spent

studying the material, outside of class. As shown in the frequency table 8, the answers could

range from less than one hour a week to more than six hours a week. More than half of the

respondents reported to work less than one hour a week at home for this course. Only 15.5% of

the respondents indicated to weekly spend more than two hours at home working on this course.

We can therefore conclude that not many students were encouraged to spend a large number of

hours working on that course at home, although they perceived corporation tax as a quite

difficult course.

Table 8: Frequencies weekly reported study time (excl. lessons)

Frequency % Cum. %

Valid < 1 hour (1) 66 51.2 51.2

Between 1 and 2 hours (2) 43 33.3 84.5





> 6 hours (7) 0 0.0 100.0

Total 129 100.0

35

Concerning the learning approaches, a further division into four groups of students can be

made: a group of students with low scores for deep approach and high scores for surface

approach (1), a group of students employing low levels for both approaches (2), a group of

students with high scores for deep approach and low scores for surface approach (3) and a fourth

group of students employing high levels for both learning approaches (4). To assign students to

a particular group, the mean of both learning approaches was used as a threshold. For instance,

students with a lower score than the average of the deep approach, but a higher score than mean

of the surface approach are assigned to the first quadrant.

The distribution of the students across the four groups is shown in table 9 and also visualized

in the mean plot in figure 3. The largest group of students employed low levels for both learning

approaches (n = 36). Although this is a surprising group, prior research has also identified a

profile that consisted of low scores on both learning approaches. For instance, a recent study of

Everaert, Opdecam, & Maussen (2017) also found a large cohort of students scoring low on

both learning approaches. They called these students “rote learners”. Rote learners typically

resort to a repetitive strategy by revising material until it is remembered, but they do not really

understand the material, and hence fail to use it. Low scores for the deep approach and high

scores for the surface approach were found for 32 students. For an approximately equally large

group of students, the opposite trend was found, being high scores on the deep approach and

low scores on the surface approach (n = 31). Finally, the smallest group of students employed

high levels for both learning approaches (n = 25). The fact that this cohort contains the least

students may be explained by the fact that the learning approaches are in theory mutually

exclusive; students will not maintain both approaches simultaneously (Biggs, 1987).

Nevertheless, more than 20% of the respondents belonged to this group.

Table 9: Frequencies quadrants of learning approaches

Frequency % Cum. %

Valid Low deep approach; high surface approach (1) 32 25.8 25.8

Low deep approach; low surface approach (2) 36 29.0 54.8

High deep approach; low surface approach (3) 31 25.0 79.8

High deep approach; high surface approach (4) 25 20.2 100.0

Total 124 100.0

36

Figure 3: Plot of the learning approaches (Mean-split)

37

3.2 Correlations Table 10: Correlation table

Performance

retrospect.

correcting

(mark on 40)

Performance

NR scoring

(mark on 40)

Gender

Familiarity

with

retrospect.

correcting

Preference

scoring

method

Lesson

attendance

(exercises)

Lesson

attendance

(theory)

Time

weekly

spent

Perceptions

course

difficulty

Deep

approach

Surface

approach

Ability

(mark on 1000)

Performance

retrospect.

correcting

(mark on 40)

1

Performance NR

scoring (mark on 40)

1.000*** 1

Gender -0.061 -0.061 1

Familiarity

with retrospect.

correcting

0.016 0.016 -0.086 1

Preference scoring

method

0.057 0.057 0.280*** 0.032

1

Lesson attendance

(exercises)

0.105 0.105 0.097 0.024 0.055 1

Lesson attendance

(theory)

0.138 0.138 -0.104 0.086 0.046 0.711*** 1

Time weekly spent 0.239** 0.239** 0.148 -0.105 0.152* 0.091 0.151* 1

Perceptions course

difficulty

0.122 0.122 0.216** -0.108 0.086 0.181** 0.110 0.144 1

Deep approach 0.034 0.034 -0.024 -0.093 0.164* -0.157 -0.085 0.254*** -0.103 1

Surface approach -0.139 -0.139 0.034 0.002 -0.023 0.129 0.139 -0.052 -0.040 -0.209** 1

Ability

(mark on 1000)

0.659*** 0.659*** 0.115 0.135 0.180 0.209** 0.273*** 0.171* 0.154 -0.032 -0.291*** 1

Correlations performance on different types of questions

Theoretical

questions

(mark on 10)

0.886*** 0.886*** -0.031 0.029 0.036 0.125 0.140 0.219** 0.127 -0.036 -0.180* 0.584***

Calculations

(mark on 10)

0.922*** 0.922*** -0.144 -0.020 0.028 0.120 0.143 0.234** 0.080 0.059 -0.121 0.631***

Application

questions

(mark on 10)

0.525*** 0.525*** 0.183* 0.082 0.133 -0.082 -0.012 0.070 0.123 0.071 0.020 0.279***

*** indicates correlation is significant at the 0.01 level, ** indicates correlation is significant at the 0.05 level, * indicates correlation is significant at the 0.10 level.

38

Table 10 shows the correlations between all the different variables. On the main diagonal, every

variable is correlated with itself, which leads to perfect correlations (r = 1). The dependent

variable performance on the exam is, both in case of retrospective correcting for guessing and

NR scoring, significantly positively correlated with the amount of time students weekly spent

at home working on the course of corporation tax (r = 0.239, p < 0.05) and with ability (r =

0.659, p < 0.01). On the one hand, this means that students who spent more time studying for

the course, achieved a higher grade on the exam. On the other hand, students who obtained a

high total score during their second bachelor, also obtained a higher score on the exam of

corporation tax in comparison to those with low ability levels.

The independent variable gender shows a significant positive correlation with the variable

measuring the preference of scoring method (r = 0.280, p < 0.01) and with the perceptions of

students about course difficulty (r = 0.216, p < 0.05). This means that women have a higher

preference for the retrospective correcting for guessing scoring method, and that they also

perceive the course of corporation tax as more difficult.

The variable measuring the preference of scoring method shows a positive correlation with the

weekly invested study time (r = 0.152) and the deep learning approach (r = 0.164). These

correlations are, however, only significant at the 0.10 level.

The variables attendance of exercises classes and attendance of theory classes are strongly

positively correlated with each other (r = 0.711, p < 0.01). Furthermore, exercises lesson

attendance has a significant positive correlation with the perceptions about course difficulty,

indicating that students who attended more exercises classes, perceived the course to be more

difficult (r = 0.181, p < 0.05). The attendance of theory classes shows a positive correlation

with the weekly invested study time, though only significant at the 0.10 level (r = 0.151).

The learning approaches are significantly negatively correlated with each other (r = -0.209, p

< 0.05). This negative correlation between the deep and surface learning approach makes sense,

since the learning approaches are, in theory, mutually exclusive (Biggs, 1987). A high score for

one learning approach normally results in a weak score on the other approach. Furthermore, the

deep approach has a positive correlation with the amount of weekly invested study time (r =

0.254, p < 0.01). This means that students with a deep approach spent more time on studying

the course of corporation tax at home.

39

Besides a strong, positive correlation with performance on the exam, the control variable ability

is significantly positively correlated with exercises lesson attendance (r = 0.209, p < 0.05), with

theory lesson attendance (r = 0.273, p < 0.01), with weekly invested study time (r = 0.171, p <

0.10) and significantly negatively correlated with the surface approach (r = -0.291, p < 0.01).

The dependent variable of performance can be further divided in performance on three types of

questions: theoretical questions, calculations and applications questions. As shown below in

table 10, performance on each type of question is significantly positively correlated with general

performance on the exam, both with and without retrospective correcting for guessing. This

means that a high mark on each type of question is associated with a high grade on the final

exam. Moreover, there is a positive correlation between gender and performance on application

questions, indicating that female students perform better than men on this category of questions

(r = 0.183). This correlation is only significant at the 0.10 level, though it should be noted that

significance at the 0.05 level was borderline missed (p = 0.053). Furthermore, the amount of

time students weekly spent working on the corporation tax at home is significantly positively

correlated with performance on theoretical questions (r = 0.219, p < 0.05) and performance on

the calculations (r = 0.234, p < 0.05). There is also a negative correlation between performance

on theoretical questions and the surface approach, significant at the 0.10 level (r = -0.180, p =

0.059). Finally, performance on each type of question is significantly positively correlated with

ability.

40

3.3 Gender differences By means of the independent samples t-test, gender differences are examined for all variables.

In table 11, an overview can be found of the mean scores of men and women on the different

variables, the mean differences between the sexes, the obtained t-test score and the

corresponding level of significance. The Levene’s test was applied to check whether the

variances were approximately equal across both groups. The p-value of this test was larger than

0.05 for almost all variables, meaning that equal variances for these variables could be assumed.

Except for the variable measuring the preference of scoring method and the weekly invested

study time, the Levene’s test showed a p-value below 0.05. A significant score for this test

implies that equal variances are not assumed. This can be solved by using the data of the “Welch

modified t-test” or choosing the Mann-Whitney U-test, which is the preferred method. The

Mann-Whitney U-test is a valuable alternative when the condition of equal variances is not met.

The results of the Mann-Whitney U-test test for these two variables can also be found in the

table below, panel B.

Table 11: Gender differences (Independent samples T-test & Mann-Whitney U-test)

Panel A: T-test Variable

Mean men Mean

women Mean

difference t p-value

Performance with NR scoring (mark on 40)

31.18 30.46 0.72 0.64 0.524

Performance with retrospective correcting for guessing (mark on 40)

27.58 26.57 1.01 0.64 0.524

Performance theoretical questions (mark on 10)

8.30 8.20 0.10 0.32 0.749

Performance calculations (mark on 10)

7.46 6.94 0.52 1.52 0.130

Performance application questions (mark on 10)

7.58 8.29 -0.70 -1.96 0.053

Familiarity with retrospective correcting

4.69 4.59 0.10 0.91 0.366

Lesson attendance (exercises) 4.43 4.64 -0.21 -1.02 0.310

Lesson attendance (theory) 4.53 4.28 0.25 1.10 0.273

Perceptions course difficulty (mark on 10)

7.60 8.13 -0.53 -2.32 0.022

Deep Approach 2.62 2.60 0.02 0.25 0.805

Surface approach 2.62 2.66 -0.04 -0.35 0.726

Ability (mark on 1000) 569.78 591.72 -21.94 -1.18 0.242

Panel B: Mann Whitney U-test

Variable Mean men Mean women Mean difference p-value

Preference scoring method (mark on 10)

8.28 9.22 -0.95 0.004

Time weekly spent (excl. lessons) 1.60 1.93 -0.33 0.411

41

From this table, it can be concluded that, although male students on average performed better

on the exam, the performance between men and women did not differ significantly. A sound

consequence of implementing retrospective correcting for guessing is that the mean difference

in performance becomes even larger in favour of male students. This is due to the fact that a

higher cut-off score has to be reached to pass the exam in comparison to NR scoring, where no

correction for guessing is applied. Consequently, it seems that the first part of hypothesis 1,

stating that female students outperform male students when MC exams are scored

retrospectively for guessing, cannot be confirmed. When looking at ability, an opposite trend is

observed. On average, female students obtained a higher total score during their second

bachelor than their male counterparts. Nevertheless, this difference in performance between

men and women is again not significant.

Also when looking at the different types of MC questions, no significant gender differences in

performance were found. However, regarding application questions, it has to be mentioned that

significance at the 0.05 level is borderline missed (p = 0.053). Female students in this sample

performed better than male students for applications questions. Applications are, according to

Bloom’s taxonomy, more complex to solve than theoretical questions and calculations. Hence,

contrary to the second part of hypothesis 1, there appears a small gender effect in favour of

women when more complex questions are involved. The differences in mean scores between

both sexes on the three types of questions are visualised in figure 4 below. From this figure it

can also be seen that women perform considerably better regarding application questions.

42

Figure 4: Mean scores on the different types of MC questions (mark on 10)

Concerning the other independent variables, table 11 shows significant differences between

male students and female students related to the preference of scoring method and students’

perceptions about course difficulty. With regard to the preference of scoring method, female

students have a significant higher preference for this non-conventional scoring method

compared to male students (mean of 8.28 for male versus 9.22 for female, p = 0.004).

Furthermore, female students perceive the course of corporation tax significantly more difficult

than their male counterparts (mean of 7.60 for male versus 8.13 for female, p = 0.022).

Hence, it can be concluded that table 11 above has shown quite similar results as those detected

in the correlation table. Significant gender differences appear with regard to performance on

application questions, preference of scoring method and perceptions about course difficulty.

43

3.4 Hypotheses testing In this section, the hypotheses are tested by examining the influence of the independent

variables on performance on the exam, which was corrected retrospectively for guessing.

Additionally, the results are shown in case no “standard setting” was applied. It will become

clear that the same results are yielded as when retrospective correcting for guessing is used.

Furthermore, the possible relationships between each independent variable and performance on

the distinct types of questions are tested as well, though there were no hypotheses formulated,

except for gender, regarding performance on these categories of questions.

3.4.1 Hypothesis 1

The first hypothesis claimed that female students perform better than male students on MC

examinations that are scored retrospectively for guessing. This was, however, not detected in

the t-test table. A gender effect was neither found for performance on the distinct types of

questions, with the exception of a small gender effect for performance on application questions.

Therefore, additional ANCOVAs with performance as dependent variable and gender as

independent variable are performed. ANCOVAs are used for comparing groups on a dependent

variable and when it is expected that another variable (i.e. “the covariate”) also affects the

dependent variable in addition to the independent variable (De Moor & Van Maele, 2008).

Since performance is highly correlated with ability, ability is added as the covariate. The results

can be found in table 12. A significant impact of gender on performance on calculations and

performance on application questions is found, while controlling for ability. Similar to the

results of the t-test, the results of the ANCOVAs reveal that female students performed better

on applications in comparison to male students. As detected in the t-tests, significance at the

0.05 level was borderline missed, but attained at the 0.10 level. Regarding calculations, men

performed significantly better than women. Thus, although men in this sample on average have

a lower ability level, they do better on calculations compared to women. For general exam

performance, both with and without retrospective correcting for guessing, no significant impact

of gender has been found. The same conclusion can be drawn for performance on the most

simple questions, being the theoretical questions.

44

Table 12: ANCOVA for gender differences in performance (control variable: ability)

Variable Estimated marginal mean

men

Estimated marginal mean

women

F p-value

Performance with NR scoring (mark on 40)

31.68 30.36 2.24 0.137

Performance with retrospective correcting for guessing (mark on 40)

28.29 26.43 2.24 0.137

Performance theoretical questions (mark on 10)

8.37 8.20 0.49 0.485

Performance calculations (mark on 10)

7.66 6.89 7.90 0.006

Performance application questions (mark on 10)

7.60 8.29 3.73 0.056

Consequently, hypothesis 1 cannot be confirmed. First, no significant impact of gender was

found for performance on the exam, which was corrected retrospectively for guessing. Second,

a gender effect in favour of male students with calculations and a gender effect in favour of

female students with applications have been detected. This contradicts prior findings of Leaver

& van Walbeek (2006) who found a gender effect in favour of male students for all types of

questions and especially for the more complex questions.

3.4.2 Hypotheses 2 Hypothesis 2a supposed that repeating a course is associated with a higher performance on MC

exams which are corrected retrospectively for guessing. However, this hypothesis will not be

tested as the number of respondents retaking the course is very low (n = 5).

Hypothesis 2b stated that students who are familiar with MC exams that are corrected

retrospectively for guessing, are predisposed to perform better on MC exams where this scoring

method is applied. Table 13 shows that a positive but insignificant coefficient is found

(coefficient = 0.237). Consequently, hypothesis 2b can be rejected. Reasonably, a significant

impact of familiarity with retrospective correcting for guessing has neither been found for exam

performance when NR scoring is applied nor for the performance scores on the three distinct

types of questions.

45

Table 13: Regression of familiarity with retrospective correcting for guessing on

performance

Performance

with

retrospect.

correcting

(mark on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 25.835

(3.963)***

29.938

(6.464)***

7.862

(6.349)***

7.421

(5.184)***

6.741

(4.479)***

Familiarity with

retrospective

correcting for

guessing

0.237

(0.169)

0.168

(0.169)

0.081

(0.306)

-0.064

(-0.209)

0.280

(0.866)

Model

F 0.029 0.029 0.094 0.044 0.750

p-value 0.866 0.866 0.760 0.835 0.388

R² 0.000 0.000 0.001 0.000 0.007

Note: t-statistics are in parentheses.

*** indicates significant at the 0.01 level, ** indicates significant at the 0.05 level, * indicates significant at the 0.10 level.

Similarly, no significant impact of familiarity with retrospective correcting for guessing on performance has been found

when adding ability as a control variable to the regression model. In this model, only ability turned out to be an important predictor

of performance.

Hypothesis 2c asserts that preference for MC examinations that are corrected retrospectively

for guessing is associated with higher performance on MC examinations where this scoring

method is applied. Table 14 shows a positive, but insignificant regression coefficient for

preference of scoring method (coefficient = 0.279). Hence, also hypothesis 2c cannot be

confirmed. Also regarding exam performance without retrospective correcting for guessing and

the performance scores for the three distinct types of questions, a significant impact of the

preferred scoring method has reasonably not been found.

Table 14: Regression of preference of scoring method on performance

Performance

with

retrospect.

correcting

(mark on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 24.448

(5.798)***

28.950

(9.663)***

7.943

(9.903)***

6.854

(7.391)***

6.690

(6.899)***

Preference

scoring method

0.279

(0.598)

0.199

(0.599)

0.033

(0.375)

0.030

(0.296)

0.152

(1.411)

Model

F 0.358 0.358 0.140 0.088 1.991

p-value 0.551 0.551 0.709 0.768 0.161

R² 0.003 0.003 0.001 0.001 0.018



Similarly, no significant impact of preference of scoring method on performance has been found when adding ability as a control

variable to the regression model. In this model, only ability turned out to be an important predictor of performance.

46

3.4.3 Hypothesis 3 Hypothesis 3 states that there is a positive relationship between lesson attendance and

performance on MC exams, which are corrected retrospectively for guessing. The respective

coefficients for theory and exercises lesson attendance are 0.871 and 0.146 in case of

retrospective correcting for guessing. The coefficients are, however, insignificant, meaning that

lesson attendance seems to have no impact on exam performance. From table 15, we can

conclude that hypothesis 3 cannot be confirmed. Furthermore, no significant impact of lesson

attendance was found for performance on the three distinct types of questions. These findings

conflict with previous research such as the studies of Kirby & McElroy (2003) and Aden,

Yahye, & Dahir (2013) who found that lesson attendance has a significant positive effect on

performance.

Table 15: Regression of lesson attendance on performance

Performance

with

retrospect.

correcting (mark

on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 22.458

(6.490)***

27.536

(11.200)***

7.285

(11.092)***

6.041

(7.958)***

8.549

(10.617)***

Lesson

attendance

(theory)

0.871

(0.957)

0.619

(0.957)

0.138

(0.796)

0.175

(0.874)

0.135

(0.635)

Lesson

attendance

(exercises)

0.146

(0.150)

0.104

(0.151)

0.077

(0.417)

0.070

(0.328)

-0.241

(-1.064)

Model

F 1.065 1.066 1.185 1.187 0.573

p-value 0.348 0.348 0.310 0.309 0.565

R² 0.019 0.019 0.021 0.021 0.010



Similarly, no significant impacts of both theory and exercises lesson attendance on performance have been found when adding

ability as a control variable to the regression model. In this model, only ability turned out to be an important predictor of

performance.

However, an additional assumption can be made with regard to lesson attendance. More

specifically, it may be assumed that every student who attended the last course of corporation

tax, filled out the survey. Consequently, students who did not participate in the survey are

assumed to be absent during this last course. When comparing performance between the

respondents who completed the questionnaire and were present during the last class on the one

hand, and the students who did not participate in the survey and who are assumed to be absent

on the other hand, significant differences in performances are found. The results of the t-test,

comparing these two groups, are shown in table 16. From this test, it can be concluded that

students who attended the last course have performed significantly better on the exam compared

47

to students who are assumed to be absent as they did not participate in the survey (p = 0.000).

This significant difference in performance also applies for the performance scores on the three

distinct types of questions.

Table 16: Additional t-test regarding attendance of last course

Variable

Mean

absent

students

Mean

respondents

Mean

difference

t p-value

Performance with NR scoring

(mark on 40)

27.18 30.71 -3.53 -5.15 0.000

Performance with retrospective


(mark on 40)

21.96 26.93 -4.97 -5.15 0.000

Performance theoretical questions

(mark on 10)

7.41 8.24 -0.82 -4.24 0.000

Performance calculations

(mark on 10)

6.19 7.12 -0.93 -4.48 0.000

Performance application questions

(mark on 10)

7.18 8.04 -0.85 -3.80 0.000

3.4.4 Hypothesis 4 Hypothesis 4 asserts that weekly invested study time has a positive effect on performance on

MC exams, that are corrected retrospectively for guessing. Table 17 shows that the coefficient

for this variable is positive and significant (coefficient = 1.767, p < 0.05). This means that

students who weekly spent more time studying at home for the course of corporation tax,

achieved higher grades on the exam that was scored retrospectively for guessing. This finding

supports hypothesis 4. The R² is equal to 0.057, meaning that 5.7% of the variance in

performance on the exam can be explained by the variable of the weekly invested study time.

Logically, the same positive and significant impact of time spent is found for performance on

the exam when no retrospective correcting for guessing is applied. With regard to performance

on the three distinct types of questions, a positive and significant effect of time spent was found

for the theoretical questions (coefficient = 0.307, p < 0.05) and calculations (coefficient = 0.380,

p < 0.05). Concerning performance on application questions, no significant impact of reported

study time was found.

48

Table 17: Regression of time weekly spent on performance

Performance

with

retrospect.

correcting

(mark on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 23.726

(16.444)***

28.439

(27.745)***

7.681

(27.882)***

6.435

(20.283)***

7.819

(22.774)***

Time weekly

spent

1.767

(2.580)**

1.255

(2.580)**

0.307

(2.350)**

0.380

(2.525)**

0.119

(0.733)

Model

F 6.656 6.654 5.523 6.375 0.537

p-value 0.011 0.011 0.021 0.013 0.465

R² 0.057 0.057 0.048 0.055 0.005



When adding ability as a control variable to the regression model, a positive impact of time spent was still found for performance

with and without retrospective correcting for guessing, though only significant at the 0.10 level. The significant impact of time spent

on performance on theoretical questions and calculations disappeared completely when controlling for ability. These results might

be explained by the correlation between time spent and ability.

3.4.5 Hypothesis 5 Hypothesis 5 supposes that students perceiving a course as rather difficult will perform better

on the MC examination of this course, which is scored retrospectively for guessing, because

they will put more efforts in studying the subject. Table 18 shows that the coefficient for

perceptions about course difficulty is positive, however, not significant (coefficient = 0.830).

Consequently, this finding leads to the rejection of hypothesis 5. Similarly, no significant

impact of perceptions about course difficulty was found for exam performance without

retrospective correcting for guessing and for performance on the three distinct types of

questions.

Table 18: Regression of perceptions about course difficulty on performance

Performance

with

retrospect.

correcting

(mark on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 20.338

(3.937)***

26.032

(7.093)***

6.937

(7.072)***

6.172

(5.417)***

6.497

(5.430)***

Perceptions

course difficulty

0.830

(1.289)

0.590

(1.290)

0.164

(1.340)

0.120

(0.844)

0.194

(1.300)

Model

F 1.663 1.663 1.796 0.713 1.691

p-value 0.200 0.200 0.183 0.400 0.196

R² 0.015 0.015 0.016 0.006 0.015



Similarly, no significant impact of perceptions about course difficulty on performance has been found when adding ability as a

control variable to the regression model. In this model, only ability turned out to be an important predictor of performance.

49

3.4.6 Hypothesis 6 Hypothesis 6 assumes a positive significant impact of the deep approach and a negative

significant impact of the surface approach on performance on the exam, which was corrected

retrospectively for guessing. Table 19 shows that only a significant impact of the surface

approach is found. The coefficient for the surface approach is -2.650. It should, however, be

noted that significance is only attained at the 0.10 level. Hence, hypothesis 6 can only partly be

supported. With regard to performance on the theoretical questions, a negative and more

significant effect of the surface approach is found (coefficient = -0.702, p < 0.05). This means

that students who have a high surface approach, performed less well on theoretical questions.

The R² for this regression model equals 0.062, meaning that 6.2 % of the variance in

performance on theoretical questions can be explained by the learning approach students

employ. Concerning performance on calculations and application questions, no significant

impact of the surface approach was found. Regarding the deep learning approach, no significant

impact on performance on the exam, that was corrected retrospectively for guessing, was found.

A significant positive impact of the deep approach was neither found for performance on the

three categories of questions.

Table 19: Regression of learning approaches on performance

Performance

with

retrospect.

correcting

(mark on 40)

Performance

with NR scoring

(mark on 40)

Performance

theoretical

questions

(mark on 10)

Performance

calculations

(mark on 10)

Performance

application

questions

(mark on 10)

C 35.328

(5.365)***

36.683

(7.840)***

11.112

(9.058)***

8.370

(5.720)***

6.855

(4.357)***

Deep approach -0.405

(-0.244)

-0.288

(-0.244)

-0.367

(-1.187)

0.037

(0.099)

0.322

(0.813)

Surface approach -2.650

(-1.829)*

-1.883

(-1.829)*

-0.702

(-2.600)**

-0.482

(-1.497)

0.142

(0.411)

Model

F 1.713 1.714 3.492 1.269 0.348

p-value 0.185 0.185 0.034 0.285 0.707

R² 0.032 0.032 0.062 0.024 0.007



Due to the high correlation between the surface approach and ability, the learning approaches and ability have not been taken

together into one regression model.

50

3.4.7 Robustness check In what follows, the results of the multiple regression analyses, which included all the

independent variables, are discussed. Only the variable measuring how many times students

participated in the exam is excluded. As already mentioned, a condition for multiple regression

is that the independent variables are not highly correlated as this can result in redundant

information in the regression model. This problem, called multicollinearity, can be calculated

by means of the variance inflation factor (VIF). When taking performance on the exam with

“standard setting” as the dependent variable, the highest value for VIF is 2.163 for the variable

of theory lesson attendance. Hence, multicollinearity is out of question here (Verlet, 2015).

These results are similar as with the “single regressions” and are summarized in table 20. On

the one hand, there is a significant positive effect of the weekly invested study time on exam

performance (coefficient = 0.243, p = 0.020). On the other hand, the surface approach has a

negative effect on exam performance, significant at the 0.10 level (coefficient = -0.192, p =

0.057). The adjusted R² takes into account the amount of variables that have been included as

independent variables and indicates to which degree the variance in the score on the exam,

corrected retrospectively for guessing, can be explained by all independent variables in the

model (Verlet, 2015). The adjusted R² of this regression model equals 0.059, meaning that 5.9%

of the variance in performance on the exam can be explained by the regression model.

Table 20: Regression of all the independent variables on performance with retrospective


Variable Standardized

coefficients

t-value p-value

Beta

Constant 1.786 0.077

Gender -0.115 -1.118 0.266

Preference scoring method 0.035 0.351 0.726


correcting

0.038 0.394 0.694

Lesson attendance

(exercises)

0.061 0.440 0.661

Lesson attendance (theory) 0.059 0.425 0.672

Time weekly spent 0.243 2.357 0.020

Perceptions course difficulty 0.112 1.115 0.268

Deep approach -0.066 -0.620 0.537

Surface approach -0.192 -1.929 0.057

Model summary

Dependent variable Performance with retrospective correcting for guessing

F (model) 1.751

p-value (model) 0.088

Adjusted R² 0.059

* Logically, the same results have been found when taking performance on the exam without retrospective correcting for guessing

(i.e. “performance with NR scoring”) as the dependent variable in the regression model.

51

Also when taking the performance on theoretical questions as the dependent variable, similar

conclusions can be drawn as with the “single” regressions. Table 21 shows that weekly invested

study time has a significantly positive influence (coefficient = 0.217, p = 0.036), while the

surface approach has a significantly negative influence (coefficient = -0.269, p = 0.007). The

adjusted R² of this regression model is 0.080, meaning that 8% of the variance in performance

on theoretical questions can be explained by this regression model.

Table 21: Regression of all the independent variables on performance on theoretical

questions


coefficients

t-value p-value

Beta


Gender -0.074 -0.721 0.472



correcting

0.036 0.378 0.706

Lesson attendance

(exercises)

0.065 0.478 0.634




Deep approach -0.146 -1.380 0.171


Model summary

Dependent variable Performance on theoretical questions (mark on 10)

F (model) 2.028


Adjusted R² 0.080

* The VIF’s are all below 3.

In table 22, the results of the regression model using performance on calculations as the

dependent variable, are shown. Again, there is a significantly positive impact of the weekly

invested study time (coefficient = 0.245, p = 0.019). Furthermore, the gender coefficient is

negative and significant (coefficient = -0.209, p = 0.045). This means that male students

performed better on calculations than female students. The adjusted R² of this regression model

equals 0.064, meaning that 6.4% of the variance in performance on calculations can be

explained by this regression model.

52

Table 22: Regression of all the independent on performance on calculations


coefficients

t-value p-value

Beta


Gender -0.209 -2.030 0.045



correcting

-0.007 -0.076 0.940

Lesson attendance

(exercises)

0.121 0.880 0.381




Deep approach -0.035 -0.327 0.744


Model summary

Dependent variable Performance on calculations (mark on 10)

F (model) 1.811


Adjusted R² 0.064


Finally, table 23 shows the results of the regression model with performance on applications as

the dependent variable. The independent variables seem to have no significant impact on

performance on application questions. Only for gender, significance is attained at the 0.10 level

(p = 0.074). The gender coefficient is positive, indicating that female students performed

significantly better on application questions than male students (coefficient = 0.189). The

adjusted R² of this regression model is 0.029, meaning that only 2.9% of the variance in

performance on applications can be explained by this regression model. The low explanatory

power of the independent variables for performance on applications might be due to the fact

that the exam of corporation tax contained only six application questions. Hence, results

regarding performance on applications might be influenced and have to be interpreted with

caution.

53

Table 23: Regression of all the independent variables on performance on application

questions


coefficients

t-value p-value

Beta


Gender 0.189 1.807 0.074



correcting

0.141 1.428 0.156

Lesson attendance

(exercises)

-0.186 -1.327 0.188




Deep approach 0.064 0.587 0.558

Surface approach 0.058 0.576 0.566

Model summary

Dependent variable Performance on application questions (mark on 10)

F (model) 1.359


Adjusted R² 0.029


54

4 Discussion

In the present study, the relationship between performance on multiple choice exams that are

corrected retrospectively for guessing and gender was the main research focus. This focus has

grown out of concern that a gender effect may occur. Many previous studies found a gender

effect in favour of male students in MC examinations, especially in case these exams were

scored by means of negative marking (NM). Concerning MC exams which are corrected

retrospectively for guessing, as nowadays applied at the University of Ghent, research is very

scarce. Since prior literature also held other students’ characteristics responsible for differences

in performance, their explanatory power has been investigated in this study. In what follows,

the findings of each of the hypotheses are discussed and other important comments are given.

First, it can be concluded that gender has no significant impact on the performance on the

exam, which was corrected retrospectively for guessing. Additionally, the score on the exam

has been further refined to performance on different categories of questions. Bloom’s taxonomy

has been used to categorize questions according to the level of cognitive reasoning required.

The exam of corporation tax consisted of theoretical questions, calculations and application

questions, which can be assigned respectively to the first (“knowledge”), second

(“comprehension”) and third level (“application”) of this taxonomy. The higher the level of

Bloom’s hierarchy, the more complex questions become (Leaver & van Walbeek, 2006). By

performing ANCOVAs of gender on performance, while controlling for ability, a significant

gender effect was found for the calculations and the application questions. Concerning

calculations, male students performed significantly better compared to female students. Also

when including all the independent variables in a regression model, the gender coefficient was

negative and significant. This finding is in line with the study of Du Plessis & Du Plessis (2007)

and Declerck (2010) who also found that male students scored consistently better on MC

questions of quantitative nature. For performance on application questions, the results show a

positive gender coefficient and a trend towards significance at the 0.05 level. This positive

gender effect means that women performed significantly better than men on application

questions. Applications can be considered as the most difficult questions that have been posed

on the exam. They do not only require students to memorize and understand material, but

involve higher levels of thinking as students have to apply previously learned information in

new situations. Finally, no significant gender effect has been found with regard to performance

on theoretical questions, which belong to the lowest level of Bloom’s taxonomy. These findings

55

do not correspond with the results of Leaver & van Walbeek (2006) who found a significant

gender effect in favour of male students for all types of questions, and especially for those

categorized at higher levels of Bloom’s taxonomy. However, in this study, caution is needed

when interpreting and generalizing the results concerning performance on applications. As the

exam of corporation tax contained only six application questions, this might have influenced

the results. Hence, no evidence was found that supports hypothesis 1.

Although not expected in advance, other gender differences have been detected. The correlation

table and the results of the t-test revealed that women perceive the corporation tax course more

difficult compared to men. Furthermore, female students have a higher preference for the

retrospective correcting for guessing scoring method than male students. This might be

explained by the fact that the transition from NM towards ‘standard setting’ at the University

of Ghent mainly benefits women. The higher risk aversion in women, as frequently observed

in prior literature, becomes completely irrelevant with this non-conventional scoring method.

The transition increased the mark on 20 with 0.89 for women and with 0.46 for men (Van de

Poele & Sabbe, 2016).

Second, a significant positive effect of time spent on exam performance has been detected,

supporting hypothesis 4. This means, the more students studied at home for this course, the

higher their grades on the exam which was scored retrospectively for guessing. Hence, it can

be concluded that these findings are in line with the studies of Rau & Durand (2000),

Stinebrickner & Stinebrickner (2004), Diseth, Pallesen, Brunborg, & Larsen (2010). A

significant, positive impact of time spent was also found for general performance on the rather

simple questions, being the theoretical questions and calculations. Also when the other

independent variables were included in the regression model, a significant positive effect of

weekly invested study time was found. However, regarding performance on application

questions, no positive impact of time spent was found. On the one hand, results for application

questions might again be influenced by the limited number of applications on the exam. On the

other hand, the fact that invested study time has no significant impact on performance on

applications may be explained by differences between students regarding skill sets. Students

with a higher critical thinking capacity, for instance, might attain the same or better marks on

applications with less study time invested (Plant et al., 2005).

56

Third, when investigating the relation between students’ learning approaches and exam

performance, evidence for hypothesis 6 is only found with regard to the surface approach. The

use of the surface approach has a negative influence on performance on the exam when

retrospective correcting for guessing is used. However, it should be noted that significance was

only attained at the 0.10 level. When looking at the impact of the surface approach on the

general performance scores on the three categories of questions, a negative and more significant

effect of the surface approach on performance on theoretical questions was found. This means

that having a surface approach results in lower performance on theoretical questions. When

adding the other independent variables in the regression model, there is an even more significant

impact of the surface approach on performance on theoretical questions. As such, it seems that

the negative relationship between a surface approach and performance on theoretical questions

is a robust one. Concerning performance on calculations and applications, no significant

negative effect of the surface approach was found. With regard to the deep learning approach,

a significant impact was neither found for performance on the exam nor for performance on the

three types of questions. Numerous prior research, such as the study of Diseth & Martinsen

(2003), also did not find evidence for the deep approach and only found that higher surface

approach scores are associated with less successful academic performance. Furthermore, these

authors argue that this may be due to the fact that academic courses frequently include a fixed

curriculum and that the standards for good exam performance are well defined. Consequently,

students are not really encouraged or invited to explore subjects which are not included in the

curriculum. Hence, students may feel more inclined to adopt a surface approach to learning.

Moreover, the Cronbach’s alphas for the learning approaches are somewhat lower than the

alpha values of 0.73 and 0.64 obtained by the study of Biggs, Kember, & Leung (2001). This

is particularly true for the Cronbach’s alpha of the deep learning approach, which only

amounted 0.65 in this study.

Though the deep approach had no significant influence on performance, a significant positive

correlation was found between the deep approach and time spent studying at home for the

course. Students with a higher deep approach, reported to have spent more time on corporation

tax during the semester. This is not surprising, as students with a deep learning approach have

a sincere interest in the subject and put more efforts in thoroughly understanding the material

compared to those with a surface approach. This significant positive effect of the deep learning

approach on time spent was also found by Everaert, Opdecam, & Maussen (2017).

57

Further, plotting students’ approaches to learning with a mean split, indicated that 25.8% of the

students mainly had a surface approach and 25% of the students mainly employed a deep

approach. However, 29% of the students scored below the mean for both learning approaches.

Although this is a quite surprisingly large group, prior research has identified a profile that

consisted of low scores on both learning approaches. A study of law students found that 23%

of the students employed low levels on both learning approaches (Lindblom-Ylanne, in:

Gijbels, Van de Watering, Dochy, & Van den Bossche, 2005). Also a more recent study of

Everaert, Opdecam, & Maussen (2017) found a large cohort of students scoring low on both

approaches. They called these students rote learners. The aim of these students is not to

thoroughly understand the material, but they rather seek self-fulfilment by revising and revising

the material. These students often know the course by heart, but do not understand what they

have learned. Consequently, they fail to apply the learned information in new situations. They

are called rote learners, because they are willing to invest more effort in studying than strictly

necessary to pass the course, but probably not in an adequate manner. Although this kind of

combination in learning approaches is considered as “disintegrated”, this profile is quite typical

for novice students (Gijbels et al., 2005). The students in this sample are, however, third

bachelor students, and hence, can no longer be considered as “novice”. Nevertheless, it seems

that they still face difficulties in approaching their studies. Finally, 20.2% of the respondents

reported high scores on both learning approaches. Although the smallest number of respondents

fitted in this group, it is still a relatively large group. This is surprising as the learning

approaches are theoretically considered mutually exclusive (Biggs, 1987). Again Gijbels et al.

(2005) argued that this profile is quite typical of novice students. Hence, it can be concluded

that a large percentage of the students in the sample struggles to find a suitable method for

approaching their studies.

Fourth, concerning familiarity with MC exams that are scored retrospectively for guessing,

the descriptive statistics revealed that students feel very comfortable with this marking method

and understand how grades are calculated. This is a great advantage compared to the NM

scoring method, which was frequently applied at Ghent University in the past. In case of NM,

students were often too occupied figuring out the most optimal answering strategy as different

teachers sometimes attached different amounts of penalties to incorrect answers. However,

feeling familiar with retrospective correcting for guessing did not result in higher performance

on these type of exams. Nevertheless, more research is needed here, since the Cronbach’s alpha

was extremely low and might have influenced the results. This low value is possibly due to the

58

fact that the statements are not based on an existing scale, because such an instrument was not

found in prior literature. Hence, we produced a scale of our own to take this factor into account.

Further, the low value of the Cronbach’s alpha can be attributed to the low number of statements

included. Also no evidence was found for hypothesis 2c, stating that students preferring the

retrospective correcting for guessing scoring method perform better on this type of MC

examination than those with a higher preference for NM. The absence of a significant impact

on performance may be explained by the great unanimity among respondents about the

preferred scoring method. Only a very small percentage of the respondents (4.7%) reported to

have a higher preference for the NM scoring method instead of retrospective correcting for

guessing.

Five, linking the answers of the survey regarding theory and exercises class attendance to

performance, did not reveal a significant impact of lesson attendance on performance. This is

not in line with the studies of Krieg & Uyar (2001), Kirby & McElroy (2003), and Aden, Yahye,

& Dahir (2013). However, the results in this thesis might be influenced, as responses only have

been collected of students who attended the last class. The vast majority of these respondents

reported to have attended between 80% and 100% of both theory and exercises classes. Notably,

more than half of the students who subscribed to the course of corporation tax, were absent.

This is quite surprising as important information concerning the exam is often communicated

during this last class. Consequently, it is conceivable that many of these absent students also

skipped other classes of this course. When assuming that those students who did not participate

in the survey were absent during the last course and assuming that every present student filled

out the survey, significant performance differences were found between the students attending

the last course and the absent students. Students who attended the last lecture obtained

significantly better marks on the exam that was scored retrospectively for guessing as well as

on each type of question compared to these who skipped the last course. On the one hand, this

finding may make students aware of the importance of attending lessons, as this may result in

higher grades. On the other hand, this finding may also motivate instructors as it shows that

teaching indeed has a positive influence on the performance outcomes of students. Research

into e-learning found that an important reason for absenteeism in higher education is that

students nowadays dispose of technology alternatives, such as the platform “Minerva” at Ghent

University (Naber & Köhle: in Massingham & Herrington, 2006). Furthermore, it was also

noticeable that a relatively higher number of female students attended the last class, although a

59

higher proportion of male students (n = 185) participated in the exam in January in comparison

to female students (n = 144).

Six, also perceptions about course difficulty had no significant effect on performance. On the

one hand, this finding is in line with results of the studies of Hong (1999) and Combs, Michael,

& Fiore (2002) who found no direct association between beliefs about difficulty and

performance. On the other hand, this finding contradicts the results of Foos (1992) who found

that students perceiving a course as difficult, will perform better because they will work harder.

However, as shown in the correlation table, no positive association was found between

perceptions about course difficulty and invested study time. Those perceiving the course as

quite difficult did not spend more time studying the material at home compared to those who

observed the course as more easy. This might explain why no significant differences in

performance are found. Finally, though not hypothesized in advance, perceptions about course

difficulty hold a significantly positive correlation with exercises class attendance. This means

that students attending more exercises classes, perceived the course more difficult.

60

4.1 Limitations

There are several limitations to this study that one has to be aware of. First of all, the possible

appearance of response bias is a limitation inherent to the research method of surveys. Response

bias refers to the tendency of respondents to systematically respond to questions on a different

basis than the content of the items. A common response tendency is socially desirable

responding, which means that a respondent adjusts his answers to what one thinks is socially

acceptable or politically correct or what one thinks the researcher would like to hear.

Consequently, a survey may only show what people claim to do or think, and may not always

correspond with reality (van Thiel, 2010). However, I tried to capture this gap as much as

possible by giving clear instructions for completing the survey. I emphasized that the data

would be dealt with in a reliable way and that results would be treated anonymously. This way,

I wanted to guarantee students that their responses would not be passed on to the responsible

teacher so that they could answer really honestly.

The second limitation is due to the rather low number of observations. Although 350 students

subscribed to the course of corporation tax, more than half of them were absent during the last

lesson when the survey was distributed. As this was the last class before the exam, this low

turnout was not expected in advance. Moreover, as some respondents could not be identified,

their answers to the questionnaire could not be linked to their score on the exam. Also for those

who completed the survey, but did not participate in the exam, influences on performance could

not be investigated.

External validity is a third research limitation and refers to the extent that findings of this study

can be transferred to or applied in situations, other than the context in which the study was

conducted. Although research aims at generating information that can be used in other settings

as well, it should be acknowledged that a study can never produce generally transferable results

(Malterud, 2001). As there is a gap in literature concerning non-conventional scoring methods,

there are at present no clear indications that the findings of this study will also apply for other

MC exams that use a retrospective correcting for guessing scoring method. Furthermore, the

fact that this study only uses data of one course at one university, also limits the generalizability

of the results.

61

Fourth, as already mentioned, the Cronbach’s alfa for familiarity was very low. Also for the

learning approaches, and especially for the deep approach, they were below the values obtained

by Biggs, Kember, & Leung (2001).

Five, many previous studies also investigated the effect of student motivation on academic

performance in higher education and concluded that motivation is an important predictor for

performance (e.g. Turner, Chandler, & Heffer, 2009). However, due to the weak outcomes of

the factor and reliability analyses of the measurement instrument for motivation, this study

could not produce reliable findings for this variable.

4.2 Future research

More research is needed to investigate whether gender differences in performance occur with

MC examinations that are corrected retrospectively for guessing. Besides gender differences,

further empirical studies should also measure which other factors may have an influence on

performance on these exams. A particular interesting avenue for future research is to compare

the explanatory value of different students’ characteristics for performance on exams scored by

means of negative marking (NM) and for performance on exams scored by means of “standard

setting”. This was not possible in the current study as students were told that “standard setting”

was the scoring method being used at the exam. When NM was applied, students would have

omitted several items and consequently, they would have obtained different marks on the exam.

Furthermore, it would be very interesting to replicate this research across different disciplines

in multiple higher education institutions. In addition, also other factors (e.g. the time restrictions

during examinations) can be considered, which were not taken into account in the present study.

62

5 Conclusion

A major contribution of this study is that it extends prior literature on gender bias in MC

examinations since an alternative scoring method is explored, being retrospective correcting for

guessing, also known as “standard setting” or “hogere cesuur” in Dutch. Besides gender

differences, it has been tested whether other students’ characteristics lead to advantages in

taking these type of MC exams. Furthermore, this study also investigated the effect of the

different students’ characteristics on performance on the three distinct types of questions.

This study found no evidence of the existence of a gender effect in relation to performance on

MC exams that are corrected retrospectively for guessing. No significant differences between

men and women have been found for performance on the exam. Though, when making a

distinction between general scores on the different types of questions being posed on the exam,

other conclusions can be drawn. On the one hand, results showed that male students performed

significantly better on calculations compared to female students. On the other hand, female

students outperformed male students when application questions were involved. Regarding the

most simple questions, being theoretical questions, no gender effect on performance was found.

It is, nevertheless, recommended for instructors to incorporate a balanced mix of different types

of questions in MC exams.

Besides gender, this study also took other factors into account which might affect the

performance of students. In fact, statistically significant performance differences have been

found for weekly invested study time and the use of the surface learning approach. First, the

present study found that invested study time is a strong predictor of performance. Weekly

invested study time has a positive impact on performance on exams, that are scored

retrospectively for guessing. Also for general performance on theoretical questions and

calculations, self-study time had an independent effect on performance above and beyond the

other students’ characteristics and qualitative aspects of learning activities. Hence, educators

should convince their students of the importance to invest adequate time in their learning

activities on a frequent basis, throughout the whole semester.

63

Second, regarding the approaches to learning, the results indicated that the use of the surface

learning approach leads to lower academic performance, while the deep approach unexpectedly

did not predict achievement. For general performance on the exam, that was scored

retrospectively for guessing, the use of the surface approach showed, however, only a slight

trends towards significance (p < 0.10). For performance on theoretical questions, being the most

simple MC questions, the surface approach showed a more significant association with lower

performance. Consequently, it is recommended that educators try to discourage students to

employ a surface learning approach. The use of a surface approach can, for instance, be made

less attractive by matching the level of the subject with students’ prior knowledge. When

students employed a surface approach in previous subjects, they will realise that they do not

have the expected prior knowledge at the start of a new subject. Furthermore, restricting the

workload to a level that allows students to explore the material more thoroughly, may also

discourage the use of the surface approach (Biggs & Tang, 2007). Finally, the results indicated

that a large group of students (“rote learners”) scored low on both learning approaches, which

also requires further attention. Assessment has to be aligned with the desired learning outcomes

in a way that success for the rote recall of information is reduced.

Regarding lesson attendance, no evidence of a significant impact on exam performance was

initially found. The reported degree of class attendance was not associated with performance

on the exam nor with performance on the three distinct types of questions. Nevertheless, when

comparing the scores of the students attending the last class with the scores of those who are

assumed to be absent, strong differences in performance occurred. Significant performance

differences were found for performance on the exam when retrospective correcting for guessing

is applied and for general performance on the different types of questions as well. As such, it

seems that attending lectures indeed has a positive effect on academic achievement. For the

other independent variables, no evidence has been found of a significant impact on performance

on exams scored retrospectively for guessing. However, as described above, several limitations

urge for more, extensive research in this field.

VII

Bibliography Aden, A.A., Yahye, Z.A., & Dahir, A.M. (2013). The Effect of Students’ Attendance on

Academic Performance: A Case Study at Simad University Mogadishu. Academic Research

International, 4 (6), 409 – 417.

Arthur, N., & Everaert, P. (2012). Gender and performance in accounting examinations:

Exploring the impact of examination format. Accounting Education, 21(5), 471–487.

Beller, M., & Gafni, N. (2000). Can item format (multiple choice vs. open-ended) account for

gender differences in mathematics achievement? Sex roles, 42(1 – 2), 1 – 21.

Betts, L. R., Elder, T. J., Hartley, J. & Trueman, M. (2009). Does correction for guessing reduce

students’ performance on multiple-choice examinations? Yes? No? Sometimes? Assessment &

Evaluation in Higher Education, 34(1), 1–15.

Bible, L., Simkin, M.G., & Kuechler, W.L. (2008). Using Multiple-Choice Tests to Evaluate

Students' Understanding of Accounting. Accounting Education, 17:1, S55 - S68.

Biggs, J. B. (1987). Student approaches to Learning and Studying. Hawthorn, Victoria:

Austrian Council of Educational Research.

Biggs, J., Kember, D., & Leung, D. Y. (2001). The revised two-factor study process

questionnaire: R – SPQ – 2F. British Journal Of Educational Psychology, 71(1), 133-149.

Biggs, J. & Tang, C. (2007). Teaching for Quality Learning at University (3rd Ed.).

Maidenhead: McGraw Hill Education & Open University Press.

Byrne, M., Flood, B., & Willis, P. (2002). The relationship between learning approaches and

learning outcomes: a study of Irish accounting students. Accounting Education, 11(1), 27-42.

Chan, N., & Kennedy, P. E. (2002). Are multiple choice exams easier for economics students?

A comparison of multiple-choice and ‘equivalent’ constructed-response exam questions.

Southern Economic Journal, 68(4), 957-971.

VIII

Chemolli, E., & Gagné, M. (2014). Evidence against the continuum structure underlying

motivation measures derived from self-determination theory. Psychological Assessment,

26(2),575–585.

Cohen‐Schotanus J., & Van der Vleuten, C. (2010). A standard setting method with the best

performing studentsas point of reference: Practical and affordable. Medical Teacher, 32, 154‐

160.

Combs, H. M., Michael, L., & Fiore, B. (2002). Easy Test or Hard Test, Does it Matter? The

Impact of Perceived Test Difficulty on Study Time and Test Anxiety. Retrieved on March 24,

2017, via http://www.kon.org/urc/v6/combs.html

Cortright, R., Lujan, H., Cox, J., & DiCarlo, S. (2011). Does sex (female versus male) influence

the impact of class attendance on examination performance? Advances in Physiology

Education, 35, 416-420.

Davidson, R. A. (2002). Relationship of study approach and exam performance. Journal of

Accounting Education, 20(1), 29-44.

Deci, E. L., & Ryan, R. M. (1985). Intrinsic motivation and self-determination in human

behavior. New York: Plenum

Declerck, S. (2010). De invloed van gender en examen formaat op de prestaties van studenten

[Masterproef]. Gent: Universiteit Gent Master in de bedrijfseconomie.

De Lange, P., & Mavondo, F. (2004). Gender and motivational differences in approaches to

learning by a cohort of open learning students. Accounting Education, 13(4), 431-448

De Moor, G., & Van Maele, G. (2008). Inleiding tot de biomedische statistiek. Leuven: Acco.

Diseth, A. & Martinsen, O. (2003). Approaches to learning, cognitive style, and motives as

predictors of academic achievement. Educational Psychology, 23(2), 195 – 207.

IX

Diseth, A., Pallesen, S., Brunborg, G. S., & Larsen, S. (2010). Academic achievement among

first semester undergraduate psychology students: The role of course experience, effort,

motives and learning strategies. Higher Education, 59, 335 –352.

Downing S. M. (2003). Guessing on selected-response examinations. Medical Education, 37,

670 – 671.

Duchesne, I., & Nonneman, W. (1998). The demand for higher education in Belgium.

Economics of Education Review, 17(2), 211-218.

Du Plessis, S., & Du Plessis, S. (2007). A new and direct test of the ‘gender bias’ in multiple-

choice questions, Stellenbosch Economic Working Paper.

Espinosa, M. P. & Gardeazabal, J. (2010). Optimal correction for guessing in multiple-choice

tests. Journal of Mathematical Psychology, 54(5), 415–425.

Everaert, P., Opdecam, E., & Maussen, S. (2017). The Relationship between Motivation,

Learning approaches, Academic Performance and Time Spent. Accounting Education, 26(1),

78-107.

Foos, P. W. (1992). Test performance as a function of expected form and difficulty. Journal of

Experimental Education, 60(3), 205-211.

Gijbels, D., Van de Watering, G., Dochy, F., & Van den Bossche, P. (2005). The relationship

between students’ approaches to learning and the assessment of learning outcomes. European

Journal of Psychology of Education, 20(4), 327 – 341.

Hall, M., Ramsay, A., & Raven, J. (2004). Changing the learning environment to promote deep

learning approaches in first-year accounting students. Accounting Education, 13(4); 489-505.

Hartley, J., Betts, L., & Murray,W. (2007). Gender and assessment: differences, similarities and

implications. Psychology Teaching Review, 13(1), 34-47.

X

Hong, E. (1999). Test anxiety, perceived test difficulty, and test performance: temporal patterns

of their effects. Learning and Individual Differences,11(4), 431-448.

Kastner, M., & Stangl, B. (2011). Multiple Choice and Constructed Response Tests: Do Test

Format and Scoring Matter? Procedia - Social and Behavioral Sciences, 12, 263-273.

Kirby, A., & McElroy, B. (2003). The Effect of Attendance on Grade for First Year Economics

Students in University college Cork. The Economic and Social Review, 34(3), 311-326.

Krieg, R.G., & Uyar, B. (2001). Student Performance in Business and Economics Statistics:

Does Exam Structure Matter?, Journal of Economics and Finance, 25(2), 229-241.

Leaver, R., & van Walbeek, C. (2006). Gender bias" in multiple choice questions: does the type

of question make a difference? University of Cape Town, School of Economics Working Paper.

Lesage, E., Valcke, M., & Sabbe, E. (2013). Scoring methods for multiple choice assessment

in higher education: is it still a matter of number right scoring or negative marking? STUDIES

IN EDUCATIONAL EVALUATION, 39(3), 188–193.

Malterud, K. (2001). Qualitative research: standards, challenges, and guidelines. The Lancet,

358: p. 483-488.

Marín, C., & Rosa-García, A. (2011). Gender bias in risk aversion: evidence from multiple

choice exams. Working Paper 39987, MPRA.

Massingham, P., & Herrington, T. (2006). Does Attendance Matter? An Examination of Student

Attitudes, Participation, Performance and Attendance, Journal of University Teaching &

Learning Practice, 3(2).

Morgan, G.A., Leech, N. L., Gloeckner, G. W., & Barrett, K.C. (2004). SPSS for Introductory

Statistics: Use and Interpretation (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.

XI

Niemiec, C. P., & Ryan, R. M. (2009). Autonomy, competence, and relatedness in the

classroom: Applying self-determination theory to educational practice. Theory and Research in

Education, 7, 133-144.

Nonis, S.A., & Hudson, G. I. (2006). Academic Performance of College Students: Influence of

Time Spent Studying and Working, Journal of Education for Business, 81:3, 151-159.

Norcini, J.J. (2003). Setting standards on educational tests. Medical Education, 37, 464-469.

Plant, E. A., Ericsson, K. A., Hill, L., & Asberg, K. (2005). Why study time does not predict

grade point average across college students: Implications of deliberate practice for academic

performance. Contemp Educ Psychol, 30(1), 96-116.

Rau, W., & Durand, A. (2000). The academic ethic and college grades: Does hard work help

students to « make the grade » ? Sociology of education, 19-38.

Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic

motivation, social development, and well-being. American Psychologist, 55, 68-78.

Scouller, K. (1998). The influence of assessment method on students' learning approaches:

multiple choice question examination versus assignment essay. Higher Education, 35(4), 453-

472.

Self-determination theory. (2017). Self-regulation questionnaires. Retrieved on February 26,

2017, from http://selfdeterminationtheory.org/self-regulation-questionnaires/

Stinebrickner, R., & Stinebrickner, T. (2004). Time-use and college outcomes. Journal of

Econometrics, 121, 243–269.

Turner, E. A., Chandler, M., & Heffer, R. W. (2009). The influence of parenting styles,

achievement motivation, and self-efficacy on academic performance in college students.

Journal of College Student Development, 50(3), 337-346.

XII

Universiteit Gent. (2017). Geen giscorrectie meer bij meerkeuzevragen. Retrieved on February

10, 2017, from http://www.ugent.be/student/nl/studeren/examens/geen-giscorrectie-meer-bij-

meerkeuzevragen

Van de Poele, L. , & Sabbe, E. (2016). Hogere cesuur [PowerPoint-presentatie]. Retrieved on

February 28, 2017, via

https://www.timvervoort.com/lno2/documenten/2016/De_SIG_evalueren_goes_classic_hoger

e_cesuur_sessie.pdf

Vansteenkiste, M., Lens, W., & Deci, E. L. (2006). Intrinsic versus extrinsic goal contents in

self-determination theory: another look at the quality of academic motivation. Educational

psychologist, 41(1), 19-31.

van Thiel, S. (2010). Bestuurskundig onderzoek, een methodologische inleiding. Bussum:

Coutinho

Verlet, D. (2015). Onderzoeksmethoden: 6e sessie SPSS regressieanalyse. Faculteit Economie

en Bedrijfskunde, Universiteit Gent

Wester, A., & Henriksson, W. (2000). The interaction between item format and gender

differences in mathematics performance based on TIMSS data. Studies in Educational

Evaluation, 26, 79–90.

Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, NJ:

Lawrence Erlbaum Associates.

Woodford, K., & Bancroft, P. (2004). Using multiple choice questions effectively in information

technology education. Paper presented at the 21st ASCILITE Conference, Perth.

1

Appendices Appendix 1: Survey

Beste student,

Ik ben een masterstudent in de Bedrijfseconomie. In het kader van mijn masterproef doe ik

onderzoek naar gender bias bij standardsetting als evaluatiemethode bij multiple-choice examens.

Het doel van mijn onderzoek is inzicht te krijgen in welke factoren de score op dergelijke examens

kunnen beïnvloeden. De enquête bestaat uit enkele algemene vragen en vervolgens meerdere

vragen die specifiek verband houden met het vak vennootschapsbelasting. We benadrukken dat de

data strikt vertrouwelijk zullen worden behandeld en geen specifieke informatie zal worden

doorgespeeld naar de lesgevers.

Het invullen van deze vragenlijst zal 10 à 15 minuten van uw tijd in beslag nemen.

Indien je nog vragen of opmerkingen mocht hebben over het onderzoek, neem dan gerust contact

met mij op via [email protected].

Dank bij voorbaat.

Met vriendelijke groeten,


INSTRUCTIES BIJ HET INVULLEN VAN DE VRAGENLIJST:

I. Voor de kwaliteit van het onderzoek is het van belang dat je alle vragen

beantwoordt.

II. Per vraag is er slechts één antwoord mogelijk. In geval van twijfel tussen meerdere

antwoorden, probeer dan toch het antwoord aan te duiden dat het meest aansluit bij

jouw werkelijke situatie.

III. Er bestaan geen foute antwoorden op de vragen in deze enquête. Probeer echter wel

eerlijk te zijn bij het beantwoorden van de vragen. Elk eerlijk antwoord is immers

wel een goed antwoord!

1. Wat is je geslacht?

Man

Vrouw

mailto:[email protected]

2

Gemakkelijk Moeilijk

MMoeilijk

2. Wat is je geboortejaar?

…….…….

3. Hoeveel keer heb je het examen vennootschapsbelasting al afgelegd?

Ik zal het examen vennootschapsbelasting voor de eerste keer afleggen komende

examenperiode.

Ik heb het examen vennootschapsbelasting al één keer afgelegd in het verleden.

Ik heb het examen vennootschapsbelasting al twee keer afgelegd in het verleden.

Ik heb het examen vennootschapsbelasting al meer dan twee keer afgelegd in het

verleden.

4. Hoeveel procent van de oefeningenlessen vennootschapsbelasting heb je dit semester

bijgewoond?

0 – 19%

20 – 39%

40 – 59%

60 – 79%

80 – 100%

Ik ga nooit naar de oefeningenles.

5. Hoeveel procent van de theorielessen vennootschapsbelasting heb je dit semester

bijgewoond?

0 – 19%

20 – 39%

40 – 59%

60 – 79%

80 – 100%

Ik ga nooit naar de theorieles.

6. Hoe moeilijk is de inhoud van het vak vennootschapsbelasting voor jou? Omcirkel het

meest passende antwoord.

1 2 3 4 5 6 7 8 9 10

3

Giscorrectie(*) Standard setting(**)

MMoeilijk

7. Hoeveel tijd heb je gemiddeld wekelijks aan het vak vennootschapsbelasting

gespendeerd (exclusief de lessen die je hebt bijgewoond)?

Minder dan 1 uur per week

Tussen 1 en 2 uren per week





Meer dan 6 uren per week per week

8. Welke evaluatiemethode geniet jouw voorkeur bij multiple-choice examens? Omcirkel

het meest passende antwoord.

(*) Bij de toepassing van giscorrectie krijg je voor elk goed antwoord een positieve score, maar

verlies je ook punten bij een verkeerd antwoord of een open gelaten vraag (Universiteit Gent,

2016).

(**) Bij toepassing van standard setting of een hogere cesuur kan je geen punten verliezen als

je een meerkeuzevraag verkeerd hebt beantwoord, maar je moet wel meer dan de traditionele

50% van de vragen juist beantwoorden om te kunnen slagen (Universiteit Gent, 2016).

9. Duid aan in hoeverre je akkoord bent met de volgende stellingen over standardsetting

als evaluatiemethode.

He

lem

aa

l n

iet

ak

ko

ord

Ee

rd

er n

iet

ak

ko

ord

Ak

ko

ord

, n

och

nie

t

ak

ko

ord

Ee

rd

er a

kk

oo

rd

He

lem

aa

l a

kk

oo

rd

A. Ik heb al veel examens gemaakt waar standardsetting als

verbetermethode werd gehanteerd. 1 2 3 4 5

B. Het schrikt mij af dat er bij standard setting een groter aantal

vragen juist moet beantwoord worden om te kunnen slagen. 1 2 3 4 5

C. Ik begrijp hoe de scores berekend worden op examens met

standardsetting als verbetermethode. 1 2 3 4 5

1 2 3 4 5 6 7 8 9 10

4

10. Duid aan in hoeverre volgende stellingen voor jou van toepassing zijn. Denk hierbij

aan het vak vennootschapsbelasting!

No

oit

of

ze

lde

n v

an

to

ep

assin

g

So

ms v

an

to

ep

assin

g

De

he

lft v

an

de

tij

d

va

n t

oe

pa

ssin

g

Va

ak

va

n

to

ep

assin

g

Alt

ijd

va

n

to

ep

assin

g

A. Mijn doel is om te slagen voor het vak door er zo weinig mogelijk

werk in te steken. 1 2 3 4 5

B. Ik ben pas tevreden wanneer ik genoeg gestudeerd heb aan een

hoofdstuk, zodat ik mijn eigen conclusies kan vormen. 1 2 3 4 5

C. Ik studeer enkel mijn slides of hetgeen gezien is in de les grondig. 1 2 3 4 5

D. Ik vind dat onderwerpen grondig bestuderen niet nuttig is. Het is

een verspilling van tijd, omdat je enkel een 10 nodig hebt om te

slagen.

1 2 3 4 5

E. Ik vind dat ik op de meeste examens kan slagen door belangrijke

onderdelen van buiten te leren i.p.v. deze proberen te begrijpen. 1 2 3 4 5

F. Ik test mezelf op belangrijke onderwerpen in een cursus tot ik ze

volledig begrijp. 1 2 3 4 5

G. Studeren geeft me een gevoel van persoonlijke voldoening. 1 2 3 4 5

H. Ik heb het gevoel dat vrijwel elk onderwerp zeer interessant kan

zijn, zodra ik mij er in verdiep. 1 2 3 4 5

I. Ik vind de meeste nieuwe onderwerpen interessant en spendeer er

extra tijd aan om er zo meer inzicht in te verkrijgen. 1 2 3 4 5

J. Wanneer ik mijn cursus niet zo interessant vind, beperk ik het

studeren tot het minimum. 1 2 3 4 5

K. Moeilijke stukken uit de leerstof leer ik gewoon van buiten en

herhaal ik, tot ik alles volledig uit het hoofd ken, ook al begrijp ik

het niet helemaal.

1 2 3 4 5

L. Ik vind dat studeren even interessant kan zijn als een goed boek

lezen of een goede film bekijken. 1 2 3 4 5

M. Ik beperk mijn studie tot wat specifiek aangegeven is, omdat ik

denk dat extra dingen (zoals extra informatie opzoeken) niet

noodzakelijk zijn.

1 2 3 4 5

N. Ik werk hard voor mijn studies, omdat ik het interessant vind. 1 2 3 4 5

O. Ik spendeer veel van mijn vrije tijd aan het meer te weten komen

over interessante onderwerpen, die behandeld werden in de

verschillende lessen.

1 2 3 4 5

5

P. Ik geloof dat professoren niet zouden mogen verwachten dat

studenten veel tijd spenderen aan het bestuderen van onderwerpen,

waarvan iedereen weet dat ze niet zullen ondervraagd worden.

1 2 3 4 5

Q. Ik ga naar de meeste oefeningenlessen met specifieke vragen, waar

ik een antwoord op wil krijgen. 1 2 3 4 5

R. Ik vind het belangrijk om de leerstof in het handboek grondig te

bekijken vooraleer ik naar de oefeningenles ga. 1 2 3 4 5

S. Ik zie geen nut in het bestuderen van onderwerpen die toch niet

gevraagd zullen worden op het examen. 1 2 3 4 5

T. De oplossingen van de oefeningen van buiten leren, is voor mij

wellicht de beste manier om te slagen voor het examen. 1 2 3 4 5

11. Duid aan in hoeverre je akkoord bent met de volgende stellingen.

1. Ik heb gekozen voor deze studierichting omdat…

He

lem

aa

l n

iet

ak

ko

ord

Ee

rd

er n

iet

ak

ko

ord

Ak

ko

ord

, n

och

nie

t

ak

ko

ord

Ee

rd

er a

kk

oo

rd

He

lem

aa

l a

kk

oo

rd

A. Ik anders spijt zou hebben als ik het niet had gedaan. 1 2 3 4 5

B. Anderen (ouders, vrienden, leerkrachten,…) me hiertoe hebben

verplicht. 1 2 3 4 5

C. Dit voor mij een persoonlijk belangrijke keuze was. 1 2 3 4 5

D. Omdat deze studierichting me interesseerde. 1 2 3 4 5

2. Ik let goed op in de lessen omdat…

He

lem

aa

l n

iet

ak

ko

ord

Ee

rd

er n

iet

ak

ko

ord

Ak

ko

ord

, n

och

nie

t

ak

ko

ord

Ee

rd

er a

kk

oo

rd

He

lem

aa

l a

kk

oo

rd

E. Ik me zeer graag wil verdiepen in het vak

vennootschapsbelasting. 1 2 3 4 5

F. Ik me schuldig zal voelen als ik het niet doe. 1 2 3 4 5

G. Ik nieuwe dingen wil bijleren. 1 2 3 4 5

H. Ik verondersteld word om dit te doen. 1 2 3 4 5

6

3. Ik heb de oefeningen (soms) vooraf voorbereid omdat…

He

lem

aa

l n

iet

ak

ko

ord

Ee

rd

er n

iet

ak

ko

ord

Ak

ko

ord

, n

och

nie

t a

kk

oo

rd

Ee

rd

er a

kk

oo

rd

He

lem

aa

l

ak

ko

ord

I. Ik me schuldig zou voelen als ik het niet had gedaan. 1 2 3 4 5

J. Ik het boeiend vond om de oefeningen voor te bereiden. 1 2 3 4 5

K. Anderen (ouders, vrienden, docenten, …) dit van mij hebben

verwacht. 1 2 3 4 5

L. Ik het belangrijk vond om deze oefeningen voor te bereiden. 1 2 3 4 5

4. Ik doe mijn uiterste best voor het vak

vennootschapsbelasting omdat…

He

lem

aa

l n

iet

ak

ko

ord

Ee

rd

er n

iet

ak

ko

ord

Ak

ko

ord

, n

och

nie

t a

kk

oo

rd

Ee

rd

er a

kk

oo

rd

He

lem

aa

l

ak

ko

ord

M. Anderen (familie, vrienden, …) het verwachten van me. 1 2 3 4 5

N. Ik anderen de indruk wil geven dat ik een goede student ben. 1 2 3 4 5

O. Ik vennootschapsbelasting interessant vind. 1 2 3 4 5

P. Mijn ouders anders teleurgesteld zijn in mij. 1 2 3 4 5

Q. Ik me anders slecht ga voelen als ik niet de gewenste score

behaal. 1 2 3 4 5

R. Ik hoge cijfers wil behalen op het examen. 1 2 3 4 5

S. Ik trots op mezelf kan zijn. 1 2 3 4 5

T. Ik verondersteld word om dit te doen. 1 2 3 4 5

Gelieve hieronder nog je stamnummer in te vullen:

7

Als je je studentenkaart niet bij hebt en je kent je studentennummer niet, gelieve dan je naam in te

vullen.

Voornaam: ………………………………………..………………

Naam:……………………………………………………………..

We benadrukken dat de gegevens strikt vertrouwelijk zijn en dat de namen zullen omgezet worden

in nummers, zodat de data anoniem kan behandeld worden.

Bedankt voor jouw deelname aan het onderzoek!

8

Appendix 2: Factor loadings and Cronbach’s alpha familiarity

Item Cronbach’s alpha Factor loading

Familiarity with retrospective correcting for guessing 0.47

I have already made many exams which were corrected

retrospectively for guessing.

0.82

The fact that a larger number of questions has to be answered

correctly in case of retrospective correcting for guessing, scares

me.

0.05

I understand how scores are calculated on exams which are

corrected retrospectively for guessing.

0.82

9

Appendix 3: Factor loadings and Cronbach’s alphas R-SPQ-2F

Item Cronbach’s alpha Factor loading

Deep approach 0.65

2. I find that I have to do enough work on a chapter so that I can

form my own conclusion before I am satisfied.

0.45

6. I test myself on important topics in a course until I understand

them completely.

0.47

7. I find that at times studying gives me a feeling of deep

personal satisfaction.

0.64

8. I feel that virtually any topic can be highly interesting once I

get into it.

0.49

9. I find most new topics interesting and often spend extra time

trying to obtain more insights into them.

0.60

12. I find that studying academic topics can at times be as

exciting as a good novel or movie.

0.41

14. I work hard at my studies because I find the material

interesting.

0.57

15. I spend a lot of my free time finding out more about

interesting topics which have been discussed in different

classes.

0.40

17. I come to most exercise classes with questions in mind that

I want answering.

0.33

18. I make a point of studying the course material in the textbook

thoroughly before going to the exercise classes.

0.49

Surface approach 0.63

1. My aim is to pass the course while doing as little work as

possible.

0.25

3. I only study seriously what’s given out in class or in the course

outlines.

0.45

4. I find it not helpful to study topics in depth. It confuses and

wastes time, when you all need is a 10 to pass the course.

0.56

5. I find I can get by in most examinations by memorising key

sections rather than trying to understand them.

0.74

10. I do not find my course very interesting, so I keep my work

to the minimum.

0.34

11. I learn some things by rote, going over and over them until I

know them by heart even if I do not understand them.

0.56

13. I generally restrict my study to what is specifically set as I

think it is unnecessary to do anything extra.

0.45

16. I believe that lecturers should not expect students to spend

significant amounts of time studying material everyone knows

won’t be examined.

0.42

19. I see no point in learning material which is not likely to be in

the examination.

0.40

20. I find the best way to pass examinations is to try to

remember the solution of the exercises.

0.49

10

Appendix 4: Factor loadings and Cronbach’s alphas RAI

Item Cronbach’s

alpha

Factor loading

Intrinsic 0.66 Component 1

Component 2

Component 3

Component

4

4. I have chosen this field of study because it

interested me.

0.115 0.146 0.029 0.805

5. I pay attention in class because I want to

deepen my knowledge in the subject of

corporation tax.

0.040 0.886 0.097 -0.070

10. I (sometimes) prepared the exercises in

advance, because it fascinated me to prepare

them.

-0.137 0.385 0.704 -0.130

15. I do the best I can for the course of

corporation tax because I find it interesting.

0.090 0.876 0.071 0.094

Identified 0.38

3. I have chosen this field of study because this

was an important choice for me.

0.130 0.059 -0.078 0.755

7. I pay attention in class because I want to learn

new things.

-0.060 0.614 0.130 0.357


advance, because I found it important to prepare

these exercises.

-0.031 0.161 0.817 0.090


corporation tax because I want to achieve high

grades on the exam.

0.420 0.296 0.168 0.278

Introjected 0.54

1. I have chosen this field of study because I

would regret it if I did not have done it.

0.265 0.171 -0.236 0.217

6. I pay attention in class because I would feel

guilty if I did not.

0.465 -0.282 0.305 0.217


advance, because I would feel guilty if I did not.

0.225 -0.006 0.753 0.004


corporation tax because I want to give others the

impression that I am a good student.

0.412 0.441 0.332 -0.066


corporation tax because I would feel bad if I did

not achieve the desired mark.

0.634 0.128 0.020 0.224


corporation tax because I can be proud of

myself.

0.484 0.277 0.015 0.273

External 0.69

2. I have chosen this field of study because other

people (parents, friends, teachers, …) forced

me to do so.

0.244 0.077 -0.076 -0.755

8. I pay attention in class because I am

supposed to do so.

0.611 -0.313 0.043 0.015


advance, because other people (parents,

friends, teachers, …) expect this from me.

0.168 0.015 0.667 0.050


corporation tax because other people (family,

friends, …) expect this from me.

0.635 0.104 0.363 -0.268


corporation tax because otherwise I would

disappoint my parents.

0.650 0.215 0.112 -0.211


corporation tax because I’m supposed to do so.

0.754 -0.115 -0.082 -0.077

THE EFFECT OF GENDER ON MULTIPLE CHOICE EXAMS: … · 2017-08-04 · Dit onderzoek zal proberen bij...

Documents

Transcript of THE EFFECT OF GENDER ON MULTIPLE CHOICE EXAMS: … · 2017-08-04 · Dit onderzoek zal proberen bij...