A study on teachers' item weighting and the Rasch model ...
Transcript of A study on teachers' item weighting and the Rasch model ...
A study on teachers’ item weighting and the
Rasch model: Summative test items’ difficulty
logits calibration using the Rasch model
Author Name: Sanghoon Mun University of Bath
British Council ELT Master’s Dissertation Awards: Special
commendation
A study on teachers' item weighting and the Rasch model: Summative test items'
difficulty logits calibration using the Rasch model
Table of Contents
Chapter 1. Introduction .............................................................................................................1
1.1 Introduction.................................................................................................................1
1.2 Purpose of Study .........................................................................................................2
1.3 Overview .....................................................................................................................2
Chapter 2. Literature Review .....................................................................................................3
2.1 Introduction to Chapter Two ...........................................................................................3
2.2 Test Items Weighting Methods........................................................................................3
2.2.1 Equal Weighting Method...........................................................................................3
2.2.2. Differential Weighting Method: Weighting by difficulty..........................................5
2.3 Item Analysis ....................................................................................................................5
2.3.1 Classical Test Theory (True-Score Theory) ................................................................6
2.3.2 Item Response Theory (IRT) ......................................................................................8
2.3.2.1 The Rasch model: Fit statistics..........................................................................11
2.4 Conclusion to Chapter 2 .................................................................................................13
Chapter 3. Research Methodology .........................................................................................14
3.1 Introduction to Chapter 3 ..............................................................................................14
3.2 Context ...........................................................................................................................14
3.3 Research Strategy...........................................................................................................15
3.4 Research Design .............................................................................................................16
3.4.1 Overview of the Research Procedure......................................................................16
3.4.2 Data Collection and Sampling..................................................................................17
3.4.3 Data Analysis............................................................................................................18
3.4.3.1 Way of Categorising Language Elements........................................................119
3.4.4 Validity and Reliability .............................................................................................20
3.5 Ethical Considerations....................................................................................................20
3.6 Conclusion to Chapter 3 .................................................................................................20
Chapter 4. Data Analysis ..........................................................................................................20
i
4.1 Introduction to Chapter 4 ..............................................................................................20
4.2 Selecting Data.................................................................................................................21
4.3 Test Item Classification ..................................................................................................21
4.4 Data Refinement ............................................................................................................23
4.5 Item Difficulty Calibration ..............................................................................................24
4.5.1 X High School ...........................................................................................................24
4.5.2 Y High School ...........................................................................................................27
4.5.3 Z High School ...........................................................................................................29
4.6 Language Category Classification...................................................................................33
4.7 Summary ........................................................................................................................35
Chapter 5. Discussion...............................................................................................................36
5.1 Introduction to Chapter 5 ..............................................................................................36
5.1 Discussion and Implications ...........................................................................................36
5.3 Recommendations for Future Research ........................................................................38
5.4 Conclusion ......................................................................................................................38
References ...............................................................................................................................40
Appendix ..................................................................................................................................44
ii
Chapter 1. Introduction
1.1 Introduction
As language tests have become a part of life, the importance of the language tests seems to
grow bigger and bigger in a number of domains. From this perspective, McNamara (2000) points out
that language tests have played a powerful role as gateways at important transitional moments in
education, in employment, and in immigration. Taylor (2005) also argues that depending on the
language test results, test-takers' life chances or careers can be influenced in a given domain.
Likewise, in Korean society, a high language test score is considered as a way of gaining "access to
the world of elite" (Hu & Mckay, 2012, p. 352). Such beliefs that a language test and its result serves
as "crucial milestones in the journey to success" (Brown, 1994, p. 401) can be identified in the high
school language test, because in most 'competitive' universities in South Korea, English test results
which students have obtained from school summative tests are usually used as evidence of
predicting their academic latent trait (Weir, 2005) and directly influence students' university
admission to a great extent.
In designing such a high-stakes test, language teachers usually make multiple-choice questions
with the belief that those who choose a preferable answer to others have more knowledge or ability
than those who do not (Cliff, 1989). Even though multiple-choice testing is frequently the target of
disparaging comments in the everyday conversations of students and teachers, it is generally said
that multiple-choice items are economically practical and provide relatively objective (or valid)
scoring in the testing field (Diekhoff, 1983; Pae, 2012) by preventing the possibility of divergent
answers (Statman, 1998). I understand that multiple-choice items are economically practical and
efficient in terms of a scoring process. However, the scoring issue needs to be carefully investigated in
relation to weighting process. Otherwise, I believe that it may cause a difficulty in demonstrating the
link between the test users' interpretation of the score and the decisions that they make on the basis
of the score. (Fulcher & Davidson, 2009).
After designing multiple-choice items, most school teachers resort to differential item weighting
with the assumption that equally weighted test items cannot reflect the importance of test content
(Feldt, 2004) and that only the very able will be able to identify the correct answer to the most
difficult (West, 1924). In addition, they aim to minimise the number of students who gain the same
scores (this topic will be specifically discussed in chapter 3) so that students can be clearly
differentiated in terms of their capacity. As a way of allocating differential weights to the multiple-
choice test items, they usually rely on their subjective judgement on the difficulty levels of the items
(this is called an 'a priori' weighting method). Based on subjectively assigned weightings of the items, a
scoring process is taken place. At this point, I want to pose a question about whether weighting
and scoring are discrete processes. If scoring is a simple process of adding the weighted points
regardless of the weighting process, the particular values of the weights can be relatively
unimportant (Koopman, 1988). However, that is not the case. Weighting and scoring are not discrete
processes. Rather, a weighting process may have an influence on the scoring process to a certain
degree, because the scoring process cannot take place before the items are weighted. In this sense, I
believe that it is difficult to secure objectivity in scoring multiple-choice items.
1
In order for the test scores to be interpreted as critical indicators for making decisions, test
designers need to identify and eliminate any potential sources of errors that can decrease both the
reliability of scores and the validity of their interpretations (Bachman, 1990). In the similar vein,
McNamara and Ryan (2011) claim that "a person's chances of success on a test should not be
influenced by irrelevant factors" (p. 163). Among a number of factors which influence the test
reliability and validity is the weighting process (Guilford, 1954; Wang & Stanley, 1970), which is the
topic of this study. Depending on how test designers weight the items, the test reliability and validity
can be enhanced or marred. In spite of the importance of the weighting process in language tests,
little, if any, attention has been paid to the weighting issue in my context. Hence, the current
research will investigate how consistently language teachers weight the items using the Rasch model.
1.2 Purpose of Study
When deciding the difficulty levels of the test items and allocating corresponding marks to them,
teachers in South Korea tend to use their knowledge and experiences. Based on their prior
experiences and expertise, they weight test items and allocate different marks to the test items
before test takers take a test. I assume that such a way of item weightings may not match the virtual
difficulty levels of the items. Rather, there may be a discrepancy between teachers' prior item
weighting and students' virtual responses to the items. In relation to this assumption, two research
questions will be investigated in this study.
• Is there a gap between teachers' weighting and students' virtual item responses? If so, is
the gap large or small?
• In terms of language categories, does what teachers believe to be difficult correspond to
what students found to be difficult?
1.3 Overview
This dissertation is composed of five chapters, including this chapter. Following this chapter, in
chapter 2, literature about item weighting methods and two ways of analysing test items will be
reviewed and discussed. In chapter 3, along with the explanation about the research context, the
overall blueprints of current research will be described. In chapter 4, the data collected from three
different high schools will be analysed using the Rasch model and findings will be expounded upon. In
the final chapter, a discussion will be made in relation to the findings and recommendations for
the further research will be suggested.
2
Chapter 2. Literature Review
2.1 Introduction to Chapter Two
William (2011) argues that "it is only through assessment that we can find out whether a
particular sequence of instructional activities has resulted in the intended learning outcomes" (p. 3). In
addition, educational practitioners and theorists have widely noted the effects of assessment on
learning and teaching (Zhana & Andrews, 2014; Brown, 1997). From this perspective, in a formal
education setting, most learning and teaching activities are accompanied by assessment, and as an
instrument of manifesting the assessment process in relation to teaching programmes and materials
(Woodford, 1980), tests (e.g. performance test, summative test, etc.) are frequently set up and
implemented (Lloyd-Jones, 1992). That is, "assessment is a superordinate term for all forms of
assessment, and testing is a term for one particular form of assessment" (Leung & Lewkowicz, 2006, p.
212). Given the relationship between assessment and tests, I believe that the quality of
assessment may be dependent on the quality of the test to a great extent.
When designing a test, test designers (mainly language teachers in my context) usually
encounter the moment in which their professional judgement needs to intervene (Allal, 2013). First
of all, they have to make a decision about the domain being tested. Subsequently, they choose
"proper" test methods and reflect both the domain and methods into the test. Along with those
processes, test designers also labour in speculating on the way of item weighting which has an
impact on the internal criteria such as reliability (Wang & Stanley, 1970). In the first section of this
chapter, the item weighting methods (an equal weighting method and a differential weighting
method) will be introduced.
After students take the test, teachers need to analyse and evaluate what test scores mean
on the basis of their test results (Taylor, 2013). In this process, the results are summarised into the
numerical values such as mean, standard deviation, and frequent distribution which contain much
information about the test. Along with the overall analysis of the test, teachers also need to
specifically estimate the individual test items in order to find out how difficult an individual item is
and how high the item discrimination power is. In this sense, after introducing the weighting
methods, ways of analysing the test scores and test items (classical test theory and item response
theory) will be explained.
2.2 Test Items Weighting Methods
In general, the total score may be obtained by merely adding the marks of correctly
answered items. That is, if students response to a certain item rightly, the score goes up by the mark.
Otherwise, the score does not change at all. However, the score which students gain is not
summated in an identical way. Depending on the method of weighting, the scores that students can
obtain can differ. In this section, two weighting method will be explained: an equal weighting
method and a differential weighting method.
2.2.1 Equal Weighting Method
In an equal weighting method, all test items have the equal weight (Stalnaker, 1938), so a
test designer allocates the same mark to all test items with the belief that each item is equally
related to the underlying trait being measured (Xu & Stone, 2012). To put it simply, if the maximum
3
possible score which a test taker is able to gain is 100 and a test designer creates 20 test items, each
item's mark assigned can be 5 points equally. This method usually makes the scoring process easy
and helps test developers to interpret the test result conveniently. However, practically, "items are not
equally correlated" within a test (Guilford, 1954, p. 443) because of the standard deviation (SD). That
is, even though this weighting method ostensibly distributes the same mark to each item,
internally and practically each item has different weightings, because the SD of items determine the
weights of the items (Wang & Stanley, 1970). The SD indicates the distance of a particular score from
the mean score. As the SD is higher, the distance of a particular score from the mean score is further.
In relation to the SD and item weightings, Kim et al. (2010) mention that if the SD of the item is high,
the effective weighting of the item becomes high, and if the SD is low, the effective weighting
becomes low. Suppose that 10 students answered the 3 items and they had the results as shown in
Table 2.1.
PERSON
ITEM SD Variance Average ABCDEFG H I J
1 222 2 00 0 0 0 0 0.98 0.96 0.80 2
222 2 22 2 0 0 0 0.92 0.84 1.40 3
222 2 22 2 2 0 0 0.80 0.64 1.60
Table 2.1 10 Students' test scores on three items
Table 2.1 shows that an identical point, 2, has been equally allocated to each item. Some
students responded to a certain item correctly, while others did not. In addition, Table 2.1
represents that the SDs and variances of three items are different. The fewer the number of correct
responses, the higher the SD is. Since a high SD refers to a high effective weighting (Kim, et al., 2010),
the effective weighting of item 1 is higher than item 2 and 3. In this sense, even though the same
mark is allocated to the test items, it can be said that those three test items have in practice
different effective weights.
In a real test context (especially in a large-scale test), it may be extremely difficult for
different items to have the same SD and variance, because student's abilities vary. In addition, what is
difficult for one can be easy for another, or vice versa. From this perspective, Wang and Stanley
(1970) claim that even though the mark of items within the test is equally assigned, the effective
weighting of the items can be naturally adjusted. This is called a natural (or random) weighting.
Even if an equal weighting method may be able to make the interpretation of the results
convenient and naturally adjust the effective weighting of the test items, Stalnaker (1938) and Kim et
al. (2010) argue that since this weighing method distributes the same mark without considering the
critical factors (e.g., item complexity, importance of content, time required, etc.) which can
influence the test result, the test stability and reliability are likely to be compromised. In addition,
Shi and Chang (2012) contend in their research that assigning different weights rather than an equal
mark to the test items can lead to more accurate estimates of the test-takers' latent traits. In spite of
that, Wang and Stanley (1970) argue that "although differential weighting theoretically promises to
provide substantial gains in predictive or construct validity, in practice these gains are often so slight
that they do not seem to justify the labour involved in deriving the weights and scoring with them" (p.
664).
4
2.2.2. Differential Weighting Method: Weighting by difficulty
While an equal weighting method distributes the same mark to the test items, a differential
weighting method assigns different marks to each item on the basis of the particular criterion
adopted. Among criteria (e.g., item length, item validity, etc.) for assigning different weights to the
items is item difficulty, which is the topic in this section.
When test developers or subject-matter experts assign different weights to sections or items
of a test, they considers a number of factors such as time, the maximum score, and the number of
questions. However, more ultimately, based on an intuitive feel for items' difficulty or "worth"
(Wang & Stanley, 1970), they subjectively judge that one question may be deemed more important
than others (Stalnaker, 1938; Wang & Stanley, 1970) and assign different weights to the items. This is
called an 'a priori' (subjective) weighting method and is generally implemented in South Korea (Kim
& Roh, 1999, cited in Kim et al., 2010).
However, it is extremely difficult to estimate how difficult the test item will be before the
test takers take the test (Baker, 1985; Kim et al., 2010). In some cases, what is considered to be
difficult by test designers may be easy for the test takers or vice versa. Because of the gap, there
may be a possibility that the weight is not equivalent to the proportion of those who fail to answer
the item correctly (Wang & Stanley, 1970). Thus, Guilford (1954) argues that since weighting on an 'a
priori' basis heavily relies on personal bias (subjectivity), this method can compromise the reliability
and validity of the test unless the criteria for the item weighting are consistently and strictly laid
down.
In addition, Gulliksen (1950) and Stalnaker (1938) contend that a differential weighting
method may not be advantageous over an equal weighting method. Gulliksen (1950) points out that in
a wide range of cases, an equal weighting method produces a statistically similar result with a
differential weighting method. Stalnaker (1938) indicates that the balancing of weights becomes
highly complex so that if more than two teachers are involved, a great amount of time may be spent in
determining the appropriate weights (Stalnaker, 1938).
In spite of such limitations of a differential weighting method, Wang and Stanley (1970)
assess this method like this: an 'a priori' weighting is the most appropriate when it is actually used to
define the nature of the composite measure" (p. 668). In addition, since test designers have
entrenched conviction that knowing a very difficult item is evidence of considerably more ability or
achievement than knowing a simple one, they have made an attempt to distribute different weights to
the test items and consistently redefined the differential weighting method (Wang & Stanley,
1970).
In this section, two weighting methods, an equal weighting method and a differential
weighting method were explained. In the following section, two ways of analysing the test items will
be discussed along with the specification of the relevant terms.
2.3 Item Analysis
Bachman (1991) argues that the broad purposes of language tests are to predict test takers'
authentic capacity in the future and fundamentally to make decisions about the test takers' ability in
non-test contexts (e.g., employment, placement, grading, etc.). In order for the test to fulfil those
5
two primary functions properly, careful attention needs to be paid to the principles of test
construction when designing a "good" test (Kaplan & Saccuzzo, 2005). At the same time,
spontaneously, as the methodological field of language testing has advanced, the advances in the
tools that are available for test analysis come to concur synergically (Bachman, 1989) for the
purpose of examining the contribution that an individual test item is making to the whole test
(Hughes, 2003). In this section, two test analysis theories will be explained: classical test theory and
item response theory.
2.3.1 Classical Test Theory (True-Score Theory)
An individual test taker's true score is represented by summation of the marks of all
correctly answered test items. However, in some cases, it seems to be commonly witnessed that test
takers can make a lucky guess or by chance mismark questions incorrectly. From this perspective, in
early 1900s, a theoretical framework was established by British Psychologist Charles Spearman on
the basis of the simple notion that the sum of a "true" score plus random "error" is a test score
(Domino & Domino, 2006). For example, among 10 items, a test taker knows only seven answers,
but the test taker luckily responses to nine answers correctly. In this case, true score is seven, two
means random error, and nine becomes observed scores. On the other hand, if a test taker
mismarks two items incorrectly and gains five scores, the true score becomes seven. The formula is
below.
X (Observed Score) = T (True Score) + E (Error) (2.1)
Classical test theory assumes that the T (true score) for an individual will not change with
repeated applications of the same test (Kaplan & Saccuzzo, 2005). However, because error is
random and varies in every test administration, each test taker's observed score always differs from the
person's true ability or characteristic (Sharkness & DeAngelo, 2011). That is, the amounts of E (error)
produce inconsistent X (observed score) and consequently increase item variability (DeVellis, 2006).
When the aim of the test is to "yield a score that is a relatively close reflection of the true score"
(DeVellis, 2006, p. S51), the smaller the dispersion (standard deviation of errors or variance) between
T and X is, the more reliable and accurate the test is.
Along with the understanding of the formula 2.1, in this theory, the calculation of facility
(item difficulty) value matters, because the facility value provides useful information through which
we are able to simply compare the difficulty between items within a test. If the number of correct
responses is divided by the total number of responses, we can gain the facility value. For example,
out of 100 test takers having taken the same test, if 25 test takers respond to an item X correctly and
60 test takers respond to an item Y correctly, the facility values of item X and Y are 0.25 and 0.60
respectively. In this case, an item X is said to be more difficult than an item Y. This facility value is
also closely connected with discrimination index (DI), because if the facility value of a certain item is
too high or too low, the item may not be considered to be eligible for discriminating the overall
ability between test takers. DI refers to an indicator of how well an item discriminates between
strong test takers (STTs) and weak test takers (WTTs) (Hughes, 2003). The maximum discrimination
index is 1, indicating a perfect discrimination power, whereas 0 (zero) indicates that an item does
not discriminate at all.
6
In computing DI, a question about how to define high scorer vs. low scorer can be posed. In
regards to this question, Kelly (1939, cited in Domino & Domino, 2006) suggests a specific
percentage which makes it possible to distinguish STTs from WTTs. Kelly suggests that the
percentage which distinguishes STTs from WTTs is 27%. That is, out of 100 test takers, 27 high
scorers form STTs and 27 scorers in the lowest ranks form WTTs. The formula of calculating DI is
below.
DI = The
number of cases (either the number of STTs or WTTs) Correct
Responses of STTs Correct Responses of WTTs
(2.2)
For example, out of 100 test takers, if 23 STTs and 6 WTTs respond to one specific item
correctly, the DI of the item becomes 0.63.
0.63 (DI) =
23(Correct Responses of STTs ) 6(Correct Responses of WTTs)
27(The number of cases)
As a means of interpreting the numerical values of DI and conveying this interpretation to a
non-technical audience, the verbal labels used to describe an item's discrimination can be related to
the range of values as follows:
Range of values
.40 and above
.30 to .39
.20 to .29
.19 and below
Interpretation
Very good items
Reasonably good items but possibly subject to improvement
Marginal items, usually needing and being subject to improvement
Poor items, to be rejected or improved by revision
Table 2.2 Levels of discrimination (Popham, 2000, cited in Green, 2013, p.29)
In classical test theory, a test taker's total score is defined by the total number of items
which are correctly answered and the score (or mean score) is interpreted in terms of relative ability
within the group of students taking the test (Domino & Domino, 2006; Lawson, 2006). For example,
out of 100 test takers, if there are 9 test takers who obtain higher mean scores than a test taker Y,
the percentile rank of the test taker Y becomes 90%. It also gives information that 89% of test takers
gains lower mean scores than the test taker Y obtains. However, such a piece of information does
not account for the relationship between the person and the item. That is, classical test theory fails
to assess how a test-taker responds to a certain item, because test-takers' trait levels and item
difficulty levels are not intrinsically connected under classical test theory (Furr & Bacharach, 2008).
Hence, in order to "formalize the relationship between the latent trait of a person and the item
difficulty levels using a mathematical model" (Wilson, 2013, p. 3770), a new theory called item
response theory was developed.
7
2.3.2 Item Response Theory (IRT)
Under classical test theory, test takers' ability levels are measured by simply adding
responses across items and then converting the sum into a standard score (Embretson, 1999). In
addition, the difficulties of the items are simply calculated by dividing the number of correct
responses with the total number of responses. In classical test theory, therefore, the information
about a person's latent level and an item property cannot assess the probability of whether test
takers may respond to the item correctly or not. That is, classical test theory does not account for
the relationship between the person and the item.
On the other hand, in IRT model, test-takers' ability levels and item difficulty are
independent variables that are estimated separately (Embretson, 1999). In addition, the value of
both latent levels and item difficulty is estimated on an immensely simplified unit of measurement
which is called a 'logit scale' (Bond & Fox, 2001). Since both latent levels and item difficulty use the
same scale, it becomes possible to provide a means for quantifying the probability that a test-taker
will pass or fail a particular item by comparing a test-taker's ability logit scale with the difficulty logit
scale of the item (Henning, 1987). For example, if a test-taker's latent trait is +1.0 and an item
difficulty logit is smaller than +1.0, the possibility of the test-taker passing the item goes up. On the
other hand, if the test-taker encounters the item whose difficulty logit is larger than +1.0, the
possibility decreases (the possibility will be specifically discussed again below). In this sense, Crocker
and Algina (1986) define IRT as the computation and examination of any statistical property of test
takers' responses to an individual test item.
Probability
Figure 2.1 the Relationship between Ability and Item Response (Crocker & Algina, 1986, p. 341)
Intuitively, if a person has a low trait level, the likelihood of the person passing most items
on a very hard test is low. On the other hand, if a person has a high trait level, it is more likely that
the person passes most items on a very hard test. The relationship between latent trait and item
response is shown in Figure 2.1. According to the graph above, as the logit of the latent trait (ability)
goes away from 0 (zero), it is seen that the probability of passing the items goes up and down. To be
specific, if the logit of latent trait is going up above zero, the probability of person's selecting the
correct answer increases. On the other hand, as the logit of latent trait is going down below zero,
the probability decreases. Such a relationship between the person's ability and item difficulty is also
well schematised through the item-person map.
8
5
Figure 2.2 Item-person map
As illustrated in Figure 2.2, on the item-person map, test items are separated from the test-
takers (Murphy & Davidshofer, c1991; Mellenbergh, 1996). According to Figure 2.2, the person
ability estimates range from approximately +2.4 to -2.4 logits. The test-taker who has the highest
ability logit is SAM and that who has the lowest is BEN. Based on the place of the person in the map, it
can be presumed that SAM is considerably more able than BEN. Likewise, the item estimates range
from approximately +2.8 (item 3) to -2.4 (item 8) logits. Item 3 in the very top of the map is
measured as the toughest item, while item 8 is the easiest one. The locations of the item 3 and 8
represent that the possibility of most test-takers answering to item 3 incorrectly, even if not
necessary, would be very high in this test, whereas the probability of most test takers' answering to
item 8 incorrectly would be very rare.
On the basis of the location of the item and person on the map, it also becomes possible to
roughly estimate the probability of how test-takers respond to each particular item by comparing
the latent trait logit with the item difficulty logit (Kaplan & Saccuzzo, 2005). That is, the relationship
between the person's ability and item properties can be accounted for through this map. For
example, the possibility that RIO can pass item 7 is high, because the place where RIO is situated is
9
higher than the place of item 7. However, even if RIO is situated higher than item 6, the gap
between RIO and item 6 is relatively close. Therefore, it is not a surprising event, even if RIO
responds to item 6 incorrectly. In addition, RIO is located in the same place with item 5. That is, the
logit of the latent trait is the identical to that of the item difficulty. In this case, the possibility to find
the correct answer and the wrong answer is the same.
To be specific, it is possible to predict the approximate probability of a test taker's success on
a given item (Green, 2013), if the logits of a person's ability and an item difficulty are provided. This
is done by using a conversion table such as the one shown in Table 2.3.
Positive (above Zero) Negative (below Zero)
Difference between a person's ability and item
difficulty 5.0
4.6
4.0
3.0
2.2
2.0
1.4
1.1
1.0 0.8
0.5
0.4
0.2
0.1
0.0
Probability of
answering item correctly
99% 99%
98%
95%
90%
88%
80%
75%
73% 70%
62%
60%
55%
52%
50%
Difference between a person's ability and item
difficulty -5.0
-4.6
-4.0 -
3.0
-2.2 -
2.0 -
1.4
-1.1
-1.0 -0.8
-0.5
-0.4
-0.2
-0.1
0.0
Probability of
answering item correctly
1
% 1
%
2
%
5
%
10%
12%
20%
25%
27% 30%
38%
40%
45%
48%
50%
Table 2.3 Conversion table (Green, 2013, p. 165)
For example, Figure 2.2 shows that the ability logit of TIM is 0 (zero) and the difficulty logit
of item 7 is -1.0. The logit difference between a person's ability and item difficulty is 1.0. According to
Table 2.3, if the logit difference between a person's ability and item difficulty is 1.0 above zero, a test
taker may have a 73 per cent chance of answering the item correctly. On the other hand, the
difference between a person's ability and item difficulty is smaller than 0 (zero), the chance will be
less than 50 per cent. From this perspective, after analysing test scores using the IRT model, if a
'weaker' examinee that has below zero latent logit answers several 'very' difficult questions
correctly, it is usually advised for testers to carefully investigate the examinee and find the reasons
(Henning, 1987). In this case, this weaker person who responses to the several difficult questions
correctly is called misfit data (this topic will be dealt with later).
In IRT, there are three logistic models: one-parameter model, two-parameter model, and
three-parameter model. One-parameter model (which is also called the Rasch model) uses only item
difficulty in order to measure test takers' ability, whereas two-parameter model uses item difficulty
and test discrimination and three-parameter model uses likelihood (possibility) along with those two
parameters. As the number of parameters adds up, the process of computation becomes more
10
complex (Henning, 1987). Studies using these models are becoming numerous (Skehan, 1989;
Henning, 1987; Domino & Domino, 2006). However, in educational measurement, one-parameter
model tends to be preferred by language testers over the other two models in terms of the
simplicity of computation, easy interpretation, and the small size of sample required, even if there
has been much controversy on choice of a model (Henning, 1987).
2.3.2.1 The Rasch model: Fit statistics
In the Rasch model, fit (infit and outfit) statistics are used in order to detect the
discrepancies between empirical data and the Rasch model prescriptions (Bond & Fox, 2001). By
statistically indicating the degree of match between the observed performance and expected
performance, fit statistics report how well the empirical data accord with the Rasch model (Linacre,
2002). Routinely, fit statistics are reported in both an unstandardized and a standardized form. An
unstandardized form refers to mean square (MNSQ) and a standardized form is standardized t
(ZSTD).
A fit MNSQ value provides information about "how confident we can be in the measures
(logits) associated with the persons and the items" (Green, 2013, p. 167). Depending on the MNSQ
value, whether items or persons fit the Rasch model or not can be judged. The acceptable MNSQ
value of a person and an item ranges from +0.5 to +1.5, which is considered to be productive for
measurement (Green, 2013). On the other hand, all data whose MNSQ values are not between +0.5
and +1.5 are classfied as misfit data, indicating that the data do not fit the Rasch model.
If MNSQ value is less than +0.5, it means that a person or an item is performing in a too
predictable way. For example, if a person with a certain ability responds to all easy questions
correctly and responds to all difficult question incorrectly, the MNSQ value may be lower than +0.5.
"The MNSQ value of lower than +0.5 is considered to be 'less productive' for measurement" (Green, 2013,
p. 169). On the other hand, if MNSQ value is higher than +1.5, it means that persons or items are
performing in an unpredictable way. For example, if an able person responds to an easy item
incorrectly, MNSQ value can be higher. Because of the unpredictability, "the MNSQ value of higher
than +1.5 is considered 'unproductive' for measurement " (Green, 2013, p. 169). In the Rasch model, the
'unproductive' data (MNSQ > +1.5) are usually focused on and investigated rather than 'less
productive' data (MNSQ < +0.5).
"The infit and outfit statistics adopt slightly different techniques for assessing an item's fit in the
Rasch model" (Bond & Fox, 2001, p. 43). The infit MNSQ assigns relatively more weight to the
performances of persons which are closer to the item difficulty value (Ibid.). Thus, if a person
incorrectly answers the items particularly close to their ability level, the infit MNSQ value can be
affected. On the other hand, outfit MNSQ is more sensitive to the influence of outlying scores (lucky
guesses of low performers and careless mistakes of high performers). That is, outfit MNSQ is related to
how a person responds to the items that are very easy (item difficulty logit < -2.0) or very hard
(item difficulty logit > +2.0). Thus, if a very able person does not respond to a very easy item
correctly, the outfit MNSQ value can be affected.
It is usually easy to detect the reason for high outfit MNSQ (>+1.5), while high infit MNSQ
(>+1.5) does not always provide a clear reason why the person or item responded in such ways
(Green, 2013). In this sense, infit MNSQ is treated as greater threat which can distort the
11
measurement system, whereas outfit MNSQ is considered as less threat to measurement. In other
words, "aberrant infit scores usually cause more concern than large outfit statistics" (Bond & Fox,
2001, p. 43). From this perspective, in the Rasch model, more attention is routinely paid to infit
values than to outfit values (Bond & Fox, 2001).
Another value which can distinguish fit data from misfit data is ZSTD (which is also called
standardized t or fit t). As an alternative measure that indicates the degree of fit of an item or a
person to the Rasch model, ZSTD values also report the statistical probability of MNSQ statistics
occurring by chance when the data fit the Rasch model (Linacre, 2002). The acceptable value of ZSTD
ranges from -2.0 to +2.0. If infit and outfit ZSTD is higher than +2.0 (underfit) and lower than -2.0
(overfit), it means less compatibility with the Rasch model (Bond & Fox, 2001). However, in general, if
items and persons have infit MNSQ values within the acceptable range between +0.5 and +1.5,
ZSTD statistics can be ignored (Linacre, 2002; Bond & Fox, 2001).
Table 2.4 Scoring matrix for a 10-item vocabulary test (Henning, 1987, p. 119) ITEMS
PERSONS SUM 1 2 3 4 5 6 7 8 9 10
TOM 0 0 0 0 0 0 0 0 0 0 BEN 1 1 1 0 0 0 0 0 0 0 1 KIM 1 1 1 1 0 0 0 0 0 0 2 ANN 1 1 1 1 0 1 0 0 0 0 3 TIM 1 1 0 1 1 1 0 0 0 0 3 RIO 1 1 1 1 1 1 0 0 0 0 4 SUE 1 1 1 0 1 0 1 1 0 0 4 SAM 1 1 1 1 1 1 1 0 0 0 5 JUN 1 1 1 1 1 1 1 1 1 1 ROB 1 1 1 1 1 1 1 1 1 1
(* 0=wrong response, 1=right response)
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE
ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD BEN -2.16 1 0.49 -0.76 0.28 -0.27 KIM -0.99 2 0.51 -1.19 0.39 -0.45
ANN -0.04 3 0.62 -0.82 0.48 -0.66
TIM -0.04 3 1.30 0.75 1.79 1.12
RIO 0.95 4 0.37 -1.31 0.29 -0.77
SUE 0.95 4 2.68 2.25 2.78 1.61
SAM 2.2 5 0.47 -0.78 0.25 -0.32 IT E M DIFFICULTY INFIT INFIT OUTFIT OUTFIT
SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD
3 -2.20 6 1.56 0.92 1.41 0.73 4 -1.08 5 1.15 0.45 1.37 0.67 5
-0.24 4 0.59 -1.10 0.48 -0.86
6 -0.24 4 0.93 -0.04 0.85 -0.06
7 1.34 2 0.62 -0.87 0.46 -0.41
8 2.42 1 1.21 0.52 0.78 0.30
Table 2.5 The Rasch analysis results (Person data and Item data) based on the data in the Table 2.4
There are two tables. Table 2.4 is the scoring matrix of vocabulary test and Table 2.5 is the
Rasch analysis report of the data in Table 2.4. The calibration was done by Winsteps (Version 3.81.0).
First of all, the number of examinees and items on both tables are different. In Table 2.4, there are
10 persons and 10 items, while in Table 2.5 are 7 persons and 6 items presented. In the Rasch model,
12
extreme scores (all correct and all wrong) always fit the Rasch model exactly, so the data which have
extreme scores are excluded from the computation of fit statistics (Linacre, 2002). Thus, TOM, JUN,
and ROB are excluded. After excluding the person data, item 1, 2, 9, and 10 are also found to be
extreme scores, so those 4 items are deleted. Table 2.5 indicates that SAM (+2.2) is the most able
person and BEN (-2.16) is the weakest person. In addition, the table shows that item 8 (+2.42) is the
toughest and item 3 (-2.20) is the easiest.
Based on the Rasch analysis, the information about the abilities of persons and the
difficulties of each item can be obtained. According to Table 2.5, the infit MNSQ of most people and
items are within the acceptable range, except SUE and item 3. The infit MNSQ and ZSTD of SUE are
+2.68 and +2.25 respectively. Those two values imply that SUE does not fit the Rasch model properly.
That is, it is identified that SUE is a misfit datum. When looking into SUE's responses, it can be found
that SUE did not correctly answer item 4 and 6 which were relatively easy for SUE. Compared with
SUE's latent trait, these unpredictable responses may raise the infit MNSQ and ZSTD values.
However, item 3 is not classified as a misfit datum. Even if infit MNSQ of item 3 is higher than +1.5,
infit ZSTD is within the acceptable range (<+2.0). In this case, it can be said that item 3 fit the Rasch
model.
2.4 Conclusion to Chapter 2
Language testing shares similar goals of understanding the process of language learning
(Shohamy, 2000). Thus, the theory of language testing is more likely to be congruent with
psychological knowledge on language learning (Lado, 1961). From this perspective, in the field of
language testing, there had been a tension between the analytical and the integrative (Davies, 1978).
For example, Lado (1961) believes that language is a linguistic phenomenon that tests linguistic
abilities. He argued that a language must be broken down into its linguistic components. On the
other hand, Taylor (2004) introduces in his article that recent language testing tends to pay much
attention to how different language variables interact one another. However, in language testing,
such distinction is "not a real or an absolute one" (Davies, 1978, p. 151). Rather, when developing a test,
more attention needs to be paid to the test takers that virtually take the tests or the test takers' needs
that can vary from one context to another (Shohamy, 2000). To sum up, I do not believe that there
exists a one-size-fits-all test instrument, but I believe that there exists a one-size-fits-all test
designer who understands the test contexts and examinees' (or test results users') demands.
In this chapter, the weighting methods and the ways of analysing test items were dealt with.
In the next chapter, the method which was implemented in the current research will be discussed.
13
Chapter 3. Research Methodology
3.1 Introduction to Chapter 3
The previous chapter explained two item weighting methods and the ways of analysing test
items. In this chapter, firstly, the context in which Korean teachers design a summative test will be
explained. After the explanation about the context, a research strategy will be discussed and the
overall procedures of this research will be delineated.
3.2 Context
High school students study for three years in South Korea and one year is divided into two
terms. In each term, the students take the paper-pencil examination twice. Along with the paper-
pencil examination, performance-based tests (e.g. essay writing, student portfolios, group projects,
etc.) are also administered to students in order to test their actual performances (Brown, 1994). By
summing the scores which students obtain from the paper-pencil tests and performance-based tests,
students' term scores are yielded and the final results are ranked. Subsequently, the rank is
converted into a percentage and based on the percentage, students' grades are assigned from 1 to 9
(see Table 3.1). The lower the grade, the better the result is.
Grade
1
2
3
4
5
6
7
8
9
Percentage
above 96%
95% to 89%
88% to 77%
76% to 60%
59% to 40%
39% to 23%
22% to 11%
10% to 4%
3% and below
Frequency Distribution
4%
11%
23%
40%
60%
77%
89%
96%
100%
If there are 100 students, the
number of students in each grade
4
7
12 17
20
17 12
7
4
Table 3.1 Grade index table
In South Korea, subject teachers in each school design a summative test themselves, so the
examination which students take differs between schools. Teachers autonomously decide the
language categories, test objectives, and even difficulty level before students take the examination.
After completing a test design, teachers submit the test sheets with a two-dimensional table of test
specifications (Figure 3.1) to the school head office. In the two-dimensional table of test
specifications, specific information about answers, item marks, difficulty levels, language categories,
and objectives are contained.
14
Figure 3.1 Example of two-dimensional table of test specifications
As can be surmised in Figure 3.1, teachers estimate the difficulty level and based on their
intuitive feel for the item difficulty (Rudner, 2001), they allocate different weightings to the items.
The more difficult they believe the item is, the higher mark the item is allocated. The primary reason to
allocate different marks to the items lies in minimising the number of the students who gain the
same total scores and in helping classify students (Shohamy, 1998), because the Korean education
office employs a policy against those who obtain the same total score. For example, out of 100
students, suppose that students who are in the rank from 9 to 13 have obtained the same score.
According to Table 3.1, those students are supposed to obtain a grade 2. However, the policy does
not allow their grade to be implemented. Instead, they are allocated grade 3. Thus, only 4 students
(from 5 to 8 in the rank) obtain grade 2. Teachers do not want their students to receive such
disadvantageous results, so the teachers allocate different marks to the test items. By adopting this
policy, the education office believes that the discrimination power of the test items becomes high
and the reliability of the test results can be enhanced.
The school test grade is used as a means by which students can gain entry to study at
university. Thus, most students tend to study assiduously so as to obtain as highest scores they can.
That is, summative tests have a high-stakes role in South Korea. It follows, then, that because the
tests are so important, teachers should pay careful attention to the overall procedures of designing
the tests. Among many factors which are related to the test design, the way of weighting the item
will be researched in this dissertation. To be specific, most Korean teachers use an 'a priori'
weighting method in deciding the item difficulty levels and assign different weights to the test items.
My belief, which led to the development of this study, was that teachers' decisions about the item
difficulty levels always accords with the students' actual responses to the items. In order to identify
the gap between teachers' decision about item difficulty levels and the virtual difficulty levels of the
test items, item analysis using the Rasch model will be implemented.
3.3 Research Strategy
The purpose of this research is to identify the gap between the virtual item difficulty and
teachers' item weighting. Since teachers' decision about item difficulty levels is well described in the two-
dimensional table of test specifications collected from schools (see appendix 3), the focus of this
research is to measure the virtual difficulty value of the individual test items. From this perspective,
quantitative research which views social reality as an objective reality and "advocates the
application of the methods of the natural sciences to the study of social reality and beyond" (Bryman,
2012, p. 28) seems to be more appropriate than a qualitative approach in conducting this study.
15
Of the quantitative research methods, item analysis will be used in this research. Crocker
and Algina (1986) define item analysis as the computation and examination of any statistical
property of test takers' responses to an individual test item. In this sense, epistemologically
objectivity needs to be secured in this research when processing the collected data precisely and
transforming them into numerical value (Kumar, 2005). The data obtained, thus, will be analysed
and interpreted with the computer statistic programme which is called 'Winsteps (Version 3.81.0)'.
3.4 Research Design
3.4.1 Overview of the Research Procedure
As the first step, the test items will be arranged in the difficulty levels which teachers
decided prior to the test administration. The two-dimensional table of test specifications which I
obtained from X, Y, and Z high school (see appendix 3) will be a source of reference for the item
arrangement. Through this process, how teachers in X, Y, and Z school have weighted the test items
will be illustrated. In relation to the second research question, the test items will be rearranged in
the language categories and item difficulty levels, which I believe will identify which language
categories teachers believe to be difficult or easy for students.
After classifying the test items in terms of the predetermined difficulty levels and the
language categories, I will calibrate the virtual difficulty values (logits) of each item using the Rasch
model. As a preparation for measuring the difficulty logits of the test items, I will dichotomously
transform students' responses into "1" for correct response and "0 (zero)" for wrong response (this
procedure will be specifically explained in 3.4.3). Based on the dichotomously transformed data, I
will measure the item difficulty logits.
However, in order to measure the difficulty logits of the item precisely, the element which
can affect the calibration needs to be removed. As explained in chapter 2, high person infit MNSQ
(>+1.5) and ZSTD (>+2.0) can statistically compromise the measurement process of the item
difficulty logits (Linacre, 2002). In this sense, I will calibrate the person infit MNSQ and ZSTD. After
calculating person infit MNSQ and ZSTD, all the misfit data whose infit MNSQ and ZSTD are over +1.5
and +2.0, respectively will be filtered out. All the process of measuring infit MNSQ and infit ZSTD will
be led by the statistical software, Winsteps (version 3.81.0).
Once the misfit person data which can affect the difficulty measurement has been detected
and removed, item analysis using the Rasch model will be done again in order to calculate virtual
item difficulty logits. In particular, in this process, by using Baker's (1985) verbal terms, item
difficulty logit calculated will be defined into five verbal terms according to the index (logits): very
easy, easy, medium, hard, and very hard. Figure 3.2 shows the difficulty index and the corresponding
verbal terms.
Figure 3.2 Difficulty level and index (Baker, 1985)
16
After the virtual difficulty logits of the test items are computed with the help of Winsteps
(version 3.81.0), two comparisons will be made. One will be between teachers' estimation on the
item difficulty levels and virtual item difficulty levels. The other will be between teachers' beliefs
about the difficulty levels of the language categories and students' virtual responses to the language
categories. Afterwards, based on the statistical results, the hypothesis will be checked and
knowledge (answers to the research questions) will be generated. The overall blueprint of current
research can be schematised as is in Figure 3.3.
Data collection and random sampling
Classifying the items and language categories in relation to difficulty levels
Detecting misfit person data and removing all misfit data detected
Calibrating virtual item difficulty logits
Comparing between teachers' beliefs and the Rasch analysis results / answering to the research questions
Figure 3.3 Overall procedures of the research
3.4.2 Data Collection and Sampling
In this research, two different kinds of samples are required: a two-dimensional table of test
specifications (with test items) and the test results about how students responsed to each test items.
Those two samples are extracted from three different public high schools in Daejeon, South Korea.
Those three schools are located within 2-mile radius. Since students in this city are randomly
assigned to the high schools by the city education office on the basis of their living area, the students of
those three schools are demographically similar to a large extent, which enhances the reliability
and validity of the findings to a certain extent (Scott & Morrison, 2006).
Table 3.2 Sample summarisation
X high school
Y high school
Z high school
Type of School
Co-education public School
Grade of subjects
3rd Grade
Location
Same part of city (all schools are within 3-mile
radius)
Size of
Collected data
262 362 195
Item Type
Multiple Choice
Number of Test Item
28 28 29
17
The primary process of this research is to calculate the virtual difficulty levels of the test
items using the Rasch model. In order to do so, it is necessary to collect data about how students
responded to the test items. Since students' response data are computerised and transformed into
excel files like Figure 3.4 after students have taken the test, it may not be difficult to obtain the
students' response data. Rather, the difficulty may lie in deciding the sample size for the accuracy of
research (Lewin, 2011). With regards to the sample size, Henning (1987) suggests that the
recommended sample size in the Rasch model is between 100 and 200 for statistically meaningful
results. Thus, by using excel function "RANDUNIQ", 200 discrete samples will be randomly extracted.
Figure 3.4 Example of students' responses data (a dot means correct answer and a number means
incorrect answers)
3.4.3 Data Analysis
Depending on the features of data, the method of data coding should be appropriately
implemented. If not, the findings would not be valid and reliable. The main aim of data analysis in
this research is to calculate the difficulty logits of all test items and then compare items' virtual
difficulty levels with teachers' item weighting. In order to do so, students' responses will be
dichotomously turned into "1" for correct responses and "0 (zero)" for wrong responses (see Table 2.4
in chapter 2). Based on dichotomously transformed data, the Rasch analysis will be done with the
aid of Winsteps (version 3.81.0). Through the Rasch analysis, raw data will be turned into a
report which contains much information (Oppenheim, 1992), that can be used to examine the
research hypothesis on teachers' decision about item weighting.
Black (1999) claims that well-presented descriptive statistics with visual aids can enhance
the comprehension of the outcomes of quantitative research. Similarly, Kumar (2005) suggests that
visualising the findings is one of the important ways of communication and effective data-display
techniques can help readers to understand the findings clearly and easily. From this perspective,
figures such as an item-person map will be inserted in this research, which may help readers
recognise the hierarchy of items' difficulty and the relationship between items and test-takers'
abilities (e.g. Figure 2.2). In addition, a number of tables will be used in order to show information
about item difficulty logits and mainly to identify the difference between teachers' item weighting
and virtual item difficulty levels.
3.4.3.1 Way of Categorising Language Elements
18
In order to answer to the second research question, it is necessary to categorise language
elements. As a way of classifying the language elements, I will refer to the direction of the test items,
because the directions of test items reflect the language category which teachers want to assess.
First of all, I will transform the Korean directions of all test items into English version. After changing
the Korean directions into English ones, the items with the same direction will be combined into the
same language category. Take X15 (X high school's 15th item) and Y3 (Y high school's third item) as an
example. As shown in Figure 3.5, the directions of X15 and Y3 were originally written in Korean, but I
will change them into English. Those two items have the same direction. Thus, those two items will be
combined into the same language category, "finding a suitable discourse marker" (see Table 4.2 and
appendix 3). In this way, all items will be categorised on the basis of the directions of the test items.
Figure 3.5 Examples of transforming the Korean direction into English version
19
3.4.4 Validity and Reliability
Validity refers to the accuracy of the data and the appropriateness of the research questions
being investigated (Denscombe, 2010). In order to attain trustworthy findings, therefore, careful
attention has to be paid to the process of narrowing the gap between what is true and what is
measured in relation to the research questions. In this research, for the purpose of making the data
accurate, the process of data refinement will be executed. To be specific, before calibrating items'
difficulty logits, by calculating person infit MNSQ and ZSTD, all person data whose infit MNSQ and
ZSTD is over +1.5 and +2.0, respectively will be deleted, because those misfit data may affect the
item difficulty logit calibration. After excluding misfit person data which can affect precise difficulty
logit measurement, only the fit person data to Rasch model will be analysed and the virtual difficulty
of each item will be calculated through the Rasch analysis. I believe that such a data refinement
process may be able to consolidate the validity of this research.
Along with the concept of validity, reliability also matters in this research, because reliability
serves as "important safeguards against the contamination of scientific data" (Krippendorff, 1980, p.
129). However, since "validity presumes reliability" (Bryman, 2012, p. 173), high level of reliability will
be secured through data refinement process in this research. In addition, in terms of sample size, as
claimed by Henning (1987), by attaining more than 100 and less than 200 samples, the reliability of
the statistical results may be secured to a certain extent.
3.5 Ethical Considerations
In analysing students' responses to the test items, ethical issues may not be at the forefront
(Neuman, 2005), because in this research, people being studied are not directly involved. Instead,
the students' response data and the two-dimensional table of test specifications are the privacy that
needs to be kept confidential, because both data contains information about school name, students'
names, and students' numbers. Such information may compromise the ethical issues. Thus, I had
asked teachers to remove all the private information before they provided the data to me. After
collecting the data from the teachers, I double-checked whether all private information had been
deleted by the teachers. All deleted information about the school name and students' names was
replaced with pseudonyms to conform to ethical guidelines. Furthermore, test items can be one the
concerns which affect the ethics of the current research. However, test items are meant to be
uploaded on the school webpage after examination. Thus, all who join the webpage can access the
data.
3.6 Conclusion to Chapter 3
In this chapter, I have explained the overall procedures of my quantitative research and its
theoretical underpinning with regards to the research questions. Moreover, the process of how to
collect and analyse the data has been introduced. In the following chapter, data will be analysed and
the findings of this research will be presented.
Chapter 4. Data Analysis
4.1 Introduction to Chapter 4
20
In the previous chapter, the context about how teachers develop test items was discussed.
In addition, the research method of this dissertation was explained along with issues relating to
validity and reliability. This chapter reports on how the data collected from three schools were
analysed quantitatively with the help of Winsteps. Subsequently, based on the statistical information
which Winsteps provided, the research questions will be addressed.
4.2 Selecting Data
The size of the samples I obtained from three public schools varied by institution. From X
high school, 262 samples were collected, from Y high school, 362 samples, and from Z high school,
195 samples were gained. Henning (1987) notes that the appropriate size of samples is from 100 to
200 in order to gain statically meaningful results in the Rasch model. Therefore, I reduced the
sample size of X and Y school into 200 using a random sampling process. In this process of reducing
the sample size, Excel function "RANDUNIQ", which makes it possible to extract 200 unique random
numbers, was used (see appendix 1 and 2).
4.3 Test Item Classification
While designing test items, teachers tend to decide the level of difficulty of test items and
allocate a different mark to each item on the basis of their prior knowledge and experiences.
Likewise, in X, Y, and Z high school, teachers had decided the difficulty level of each item before
students took the test, and they had allocated different marks to the items according to their feel for
the difficulty levels of the items (Rudner, 2001). That is, the more difficult they believed the test item
was, the higher mark they assigned to the item. However, as shown in Table 4.1, two test items (X14
and X24) of X high school were exceptions. Even though the difficulty level of those two items was
believed to be medium, teachers allocated lower marks than some of the easy test items (e.g. X10 or
X12). Aside from those two items, item marks were hierarchically assigned to all items on the basis
of the difficulty levels. Through the item classification of those three schools, I could find the
correlation between difficulty level and mark allocation. The capital letter X, Y, and Z in the table
refer to X, Y, and Z high school and the number followed by the capital letter means an item number.
Thus, X10 means X high school's tenth test item.
In addition, Table 4.1 shows that teachers had assigned a wide range of marks to items
within a single test. That is, in X high school's English test, teachers made 28 test items and 22
different marks were used and allocated to each item. Between the lowest mark (2.5) and the
highest mark (4.6), there were 20 different marks allocated to items. Similarly, 7 different marks
were allocated to Y high school's test items and 10 different marks were allocated to Z high school's
items. Such a way of assigning a wide range of marks to items within a single test led to making
another stratification between items with the same difficulty level. In other words, even if the
difficulty level of the items is the same, by assigning different marks, teachers make another item
difficulty layer between the items with the same difficulty level. Because of the stratifications, in
some cases, the mark difference between the items with the different difficulty levels (e.g., X12 and
X3) is smaller than the difference between the items with the same difficulty level (e.g. X4 and X15).
X High school Y High school Z High school
21
ITEM
MARK DIFFICULTY
LEVEL
ITEM MARK DIFFICULTY
LEVEL
ITEM MARK DIFFICULTY
LEVEL
X4 2.5 E Y19 3.1 E Z19 3 E X17 2.6 E Y26 3.1 E Z29 3
EX18 2.7 E Y15 3.2 E Z4
3.1 E X8 2.8 E Y25 3.2 E Z6 3.1
EX15 2.9 E Y21 3.3 E Z22
3.1 EX24 3 M Y27 3.3 E
Z26 3.1 EX16 3.1 E Y1 3.6 M
Z28 3.1 EX14 3.2 M Y2 3.6 M
Z9 3.2 EX10 3.3 E Y3 3.6 M
Z10 3.3 MX12 3.3 E Y4 3.6 M
Z27 3.3 M X3 3.4 M Y5 3.6 M Z3 3.4
MX27 3.4 M Y8 3.6 M Z7 3.4
M X2 3.5 M Y9 3.6 M Z8 3.4
MX7 3.5 M Y11 3.6 M Z13 3.4
MX28 3.6 M Y12 3.6 M Z15 3.4
M X9 3.7 M Y14 3.6 M Z21 3.4 M X13 3.7 M Y16 3.6 M Z5 3.5 M X25 3.8 M Y20 3.6 M Z14 3.5
MX26 3.9 M Y22 3.6 M Z17 3.5
M X1 4 M Y24 3.6 M Z11 3.6 M X20 4 M Y23 3.7 M Z16 3.6
MX21 4 M Y6 3.8 H Z24 3.7
M X23 4.1 H Y7 3.8 H Z25 3.7
MX22 4.2 H Y10 3.8 H Z18
3.8 H X19 4.3 H Y13 3.8 H Z23 3.8 H X6 4.4 H Y17 3.8 H Z1 3.9
HX5 4.5 H Y18 3.8 H Z2
3.9 H X11 4.6 H Y28 3.9 H Z12 3.9 H
Z20 3.9 H
(* E: easy, M: medium, H: hard)
Table 4.1 Item mark allocation and item difficulty of three schools.
As explained in the previous chapter, based on the directions of test items (see appendix 3), I
classified all the test items according to 17 different criteria, as illustrated in Table 4.2. Table 4.2
summarises how teachers weighted the test items according to language categories. In Table 4.2,
teachers in X, Y, and Z high school believed that the language categories, "identifying the
syntactically wrong (or correct) usage of word categories" (10) and "inferring a word, a phrase, and a
sentence" (11, 12, and 13), could be difficult for students to solve. Therefore, generally, teachers
assigned higher marks to those items pertinent to those language categories (10, 11, 12, and 13). On
the other hand, teachers considered "finding a wrong explanation about a given passage (7)" and
"finding suitable discourse markers (9)" as easy language categories, so they assigned relatively
lower marks to the items related to the language category (7) and (9). In addition, the language
categories "finding a title of a given passage" (5) and "placing sentences in a logical order" (15) were
mostly classified into the medium tasks.
22
Table 4.2 X, Y, and Z high schools' language category specification and item difficulty levels
Difficulty
1
2 3
Language Categories
Distinguishing a pronoun which indicates a different reference
Finding a non-cohesive sentence
Finding suitable words to the context of a given passage
Level
E
H
M
E
M
E
X4
Y1
8
Y
1
X12
X3
Z9
School and Item number
Y22
Z10
4 5 6
7
8
9
10
11
12
13 14 15
16
17
Finding a theme of a given passage Finding a title of a given passage Finding a topic of a given passage
Finding a wrong explanation about a given passage
Finding an awkward word to the context of the given passage
Finding suitable discourse markers
Identifying the syntactically wrong (or correct) usage of word categories
(e.g. noun, verb, relative pronoun, preposition, adjective, etc.)
Inferring a phrase based on a given passage
Inferring a sentence based on a given passage
Inferring a word based on a given passage Locating a given sentence in a passage
Placing sentences in a logical order
Summarising a given passage
Identifying author's mood
H
E
H
M
E
M
E
E
H
M
M E
H
M
E
H M
M
H M
H
M
E
M
E
H
E
E
X9
X10
X19
X20
Y27
X7
X8
X17
Y2
8
X1
Y3 X15
X5
Y
8
Y2
6
X2
3 X25
Z13
X22
X24
Z18
X1
3
Y1
5
X1
4
Z28
X1
1
Y2
4
X1
8
Y1
2
X21
Z5
Z4
Y2
5 X2
X16
X6
Y1
6
Y
6 X26
Z16
Z20
X2
8
Y
5
Z19
Y
9
Y1
0
Z3
Z1
7
X2
7
Z
6
Z2
9 Y1
1
Y1
9
Z
1
Z2
4
Y
7
Y
4
Y
2
Y2
0
Y2
3
Y2
1
Z2
Z25
Y1
3
Z11
Y1
4
Z26
Z7
Z22
Z23 Y17
Z14
Z21
Z27
Z8
Z1
2
Z1
5
*H: hard, M: medium, E: easy
4.4 Data Refinement
23
In order to precisely calibrate the item difficulty logits, the misfit person data need to be
excluded, because such misfit data can distort or degrade the measurement system (Linacre, 2002).
Thus, by using the system explained in the previous chapter, I analysed the response data using the
Rasch model and found the misfit data whose infit MNSQ and infit ZSTD values were not within
acceptable ranges. To be specific, two misfit person data (entry no. 3 and 80) were found in X high
school, while no misfit datum was found in Y and Z high school (see appendix 4). Those two misfit
data's infit MNSQ and ZSTD value were beyond the acceptable range as is in Table 4.3.
PERSO
N ENTRY
3 80 210
ABILITY
MEASURE - 0. 95 - 0. 39 5.
3
SCOR
E 9 12
28
INFIT MNSQ 1. 58 1. 67
1. 00
INFIT ZST
D 2. 63 3. 46
0. 00
OUTFIT MNSQ 1. 73 2. 90
1. 00
OUTFIT
ZSTD 1. 61 4. 27 0.
00
Table 4.3 Misfit person data and extreme scorer of X high school
In addition, one person (entry no. 210) in X high school responded to all test items correctly
(see appendix 4). Since extreme scorers (those who answer all test items correctly or wrongly)
always fit the Rasch model perfectly (Henning, 1987), the extreme scorer which was found in X high
school was excluded. After eliminating those three cases, the sample size of X high school became
197. However, the sample number of Y and Z high school did not change. The number of samples in Y
and Z high school was 200 and 195, respectively.
4.5 Item Difficulty Calibration
Based on the refined students' item response data (see appendix 5), the difficulty logits of
each school's test items were measured by Winsteps. After calculating the difficulty logit of each
item, each item was tagged with five difficulty levels according to the logits which Baker (1985)
suggested (see Figure 3.2): VH (very hard), H (hard), M (medium), E (easy), and VE (very easy).
Through this process, it was possible to compare the difficulty levels which teachers had decided
with the actual difficulty level which was measured by Winsteps.
4.5.1 X High School
X HIGH SCHOOL (Item Reliability .97)
ITEM DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE
ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL X22 3.36 21 0.27 1.03 0.21 1.90 1.64 VH X7 1.81 53 0.19 1.25 2.45 2.68 5.42 H X16 1.46 63 0.18 1.31 3.22 1.52 2.52 H X3 1.27 69 0.18 0.89 -1.27 0.94 -0.30 H X18 0.8 84 0.17 0.76 -3.18 0.71 -2.19 H X27 0.8 84 0.17 1.26 2.93 1.43 2.69 H X21 0.74 86 0.17 1.01 0.18 0.92 -0.50 H X5 0.71 87 0.17 1.25 2.84 1.37 2.44 H X23 0.71 87 0.17 0.80 -2.66 0.78 -1.66 H X28 0.62 90 0.17 0.91 -1.05 0.81 -1.41 H X1 0.5 94 0.17 1.31 3.47 1.45 2.95 H X24 0.29 101 0.17 0.98 -0.21 1.01 0.11 M X14 0.17 105 0.17 0.99 -0.07 0.92 -0.55 M X15 0.17 105 0.17 0.92 -1.00 0.84 -1.14 M X2 -0.12 115 0.17 0.90 -1.28 0.83 -1.06 M X13 -0.4 124 0.18 0.95 -0.61 0.83 -0.94 M
ITEM DIFFICULTY SCORE MODLSE INFIT INFIT OUTFIT OUTFIT DIFFICULTY (Response)
24
ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD LEVEL X8 -0.43 125 0.18 0.90 -1.23 0.83 -0.93 M X12 -0.46 126 0.18 0.99 -0.14 0.92 -0.40 M X19 -0.74 135 0.18 0.97 -0.33 1.40 1.76 E X25 -0.77 136 0.18 0.94 -0.68 0.79 -1.00 E X11 -0.84 138 0.18 0.96 -0.47 1.01 0.10 E X17 -0.91 140 0.18 0.95 -0.57 0.89 -0.40 E X6 -0.97 142 0.18 0.97 -0.30 0.76 -1.01 E X9 -1.11 146 0.19 1.09 0.99 1.14 0.60 E X10 -1.18 148 0.19 1.04 0.44 0.84 -0.56 E X26 -1.33 152 0.19 0.79 -2.36 0.59 -1.58 E X20 -1.4 154 0.2 0.83 -1.87 0.59 -1.52 E X4 -2.75 181 0.27 0.83 -0.93 0.40 -1.44 VE
Table 4.4 Item difficulty measurement of X high school test items
Table 4.4 shows the results of how difficult the items are. According to Table 4.4, the item
estimates range from +3.36 (X22) to -2.75 (X4) logits. That is, students in X high school had more
difficulty in responding to X22 correctly than in responding to any other items, whereas the students
solved X4 with ease. Out of 28 test items, 11 test items were estimated to be hard, while 10 items
were found to be easy for students. Furthermore, the number of items that were at a medium level
was relatively small. By using an item map, the hierarchy of the item difficulty can be schematised
(see Figure 4.1).
Figure 4.1 X high school item map
25
Teachers' Weighting The Rasch Analysis
MARK DIFLFEVULTY
IC
ITEM DIFFICULTY DIFFICULTY
4.6 4.5
4.4
4.3
4.2
4.1
H H
H
H
H
H
E
L
ENTRY X11 X5
X6 X19
X22
X23
LEVEL E H
E
E VH H
LOGIT -0.84 0.71 -0.97 -
0.74
3.36
0.71
4 M X1 H 0.5 4 M X20 E -1.4 4 M X21 H 0.74
3.9 M X26 E -1.33 3.8 M X25 E -0.77 3.7 M X9 E -1.11
3.7 M X13 M -0.4 3.6 M X28 H 0.62 3.5 M X2 M -0.12
3.5 M X7 H 1.81 3.4 M X3 H 1.27 3.4 M X27 H 0.8
3.3 E X10 E -1.18 3.3 E X12 M -0.46 3.2 M X14 M 0.17
3.1 E X16 H 1.46
3 M X24 M 0.29 2.9
E X15 M 0.17 2.8
E X8 M -0.43 2.7
E X18 H 0.8 2.6
E X17 E -0.91 2.5
E X4 VE -2.75
Table 4.5 X high school teachers' difficulty level estimation and virtual difficulty levels
In the middle of Table 4.5 are the items which X school teachers' designed. The left column of
Table 4.5 contains the information about teachers' prior weighting and mark allocation. In the right
side of the item entry column, the difficulty logits calibrated and difficulty levels of items are
presented. The items are arranged in the difficulty levels which teachers in X high school decided
prior to the test administration.
Table 4.5 indicates that there are a number of cases which display a gap between teachers'
item difficulty level estimation and the actual item difficulty level. Teachers in X high school believed
that X11 might be the most difficult item for students to solve, so they allocated the highest mark to
X11. However, according to Table 4.5, the measured difficulty logit of X11 is -0.84 suggesting that
X11 was an easy item. Under this model, the teacher should not have assigned the highest mark to
X11. Similarly, other two items, X6 and X19, had been estimated to be hard by teachers, but actually
the item logits of X6 (-0.97) and X19 (-0.74) indicate that it was easy for students to find the correct
answers of those two items. Teachers also misjudged the difficulty levels of two items, X16 and X18.
The teachers presumed that X16 and X18 might be an easy item. However, the calculated difficulty
logits of X16 (+1.46) and X18 (+0.8) intimate that it was hard for students to respond to those two
items correctly. According to Table 4.4, less than a third of students responded to X16 correctly and
less than a half of students answered to X18 correctly. Aside from those five items, in X high school's
test items, there were 13 other items which showed the incongruity between teachers' 'a priori'
weighting and virtual difficulty levels.
26
4.5.2 Y High School
Y HIGH SCHOOL (Item Reliability .94)
IT E M DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE
ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL Y26 1.48 41 0.2 1.13 1.14 1.16 0.79 H Y22 1.21 48 0.19 1.15 1.36 1.47 2.27 H Y17 1.03 53 0.19 1.24 2.24 1.52 2.72 H Y27 0.93 56 0.18 1.38 3.52 1.51 2.85 H Y13 0.76 61 0.18 0.87 -1.44 0.79 -1.52 H Y16 0.63 65 0.18 1.20 2.06 1.30 2.02 H Y24 0.63 65 0.18 0.89 -1.19 0.86 -1.02 H Y6 0.54 68 0.18 0.81 -2.21 0.73 -2.26 H Y19 0.54 68 0.18 0.98 -0.18 0.96 -0.30 H Y28 0.51 69 0.17 1.54 5.24 1.71 4.60 H Y8 0.16 81 0.17 0.85 -1.91 0.81 -1.76 M Y7 0.07 84 0.17 1.11 1.31 1.21 1.82 M Y12 -0.01 87 0.17 0.77 -3.15 0.72 -2.72 M Y3 -0.04 88 0.17 1.12 1.56 1.30 2.54 M Y20 -0.09 90 0.17 1.08 1.01 1.16 1.41 M Y23 -0.12 91 0.17 1.19 2.37 1.13 1.20 M Y5 -0.18 93 0.17 0.96 -0.54 0.90 -0.92 M Y14 -0.2 94 0.16 0.82 -2.63 0.78 -2.08 M Y21 -0.23 95 0.16 1.22 2.80 1.48 3.79 M Y11 -0.34 99 0.16 0.72 -4.35 0.68 -3.14 M Y9 -0.39 101 0.16 1.10 1.33 1.08 0.70 M Y18 -0.47 104 0.16 0.91 -1.31 0.81 -1.69 M Y2 -0.71 113 0.16 0.72 -4.68 0.63 -3.34 E Y15 -0.76 115 0.16 0.79 -3.34 0.68 -2.76 E Y25 -0.95 122 0.16 0.89 -1.68 0.84 -1.13 E Y4 -1.06 126 0.17 0.76 -4.03 0.62 -2.89 E Y1 -1.26 133 0.17 0.92 -1.21 0.85 -0.91 E Y10 -1.7 148 0.18 0.93 -0.95 0.89 -0.46 E
Table 4.6 Item difficulty measurement of Y high school test items
Table 4.6 shows the results of how difficult the items are. According to Table 4.6, the item
estimates range from +1.48 (Y26) to -1.7 (Y10) logits. Students in Y high school had more difficulty in
responding to Y26 correctly than in responding to any other items, whereas the students solved Y10
with ease. In Y high school, the Rasch model rated 10 items to have been hard, 12 items to have
been medium, and 6 items to have been easy. The frequency of easy items was low, whereas that of
medium and hard items was relatively high in Y high school's English test. Table 4.6 also indicates
that the item difficulty logit gap between adjacent items in the difficulty level borderline is wider
than other adjacent items with the same difficulty level, except two pairs of items (Y1 and Y10 / Y26
and Y22). To be specific, the item difficulty logit difference between two adjacent items in the
difficulty borderline (Y28 and Y8/ Y18 and Y2) is relatively wider than the logit difference between
two adjacent items in the same difficult level (e.g. Y17 and Y27). This suggests that students might be
able to readily distinguish hard items from medium items and medium items from easy items while
taking the test. Those gaps between adjacent items can also be identified in the item map (see
Figure 4.2).
27
Figure 4.2 Y high school item map
28
Teachers' Weighting The Rasch Analysis
DIFFICULTY ITEM DIFFICULTY DIFFICULTY MARK
LEVEL ENTRY LEVEL LOGIT 3.9 H Y28 H 0.51 3.8 H Y6 H 0.54 3.8 H Y7 M 0.07 3.8 H Y10 E -1.7 3.8 H Y13 H 0.76 3.8 H Y17 H 1.03 3.8
H Y18 M -0.47 3.7
M Y23 M -0.12 3.6
M Y1 E -1.26 3.6
M Y2 E -0.71 3.6
M Y3 M -0.04 3.6
M Y4 E -1.06 3.6
M Y5 M -0.18 3.6 M Y8 M 0.16 3.6
M Y9 M -0.39 3.6
M Y11 M -0.34 3.6
M Y12 M -0.01 3.6 M Y14 M -0.2 3.6 M Y16 H 0.63 3.6
M Y20 M -0.09 3.6 M Y22 H 1.21 3.6 M Y24 H 0.63 3.3
E Y21 M -0.23 3.3 E Y27 H 0.93 3.2
E Y15 E -0.76 3.2
E Y25 E -0.95 3.1 E Y19 H 0.54 3.1 E Y26 H 1.48
Table 4.7 Y high school's teachers' weighting and virtual difficulty
English teachers in Y high school believed that Y28 might have been the most difficult for
students to solve, so they had allocated the highest mark (3.9) to Y28. In contrast, Y26 had been
considered to be the easiest item by teachers, so the teachers had assigned the lowest mark (3.1) to
Y26. However, the evidence which Winsteps provided manifested that there was a gap between
teachers' decision about the item weighting and the virtual item difficulty levels in Y high school.
That is, teachers' weighting on Y26 completely mismatched the virtual difficulty level. As seen in
Figure 4.2 and Table 4.7, Y26 is placed in the very top of the item map and has the highest difficulty
logit (+1.48) suggesting that no item is more difficult than Y26. In spite of that, teachers believed
that Y26 might be easy for students and the teachers allocated the lowest mark (3.1) to Y26. The
teachers should have assigned the highest mark to Y26. In addition, teachers postulated that Y10
might have been a tough item. However, Y26's difficulty logit measured by the Rasch model showed
that teachers' estimation, to a great extent, differed from the virtual item difficulty level. That is, the
difficulty logit of Y10 (-1.7) suggested that Y10 was an easy item for students. In Y high school, out of
28 items, 13 items' virtual difficulty levels did not match teachers' weighting.
4.5.3 Z High School
29
Z HIGH SCHOOL (Item Reliability .96)
ITEM DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE
ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL Z3 2.33 25 0.24 1.04 0.31 1.65 1.84 VH Z12 1.35 47 0.19 1.11 1.10 1.56 2.70 H Z20 1.31 48 0.19 1.00 0.03 0.88 -0.67 H Z23 0.91 60 0.18 0.95 -0.54 0.91 -0.61 H Z1 0.66 68 0.17 1.07 0.82 1.06 0.52 H Z2 0.63 69 0.17 1.10 1.16 1.19 1.49 H Z7 0.54 72 0.17 1.16 1.89 1.18 1.46 H Z16 0.54 72 0.17 1.09 1.14 1.26 2.01 H Z19 0.51 73 0.17 1.06 0.77 1.11 0.92 H Z11 0.37 78 0.17 0.92 -1.08 0.90 -0.89 M Z13 0.29 81 0.17 1.02 0.30 1.02 0.17 M Z17 0.18 85 0.17 1.26 3.28 1.26 2.22 M Z24 0.18 85 0.17 1.04 0.54 1.01 0.13 M Z21 0.15 86 0.17 0.93 -0.96 0.96 -0.36 M Z25 -0.04 93 0.16 1.25 3.31 1.20 1.73 M Z28 -0.1 95 0.16 1.02 0.26 1.14 1.22 M Z29 -0.18 98 0.16 1.02 0.26 1.01 0.16 M Z26 -0.42 107 0.16 0.94 -0.88 0.86 -1.14 M Z8 -0.5 110 0.16 1.05 0.72 1.15 1.17 M Z10 -0.5 110 0.16 0.80 -3.21 0.72 -2.44 M Z18 -0.6 114 0.16 1.12 1.75 1.10 0.79 E Z27 -0.6 114 0.16 0.94 -0.97 0.86 -1.12 E Z14 -0.69 117 0.16 0.73 -4.47 0.62 -3.18 E Z15 -0.77 120 0.17 0.92 -1.20 0.86 -1.00 E Z22 -0.77 120 0.17 1.12 1.79 1.66 3.92 E Z9 -0.99 128 0.17 0.74 -4.23 0.60 -2.90 E Z5 -1.05 130 0.17 0.78 -3.50 0.65 -2.36 E Z6 -1.16 134 0.17 0.78 -3.36 0.63 -2.36 E Z4 -1.6 148 0.18 0.94 -0.73 0.80 -0.90 E
*VH: VERY HARD, H: HARD, M: MEDIUM, E: EASY, VE:VERY EASY
Table 4.8 Item difficulty measurement of Z high school test items
Table 4.8 shows the results of how difficult the items are. According to Table 4.8, the item
difficulty estimates range from +2.33 (Z3) to -1.6 (Z4) logits. Students in Z high school had more
difficulty in responding to Z3 correctly than in responding to any other items, whereas the students
solved Z4 with ease. In Z high school, the Rasch model rated 9 items to have been hard, 11 items to
have been medium, and 9 items to have been easy. Compared to the difficulty level distribution of
other two schools' items, Z school's item difficulty levels seem to have been equally distributed to a
certain extent. In addition, as shown in Figure 4.3, most items (25 items) are located between +0.91
(Z23) and -1.16 (Z6) logits. However, Figures 4.3 also shows that there exists a large logit gap
between Z3 (+2.33) and Z12 (+1.35) suggesting that Z3 was too difficult for the students to find the
correct answer.
30
Figure 4.3 Z high school item map
31
Teachers' Weighting The Rasch Analysis
DIFFICULTY ITEM DIFFICULTY DIFFICULTY MARK
LEVEL ENTRY LEVEL LOGIT 3.9 H Z1 H 0.66 3.9 H Z2 H 0.63 3.9
H Z12 H 1.35 3.9
H Z20 H 1.31 3.8 H Z18 E -0.6 3.8
H Z23 H 0.91 3.7
M Z24 M 0.18 3.7
M Z25 M -0.04 3.6
M Z11 M 0.37 3.6
M Z16 H 0.54 3.5
M Z5 E -1.05 3.5
M Z14 E -0.69 3.5
M Z17 M 0.18 3.4
M Z3 VH 2.33 3.4
M Z7 H 0.54 3.4 M Z8 M -0.5 3.4
M Z13 M 0.29 3.4
M Z15 E -0.77 3.4
M Z21 M 0.15 3.3 M Z10 M -0.5 3.3 M Z27 E -0.6 3.2
E Z9 E -0.99 3.1 E Z4 E -1.6 3.1
E Z6 E -1.16 3.1
E Z22 E -0.77 3.1
E Z26 M -0.42 3.1 E Z28 M -0.1 3
E Z19 H 0.51 3
E Z29 M -0.18
Table 4.9 Z high school's teachers' weighting and virtual difficulty
As identified in X and Y high school, there is also a discrepancy between teachers' 'a priori'
weighting and the virtual item difficulty levels in Z high school. That is, teachers in Z high school did
not allocate appropriate marks to the test items. For example, teachers believed that Z18 might have
been a difficult item, but the difficulty logit of Z18 was -0.6, suggesting that Z18 was an easy item for Z
high school's students. Teachers, therefore, should have assigned a lower mark to Z18 than 3.8.
Teachers also had classified item Z3 into a medium-levelled item and assigned 3.4 to Z3. However,
the Rasch model identified that Z3 was the most difficult of all items. Thus, teachers should have
given Z3 the highest mark, 3.9. In addition, the teachers misjudged the difficulty level of Z19. They
believed that Z19 was an easy item, so they allocated the lowest mark to Z19. However, Z19's virtual
difficulty logit calibrated was +0.51, suggesting that Z19 was a difficulty item. Thus, a higher mark
should have been assigned to Z19 than the teachers had assigned. In Z high school's test, out of 29
items, 12 items' virtual difficulty levels did not match teachers' weighting.
A comparison was made between teachers' weighting and the virtual difficulty levels in this
section. As a result, it was identified that in X, Y, and Z high schools' test items, there was a large gap
between teachers' weighting and the virtual difficulty levels. In the next section, another comparison will
be made in relation to language categories and item difficulty levels.
32
4.6 Language Category Classification
In the last subsection, the gap between teachers' weighting and virtual item difficulty was
identified. In this section, in relation to the second research question, a comparison will be made
between what teachers believed to be difficult and what students found to be difficult in terms of
the language category.
Teachers' weighting
IT E M
X
4
Y
18
Difficulty Level
E
H
1
Language Category
Distinguishing a pronoun which indicates a different reference
Finding a non-cohesive
1
Difficulty
Level
E
H
The Rasch analysis
ITEM
X4
X
20
X
23
X
25
X
21 X
15 X
5
Y
6 X
26
X
24 X
14
X
27 X
16 X
6
Y
8
Y
7
Y
4
X
28 Y
9
X
9 Y
23 X
8 X
17 X
1 Y
19 Z
1 Y
16
Y
13
Z11
Y
2
Y
20
X3
Z9 Y
12 Z7 X7
Z4 Y
25 X2 Y
21 Z2 Z24
Y
17
Z14 Z13
X
22 Y
14
X
13 Y
15
Z26
X
1
1
Y
2
4
Y
1 X
1
2 Y
2
2
Z
1
0
Z
1
7 X
1
0
X
1
9 Z
8 Y
2
7 Z
5
Z
6
Z
2
9
Y
2
8
Y
1
1 Y
3 Z22 Z23 Z25 Y 26
Z12 Z15 Z16 Z20 Z21
Z18 Y5
Z19 Z27 Z28 Y 10 Z3
M E H M
E H M
E H M
E H M
E H M
E H M
H M
E H M E
H M
E H M
H M
E H
M
E
M E H
E
2 3 4 5 6 7
8
9
1
0
1
1
1
2
1
3
1
4 1
5
1
6 s
e
n
t
e
n
c
e
Fi
nd
in
g
sui
ta
bl
e
w
or
ds
to
th
e c
o
n
t
e
xt
o
f
a
gi
v
e
n
p
a
s
s
a
g
e
F
i
n
d
i
n
g
a
t
h
e
m
e
o
f
a
g
i
v
e
n
p
a
s
s
a
g
e
F
i
n
d
i
n
g
a
t
i
t
l
e
o
f
a
g
i
v
e
n
p
a
s
s
a
g
e
F
i
n
d
i
n
g
a
t
o
p
i
c
o
f
a
g
i
v
e
n
p
a
s
s
a
g
e
F
in
di
n
g
a
wr
o
n
g
ex
pl
a
n
ati
o
n a
b
o
u
t
a
g
i
v
e
n
p
a
s
s
a
g
e
F
in
di
n
g
a
n
a
w
k
w
ar
d
w
or
d
to th
e
co
nte
xt
of
a
giv
en
pa
ssa
ge
F
i
n
d
i
n
g
s
u
i
t
a
b
l
e
d
i
s
c
o
u
r
s
e
m
a
r
k
e
r
s
I
d
e
n
t
i
f
y
i
n
g
t
h
e
s
y
n
t
a
c
t
i
c
a
l
l
y
w
r
o
n
g
(
o
r
c
o
r
r
e
c
t
)
u
s
a
g
e
o
f
w
o
r
d
c
a
t
e
g
o
r
i
e
s
(
e
.
g
.
n
o
u
n
, ver
b,
relat
ive
pro
nou
n,
adv
erb, p
r
e
p
o
s
i
t
i
o
n
,
a
d
j
e
c
t
i
v
e
,
e
t
c
.
)
I
n
f
e
r
r
i
n
g
a
p
h
r
a
s
e
b
a
s
e
d
o
n
a
g
i
v
e
n
p
a
s
s
a
g
e
Inf
erri
ng
a
se
nte
nc
e
ba
se
d
on a
g
i
v
e
n
p
a
s
s
a
g
e
I
n
f
e
rr
in
g
a
w
o
r
d
b
a
s
e
d
o
n a
g
i
v
e
n
p
a
s
s
a
g
e
L
o
c
a
ti
n
g
a
gi
v
e
n
s
e
n
t
e
n
c
e i
n
a
p
a
s
s
a
g
e
P
l
a
c
i
n
g
s
e
n
t
e
n
c
e
s
i
n
a
l
o
g
i
c
a
l
o
r
d
e
r
S
u
m
m
ari
sin
g a
giv
en
pa
ss
ag
e 2 3 4 5 6 7
8
9
10
11
12
13
14 15
16
M E H M
E H M
E H M
E H M
E H M
E H M
H M
E H M E
H M
E H M
H M
E H
M
E
M E H
E
X
12 Y
1
X
3 Z
10 Z
9 Y
12 X
9 X
21
Y
23
X
19
X
7
X
8
Z4 Y
25
X
17
X
1
X
2 X
16
X
15
Z2
2
X
5
Y
8 X
6
X
23 Y
7 X
25
Z1
6
Z1
3
X
22 X
24
Y2 Z19
X 13 Y 15 X 14 Z27 Y 24 X 11
Y
18 Y
22 Z
17 X
10
X
27
Z8 X
20 Z
5 Z
29 Y
28
Y
11
Y
19 Y
3 Y
16
Z
24
Y
6 Z
11
X
26 X
28 Y
14
Y
5 Z
18
Y
9 Z
3
Y
1
0
Y
27 Z
6 Y
21 Y
26
Z2
5
Y
13 Y
4 Z2
0 Z2
1 Y
20
Z7 Z1
Y
17 Z14 Z26
Z2
Z12 Z15 Z28
Z2
3
33
X 18 E 17 Identifying author's mood 17 H X 18
Table 4.10 Reclassification of language categories (teachers' weighting vs. the Rasch analysis)
All test items (85 items) which teachers in X, Y, and Z high school designed were arranged on
the basis of the language categories and difficulty levels as shown in Table 4.10. In the middle of the
table are 17 different language categories. In the left side of the table, items were arranged
according to the language categories and difficulty levels which teachers decided. In the right side of
the table, items were arranged according to the language categories and difficulty levels which the
Rasch model measured.
Table 4.10 shows that there also exists a gap between teachers' beliefs and students'
responses in terms of a language category. That is, in some cases, what teachers had believed to be a
difficult language category was an easy language category for students, or vice versa. Teachers in X
and Y high school had estimated that language category, "finding suitable discourse markers (9)",
might not be difficult, so most items related to this language category had been classified as easy
items. Unlike the teachers' estimation, however, it was found that the students in X and Y high
school had difficulty in solving the items pertinent to the category, "finding suitable discourse
markers (9)". That is, the language category was more difficult than teachers had expected.
Comparably, the language categories, "finding a suitable word to the context of the given passage
(3)" and "finding a title of a given passage (5)", were found to be more difficult for students than
teachers had predicted.
On the other hand, in case of the language category, "finding a theme of a given passage (4)",
teachers had approximated that such a language category might be tough for students to find the
correct answer, so most items related to this language category had been rated as hard items by
teachers in X, Y, and Z school. However, that was not the case. Once those items related to the
language category, "finding a theme of a given passage (4)"were administered to students in X, Y, and
Z school, the students found the correct answers of those items more easily than teachers had
estimated.
Teachers in X, Y, and Z high school had believed that students might have difficulty in solving
the items related to the language categories, "inferring a phrase (11), a sentence (12) and, a word
(13)". Teachers had believed that those language categories could not be easy for students to deal
with. Therefore, the teachers had classified all the test items related to those language categories
into hard or medium. The Rasch analysis showed that their expectation was only partially correct.
That is, teachers' estimation on the hard items related to those language categories was right to a
great extent, while it was discovered that some medium items related to those language categories
were found to be more difficult (e.g., Z16) or easier (e.g., X25) than teachers had judged.
In case of the language category, "identifying the syntactically wrong (or correct) usage of
word categories (10)", there existed a difference between schools. To be specific, X high school's
teachers had estimated the items related to this language category to be hard, while Z high school's
teachers had considered them to be hard and medium. Y high school's teachers had rated them to be
medium and easy. After analysing the item response using the Rasch model, it was discovered that
there was a gap between teachers' estimation and students' responses to the items relevant to the
language category (10) in X and Y high school. However, Z school teachers' weighting on those
items related to the language category (10) was aligned.
34
Aside from the examples suggested above, a number of examples in Table 4.9 manifest that
there exists a discrepancy between what teachers believed to be difficult and what students found
to be difficult in terms of the language category.
4.7 Summary
Through the Rasch analysis, the difficulty logits of three schools' items were calibrated.
Based on the logits calibrated, I tagged the items with Baker's verbal terms (very easy, easy, medium,
hard, and very hard) and then a comparison was made between teachers' weighting and the virtual
item difficulty levels. As a result, it was discovered that there was a large gap between teachers'
priori weighting and virtual difficulty levels. That is, what teachers had believed to be difficult was
easy for students, or vice versa. For example, in some cases, teachers had believed that certain items
(e.g. X11, Y10) might be difficult for students, so the teachers had assigned high marks to those
items. However, the Rasch model identified that it was not difficult for students to find the correct
answers of those items. In other cases, teachers had estimated that items (e.g. Y26, Z19) might be
easy for students, so the teachers had allocated low marks to those items. Yet, the students actually
had difficulty in responding to those items correctly. Thus, when considering the gap between
teachers' weighting and the virtual weighting of the item, teachers should have assigned higher or
lower marks to those items than they had assigned.
In relation to the language categories, it was found that there also existed a gap between
teachers' beliefs and students' responses. In other words, what teachers had believed to be a
difficult language category was easy for students, or vice versa. For example, even though teachers
had considered the language category, "finding suitable discourse markers (9)" to be an easy
language category, students did not take the category as an easy one. Rather, Winsteps provided
evidence that the students had difficulty in dealing with the language category, (9). In case of the
language category "finding a theme of a given passage" (4), for example, teachers had assumed that the
language category might be difficult. However, students took the language category as easier one
than teachers had expected.
To sum up, in relation to the first research question, the Rasch model confirmed that there
existed a large gap between teachers' item weighting and virtual item difficulty levels. Similarly, in
relation to the second research question, it was discovered that a gap also existed between what
teachers believed to be difficult and what students found to be difficult in terms of the language
category. In the next chapter, a discussion about current research will be made.
35
Chapter 5. Discussion
5.1 Introduction to Chapter 5
In this chapter, a discussion will be made about the findings of current research and how to
enhance the fairness of item weighting will be suggested. Thereafter, a recommendation for the
further research will be suggested and a conclusion will be drawn.
5.1 Discussion and Implications
It is difficult to predict test item difficulty levels before a test is administered to test-takers.
In spite of that, since the government policy (see 3.2) disadvantages to the students who obtain the
same scores, teachers have implemented a differential item weighting method in the belief that the
method may be able to effectively decrease the number of the students who gain the same scores.
While allocating differing marks to the test items, the teachers usually rely on their feel for the item
difficulty levels (Rudner, 2001). That is, the more difficult teachers believe the test item may be, the
higher mark teachers assign to the test item.
In relation to a differential weighting method, Blood (1951, cited in Sabers & White, 1969)
claims that expert judgement for the determination of the weights may make it possible to raise test
reliability without changing validity. Unlike Blood's claim, however, Feldt (2004) argues that test
designers' "somewhat arbitrary" judgement (p. 186) for the item weighting can be "detrimental to total
score reliability" (p. 188). In addition, Feldt points out that if teachers use a differential
weighting method, it is frequently happening that test designers have assigned the highest weight to
the less important test items. Because of such drawbacks, many experts (e.g., Gulliksen, 1950)
discourage a differential weighting method.
Even though caution must be exercised before generalising the results of the present
research to other contexts, since the sample was limited, the results of current research confirm that
there was a large gap between teachers' weighting and the virtual item difficulty. As Feldt (2004)
mentions, in this research, teachers assigned lower marks to some difficult items, while they gave
higher marks to the relatively easy items within the test. In relation to language categories, it was
discovered that there also was a gap between what teachers believed to be difficult and what
students found to be difficult. That is, what teachers believed to be a difficult language category was
shown to be easy for students, or vice versa.
The value of this study lies in examining the fact that a differential weighting method which
most Korean teachers implement could fail to reflect the importance and the difficulty of the items
to a certain extent. By using a differential weighting method, teachers may succeed in lining up
students' scores from the highest to the lowest (Shohamy, 1998) as the education office policy
demands, but this research identified that teachers often failed to distinguish the difficult items from
the easy ones. In this sense, I believe that solutions need to be put forward in order to minimise the
mistakes of allocating higher (lower) marks to the less (more) important items.
As a way of securing the consistency in allocating marks to the items, a post weighting
system needs to be established in my context. Few, if any, test designers can precisely measure the
difficulty levels of the test items before test takers take the test. Despite this, I wonder why schools
and the education office compel teachers to do what the teachers are not able to do. That is,
36
teachers have been required to design a test and assign the differential weights to the test items
simultaneously. However, as identified in this research, teachers' weighting prior to the test did not
seem to satisfy designers' hypothetical desires that difficult items should be assigned higher marks.
From this perspective, the conventions in which teachers predict the item difficulty levels and reflect
them into the item marks needs to be modified. To be specific, it would be better to separate item
weighting from a test design by letting teachers decide the difficulty levels on the basis of how
students respond to the items and assign corresponding marks to the items.
Test item design Test Mark Final
Administration Allocation Test Results
Figure 5.1 Procedure of test design and post weighting method
Many may assume that a post weighting method may be able to make teachers experience
physical and psychological burnout, because teachers have to calculate the statistical grounds on
their determination for the item weighting and assign the corresponding marks to the test items.
However, all the processes of marking students' answers are led by a computerised system which is
called NEIS (National Education Information System) in my context. Thus, through that system,
teachers can easily certify the results of how many students respond to a certain item correctly. That
is, it is not that difficult for the teachers to ground their decision about the item difficulty levels.
Instead, for the precise measurement of the difficulty index, a statistical programme needs to be set
up with the NEIS. What is more, if the programme assigns the difficulty levels of the items based on
the difficulty index (e.g. Baker's index), teachers may be able to weight the test items more
consistently.
Another feasible solution can be that teachers assign the same mark to all test items. As I
explained in chapter 3 (see 3.2), the government is "using tests as tools for setting educational
agendas" (Shohamy & McNamara, 2009, p. 1). To be specific, since the government policy
disadvantages the students who gain the same scores, "teachers are reduced to following orders"
(Shohamy, 1998, p. 340) and allocate various marks to the test items in order to reduce the number of
the students who gain the same scores. Thus, unless the education office abolishes the policy that
disadvantages those who have equal scores, teachers may not accept the solution of abolishing
differential weighting. That is, the policy which makes teachers inevitably implement a differential
weighting method should be repealed. Under this condition, I believe that in my context, an equal
weighting method may be applicable for the purpose of minimising teachers' subjectivity and
enhancing the fairness of the test results to a large extent.
It could be argued that equal weighting is the equivalent of not weighting. However, this is
not the case. As explained in chapter 2, an equal weighting method can naturally and internally give a
different effective weighting to each item (Wang & Stanley, 1970). In addition, compared to an
37
unequal weighting method which needs much time deciding the appropriate weighting of the item
(especially if more than two teachers are involved) (Stalnaker, 1938), an equal weighting method is
practical and bias-free. To sum up, an equal weighting method help teachers to consistently assign a
mark to the items and to save time of judging which item is more difficult or easier.
In a high stakes test, I believe that the test results should be scored in as precise a manner as
possible, because the test results can serve as a critical indicator of deciding the test takers'
eligibility for a certain area such as employment, school admission, and etc. The initial step of precise
scoring, I believe, comes from correct item weighting. By allocating appropriate marks to each item
precisely, it can be said that the test secures the fairness and consistency.
5.3 Recommendations for Future Research
This quantitative research aimed to calculate the difficulty levels on the basis of test takers'
item responses. In order to do so, I analysed the students' responses to the items using the Rasch
model and fulfilled the purpose of the research to a large extent. However, this quantitative
research does not show why teachers made such a decision about the item difficulty levels and mark
allocation. That is, this research is not enough to investigate more profound knowledge and reasons
about test designers' weighting decision. Thus, it must be worth conducting qualitative research for the
purpose of discovering teachers' beliefs on item weighting.
Especially, in my context, more than two teachers are usually involved in designing test
items. All teachers design certain numbers of test items and a test is composed of all the items each
teacher creates. While those teachers are designing the items and allocating the marks to them,
there must be many variables which may have influence on an item weighting decision, such as
teachers' professional identity, teachers' teaching experience, the atmosphere where teachers are
teaching, the national examination, etc. Qualitative research enquiring into the relationship between
those variables and item weighting must be worth conducting, because it can give much deeper
information on the way how teachers in various strata decide the item difficulty levels.
5.4 Conclusion
One of the concerns we have is whether a language test can produce scores that can
accurately reflect an examinee's ability in a certain area such as reading, writing, speaking, listening
(Weir, 2005; Tierney, 2006). In the similar vein, McNamara (2000) contends that a language test is a
procedure for gathering evidence with which we can predict the candidate's use of language in real
world contexts. However, fundamentally, in order for the test to measure the examinees' ability
precisely, teachers (test designers) need to pay much attention to the item weighting. That is, the
teachers need to weight the item precisely. Because depending on how teachers weight items and
allocate marks to the items, the test scores can differ and consequently, wrong weighting may be
able to prevent testers from predicting the examinees' actual abilities.
However, it is extremely difficult for teachers to weight the items precisely prior to the test
administration, because there are a number of variations which influence the weighting of the items,
such as examinees' prior knowledge, test methods, length of test content, language categories, etc.
While weighting the items, teachers may consider such variations. Yet, the final decisions about item
weighting are usually made on the basis of the teachers' subjective judgements and criteria on the
38
variations which influence the item weighting. Thus, as shown in this research, the item analysis data
indicated that what teachers believed to be difficult may not always be identical to what students
found to be difficult. In some cases, what teachers believed to be difficult was sometimes easy (or
medium) for students. As long as subjectivity is involved in item weighting, if students raise
objections to teachers' item weighting and mark allocation, it must be difficult for most teachers to
elucidate the reason logically, causing "ethical challenges" (Davies, 2004, p. 97).
In relation to the misjudgement of the weighting, I believe that teachers in my context
seem to have had an indifferent attitude. That is, once they decided the weight and assigned the
marks to the items, their matter of concern seems to shift toward the test results which
students obtain from the test. Of course, I do not deny the importance of the test results. No
matter how incorrectly the items are weighted, the test results may serve as a tool which
teachers and students use for their decision-making to a certain extent. However, it is no doubt
that the more accurately teachers weight the items, the more reliable the test results will be. In
this sense, I claim that when teachers weight the items, the teachers need to take a more critical
stance and make it a rule to find less subjective grounds for deciding the item weighting by
analysing the item response data which they gained from the previous test.
39
References
Allal, L., 2013. Teachers' professional judgement in assessment: a cognitive act and a socially
situated practice. Assessment in Education: Principles, Policy & Practice, 20(1), p. 20-34.
Bachman, L. F., 1989. Assessment and Evaluation. Applied Linguistics, Volume 10, pp. 210-226.
Bachman, L. F., 1990. Fundamental Considerations in language testing. New York: Oxford University
Press.
Bachman, L. F., 1991. What Does Language Testing Have to Offer?. TESOL Quarterly, 25(4), pp. 671-
704.
Baker, F. B., 1985. The basic of item response theory. Portsmouth, N.H.: Heinemann.
Black, T. R., 1999. Doing quantitative research in the social sciences : an integrated approach to
research design, measurement and statistics. London : SAGE.
Bond, T. G. & Fox, C. M., 2001. Applying the Rasch Model: Fundamental Measurement in the Human
Sciences. Mahwah, N.J.: Lawrence Erlbaum Associates.
Brown, G., 1997. Assessing student learning in higher education. London: Routledge.
Brown, H., 1994. Teaching by principles : an interactive approach to language pedagogy. 2 ed. Upper
Saddle River, N.J. : Prentice-Hall.
Bryman, A., 2012. Social research methods. 4 ed. Oxford: Oxford University Press.
Cliff, N., 1989. Ordinal consistency and ordinal true scores. Psychometrika, 54(1), pp. 75-91.
Crocker, L. & Algina, J., 1986. Introduction to classical and modern test theory. New York ; London :
Holt, Rinehart and Winston.
Davies, A., 1978. Language Testing. Language Teaching, 11(3), pp. 145-159 .
Denscombe, M., 2010. The good research guide for small-scale social research projects. 4 ed.
Maidenhead: McGraw-Hill/Open University Press.
DeVellis, R. F., 2006. Classical test theory. medical Care, 44(11), pp. S50-S59.
Diekhoff, G. M., 1983. Testing through relationship judgments. Journal of Educational Psychology,
75(2), pp. 227-233.
Domino, G. & Domino, M. L., 2006. Psychological Testing: An Introduction. 2 ed. New York ;
Cambridge : Cambridge University Press.
Embretson, S. E., 1999. Issues in the measurement of cognitive abilities. In: S. E. Embretson & S. L.
Hershberger, eds. The new rules of measurement: what every psychologist and educator should
know. Mahwah, N.J. ; London : L. Erlbaum Associates , pp. 1-15.
40
Feldt, L. S., 2004. Estimating the reliability of a test battery composite or a test score based on
weighted item scoring. Measurement And Evaluation In Counseling And Development, 37(3), pp.
184-190.
Fulcher, G. & Davidson, F., 2009. Test architecture, test retrofit. Language Testing, 26(1), pp. 123-
144.
Furr, R. M. & Bacharach, V. R., 2008. Psychometrics; an introduction. California: Sage Publications.
Green, R., 2013. Statistical analysis for language testers. New York: Palgrave Macmillan.
Guilford, J. P., 1954. Psychometric methods. 2 ed. New York: McGraw-Hill.
Gulliksen, H., 1950. Theory of mental tests. New York: Wiley.
Henning, G., 1987. A guide to language testing: development, evaluation, research. Cambridge:
Cambridge University Press.
Hughes, A., 2003. Testing for language teachers. 2 ed. Cambridge : Cambridge University Press.
Hu, G. & Mckay, S. L., 2012. English Language Education in East Asia: Some Recent Developments.
Journal of Multilingual and Multicultural Development , 33(4), pp. 345-362.
Kaplan, R. M. & Saccuzzo, D. P., 2005. Psychological testing : principles, applications and issues. 6 ed.
Belmont, Calif. : Thomson Wadsworth .
Kim, J. et al., 2010. An analysis of determinants and the validity of item weighting. The journal of
Curriculum and Evaluation, 13(2), pp. 197-218.
Koopman, R. F., 1988. On the sensitivity of a composite to its weights. Psychometrika, 53(4), pp. 547-
552 .
Krippendorff, K., 1980. Content analysis : an introduction to its methodology. Beverly Hills, Calif. :
Sage .
Kumar, R., 2005. Research methodology; A step-by-step guide for beginners. 2 ed. London: SAGE.
Lado, R., 1961. Language testing. London: Longmans.
Lawson, D. M., 2006. Applying the item response theory to classroom examinations. Journal of
Manipulative and Physiological Therapeutics, 29(5), pp. 393-397.
Leung, C. & Lewkowicz, J., 2006. Expanding horizons and unresolved conundrums: Language testing
and assessment. TESOL Quarterly, 40(1), pp. 211-234.
Lewin, C., 2011. Understanding and Describing Quantitative Data. In: B. Somekh & C. Lewin, eds.
Theory and Methods in Social Research. London : SAGE, pp. 220-230.
41
Linacre, J. M., 2002. Rasch.org. [Online]
Available at: http://www.rasch.org/rmt/rmt162f.htm
[Accessed 19 5 2014].
Lloyd-Jones, R., 1992. An over view of assessment. In: Assessment : from principles to action.
London : Routledge, pp. 1-12.
McNamara, T., 2000. Language Testing. Oxford: Oxford University Press.
McNamara, T. & Ryan, K., 2011. Fairness Versus Justice in Language Testing: The Place of English
Literacy in the Australian Citizenship Test. Language Assessment Quarterly, 8(2), pp. 161-178.
Mellenbergh, G. J., 1996. Measurement Precision in Test Score and Item Response Models.
Psychological Methods, 1(3), pp. 293-299.
Murphy, K. R. & Davidshofer, C. O., c1991. Psychological testing : principles & applications. 2 ed.
Englewood Cliffs, N.J. : Prentice Hall.
Neuman, W. L., 2005. Social research methods : qualitative and quantitative approaches. 6 ed.
Boston, Mass. ; London : Pearson.
Oppenheim, A. N., 1992. Questionnaire design, interviewing and attitude measurement. London ;
New York: CONTINUUM.
Pae, H., 2012. Convergence and discriminant: assessing multiple traits using multiple methods.
Educational Research and Evaluation, 18(6), pp. 571-596 .
Pae, H. K., 2012. A psychometric measurement model for adult English language learners: Pearson
Test of English Academic. Educational Research and Evaluation, 18(3), p. 211-229.
Rudner, L. M., 2001. Informed Test Component Weighting. Educational Measurement: Issues and
Practice, 20(1), pp. 16-19.
Sabers, D. L. & White, G. W., 1969. The Effect of Differential Weighting of Individual Item Responses
on the Predictive Validity and Reliability of an Aptitude Test. Journal of Educational Measurement,
6(2), pp. 93-96.
Scott, D. & Morrison, M., 2006. Key ideas in educational research. London ; New York : Continuum.
Sharkness, J. & DeAngelo, L., 2011. Measuring Student Involvement: A Comparison of Classical Test
Theory and Item Response Theory in the Construction of Scales from Student Surveys. Research in
Higher Education, 52(5), pp. 480-507 .
Shi, J. T. N.-Z. & Chang, H.-H., 2012. Item-Weighted Likelihood Method for Ability Estimation in Tests
Composed of Both Dichotomous and Polytomous Items. Journal of Educational and Behavioral
Statistics, 37(2), p. 298-315.
Shohamy, E., 1998. Critical Language Testing and Beyond.. Studies in Educational Evaluation, 24(4),
pp. 331-45 .
42
Shohamy, E., 2000. The relationship between language testing and second language acquisition,
revisited. System , Volume 28, pp. 541-553.
Shohamy, E. & McNamara, T., 2009. Language tests for citizenship, immigration, and asylum [Special
issue]. Language Assessment Quarterly, 6(1), pp. 1-5.
Skehan, P., 1989. Language testing part II. Language Teaching, 22(1), pp. 1-13.
Stalnaker, J. M., 1938. Weighting questions in the essay-type examination. Journal of Educational
Psychology, 29(7), pp. 481-490 .
Statman, S., 1998. Tester and testee: two sides of different coins. System , Volume 26, pp. 195-204.
Taylor, L., 2005. Washback and impact. ELT Journal , 59(2), pp. 154-155.
Taylor, L., 2013. Communicating the theory, practice and principles of language testing to test
stakeholders: Some reflections. Language Testing, 30(3), pp. 403-412 .
Taylor, L. B., 2004. Current Issues in English Language Testing Research. TESOL Quarterly, 38(1), pp.
141-146.
Tierney, R. D., 2006. Changing practices: influences on classroom assessment. Assessment in
Education: Principles, Policy & Practice, 13(3), pp. 239-264.
Wang, M. W. & Stanley, J. C., 1970. Differential Weighting: A Review of Methods and Empirical
Studies. Review of Educational Research, Volume 40, pp. 663-705.
Weir, C. J., 2005. Language Testing and Validation : An evidence-based approach. New York:
palgrave macmillan.
West, P. V., 1924. The Significance of Weighted Scores. Journal of Educational Psychology, 15(5), pp.
302-308 .
Wiliam, D., 2011. What is assessment for learning?. Studies in Educational Evaluation, Volume 37, p. 3-
14.
Wilson, M., 2013. Using the concept of a measurement system to characterize measurement models
used in psychometrics. Measurement, 46(9), pp. 3766-3774.
Woodford, P. E., 1980. Foreign Language Testing. Modern Language Journal, 64(1), pp. 97-102 .
Xu, T. & Stone, C. A., 2012. Using IRT Trait Estimates Versus Summated Scores in Predicting
Outcomes. Educational and Psychological Measurement, 72(3), pp. 453-468.
Zhana, Y. & Andrews, S., 2014. Washback effects from a high-stakes examination on out-of-class
English learning: insights from possible self theories. Assessment in Education: Principles, Policy &
Practice, 21(1), p. 71-89.
43
Appendix
Appendix 1
Randomly selected sample from X high school
1 2 34 810 11 13 17 18 20 21 24
25 27 28 29 30 32 33 35 38 39
40 41 42 43 45 46 47 48 49 50
52 53 55 56 57 58 59 61 63 64
65 66 67 69 71 72 73 74 75 78
80 82 84 85 86 87 88 89 90 91
92 93 94 95 96 97 99 101 102 103
105 106 108 109 111 112 113 114 115 116
117 118 119 122 123 124 126 127 128 129
130 132 133 134 135 136 137 140 141 142
145 147 148 149 151 152 153 154 155 156
157 158 159 160 161 162 163 164 165 166
167 168 169 170 171 172 173 174 175 176
177 178 179 180 181 182 183 184 185 186
187 188 189 190 191 192 193 194 195 196
197 198 199 200 201 202 203 204 205 206
207 208 210 213 214 215 216 217 219 221
222 226 227 228 229 230 231 232 233 234
236 237 239 240 242 247 248 249 251 254
255 256 257 258 259 260 262
44
Appendix 2
Randomly selected sample from Y high school
5 7 11 13 15 19 20 23 24 27
28 34 36 40 42 43 47 50 51 54
55 59 60 61 65 67 68 69 71 74
75 77 79 81 82 83 85 86 87 88
90 91 92 94 95 96 99 100 102 104
105 106 108 109 110 112 113 114 115 118
119 121 122 123 125 126 127 129 131 132
133 135 136 137 139 141 143 144 145 146
148 149 150 153 156 159 160 162 163 164
168 170 171 172 173 175 177 179 180 181
184 185 187 189 190 191 193 194 195 197
198 199 201 206 207 208 210 212 215 216
217 218 220 221 222 224 228 229 230 231
233 237 238 239 242 245 248 251 252 255
256 259 260 261 262 264 265 266 268 269
272 273 275 278 279 280 282 284 286 287
288 289 291 292 293 295 296 297 299 300
303 305 306 307 309 310 311 314 315 317
318 319 320 323 327 328 331 334 336 340
344 346 347 350 351 354 355 358 359 361
45
Appendix 3 X, Y, and Z schools' two-dimensional table of test specifications
X school's two-dimensional table of test specifications
Objective Taxonomy Difficulty
Item Num.
1 2
3
4
5 6 7
8
9 10
11 12
13 14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
language category
Finding an awkward word to the context of the given passage Finding an awkward word to the context of the given passage Finding a suitable word
Distinguishing a pronoun which indicates a different reference
Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Finding a topic of a given passage
Finding a topic of a given passage Finding a theme of a given passage
Finding a theme of a given passage Summarising a given passage Finding a non-cohesive sentence Locating a given sentence in a passage Placing sentences in a logical order
Finding suitable discourse markers
Finding suitable discourse markers
Finding a wrong explanation about a given passage Identifying author's mood Finding a title of a given passage
Finding a title of a given passage
Finding a title of a given passage
Inferring a word based on a given passage
Inferring a phrase based on a given passage
Inferring a word based on a given passage
Inferring a phrase based on a given passage
Inferring a phrase based on a given passage
Finding a title of a given long passage
Inferring words based on a given long passage
K
C
○ ○
○
○
○ ○
○
Ap
○ ○
○
○
An ○
○ ○
○
○
○
○
S
○
○
○
○
○
○
○
○
○
○
E
H
○ ○
○
○
○
○
M
○ ○
○
○ ○
○
○
○
○
○
○
○
○
○
E
○ ○ ○ ○
○
○
○
○
Mark
4
3.5
3.4
2.5
4.5 4.4 3.5
2.8
3.7
3.3
4.6 3.3
3.7 3.2
2.9
3.1
2.6
2.7
4.3
44
4.2
4.1
3
3.8
3.9
3.4
3.6
*K: Knowledge, C: comprehension, Ap: application, An: Analysis, S: synthesis E: Evaluation
46
Y school's two-dimensional table of test specifications
Item
language category
Objective Taxonomy
Difficulty
Mark
Num.
1
2
3
4
5
6
7
8
Finding non-cohesive sentence
Inferring words based on a given passage
Finding suitable discourse markers
Inferring a phrase based on a given passage
Locating a given sentence in a passage
Inferring a phrase based on a given passage
Inferring a phrase based on a given passage
Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.)
V G
○
C
○
I
○
○
○
○
Ap
○ ○
H
○
○
M
○
○
○
○
○ ○
E 3.6
3.6
3.6
3.6
3.6
3.8
3.8
3.6
9 Placing sentences in a logical order ○ ○ 3.6
10 Summarising a given passage ○ ○ 3.8
11 Fiindingaasn awkward word to the context of the ○ ○ 3.6
g ven p sage
12 Finding a theme of a given passage ○ ○ 3.6
13 Inferring a phrase based on a given passage ○ ○ 3.8
14 Inferring words based on a given passage ○ ○ 3.6
15 Locating a given sentence in a passage ○ ○ 3.2 Identifying the syntactically wrong (or correct)
16 usage of word categories (e.g. noun, verb, ○ ○ 3.6 relative pronoun, preposition, adjective, etc.)
17 Inferring a phrase based on a given passage ○ ○ 3.8
18 Finding a non-cohesive sentence ○ ○ 3.8 19 Finding suitable
discourse markers ○ ○ 3.1 20 Placing sentences in
a logical order ○ ○ 3.6 21 Finding suitable
discourse markers ○ ○ 3.3 22 Finding suitable words
to a given passage ○ ○ 3.6 23 Finding a title of a
given passage ○ ○ 3.7 24 Summarising a given
passage ○ ○ 3.6
25 Findien aawroge explanation about
ng a giv p ssa
ng ○ ○ 3.2
Identifying the syntactically wrong (or correct) 26 usage of word categories (e.g. noun, verb, ○ ○ 3.1
relative pronoun, preposition, adjective, etc.)
27 Finding a title of a given passage ○ ○ 3.3
28 Fiindingaasn awkward word to the context of the
g ven p sage
○
○
3.9
*v: vocabulary, G: grammar, C: comprehension, I: inference, Ap: application
47
Z school's two-dimensional table of test specifications
Item
language category
Objective Taxonomy
Difficulty
Mark
Num.
1 2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 24 25
26
27
28
29
Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.)
Summarising a given passage
Finding a topic of a given passage
Finding a topic of a given passage
Finding a topic of a given passage
Finding a title of a given passage
Finding a title of a given passage
Finding suitable words to the context of a given passage Finding suitable words to the context of a given passage
Inferring a phrase based on a given passage
Inferring a phrase based on a given passage
Inferring a sentence based on a given passage
Inferring a phrase based on a given passage
Inferring a phrase based on a given passage
Inferring a sentence based on a given passage
Finding a theme of a given passage
Locating a given sentence in a passage
Locating a given sentence in a passage
Inferring words based on a given passage
Inferring words based on a given passage
Finding suitable discourse markers Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Placing sentences in a logical order
Placing sentences in a logical order
Placing sentences in a logical order Finding a wrong explanation about a given passage
K
○ ○
○ ○
○ ○ ○
C
○ ○
○
Ap
○
○
○
○
○
○
An
○
○ ○
○
○
S
○
○
○
○
○
○
○
○
E H
○ ○
○
○
○
○
M
○
○
○
○
○
○
○
○
○
○
○
○
○ ○
○
E
○
○
○
○
○
○
○
○
3.9 3.9
3.4
3.1
3.5
3.1
3.4
3.4
3.2
3.3
3.6
3.9
3.4
3.5
3.4
3.6
3.5
3.8
3
3.9
3.4
3.1
3.8 3.7 3.7
3.1
3.3
3.1
3
*K: Knowledge, C: comprehension, Ap: application, An: Analysis, S: synthesis E: Evaluation
48
Appendix 4 X, Y, and Z high school's person data analysis
X HIGH SCHOOL PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE
SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 1 0.91 19 0.86 -0.65 0.69 -0.87 140 1.6 22 0.73 -0.94 0.51 -0.99 2 2.6 25 0.61 -0.78 0.28 -0.82 141 0.91 19 0.98 -0.03 0.87 -0.25 3 -0.95 9 1.58 2.63 1.73 1.61 142 -1.38 7 1.07 0.38 1.03 0.23 4 0.71 18 0.84 -0.85 0.75 -0.75 145 2.2 24 1.08 0.34 1.18 0.48 8 0.34 16 1.06 0.39 1.07 0.32 147 1.6 22 1.02 0.15 0.86 -0.11 10 1.35 21 0.79 -0.82 0.66 -0.69 148 2.2 24 0.89 -0.16 0.83 -0.01 11 -0.57 11 1.06 0.39 0.94 -0.07 149 -0.03 14 1.02 0.16 0.95 -0.10
13 -0.39 12 0.92 -0.42 0.93 -0.13 151 -0.76 10 0.97 -0.11 0.86 -0.27 17 -2.19 4 1.22 0.67 1.69 0.98 152 2.6 25 0.92 -0.02 0.75 0.01 18 -1.38 7 1.15 0.68 1.10 0.36 153 -0.95 9 1.20 1.06 3.39 3.83 20 -0.95 9 1.26 1.32 1.20 0.59 154 1.88 23 0.73 -0.77 0.47 -0.87 21 -1.38 7 1.29 1.20 2.03 1.72 155 -0.95 9 0.80 -1.06 0.72 -0.64
24 -0.95 9 0.86 -0.70 0.77 -0.48 156 1.35 21 0.79 -0.83 0.78 -0.37 25 0.71 18 1.08 0.49 1.12 0.46 157 1.88 23 0.91 -0.18 0.64 -0.47 27 0.15 15 0.97 -0.15 0.91 -0.24 158 0.34 16 0.98 -0.08 0.92 -0.21 28 -0.57 11 1.19 1.11 1.19 0.66 159 3.13 26 1.24 0.57 0.66 0.07 29 -1.88 5 1.26 0.88 1.35 0.69 160 -0.57 11 1.06 0.39 0.95 -0.03 30 -0.76 10 0.89 -0.58 0.80 -0.47 161 2.6 25 0.77 -0.36 0.44 -0.48 32 -2.19 4 1.28 0.83 1.78 1.05 162 3.98 27 1.16 0.46 0.50 0.03
33 -0.03 14 0.96 -0.17 0.90 -0.28 163 -2.57 3 1.03 0.21 3.57 1.97 35 -0.39 12 1.24 1.42 2.44 3.49 164 1.12 20 0.92 -0.30 0.80 -0.41 38 -1.88 5 1.36 1.13 1.40 0.75 165 1.35 21 1.09 0.41 1.33 0.81 39 -1.16 8 1.02 0.17 1.24 0.63 166 1.88 23 0.76 -0.69 0.98 0.17 40 -0.57 11 0.97 -0.13 0.90 -0.20 167 1.88 23 1.13 0.50 1.29 0.64
41 -2.19 4 1.28 0.83 1.33 0.64 168 1.6 22 0.97 -0.02 1.04 0.26 42 -0.95 9 1.12 0.67 1.24 0.67 169 -1.62 6 0.99 0.07 1.33 0.70 43 1.12 20 0.98 -0.03 0.89 -0.16 170 1.12 20 0.84 -0.67 0.69 -0.74 45 2.6 25 0.71 -0.51 0.35 -0.66 171 1.35 21 0.88 -0.40 0.93 0.00 46 -0.57 11 1.22 1.28 1.24 0.79 172 2.2 24 0.82 -0.35 0.61 -0.36 47 -0.95 9 1.27 1.38 3.67 4.13 173 1.6 22 1.06 0.29 1.02 0.21 48 -0.76 10 1.04 0.26 2.82 3.51 174 -1.16 8 1.18 0.88 1.07 0.29
49 -0.95 9 1.03 0.21 0.89 -0.14 175 0.91 19 1.07 0.39 1.10 0.38 50 -1.88 5 1.26 0.88 1.35 0.69 176 -1.16 8 1.19 0.91 1.67 1.37 52 -0.57 11 0.91 -0.51 0.87 -0.28 177 1.88 23 1.17 0.59 1.67 1.12 53 -0.2 13 1.31 1.78 1.25 0.90 178 -0.2 13 0.58 -3.04 0.52 -1.97 55 -2.57 3 1.28 0.71 3.44 1.91 179 0.71 18 0.99 0.03 0.83 -0.44
56 -0.57 11 1.22 1.29 1.12 0.45 180 1.12 20 0.90 -0.38 0.74 -0.57 57 -0.57 11 1.15 0.91 1.17 0.60 181 -0.03 14 1.02 0.18 0.92 -0.20 58 2.6 25 0.91 -0.04 0.71 -0.05 182 -1.16 8 0.83 -0.79 0.70 -0.57 59 -2.19 4 1.16 0.54 0.84 0.06 183 1.12 20 0.78 -0.99 0.80 -0.42 61 -0.57 11 0.88 -0.68 0.77 -0.64 184 1.35 21 0.91 -0.28 0.91 -0.06 63 -0.57 11 0.73 -1.73 0.62 -1.20 185 -0.95 9 0.96 -0.16 0.92 -0.07 64 -1.16 8 0.99 0.02 1.41 0.95 186 1.88 23 0.98 0.04 1.00 0.20
65 -0.76 10 0.99 0.03 1.01 0.16 187 -1.16 8 0.86 -0.63 0.69 -0.60 66 -0.2 13 1.16 0.97 1.24 0.88 188 1.88 23 0.79 -0.57 0.50 -0.80 67 -1.38 7 1.39 1.55 2.34 2.07 189 3.98 27 0.55 -0.33 0.10 -0.68 69 -1.16 8 1.04 0.27 1.00 0.14 190 1.12 20 0.89 -0.42 0.76 -0.51 71 1.35 21 1.03 0.22 0.98 0.11 191 1.35 21 1.29 1.12 1.35 0.84
72 -0.76 10 0.98 -0.08 1.18 0.58 192 0.71 18 0.79 -1.14 0.66 -1.09 73 1.6 22 0.90 -0.27 0.69 -0.48 193 0.71 18 0.95 -0.23 0.91 -0.18 74 0.91 19 1.09 0.51 1.17 0.58 194 0.71 18 0.98 -0.02 0.97 0.01 75 0.71 18 1.31 1.54 1.62 1.73 195 2.6 25 0.82 -0.25 0.59 -0.22 78 0.15 15 1.39 2.15 1.97 2.83 196 0.91 19 0.74 -1.34 0.59 -1.25 80 -0.39 12 1.67 3.46 2.90 4.27 197 1.35 21 0.68 -1.34 0.50 -1.22 82 -0.57 11 1.07 0.46 0.96 -0.01 198 1.12 20 0.92 -0.31 0.75 -0.56
84 0.71 18 0.84 -0.80 0.76 -0.71 199 0.91 19 0.93 -0.29 0.88 -0.25 85 -0.95 9 0.93 -0.30 0.79 -0.42 200 1.6 22 1.15 0.60 1.36 0.80 86 1.12 20 1.07 0.38 1.06 0.27 201 0.34 16 0.98 -0.05 0.93 -0.16 87 1.12 20 0.81 -0.84 0.62 -0.96 202 -0.76 10 0.97 -0.12 1.18 0.59 88 0.52 17 0.79 -1.24 0.70 -1.04 203 0.52 17 1.04 0.26 0.96 -0.02
49
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE
SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 89 1.12 20 0.89 -0.44 0.85 -0.25 204 -0.95 9 1.00 0.04 1.23 0.67 90 -1.88 5 0.97 0.00 0.87 0.04 205 3.13 26 0.60 -0.59 0.21 -0.62 91 -1.62 6 1.35 1.26 1.36 0.74 206 -1.88 5 0.83 -0.48 0.57 -0.52
92 0.71 18 0.95 -0.22 0.98 0.06 207 0.91 19 1.09 0.48 1.13 0.48 93 -1.16 8 1.09 0.50 1.17 0.50 208 3.13 26 0.60 -0.59 0.21 -0.62 94 -0.95 9 0.77 -1.24 0.63 -0.91 210 5.3 28 1.00 0.00 1.00 0.00 95 -1.62 6 0.83 -0.57 0.60 -0.60 213 -0.2 13 1.43 2.41 1.68 2.07 96 1.35 21 1.12 0.54 0.99 0.13 214 -0.03 14 1.09 0.57 1.05 0.28 97 0.71 18 0.71 -1.67 0.60 -1.35 215 0.91 19 0.85 -0.72 0.75 -0.65 99 0.91 19 0.85 -0.72 0.74 -0.69 216 1.88 23 1.03 0.21 0.75 -0.23
101 -1.38 7 1.17 0.75 1.42 0.89 217 1.88 23 0.98 0.05 0.64 -0.46 102 -1.62 6 1.01 0.13 1.87 1.37 219 -0.03 14 0.78 -1.38 0.71 -1.05 103 -1.62 6 0.83 -0.57 0.60 -0.60 221 2.6 25 0.89 -0.09 0.63 -0.17 105 -0.95 9 1.09 0.54 1.19 0.59 222 0.15 15 0.80 -1.27 0.73 -1.01 106 0.15 15 0.97 -0.14 0.90 -0.30 226 2.2 24 0.96 0.03 1.18 0.48
108 -2.19 4 0.80 -0.45 0.51 -0.48 227 0.91 19 0.99 0.02 1.02 0.18 109 -0.57 11 1.10 0.62 1.03 0.20 228 1.6 22 0.93 -0.14 0.87 -0.09 111 -0.57 11 1.19 1.14 1.30 0.94 229 0.15 15 0.95 -0.23 0.86 -0.44 112 0.34 16 0.78 -1.37 0.68 -1.17 230 -0.76 10 0.99 0.00 1.00 0.11 113 -0.39 12 1.02 0.19 0.93 -0.13 231 -0.57 11 0.79 -1.25 0.69 -0.93 114 -0.39 12 0.82 -1.13 0.93 -0.14 232 1.12 20 0.81 -0.84 0.61 -1.00 115 -1.62 6 1.01 0.13 1.72 1.20 233 1.88 23 0.94 -0.07 0.96 0.15
116 -1.88 5 1.26 0.88 1.35 0.69 234 2.2 24 0.94 -0.03 0.63 -0.33 117 -1.88 5 1.00 0.10 1.95 1.32 236 -1.16 8 0.89 -0.48 1.23 0.61 118 -1.88 5 1.26 0.88 1.35 0.69 237 -1.16 8 1.08 0.44 1.19 0.55 119 1.88 23 0.85 -0.36 0.61 -0.53 239 1.6 22 0.91 -0.24 0.83 -0.16 122 1.6 22 0.98 0.03 0.91 0.00 240 0.34 16 1.00 0.07 1.43 1.44
123 0.52 17 1.03 0.23 1.05 0.26 242 3.98 27 1.30 0.61 3.90 1.69 124 0.15 15 0.94 -0.30 0.86 -0.43 247 -0.39 12 1.09 0.61 0.97 0.01 126 0.71 18 0.77 -1.28 0.65 -1.14 248 3.13 26 1.08 0.33 0.41 -0.25 127 -0.2 13 0.96 -0.20 0.86 -0.42 249 -0.03 14 1.18 1.09 1.11 0.49 128 0.34 16 0.84 -0.94 0.81 -0.61 251 3.98 27 1.24 0.54 0.92 0.41 129 -1.62 6 0.89 -0.35 0.63 -0.52 254 2.6 25 1.25 0.65 0.73 -0.03 130 -0.57 11 0.73 -1.74 0.65 -1.09 255 3.98 27 1.09 0.39 0.35 -0.15
132 1.12 20 0.79 -0.96 0.60 -1.05 256 0.52 17 0.80 -1.15 0.71 -0.98 133 3.98 27 1.25 0.55 1.03 0.49 257 0.52 17 1.01 0.14 1.01 0.14 134 0.91 19 1.09 0.48 0.99 0.09 258 -0.57 11 0.83 -1.03 0.70 -0.88 135 1.6 22 1.04 0.25 0.96 0.10 259 0.91 19 1.02 0.15 0.92 -0.11 136 1.35 21 0.84 -0.60 0.68 -0.65 260 1.6 22 0.67 -1.20 0.46 -1.12
137 1.88 23 0.93 -0.12 0.83 -0.09 262 3.13 26 1.08 0.33 0.41 -0.25
50
Y HIGH SCHOOL
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE SCORE
ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 5 -0.49 11 0.92 -0.49 1.02 0.16 184 1.03 20 1.06 0.35 0.99 0. 04
7 -0.84 9 1.30 1.59 1.49 1.85 185 -1.23 7 1.28 1.19 1.39 1. 19 11
-1.45 6 1.24 0.93 1.75 1.77 187 -1.7 5 1.23 0.80 1.70 1. 46 13
-0.84 9 1.28 1.53 1.44 1.69 189 -1.03 8 1.13 0.69 1.28 1. 01 15
-1.45 6 1.02 0.18 1.07 0.31 190 1.23 21 0.89 -0.42 0.80 - 0. 55
19 -2.34 3 0.99 0.14 1.71 1.12 191 -1.23 7 0.91 -0.34 1.14 0. 53 20
-1.03 8 1.28 1.34 1.40 1.38 193 -1.99 4 0.78 -0.52 0.59 - 0. 75
23 0.49 17 0.98 -0.08 0.92 -0.37 194 2.33 25 1.06 0.28 0.98 0. 18 24
-1.03 8 1.07 0.42 1.22 0.84 195 -1.7 5 0.84 -0.45 0.64 - 0. 80
27 -1.99 4 1.23 0.70 1.41 0.86 197 0.33 16 0.78 -1.77 0.73 - 1. 68
28 -1.23 7 1.22 0.97 1.34 1.07 198 -1.45 6 0.78 -0.83 0.62 - 1. 04
34 -1.7 5 0.92 -0.16 0.76 -0.45 199 -1.23 7 1.16 0.76 1.06 0. 28 36
-1.03 8 1.21 1.04 1.18 0.70 201 -1.45 6 1.16 0.68 1.29 0. 84 40
-2.34 3 1.05 0.26 0.90 0.06 206 -1.23 7 1.22 0.97 1.34 1. 07 42
-1.03 8 0.92 -0.33 0.86 -0.44 207 -0.16 13 0.93 -0.48 0.90 - 0. 61
43 -0.66 10 1.39 2.24 1.55 2.32 208 1.7 23 0.75 -0.79 0.53 - 1. 13
47 -0.16 13 0.87 -1.02 0.84 -1.04 210 -0.16 13 0.83 -1.41 0.86 - 0. 93
50 -0.84 9 0.84 -0.90 0.77 -0.95 212 -2.34 3 1.12 0.41 1.30 0. 63 51
-1.45 6 0.87 -0.42 0.78 -0.51 215 1.23 21 0.74 -1.21 0.62 - 1. 24
54 -0.16 13 1.13 1.00 1.12 0.80 216 0.17 15 0.76 -2.06 0.74 - 1. 81
55 -1.23 7 1.08 0.42 1.09 0.37 217 1.45 22 1.00 0.08 0.83 - 0. 35
59 1.98 24 0.93 -0.08 0.73 -0.39 218 0.49 17 0.79 -1.50 0.74 - 1. 49
60 -0.49 11 0.89 -0.75 0.83 -0.94 220 -1.23 7 1.31 1.33 1.60 1. 69 61
-1.45 6 0.98 -0.01 0.94 -0.04 221 1.23 21 0.86 -0.60 0.72 - 0. 87
65 -0.66 10 1.06 0.41 1.08 0.46 222 -1.7 5 1.21 0.75 1.27 0. 72 67
-0.16 13 0.84 -1.27 0.84 -1.07 224 0.33 16 1.03 0.28 1.01 0. 12 68
-1.7 5 1.07 0.32 1.37 0.89 228 2.33 25 1.02 0.18 0.85 - 0. 02
69 -1.45 6 1.25 0.97 1.52 1.33 229 0.33 16 0.98 -0.08 0.99 0. 01 71
-1.7 5 1.02 0.15 1.36 0.88 230 0.84 19 1.13 0.78 1.07 0. 36 74
3.54 27 1.05 0.36 1.32 0.65 231 -1.7 5 1.23 0.80 1.35 0. 86 75
-0.84 9 0.96 -0.14 0.90 -0.34 233 -1.99 4 0.98 0.07 0.88 - 0. 05
76 -1.23 7 1.22 0.97 1.34 1.07 237 -1.03 8 1.40 1.84 1.71 2. 21 79
-1.7 5 1.33 1.08 2.06 2.00 238 -1.23 7 1.27 1.16 1.24 0. 81 81
1.98 24 1.08 0.33 2.09 1.75 239 0.67 18 0.99 -0.03 0.96 - 0. 10
82 0.67 18 0.92 -0.49 0.88 -0.49 242 1.45 22 0.97 -0.02 0.83 - 0. 35
83 -1.23 7 1.22 0.97 1.34 1.07 245 -1.23 7 1.37 1.55 1.62 1. 75 85
-1.99 4 0.88 -0.21 0.77 -0.31 248 0.84 19 0.80 -1.17 0.79 - 0. 87
86 -1.45 6 1.29 1.10 1.73 1.73 251 -1.45 6 1.06 0.32 1.02 0. 17 87
0.67 18 1.14 0.91 1.18 0.88 252 -1.03 8 1.13 0.70 1.13 0. 56 88
-1.99 4 1.09 0.35 1.12 0.39 255 -1.7 5 0.87 -0.34 0.71 - 0. 60
90 -0.66 10 1.20 1.24 1.34 1.55 256 -0.66 10 0.85 -0.93 0.95 - 0. 20
91 0.33 16 0.87 -1.01 0.81 -1.12 259 0.84 19 0.82 -1.08 0.75 - 1. 05
92 -1.7 5 0.92 -0.16 0.76 -0.45 260 -1.03 8 1.10 0.57 1.21 0. 80 94
-1.7 5 1.36 1.15 1.85 1.70 261 0.17 15 0.95 -0.37 0.93 - 0. 42
95 -0.49 11 0.81 -1.34 0.76 -1.34 262 0.84 19 0.82 -1.04 0.74 - 1. 11
96 -1.03 8 0.94 -0.22 0.94 -0.12 264 -0.84 9 1.13 0.74 1.22 0. 92 99
-1.23 7 0.96 -0.11 0.98 0.04 265 -1.99 4 1.12 0.44 1.38 0. 82
100 1.23 21 0.92 -0.30 0.79 -0.58 266 -1.45 6 1.01 0.11 0.86 - 0. 27
51
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE SCORE
ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 102 -0.16 13 0.99 -0.02 1.02 0.17 268 0.33 16 0.88 -0.87 0.83 - 1. 01
104 0.67 18 0.82 -1.17 0.77 -1.09 269 0.84 19 0.84 -0.90 0.76 - 1. 00
105 -0.49 11 1.06 0.49 1.04 0.30 272 1.23 21 1.15 0.73 1.35 1. 07
106 -0.66 10 0.88 -0.71 0.90 -0.40 273 0.84 19 0.75 -1.55 0.66 - 1. 51
108 -0.84 9 0.84 -0.87 0.83 -0.67 275 1.98 24 0.98 0.07 0.75 - 0. 35
109 -0.66 10 1.27 1.63 1.32 1.44 278 0.84 19 0.93 -0.36 0.84 - 0. 60
110 -0.84 9 1.06 0.40 1.12 0.56 279 1.23 21 0.83 -0.75 0.75 - 0. 75
112 -0.32 12 0.86 -1.07 0.87 -0.79 280 -1.23 7 0.95 -0.16 0.81 - 0. 52
113 -0.66 10 1.15 0.93 1.26 1.20 282 1.98 24 0.90 -0.17 0.72 - 0. 40
114 -0.49 11 0.97 -0.18 0.90 -0.47 284 1.45 22 0.97 -0.05 0.84 - 0. 33
115 0.17 15 0.99 -0.06 0.99 -0.03 286 1.03 20 0.82 -0.93 0.72 - 1. 02
118 -1.03 8 1.16 0.81 1.38 1.33 287 1.45 22 0.97 -0.02 0.85 - 0. 29
119 0.84 19 0.97 -0.10 0.98 0.01 288 -0.16 13 1.02 0.16 0.99 0. 00
121 -0.66 10 0.79 -1.38 0.73 -1.33 289 -1.03 8 1.08 0.44 1.19 0. 74
122 -1.99 4 0.96 0.00 0.95 0.07 291 1.7 23 0.85 -0.41 0.63 - 0. 83
123 1.03 20 0.77 -1.24 0.69 -1.15 292 -1.45 6 0.76 -0.92 0.61 - 1. 08
125 1.45 22 1.02 0.18 0.87 -0.23 293 0.67 18 0.70 -2.11 0.64 - 1. 90
126 0.17 15 1.05 0.43 1.03 0.22 295 0.67 18 0.80 -1.31 0.74 - 1. 29
127 -0.49 11 0.95 -0.29 0.97 -0.09 296 3.54 27 1.06 0.37 1.59 0. 82
129 -0.16 13 0.91 -0.65 0.95 -0.31 297 -1.45 6 0.97 -0.03 0.90 - 0. 15
131 1.7 23 0.83 -0.48 0.73 -0.52 299 0.33 16 0.69 -2.63 0.64 - 2. 37
132 -1.45 6 0.99 0.06 1.03 0.21 300 -1.45 6 0.86 -0.46 0.72 - 0. 72
133 -0.84 9 0.99 0.00 0.99 0.03 303 2.79 26 0.98 0.16 0.65 - 0. 19
135 -0.49 11 0.83 -1.20 0.80 -1.12 305 0.84 19 0.85 -0.83 0.79 - 0. 83
136 -0.66 10 1.19 1.19 1.34 1.52 306 1.7 23 0.88 -0.31 0.79 - 0. 37
137 -1.45 6 0.93 -0.18 0.83 -0.35 307 0.17 15 0.82 -1.51 0.78 - 1. 48
139 0.84 19 0.93 -0.33 0.85 -0.57 309 -0.49 11 0.95 -0.32 0.92 - 0. 39
141 -1.23 7 1.02 0.17 1.12 0.46 310 1.7 23 1.05 0.27 1.20 0. 56
143 -1.23 7 0.91 -0.35 0.93 -0.12 311 0.84 19 0.81 -1.09 0.73 - 1. 15
144 2.79 26 0.94 0.09 0.84 0.08 314 -0.49 11 1.18 1.21 1.19 1. 03
145 -0.32 12 0.88 -0.85 0.86 -0.83 315 -0.66 10 0.96 -0.23 1.00 0. 05
146 -1.7 5 0.97 0.01 0.94 0.01 317 -0.84 9 0.83 -0.96 0.78 - 0. 93
148 -1.23 7 1.12 0.58 1.42 1.28 318 1.7 23 0.83 -0.50 0.67 - 0. 71
149 -1.45 6 1.28 1.06 1.47 1.23 319 1.45 22 0.91 -0.29 0.81 - 0. 40
150 -1.45 6 0.97 -0.02 1.19 0.60 320 -1.03 8 1.02 0.14 0.94 - 0. 13
153 -0.16 13 0.80 -1.63 0.75 -1.68 323 -0.49 11 0.87 -0.88 0.86 - 0. 75
156 -0.49 11 0.88 -0.85 0.86 -0.73 327 -2.8 2 0.81 -0.13 0.40 - 0. 66
159 -0.49 11 0.88 -0.83 0.88 -0.60 328 -1.99 4 1.19 0.62 1.28 0. 66
160 -0.32 12 0.95 -0.31 1.00 0.03 331 1.98 24 0.81 -0.43 0.55 - 0. 83
162 0.17 15 1.11 0.85 1.16 1.00 334 2.33 25 0.98 0.10 0.84 - 0. 04
163 0.33 16 0.70 -2.44 0.66 -2.23 336 -2.34 3 1.18 0.52 1.76 1. 18
164 0.33 16 1.23 1.66 1.22 1.28 340 1.45 22 1.09 0.44 1.32 0. 90
168 -1.7 5 1.18 0.66 1.29 0.76 344 2.33 25 0.90 -0.08 0.62 - 0. 47
170 1.23 21 0.96 -0.11 0.87 -0.33 346 -1.23 7 0.77 -1.04 0.63 - 1. 22
171 -0.49 11 1.08 0.61 1.18 0.97 347 -1.7 5 0.92 -0.16 0.76 - 0. 45
172 -0.84 9 1.22 1.22 1.31 1.25 350 0 14 0.91 -0.69 0.95 - 0. 30 173 2.33
25 1.08 0.33 1.06 0.31 351 -1.45 6 1.03 0.20 1.20 0. 62 175 -0.66 10
1.03 0.22 1.05 0.31 354 -0.66 10 0.92 -0.44 0.87 - 0. 58 177 -1.7 5
0.92 -0.16 0.76 -0.45 355 -1.45 6 1.14 0.61 1.08 0. 33 179 -1.45 6
1.06 0.32 1.17 0.56 358 0 14 0.96 -0.28 1.01 0. 14 180 -1.23 7
1.17 0.78 1.54 1.56 359 0.17 15 1.06 0.51 1.10 0. 69 181 -1.99 4
1.18 0.58 1.55 1.07 361 1.7 23 0.95 -0.08 0.85 - 0. 20
52
Z HIGH SCHOOL
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE
SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD
1 1.53 23 0.80 -0.67 0.63 -0.93 99 0.06 15 0.90 -0.73 0.89 -0.61 2 -1.32 7 0.95 -0.13 0.83 -0.37 100 0.38 17 0.95 -0.30 0.89 -0.52
3 -0.75 10 1.15 0.96 1.13 0.56 101 -1.32 7 1.17 0.79 1.21 0.66 4 -1.53 6 1.15 0.62 1.49 1.15 102 0.06 15 1.04 0.36 1.07 0.44 5 -0.26 13 1.00 0.01 0.95 -0.23 103 3.7 28 0.72 -0.04 0.16 -0.55 6
2.44 26 1.07 0.30 0.78 -0.11 104 -0.93 9 1.13 0.76 1.77 2.30 7
-0.75 10 0.94 -0.34 0.93 -0.20 105 -0.93 9 0.84 -0.91 0.77 -0.77 8
0.22 16 0.68 -2.56 0.63 -2.33 106 0.06 15 1.05 0.41 1.06 0.39 9
1.3 22 1.10 0.47 1.10 0.39 107 0.38 17 0.81 -1.35 0.75 -1.36 10 0.55 18 0.72 -1.91 0.65 -1.86 108 -1.78 5 1.04 0.23 0.82 -0.22 11 1.3 22 0.76 -0.99 0.61 -1.20 109 -0.93 9 1.23 1.30 1.39 1.30
12 1.3 22 1.05 0.27 1.09 0.38 110 -1.53 6 1.33 1.25 1.53 1.21
13 -1.12 8 1.06 0.37 1.04 0.22 111 -1.32 7 1.19 0.87 1.41 1.11
14 -0.75 10 1.11 0.70 1.21 0.86 112 0.72 19 0.91 -0.45 0.85 -0.62
15 -1.12 8 1.28 1.36 1.29 0.92 113 -1.53 6 0.92 -0.24 0.80 -0.36 16 0.22 16 0.99 -0.02 0.96 -0.14 114 -1.32 7 1.12 0.58 2.17 2.49 17 -2.87 2 0.88 -0.01 0.52 -0.33 115 1.78 24 1.10 0.40 1.17 0.51
18 2.08 25 1.08 0.33 0.84 -0.12 116 0.72 19 1.16 0.96 1.18 0.80
19 -1.12 8 1.00 0.08 0.92 -0.15 117 2.92 27 1.06 0.30 0.94 0.24
20 -0.75 10 1.07 0.49 1.02 0.14 118 -1.32 7 1.19 0.87 1.27 0.79
21 -1.53 6 0.94 -0.13 0.91 -0.07 119 -0.93 9 1.13 0.78 1.79 2.33 22 -1.78 5 1.26 0.88 1.95 1.66 120 -1.53 6 1.10 0.45 1.13 0.44 23 1.3 22 0.82 -0.73 0.71 -0.82 121 -1.12 8 0.95 -0.21 0.96 0.00
24 1.3 22 0.96 -0.09 0.97 0.04 122 1.53 23 0.88 -0.34 0.87 -0.20
25 -1.78 5 0.97 -0.01 0.84 -0.18 123 1.1 21 1.04 0.25 0.98 0.04
26 -1.32 7 1.15 0.72 2.31 2.72 124 -0.1 14 0.76 -1.98 0.71 -1.76
27 -1.32 7 1.12 0.59 1.01 0.15 125 -1.12 8 1.01 0.11 0.98 0.06 28 0.55 18 0.79 -1.40 0.73 -1.40 126 1.3 22 0.95 -0.13 0.76 -0.64 29 -1.78 5 1.10 0.43 1.02 0.21 127 -0.42 12 0.95 -0.33 0.93 -0.29
30 0.72 19 0.94 -0.28 0.86 -0.55 128 -0.42 12 0.91 -0.63 0.87 -0.59
31 0.38 17 1.03 0.28 1.10 0.59 129 -1.12 8 0.88 -0.56 0.81 -0.53
32 -0.75 10 0.89 -0.65 0.81 -0.74 130 -0.93 9 0.80 -1.19 0.73 -0.93
33 3.7 28 1.14 0.45 3.03 1.44 131 1.1 21 0.88 -0.53 0.80 -0.62 34 -1.53 6 1.20 0.81 1.54 1.22 132 0.22 16 0.85 -1.12 0.88 -0.62 35 1.1 21 0.92 -0.33 0.82 -0.55 133 -0.26 13 1.05 0.44 1.06 0.38
36 0.9 20 1.06 0.39 1.06 0.31 134 0.38 17 1.09 0.63 1.17 0.90
37 2.08 25 0.89 -0.19 0.71 -0.39 135 1.3 22 1.23 0.97 1.16 0.54
38 1.1 21 1.18 0.85 1.39 1.26 136 -0.42 12 0.73 -2.17 0.69 -1.68
39 -1.12 8 0.84 -0.79 0.72 -0.83 137 0.22 16 0.77 -1.79 0.72 -1.72 40 -0.58 11 1.03 0.26 1.10 0.52 138 -1.78 5 1.12 0.47 1.68 1.30 41 -0.75 10 1.26 1.56 1.50 1.80 139 -1.53 6 1.02 0.16 0.93 -0.02
42 -1.12 8 1.01 0.14 1.17 0.61 140 -0.42 12 0.93 -0.47 0.96 -0.15
43 -0.1 14 0.71 -2.46 0.67 -2.09 141 -0.1 14 0.90 -0.73 0.85 -0.83
44 0.55 18 0.91 -0.56 0.85 -0.70 142 2.08 25 1.06 0.28 0.75 -0.29
45 -0.58 11 1.28 1.82 1.32 1.37 143 1.3 22 0.95 -0.15 0.88 -0.26 46 0.38 17 0.84 -1.14 0.77 -1.26 144 0.06 15 0.89 -0.81 0.84 -0.90 47 0.06 15 1.13 0.96 1.11 0.66 145 -1.32 7 1.25 1.11 1.26 0.76
48 -0.26 13 0.80 -1.58 0.75 -1.41 146 1.78 24 1.07 0.31 1.12 0.40
49 -2.41 3 1.04 0.23 0.82 -0.03 147 -0.42 12 0.81 -1.45 0.76 -1.25
50 0.38 17 0.89 -0.70 0.88 -0.60 148 -0.58 11 1.18 1.24 1.29 1.25
53
PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE
SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 51 0.06 15 0.84 -1.26 0.81 -1.14 149 0.72 19 0.73 -1.66 0.64 -1.71
52 -1.53 6 0.80 -0.75 0.63 -0.87 150 3.7 28 1.07 0.38 0.84 0.33
53 -0.1 14 1.10 0.79 1.07 0.45 151 1.3 22 0.73 -1.15 0.57 -1.36
54 1.3 22 0.77 -0.93 0.64 -1.09 152 -1.32 7 1.24 1.06 1.31 0.89 55 0.9 20 0.96 -0.15 0.93 -0.19 153 1.3 22 0.77 -0.92 0.64 -1.08 56 0.22 16 0.74 -2.08 0.68 -1.98 154 -2.06 4 1.12 0.43 1.39 0.78
57 -1.32 7 1.08 0.41 1.09 0.36 155 0.9 20 0.89 -0.51 0.83 -0.61
58 -0.93 9 1.26 1.45 1.20 0.75 156 -0.75 10 1.12 0.75 1.07 0.35
59 0.22 16 1.09 0.69 1.04 0.30 157 -0.93 9 1.04 0.27 1.71 2.14
60 -0.58 11 0.94 -0.35 0.88 -0.50 158 -0.58 11 0.97 -0.15 0.92 -0.31 61 -0.58 11 0.79 -1.51 0.73 -1.23 159 0.06 15 0.91 -0.68 0.88 -0.67 62 -0.58 11 1.16 1.09 1.19 0.86 160 -1.12 8 1.34 1.61 1.56 1.58
63 0.06 15 0.86 -1.11 0.83 -0.99 161 -1.12 8 1.05 0.32 1.81 2.12
64 -1.12 8 1.28 1.38 1.51 1.46 162 0.38 17 1.08 0.58 1.10 0.59
65 2.44 26 1.21 0.59 1.46 0.81 163 1.1 21 0.95 -0.16 0.88 -0.31
66 -1.53 6 1.17 0.72 1.47 1.10 164 -0.93 9 1.03 0.21 1.13 0.52 67 0.55 18 0.91 -0.52 0.85 -0.69 165 0.22 16 1.04 0.36 1.02 0.15 68 0.55 18 1.02 0.15 1.00 0.06 166 -0.42 12 0.97 -0.20 0.89 -0.48
69 -1.12 8 1.12 0.67 1.03 0.19 167 -0.1 14 0.80 -1.64 0.76 -1.44
70 -0.42 12 0.92 -0.55 0.89 -0.47 168 -0.58 11 0.99 -0.02 0.95 -0.17
71 -1.78 5 0.97 0.01 0.75 -0.38 169 -1.12 8 1.24 1.19 1.21 0.71
72 1.3 22 0.95 -0.11 1.02 0.17 170 0.38 17 0.81 -1.37 0.75 -1.41 73 2.08 25 1.19 0.60 1.09 0.34 171 -0.42 12 1.07 0.54 1.06 0.34 74 -0.58 11 0.94 -0.37 0.88 -0.49 172 -2.41 3 1.10 0.36 1.08 0.35
75 -0.93 9 1.10 0.60 1.27 0.96 173 -0.75 10 1.15 0.95 1.68 2.33
76 -1.78 5 1.28 0.94 1.57 1.15 174 -1.53 6 1.03 0.18 2.38 2.49
77 2.08 25 0.98 0.07 1.00 0.19 175 -0.93 9 1.09 0.57 1.10 0.43
78 -1.12 8 0.79 -1.08 0.67 -1.04 176 -1.12 8 0.90 -0.43 0.92 -0.15 79 -0.93 9 1.01 0.14 0.99 0.07 177 -0.58 11 1.16 1.10 1.20 0.89 80 1.1 21 1.30 1.33 1.23 0.81 178 1.1 21 1.08 0.42 1.04 0.22
81 -1.78 5 1.14 0.53 1.57 1.14 179 -1.53 6 1.24 0.95 1.21 0.60
82 0.9 20 0.97 -0.10 0.92 -0.21 180 0.9 20 0.94 -0.27 0.88 -0.39
83 -1.32 7 1.00 0.08 1.14 0.49 181 2.08 25 0.71 -0.73 0.47 -1.01
84 0.55 18 0.88 -0.74 0.82 -0.88 182 -0.58 11 0.86 -0.95 0.80 -0.86 85 2.92 27 1.11 0.38 0.80 0.06 183 -1.12 8 0.88 -0.58 0.80 -0.55 86 -1.78 5 1.22 0.77 1.40 0.88 184 0.72 19 0.95 -0.23 0.87 -0.52
87 -1.32 7 1.06 0.35 1.09 0.35 185 -0.75 10 1.05 0.34 1.03 0.21
88 0.9 20 1.00 0.06 1.02 0.17 186 -0.1 14 1.05 0.45 0.99 0.01
89 1.53 23 0.87 -0.40 0.77 -0.50 187 -1.12 8 1.04 0.27 1.04 0.23
90 0.22 16 0.81 -1.48 0.75 -1.51 188 -0.75 10 0.95 -0.27 0.85 -0.52 91 -0.26 13 0.79 -1.68 0.74 -1.47 189 -0.26 13 1.05 0.41 1.14 0.79 92 1.78 24 0.94 -0.07 0.74 -0.44 190 -1.12 8 1.06 0.36 1.04 0.23
93 1.3 22 0.93 -0.22 0.88 -0.25 191 -1.12 8 1.29 1.40 1.36 1.11
94 -0.58 11 1.12 0.86 1.12 0.59 192 0.38 17 0.79 -1.52 0.74 -1.45
95 0.22 16 1.18 1.27 1.25 1.36 193 0.22 16 0.85 -1.11 0.81 -1.10
96 -0.93 9 0.93 -0.36 0.87 -0.38 194 1.53 23 0.74 -0.96 0.60 -1.03 97 -0.75 10 0.79 -1.39 0.73 -1.08 195 1.1 21 0.73 -1.29 0.62 -1.39 98 0.72 19 1.29 1.57 1.26 1.12
54
Appendix 5 X, Y, and Z high school students' item response: Guttman data
Person X High school students' Person Y High school students' Person Z High school students' Entry responses Entry responses Entry responses
001 002
004
008
010
011
013
017
018
020
021
024
025
027
028
029
030
032
033
035
038
039
040
041
042
043
045
046
047
048
049
050
052
053
055
056
057
058
059
061
063
064
065
066
067
069
071
072
073
074
075 078
0101010111111000101100011100 1111110111111110111110111111
0101010111111010101100001100
0101010111101010100100011100
1101110111110100111100111111
0101010111111010101100011100
0001010111000010101100001100
0001010011010000100100000100
0001010011110110101100001100
0101010111111000101100001100
0001010011111000101100001100
0101010011001000101100000100
0101010011111100101100111101
1001010111110010101100101101
0101010011100010101100011100
0001010011100000100100001100
0001010111010010101100000100
0001000011000000000100000100
1101010111110100100100000100
1101010011101010101100001100
0101010111111000101100001100
0001010011101000100100000100
0001010111100000101100000100
0001010011100000100100001100
0001010011111000101100001100
0001110111111010101110111101
1101110111111110111110111111
0001010011100110101100001100
0001010011100000100100001100
0001010011100000101100000100
0101010111100100100100000100
0001010011100000100100001100
0101010011110000101100011100
0101010111111000101100001100
0001010111110000101100001100
0101010111111110101100001100
0101010011000000100100000100
1101110111111110101110101111
0001010011111000101100001100
1001010011111100101100001100
0001010111110000101100011100
0001000111000000001100000100
0001010111100100100100001100
0001010011111110101100001100
0001010011100000100100001100
0101010111000000000100000100
1101010111110110101100111101
0001010011111000100100001100
0001010111111110101110111101
0101010111111100101100011100
1001010011111110101100001100 0101010111111010101100001100
005 1101000001000110000000101000 007 1101100011100110010110101000
011 1101100011100110010110101000
013 1111100011100110010110001000
015 1101000011000010010000001000
019 1101000011100010010010001000
020 1101100011100110010010101000
023 1111100011110110010010001000
024 1101000011100010010000001000
027 1101100011100110010110101000
028 1101100011100110010010101000
034 1001000001000000000100001000
036 1101100011100110010110101000
040 1101000011000010010000001000
042 1011001001000010000000101000
043 1101100011100110010110101000
047 1111100111000010010010001001
050 1111000001000010010000101000
051 1001000001010000000100000000
054 1101000011100110010010001000
055 1111000011110010010010001000
059 1111110111110110011110111000
060 1111100011100010010110101001
061 1101000011000110010100001000
065 1111000011100110010000001000
067 1101001001100110010010001000
068 1111100011100110010010101000
069 1101100011100110010010101000
071 1101100011100110010110101000
074 1101100011100110010110101000
075 1111000011000010010100001000
076 1101100011100110010010101000
079 1111100011100110010010101000
081 1111111111111111011110111001
082 1111100111110110011110001000
083 1101100011100110010010101000
085 1000000001000000000100000000
086 1111100011100110010010101000
087 1111101011110110010010101000
088 1111000011100010010010001000
090 1101100011100110010110101000
091 1101000111010110010010101000
092 1001000001000000000100001000
094 1111101011100110010010011000
095 1101100011100010010110101000
096 1111100011100010010000101000
099 1101000011100010010110101000
100 1111110011110110011110001001
102 1101000011110010010000101000
104 1111111111110110011110001000
105 1111100011100110010110001000 106 1101000011100110010010101000
001 10011101111011111110110101111 002 00011101110001101100010010101
003 00011101110001100100010011101
004 00011100110001100100010010100
005 00011101110001100100010110101
006 00011101111011110110110111111
007 00011101110001100000110000000
008 00011101110001100110010011101
009 00011101111011101100110111111
010 00011101111011111100010111100
011 00011101111001111110010111111
012 00011101110001100100110011111
013 00011101110001100100010000100
014 00011100110001100100010010100
015 00011101110001100100110011101
016 00011101110011101100010011110
017 00010000000000000100000000000
018 01011111111011111100110111101
019 00011100100001100100010000101
020 00011101110001100100010111111
021 00011100100001100100010000000
022 00011101110001100100010011111
023 00011101111011111110110111111
024 01011111111011111110110111111
025 00011100100001100100110000110
026 00011101110001100100010001101
027 00011101110001100100010011100
028 00011100110011101100110111101
029 00011101110001100100110001101
030 00011101111011100100110101110
031 00011101110001100100010111110
032 00011101100011100000110010001
033 00011101110001100100010001111
034 00011101110001100100010001111
035 00011101111011111110110001111
036 00011101111011101100010011111
037 11011101111011111110110101111
038 00011101110011100100110101110
039 00011100110000101100010000010
040 00011101110001100100010110100
041 00011101110001100100010001111
042 00011100100001100000010010101
043 00011101110001101000010001110
044 00011101111001111110110001111
045 00011101110001100100010011111
046 00011101110001100100010001111
047 00011101110001100100110000100
048 00011101110011100000110000010
049 00011100100001100100010000101
050 00011111111011110100110101110
051 00011100100001101100010111111 052 00011100100001000000010010000
55
082 084
085
086
087
088
089
090
091
092
093
094
095
096
097
099
101
102
103
105
106
108
109
111
112
113
114
115
116
117
118
119
122
123
124
126
127
128
129
130
132
133
134
135
136
137
140
141
142
145
147
148
149
151
152 153
1101010111110000101100011100 0101010111111010101110111100
0001010111111000100100001100
1101010111111100101100111100
1101010111111000101100111100
1001010111111010101100101100
1101110111111110111110101101
0001010011100000101100000100
0001010011101000101100001100
0001110111111110101110111110
0001000001000000000100000100
0001010011010010100100001100
0001010011001000101100000100
0101010111111010101100011100
0101010111101110101110011110
1101110011101110101110001101
0101010111111000101100001100
0001010011000000000100000100
0001010011001000101100000100
0001010111111100101100001100
1101010111110100101100011101
0001000011000000000100001100
0001010111100100101100001100
0001010011101110101100011100
0101010111101000101100001101
0101010011010100000100000100
0001010011110000101100000100
0001010011100100000100000100
0001010011100000100100001100
0001010011001000100100000100
0001010011100000100100001100
1101010111111110101100001101
1101110111111110101100011100
1101010111110100101100011100
1101010011101010100100011101
1101010011111010101110001101
1101010011101010101100011100
1101110111101010101100001100
0001010011110000100100001100
0001010011100100101100000100
1101010111111100101100111101
1101010111111110101100011101
1001010111110110101100001100
0001010111111100101100011100
1101110111110110101110011100
0101010111111110101100011100
0111110111111110111100111101
0101010011111010101100011100
0001010011111010101100001100
1001010111111110101100011101
0101110111111100101100111101
1011110111111011111110111111
0001010111111110101100011101
0001010111111100100100000100
1101010111111100101100111101 0001010111111000101100001100
108 1101100001100010000000101000 109 1101100011100110010110101000
110 1101100011100010010010001000
112 1101100011000110010000101000
113 1101000011100010010100001000
114 1111101011100110010010101000
115 1101001111110110010010101000
118 1101100011100110010110101000
119 1111101111110110010110101000
121 1001000001000110010100101001
122 1001000001000010000010001000
123 1111111111110110011110101000
125 1111101011110110010110101000
126 1111000011100010010010001000
127 1101000011010010010000101000
129 1111000011010010010110001000
131 1111111111111111011110111010
132 1101000011000010010000101000
133 1101000011100010010010101000
135 1111100001000010000010001000
136 1111000011100110010010001000
137 1101000001000110010000001000
139 1101001111110110010010001000
141 1101100011100010010010001000
143 1011000001000010000000001000
144 1111111111111111111011111011
145 1101100011100010010100101001
146 1101000011100110010010101000
148 1101100011100110010010101000
149 1101100011100110010110101000
150 1011000001000000000000001000
153 1101100111110110010110101000
156 1101100011100110010000101000
159 1101100011110010010000001000
160 1101000111100110010110001000
162 1101100011100110010110101000
163 1101001111110110011010101000
164 1101100011100110010110001000
168 1101100011100110010110101000
170 1111101111110110010110101000
171 1111000011100110010010001000
172 1101100011100110010110101000
173 1111101111110110010110101000
175 1111100011100110010110101000
177 1001000001000000000100001000
179 1111000011110110010010001000
180 1101100011100110010110101000
181 1101000011000010010000001000
184 1111101111100110010110101000
185 1101100011100110010110101000
187 1101100011100110010110001000
189 1101000011100110010010001000
190 1101101111110110010110101000
191 1101000001000010010010101000
193 1001000001010000000000000000 194 1101100011110110010110101000
053 00011101110001100100010011110 054 00011111111011110110110101111
055 00011101110001101100110111111
056 00011101110001101100110001100
057 00011101110001100100010000100
058 00011101110001100100110111111
059 00011101110001100100110000110
060 00011101110001101100010010111
061 00011100100001101100010000101
062 00011101110001100100010011111
063 00011101110001100000010101100
064 00011101110001100100010001111
065 00011101110011101100110111111
066 00011101110001100100010001100
067 00011101110001100100110101101
068 00011101110001100100110101111
069 00011101110001100100010001110
070 00011100110001100100010011101
071 00011100110001100000010000110
072 00011111111011101110110111111
073 00011101111011111100110111111
074 00011100110001101100010110100
075 00011100110001100100010000100
076 00011101110001100100110001111
077 00011101111011110110110111111
078 00011100110001101000010000100
079 00011101110001100100010011100
080 00011101110001100100110101101
081 00011101110001100100010001111
082 00011101110001100100110011110
083 00011101110001100100010001100
084 00011101111001100100110100111
085 00011111111011111100110111111
086 00011101110001100100010011111
087 00011101110001100100010001111
088 00011101110001100100110000110
089 00011101111011111100110111111
090 00011100111001101100110001100
091 00011101101001100110110000001
092 00011101111001100110110111111
093 00011101111001100110110111111
094 00011101110001100100010011100
095 00011101110001100100010001111
096 00011100100001100000010110000
097 00011101100000100000010011000
098 00011101110001100100010011111
099 00011101110011101100110001111
100 00011101111011101110010101101
101 00011101110001100100010111111
102 00011101110001101100010001111
103 11011111111111111111111111111
104 00011101110001100100010001100
105 00011100110001100100010000010
106 00011100110001101100010100110
107 00011100110011111100010001111 108 00011100100001100100010001110
56
154 155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208 213
1101010111111110111100111111 0001010111100000101100000100
0101110111111110101110111111
1101110111111110101110101100
0101010111110010101100011100
1111110111111110111110111101
0101010111100100101100001100
1111110111111110111110111101
1111110111111110111110111111
0101010111111000101100001100
0101110111111110101110101101
1101010111111110101100101101
1101110111111110111110111111
1101010111111110101100111101
1101010111111110101100101101
0101010011000000001100000100
1101110111110010101100111100
1101010111111100101110111101
1111010111111011111110111111
1101110111111010101100101101
1101010111100100100100001100
0001010111111000101100011101
0001010011000000100100000100
1101010111111110101100011101
0001010111101010101100001100
0001010111101000101100011101
0101010111101010101100111100
0001010011100000101100001100
1001000011010000001100001100
0101110111111110111100111101
1101010111111110101110111100
0101010011100000000100000100
0101110111111110101110111101
0001010111100000100100010100
1101010111111110111100111111
1111111111111111111110111111
0101110111111110101100011101
1001010111111100101100111101
0101010111111010101110101100
0101010111111010101110111101
0101110111111100101100111101
1111110111111110111110101111
0101110111111000101110111110
0101110111111110111110111100
0101110111110100101110011111
1001010111111110101100001100
1101110111111110101110111101
1101010111111010101100011100
0101010011100000101100001100
1101010111111110101100001101
0001010011110000101100000100
1111110111111111111110111111
0001010111000000000100000100
0101010011111000101100111101
1111110111111111111110111111 0001010111111000101100001100
195 1001000011000010000010001000 197 1101111111110110010010001001
198 1101100001000000000010000000
199 1111000011110110010010001000
201 1101100011000010010000101000
206 1101100011100110010010101000
207 1101000111110110010110001000
208 1111111111110111010110111001
210 1101000001100110010110001000
212 1101000011100110010010001000
215 1111111111111110011110001001
216 1111100011110110011100101000
217 1111110111110110010010111001
218 1111000011110010011110101000
220 1101100011100110010110101000
221 1111100011110110010110101000
222 1101100011100110010110101000
224 1111100011110110010110001000
228 1111111111110110011010101001
229 1111101111100110011010001000
230 1111100011100110010110101000
231 1101100011100110010010101000
233 1101000011100010010010001000
237 1101100011100110010110101000
238 1101100011100110010110101000
239 1101101111100110010110001000
242 1101101111110110010110001000
245 1101100011100110010110101000
248 1111101111110110011110111001
251 1101100011100110010110101000
252 1101100011100110010010001000
255 1101000001000010010110001000
256 1101001001100110010000001000
259 1111101011110110011010111001
260 1111100011100010010000001000
261 1101000011100110010010001000
262 1101011111110110011110101000
264 1101100011100110010010001000
265 1101100011000010010000001000
266 1111000011100010010110101000
268 1111000011110110010100001001
269 1111111011110010011110001000
272 1101100011100110010110101000
273 1111100111100110011110101000
275 1111111011110110010110111001
278 1101101111110110011110101000
279 1111111111111111011110111000
280 1101101011100010010010101000
282 1111111111111110011110011001
284 1111111111110110011010001000
286 1111110111110010010110111000
287 1101110111110111011010111001
288 1101100011100010010010001000
289 1111000011100010010010001000
291 1111101111110110010110101000 292 1101001001000010000000001000
109 00011101110001100100010011111 110 00011101110001100100010101101
111 00011101110001100100110001100
112 00011101110011100100010101111
113 00011100100000100100010000010
114 00011101110001100100010011100
115 00011101110001100100010001111
116 00011101110001100100010111111
117 11011111111011111111111111111
118 00011101110001100100110001111
119 00011100100001100100010010110
120 00011101110001100100010001100
121 00011100110000100000010000001
122 11011101111011110110110111111
123 00011101110001100100010011101
124 00011101100001100010110010111
125 00011100100001100100010000101
126 00011101111001100110110011111
127 00011101110001100100010101100
128 00011101110011101000010000010
129 00011101100001100000010001100
130 00011101110001100000010001000
131 00011101110001101100110111110
132 00011101111001100100110011100
133 00011101110001101100110010101
134 00011101110011101100010110110
135 00011101110001100100110001111
136 00011100110001100100010001101
137 00011101110011100010010101101
138 00011101110001100100010011111
139 00011100100001100000110110100
140 00011101110001100100010010100
141 00011100110001100000010100011
142 00011101111001101110110111111
143 00011101111011101110110001111
144 00011101111001101110110100111
145 00011101110001100100010111111
146 00011101110001100100010001111
147 00011101100001101100010101000
148 00011101110001100100010001111
149 00011101110001111110110001111
150 00011101111011101110110111111
151 00011111111001101100110111111
152 00011101110001100100010011111
153 00011111111011111110010101111
154 00011101110001100100010001100
155 00011101110011100110010111111
156 00011100110001100100110010100
157 00011100110001100100010010101
158 00011100110011101100010011100
159 00011101111011101110110001101
160 00011101110001100100010001111
161 00011101100001100000010010001
162 00011101111011100100010111111
163 00011101110011100100110011111 164 00011101110001100100010111100
57
214 215
216
217
219
221
222
226
227
228
229
230
231
232
233
234
236
237
239
240
242
247
248
249
251
254
255
256
257
258
259
260
262
0101010111110000100100000100 1101010111111100101100011101
0101010111111110101100001100
1101110111111110111110011101
0001010111110010101100001101
0101110111111110101110111101
1101010111110000101100001101
1101110111111110111100111111
1001010111111110101100111101
1101010111111110101100011100
0101010011110100100100011100
0101010011110000100100001100
1001010011011100000100001100
0101010111111110101110101101
1101110111111110101100101101
0101110111111110101110111101
0001010011110000100100000100
0001010111101010100100000100
0011110111111110111110111111
1101010011110010101100011100
0101010111111010101100001100
1001010011111110101100111100
1111110111111110111110111111
0001010111110110101100001100
1101110111111110101110111101
1101010111111110101110111101
1111110111111111111110111111
0101110011111110101100101101
0001010111111100101100011100
0101010111111000100100011100
0101010111110110101100001101
1101010111111110111110111110
1111110111111110111110111111
293 1111001111110110010110101000 295 1101111111110110010100001001
296 1101100011100110010110101000
297 1101000001000010010010101000
299 1101101011110110010010111000
300 1111000001100110000000001000
303 1111111111111111011110111000
305 1111101111110110010110001000
306 1101111111111111011110111001
307 1101100111100010010110101001
309 1101001011110010010010001000
310 1111101111110110010110101000
311 1111100111110110011110101001
314 1101100011100110010110101000
315 1101100001100010010000001000
317 1101000001110010010000001000
318 1111111111111111011010111000
319 1111111111111111011110111000
320 1111000011000110010000001000
323 1101101011100010010010101000
327 0000000001000000000000001000
328 1101100011100110010110101000
331 1111111111111110111110111010
334 1111111111111111011110111001
336 1101100011100110010110101000
340 1101101011110110010110001000
344 1111101111110110011110101000
346 1101000001010010010000001000
347 1001000001000000000100001000
350 1111100111100110011100001000
351 1111000011100010010010001000
354 1101000101010110010100001000
355 1101100011100010010010101000
358 1111000111100110010010101000
359 1111100111110110010010001000 361 1111111111111111011110111001
165 00011101100001100100010100101 166 00011101111011101100110010101
167 00011101100011100100010010011
168 00011101110001100100010001100
169 00011101110001100100010001111
170 00011101111011100110010101110
171 00011101110001100100110001100
172 00011101110001100100110001111
173 00011101110001100100010000101
174 00011101100000100000010000000
175 00011101110001100100010001101
176 00011100100001000100010000000
177 00011101110001100100010001111
178 00011101111011100110110011111
179 00011101110001100100010001111
180 00011111111011111110110000111
181 11011111110011111110111111111
182 00011100110001100100110011100
183 00011101100000100000010011000
184 00011101110001100100110001101
185 00011100100001100000010001101
186 00011101110001100100110011101
187 00011100100001100000110010100
188 00011101110001101100110101110
189 00011100110001100100010001111
190 00011101110001100100010010110
191 00011101110001100100010001111
192 00011101110001100100110011111
193 00011101101001100100010100111
194 11011101111011111110110111110
195 11011111110011101100110111101
58