A study on teachers' item weighting and the Rasch model ...

A study on teachers’ item weighting and the

Rasch model: Summative test items’ difficulty

logits calibration using the Rasch model

Author Name: Sanghoon Mun University of Bath

British Council ELT Master’s Dissertation Awards: Special

commendation

A study on teachers' item weighting and the Rasch model: Summative test items'

difficulty logits calibration using the Rasch model

Table of Contents

Chapter 1. Introduction .............................................................................................................1

1.1 Introduction.................................................................................................................1

1.2 Purpose of Study .........................................................................................................2

1.3 Overview .....................................................................................................................2

Chapter 2. Literature Review .....................................................................................................3

2.1 Introduction to Chapter Two ...........................................................................................3

2.2 Test Items Weighting Methods........................................................................................3

2.2.1 Equal Weighting Method...........................................................................................3

2.2.2. Differential Weighting Method: Weighting by difficulty..........................................5

2.3 Item Analysis ....................................................................................................................5

2.3.1 Classical Test Theory (True-Score Theory) ................................................................6

2.3.2 Item Response Theory (IRT) ......................................................................................8

2.3.2.1 The Rasch model: Fit statistics..........................................................................11

2.4 Conclusion to Chapter 2 .................................................................................................13

Chapter 3. Research Methodology .........................................................................................14

3.1 Introduction to Chapter 3 ..............................................................................................14

3.2 Context ...........................................................................................................................14

3.3 Research Strategy...........................................................................................................15

3.4 Research Design .............................................................................................................16

3.4.1 Overview of the Research Procedure......................................................................16

3.4.2 Data Collection and Sampling..................................................................................17

3.4.3 Data Analysis............................................................................................................18

3.4.3.1 Way of Categorising Language Elements........................................................119

3.4.4 Validity and Reliability .............................................................................................20

3.5 Ethical Considerations....................................................................................................20

3.6 Conclusion to Chapter 3 .................................................................................................20

Chapter 4. Data Analysis ..........................................................................................................20

i


4.2 Selecting Data.................................................................................................................21

4.3 Test Item Classification ..................................................................................................21

4.4 Data Refinement ............................................................................................................23

4.5 Item Difficulty Calibration ..............................................................................................24

4.5.1 X High School ...........................................................................................................24

4.5.2 Y High School ...........................................................................................................27

4.5.3 Z High School ...........................................................................................................29

4.6 Language Category Classification...................................................................................33

4.7 Summary ........................................................................................................................35

Chapter 5. Discussion...............................................................................................................36


5.1 Discussion and Implications ...........................................................................................36

5.3 Recommendations for Future Research ........................................................................38

5.4 Conclusion ......................................................................................................................38

References ...............................................................................................................................40

Appendix ..................................................................................................................................44

ii

Chapter 1. Introduction

1.1 Introduction

As language tests have become a part of life, the importance of the language tests seems to

grow bigger and bigger in a number of domains. From this perspective, McNamara (2000) points out

that language tests have played a powerful role as gateways at important transitional moments in

education, in employment, and in immigration. Taylor (2005) also argues that depending on the

language test results, test-takers' life chances or careers can be influenced in a given domain.

Likewise, in Korean society, a high language test score is considered as a way of gaining "access to

the world of elite" (Hu & Mckay, 2012, p. 352). Such beliefs that a language test and its result serves

as "crucial milestones in the journey to success" (Brown, 1994, p. 401) can be identified in the high

school language test, because in most 'competitive' universities in South Korea, English test results

which students have obtained from school summative tests are usually used as evidence of

predicting their academic latent trait (Weir, 2005) and directly influence students' university

admission to a great extent.

In designing such a high-stakes test, language teachers usually make multiple-choice questions

with the belief that those who choose a preferable answer to others have more knowledge or ability

than those who do not (Cliff, 1989). Even though multiple-choice testing is frequently the target of

disparaging comments in the everyday conversations of students and teachers, it is generally said

that multiple-choice items are economically practical and provide relatively objective (or valid)

scoring in the testing field (Diekhoff, 1983; Pae, 2012) by preventing the possibility of divergent

answers (Statman, 1998). I understand that multiple-choice items are economically practical and

efficient in terms of a scoring process. However, the scoring issue needs to be carefully investigated in

relation to weighting process. Otherwise, I believe that it may cause a difficulty in demonstrating the

link between the test users' interpretation of the score and the decisions that they make on the basis

of the score. (Fulcher & Davidson, 2009).

After designing multiple-choice items, most school teachers resort to differential item weighting

with the assumption that equally weighted test items cannot reflect the importance of test content

(Feldt, 2004) and that only the very able will be able to identify the correct answer to the most

difficult (West, 1924). In addition, they aim to minimise the number of students who gain the same

scores (this topic will be specifically discussed in chapter 3) so that students can be clearly

differentiated in terms of their capacity. As a way of allocating differential weights to the multiple-

choice test items, they usually rely on their subjective judgement on the difficulty levels of the items

(this is called an 'a priori' weighting method). Based on subjectively assigned weightings of the items, a

scoring process is taken place. At this point, I want to pose a question about whether weighting

and scoring are discrete processes. If scoring is a simple process of adding the weighted points

regardless of the weighting process, the particular values of the weights can be relatively

unimportant (Koopman, 1988). However, that is not the case. Weighting and scoring are not discrete

processes. Rather, a weighting process may have an influence on the scoring process to a certain

degree, because the scoring process cannot take place before the items are weighted. In this sense, I

believe that it is difficult to secure objectivity in scoring multiple-choice items.

1

In order for the test scores to be interpreted as critical indicators for making decisions, test

designers need to identify and eliminate any potential sources of errors that can decrease both the

reliability of scores and the validity of their interpretations (Bachman, 1990). In the similar vein,

McNamara and Ryan (2011) claim that "a person's chances of success on a test should not be

influenced by irrelevant factors" (p. 163). Among a number of factors which influence the test

reliability and validity is the weighting process (Guilford, 1954; Wang & Stanley, 1970), which is the

topic of this study. Depending on how test designers weight the items, the test reliability and validity

can be enhanced or marred. In spite of the importance of the weighting process in language tests,

little, if any, attention has been paid to the weighting issue in my context. Hence, the current

research will investigate how consistently language teachers weight the items using the Rasch model.

1.2 Purpose of Study

When deciding the difficulty levels of the test items and allocating corresponding marks to them,

teachers in South Korea tend to use their knowledge and experiences. Based on their prior

experiences and expertise, they weight test items and allocate different marks to the test items

before test takers take a test. I assume that such a way of item weightings may not match the virtual

difficulty levels of the items. Rather, there may be a discrepancy between teachers' prior item

weighting and students' virtual responses to the items. In relation to this assumption, two research

questions will be investigated in this study.

• Is there a gap between teachers' weighting and students' virtual item responses? If so, is

the gap large or small?

• In terms of language categories, does what teachers believe to be difficult correspond to

what students found to be difficult?

1.3 Overview

This dissertation is composed of five chapters, including this chapter. Following this chapter, in

chapter 2, literature about item weighting methods and two ways of analysing test items will be

reviewed and discussed. In chapter 3, along with the explanation about the research context, the

overall blueprints of current research will be described. In chapter 4, the data collected from three

different high schools will be analysed using the Rasch model and findings will be expounded upon. In

the final chapter, a discussion will be made in relation to the findings and recommendations for

the further research will be suggested.

2

Chapter 2. Literature Review

2.1 Introduction to Chapter Two

William (2011) argues that "it is only through assessment that we can find out whether a

particular sequence of instructional activities has resulted in the intended learning outcomes" (p. 3). In

addition, educational practitioners and theorists have widely noted the effects of assessment on

learning and teaching (Zhana & Andrews, 2014; Brown, 1997). From this perspective, in a formal

education setting, most learning and teaching activities are accompanied by assessment, and as an

instrument of manifesting the assessment process in relation to teaching programmes and materials

(Woodford, 1980), tests (e.g. performance test, summative test, etc.) are frequently set up and

implemented (Lloyd-Jones, 1992). That is, "assessment is a superordinate term for all forms of

assessment, and testing is a term for one particular form of assessment" (Leung & Lewkowicz, 2006, p.

212). Given the relationship between assessment and tests, I believe that the quality of

assessment may be dependent on the quality of the test to a great extent.

When designing a test, test designers (mainly language teachers in my context) usually

encounter the moment in which their professional judgement needs to intervene (Allal, 2013). First

of all, they have to make a decision about the domain being tested. Subsequently, they choose

"proper" test methods and reflect both the domain and methods into the test. Along with those

processes, test designers also labour in speculating on the way of item weighting which has an

impact on the internal criteria such as reliability (Wang & Stanley, 1970). In the first section of this

chapter, the item weighting methods (an equal weighting method and a differential weighting

method) will be introduced.

After students take the test, teachers need to analyse and evaluate what test scores mean

on the basis of their test results (Taylor, 2013). In this process, the results are summarised into the

numerical values such as mean, standard deviation, and frequent distribution which contain much

information about the test. Along with the overall analysis of the test, teachers also need to

specifically estimate the individual test items in order to find out how difficult an individual item is

and how high the item discrimination power is. In this sense, after introducing the weighting

methods, ways of analysing the test scores and test items (classical test theory and item response

theory) will be explained.

2.2 Test Items Weighting Methods

In general, the total score may be obtained by merely adding the marks of correctly

answered items. That is, if students response to a certain item rightly, the score goes up by the mark.

Otherwise, the score does not change at all. However, the score which students gain is not

summated in an identical way. Depending on the method of weighting, the scores that students can

obtain can differ. In this section, two weighting method will be explained: an equal weighting

method and a differential weighting method.

2.2.1 Equal Weighting Method

In an equal weighting method, all test items have the equal weight (Stalnaker, 1938), so a

test designer allocates the same mark to all test items with the belief that each item is equally

related to the underlying trait being measured (Xu & Stone, 2012). To put it simply, if the maximum

3

possible score which a test taker is able to gain is 100 and a test designer creates 20 test items, each

item's mark assigned can be 5 points equally. This method usually makes the scoring process easy

and helps test developers to interpret the test result conveniently. However, practically, "items are not

equally correlated" within a test (Guilford, 1954, p. 443) because of the standard deviation (SD). That

is, even though this weighting method ostensibly distributes the same mark to each item,

internally and practically each item has different weightings, because the SD of items determine the

weights of the items (Wang & Stanley, 1970). The SD indicates the distance of a particular score from

the mean score. As the SD is higher, the distance of a particular score from the mean score is further.

In relation to the SD and item weightings, Kim et al. (2010) mention that if the SD of the item is high,

the effective weighting of the item becomes high, and if the SD is low, the effective weighting

becomes low. Suppose that 10 students answered the 3 items and they had the results as shown in

Table 2.1.

PERSON

ITEM SD Variance Average ABCDEFG H I J

1 222 2 00 0 0 0 0 0.98 0.96 0.80 2

222 2 22 2 0 0 0 0.92 0.84 1.40 3

222 2 22 2 2 0 0 0.80 0.64 1.60

Table 2.1 10 Students' test scores on three items

Table 2.1 shows that an identical point, 2, has been equally allocated to each item. Some

students responded to a certain item correctly, while others did not. In addition, Table 2.1

represents that the SDs and variances of three items are different. The fewer the number of correct

responses, the higher the SD is. Since a high SD refers to a high effective weighting (Kim, et al., 2010),

the effective weighting of item 1 is higher than item 2 and 3. In this sense, even though the same

mark is allocated to the test items, it can be said that those three test items have in practice

different effective weights.

In a real test context (especially in a large-scale test), it may be extremely difficult for

different items to have the same SD and variance, because student's abilities vary. In addition, what is

difficult for one can be easy for another, or vice versa. From this perspective, Wang and Stanley

(1970) claim that even though the mark of items within the test is equally assigned, the effective

weighting of the items can be naturally adjusted. This is called a natural (or random) weighting.

Even if an equal weighting method may be able to make the interpretation of the results

convenient and naturally adjust the effective weighting of the test items, Stalnaker (1938) and Kim et

al. (2010) argue that since this weighing method distributes the same mark without considering the

critical factors (e.g., item complexity, importance of content, time required, etc.) which can

influence the test result, the test stability and reliability are likely to be compromised. In addition,

Shi and Chang (2012) contend in their research that assigning different weights rather than an equal

mark to the test items can lead to more accurate estimates of the test-takers' latent traits. In spite of

that, Wang and Stanley (1970) argue that "although differential weighting theoretically promises to

provide substantial gains in predictive or construct validity, in practice these gains are often so slight

that they do not seem to justify the labour involved in deriving the weights and scoring with them" (p.

664).

4

2.2.2. Differential Weighting Method: Weighting by difficulty

While an equal weighting method distributes the same mark to the test items, a differential

weighting method assigns different marks to each item on the basis of the particular criterion

adopted. Among criteria (e.g., item length, item validity, etc.) for assigning different weights to the

items is item difficulty, which is the topic in this section.

When test developers or subject-matter experts assign different weights to sections or items

of a test, they considers a number of factors such as time, the maximum score, and the number of

questions. However, more ultimately, based on an intuitive feel for items' difficulty or "worth"

(Wang & Stanley, 1970), they subjectively judge that one question may be deemed more important

than others (Stalnaker, 1938; Wang & Stanley, 1970) and assign different weights to the items. This is

called an 'a priori' (subjective) weighting method and is generally implemented in South Korea (Kim

& Roh, 1999, cited in Kim et al., 2010).

However, it is extremely difficult to estimate how difficult the test item will be before the

test takers take the test (Baker, 1985; Kim et al., 2010). In some cases, what is considered to be

difficult by test designers may be easy for the test takers or vice versa. Because of the gap, there

may be a possibility that the weight is not equivalent to the proportion of those who fail to answer

the item correctly (Wang & Stanley, 1970). Thus, Guilford (1954) argues that since weighting on an 'a

priori' basis heavily relies on personal bias (subjectivity), this method can compromise the reliability

and validity of the test unless the criteria for the item weighting are consistently and strictly laid

down.

In addition, Gulliksen (1950) and Stalnaker (1938) contend that a differential weighting

method may not be advantageous over an equal weighting method. Gulliksen (1950) points out that in

a wide range of cases, an equal weighting method produces a statistically similar result with a

differential weighting method. Stalnaker (1938) indicates that the balancing of weights becomes

highly complex so that if more than two teachers are involved, a great amount of time may be spent in

determining the appropriate weights (Stalnaker, 1938).

In spite of such limitations of a differential weighting method, Wang and Stanley (1970)

assess this method like this: an 'a priori' weighting is the most appropriate when it is actually used to

define the nature of the composite measure" (p. 668). In addition, since test designers have

entrenched conviction that knowing a very difficult item is evidence of considerably more ability or

achievement than knowing a simple one, they have made an attempt to distribute different weights to

the test items and consistently redefined the differential weighting method (Wang & Stanley,

1970).

In this section, two weighting methods, an equal weighting method and a differential

weighting method were explained. In the following section, two ways of analysing the test items will

be discussed along with the specification of the relevant terms.

2.3 Item Analysis

Bachman (1991) argues that the broad purposes of language tests are to predict test takers'

authentic capacity in the future and fundamentally to make decisions about the test takers' ability in

non-test contexts (e.g., employment, placement, grading, etc.). In order for the test to fulfil those

5

two primary functions properly, careful attention needs to be paid to the principles of test

construction when designing a "good" test (Kaplan & Saccuzzo, 2005). At the same time,

spontaneously, as the methodological field of language testing has advanced, the advances in the

tools that are available for test analysis come to concur synergically (Bachman, 1989) for the

purpose of examining the contribution that an individual test item is making to the whole test

(Hughes, 2003). In this section, two test analysis theories will be explained: classical test theory and

item response theory.

2.3.1 Classical Test Theory (True-Score Theory)

An individual test taker's true score is represented by summation of the marks of all

correctly answered test items. However, in some cases, it seems to be commonly witnessed that test

takers can make a lucky guess or by chance mismark questions incorrectly. From this perspective, in

early 1900s, a theoretical framework was established by British Psychologist Charles Spearman on

the basis of the simple notion that the sum of a "true" score plus random "error" is a test score

(Domino & Domino, 2006). For example, among 10 items, a test taker knows only seven answers,

but the test taker luckily responses to nine answers correctly. In this case, true score is seven, two

means random error, and nine becomes observed scores. On the other hand, if a test taker

mismarks two items incorrectly and gains five scores, the true score becomes seven. The formula is

below.

X (Observed Score) = T (True Score) + E (Error) (2.1)

Classical test theory assumes that the T (true score) for an individual will not change with

repeated applications of the same test (Kaplan & Saccuzzo, 2005). However, because error is

random and varies in every test administration, each test taker's observed score always differs from the

person's true ability or characteristic (Sharkness & DeAngelo, 2011). That is, the amounts of E (error)

produce inconsistent X (observed score) and consequently increase item variability (DeVellis, 2006).

When the aim of the test is to "yield a score that is a relatively close reflection of the true score"

(DeVellis, 2006, p. S51), the smaller the dispersion (standard deviation of errors or variance) between

T and X is, the more reliable and accurate the test is.

Along with the understanding of the formula 2.1, in this theory, the calculation of facility

(item difficulty) value matters, because the facility value provides useful information through which

we are able to simply compare the difficulty between items within a test. If the number of correct

responses is divided by the total number of responses, we can gain the facility value. For example,

out of 100 test takers having taken the same test, if 25 test takers respond to an item X correctly and

60 test takers respond to an item Y correctly, the facility values of item X and Y are 0.25 and 0.60

respectively. In this case, an item X is said to be more difficult than an item Y. This facility value is

also closely connected with discrimination index (DI), because if the facility value of a certain item is

too high or too low, the item may not be considered to be eligible for discriminating the overall

ability between test takers. DI refers to an indicator of how well an item discriminates between

strong test takers (STTs) and weak test takers (WTTs) (Hughes, 2003). The maximum discrimination

index is 1, indicating a perfect discrimination power, whereas 0 (zero) indicates that an item does

not discriminate at all.

6

In computing DI, a question about how to define high scorer vs. low scorer can be posed. In

regards to this question, Kelly (1939, cited in Domino & Domino, 2006) suggests a specific

percentage which makes it possible to distinguish STTs from WTTs. Kelly suggests that the

percentage which distinguishes STTs from WTTs is 27%. That is, out of 100 test takers, 27 high

scorers form STTs and 27 scorers in the lowest ranks form WTTs. The formula of calculating DI is

below.

DI = The

number of cases (either the number of STTs or WTTs) Correct

Responses of STTs Correct Responses of WTTs

(2.2)

For example, out of 100 test takers, if 23 STTs and 6 WTTs respond to one specific item

correctly, the DI of the item becomes 0.63.

0.63 (DI) =

23(Correct Responses of STTs ) 6(Correct Responses of WTTs)

27(The number of cases)

As a means of interpreting the numerical values of DI and conveying this interpretation to a

non-technical audience, the verbal labels used to describe an item's discrimination can be related to

the range of values as follows:

Range of values

.40 and above

.30 to .39

.20 to .29

.19 and below

Interpretation

Very good items

Reasonably good items but possibly subject to improvement

Marginal items, usually needing and being subject to improvement

Poor items, to be rejected or improved by revision

Table 2.2 Levels of discrimination (Popham, 2000, cited in Green, 2013, p.29)

In classical test theory, a test taker's total score is defined by the total number of items

which are correctly answered and the score (or mean score) is interpreted in terms of relative ability

within the group of students taking the test (Domino & Domino, 2006; Lawson, 2006). For example,

out of 100 test takers, if there are 9 test takers who obtain higher mean scores than a test taker Y,

the percentile rank of the test taker Y becomes 90%. It also gives information that 89% of test takers

gains lower mean scores than the test taker Y obtains. However, such a piece of information does

not account for the relationship between the person and the item. That is, classical test theory fails

to assess how a test-taker responds to a certain item, because test-takers' trait levels and item

difficulty levels are not intrinsically connected under classical test theory (Furr & Bacharach, 2008).

Hence, in order to "formalize the relationship between the latent trait of a person and the item

difficulty levels using a mathematical model" (Wilson, 2013, p. 3770), a new theory called item

response theory was developed.

2.3.2 Item Response Theory (IRT)

Under classical test theory, test takers' ability levels are measured by simply adding

responses across items and then converting the sum into a standard score (Embretson, 1999). In

addition, the difficulties of the items are simply calculated by dividing the number of correct

responses with the total number of responses. In classical test theory, therefore, the information

about a person's latent level and an item property cannot assess the probability of whether test

takers may respond to the item correctly or not. That is, classical test theory does not account for

the relationship between the person and the item.

On the other hand, in IRT model, test-takers' ability levels and item difficulty are

independent variables that are estimated separately (Embretson, 1999). In addition, the value of

both latent levels and item difficulty is estimated on an immensely simplified unit of measurement

which is called a 'logit scale' (Bond & Fox, 2001). Since both latent levels and item difficulty use the

same scale, it becomes possible to provide a means for quantifying the probability that a test-taker

will pass or fail a particular item by comparing a test-taker's ability logit scale with the difficulty logit

scale of the item (Henning, 1987). For example, if a test-taker's latent trait is +1.0 and an item

difficulty logit is smaller than +1.0, the possibility of the test-taker passing the item goes up. On the

other hand, if the test-taker encounters the item whose difficulty logit is larger than +1.0, the

possibility decreases (the possibility will be specifically discussed again below). In this sense, Crocker

and Algina (1986) define IRT as the computation and examination of any statistical property of test

takers' responses to an individual test item.

Probability

Figure 2.1 the Relationship between Ability and Item Response (Crocker & Algina, 1986, p. 341)

Intuitively, if a person has a low trait level, the likelihood of the person passing most items

on a very hard test is low. On the other hand, if a person has a high trait level, it is more likely that

the person passes most items on a very hard test. The relationship between latent trait and item

response is shown in Figure 2.1. According to the graph above, as the logit of the latent trait (ability)

goes away from 0 (zero), it is seen that the probability of passing the items goes up and down. To be

specific, if the logit of latent trait is going up above zero, the probability of person's selecting the

correct answer increases. On the other hand, as the logit of latent trait is going down below zero,

the probability decreases. Such a relationship between the person's ability and item difficulty is also

well schematised through the item-person map.

8

5

Figure 2.2 Item-person map

As illustrated in Figure 2.2, on the item-person map, test items are separated from the test-

takers (Murphy & Davidshofer, c1991; Mellenbergh, 1996). According to Figure 2.2, the person

ability estimates range from approximately +2.4 to -2.4 logits. The test-taker who has the highest

ability logit is SAM and that who has the lowest is BEN. Based on the place of the person in the map, it

can be presumed that SAM is considerably more able than BEN. Likewise, the item estimates range

from approximately +2.8 (item 3) to -2.4 (item 8) logits. Item 3 in the very top of the map is

measured as the toughest item, while item 8 is the easiest one. The locations of the item 3 and 8

represent that the possibility of most test-takers answering to item 3 incorrectly, even if not

necessary, would be very high in this test, whereas the probability of most test takers' answering to

item 8 incorrectly would be very rare.

On the basis of the location of the item and person on the map, it also becomes possible to

roughly estimate the probability of how test-takers respond to each particular item by comparing

the latent trait logit with the item difficulty logit (Kaplan & Saccuzzo, 2005). That is, the relationship

between the person's ability and item properties can be accounted for through this map. For

example, the possibility that RIO can pass item 7 is high, because the place where RIO is situated is

9

higher than the place of item 7. However, even if RIO is situated higher than item 6, the gap

between RIO and item 6 is relatively close. Therefore, it is not a surprising event, even if RIO

responds to item 6 incorrectly. In addition, RIO is located in the same place with item 5. That is, the

logit of the latent trait is the identical to that of the item difficulty. In this case, the possibility to find

the correct answer and the wrong answer is the same.

To be specific, it is possible to predict the approximate probability of a test taker's success on

a given item (Green, 2013), if the logits of a person's ability and an item difficulty are provided. This

is done by using a conversion table such as the one shown in Table 2.3.

Positive (above Zero) Negative (below Zero)

Difference between a person's ability and item

difficulty 5.0

4.6

4.0

3.0

2.2

2.0

1.4

1.1

1.0 0.8

0.5

0.4

0.2

0.1

0.0

Probability of

answering item correctly

99% 99%

98%

95%

90%

88%

80%

75%

73% 70%

62%

60%

55%

52%

50%

Difference between a person's ability and item

difficulty -5.0

-4.6

-4.0 -

3.0

-2.2 -

2.0 -

1.4

-1.1

-1.0 -0.8

-0.5

-0.4

-0.2

-0.1

0.0

Probability of

answering item correctly

1

% 1

%

2

%

5

%

10%

12%

20%

25%

27% 30%

38%

40%

45%

48%

50%

Table 2.3 Conversion table (Green, 2013, p. 165)

For example, Figure 2.2 shows that the ability logit of TIM is 0 (zero) and the difficulty logit

of item 7 is -1.0. The logit difference between a person's ability and item difficulty is 1.0. According to

Table 2.3, if the logit difference between a person's ability and item difficulty is 1.0 above zero, a test

taker may have a 73 per cent chance of answering the item correctly. On the other hand, the

difference between a person's ability and item difficulty is smaller than 0 (zero), the chance will be

less than 50 per cent. From this perspective, after analysing test scores using the IRT model, if a

'weaker' examinee that has below zero latent logit answers several 'very' difficult questions

correctly, it is usually advised for testers to carefully investigate the examinee and find the reasons

(Henning, 1987). In this case, this weaker person who responses to the several difficult questions

correctly is called misfit data (this topic will be dealt with later).

In IRT, there are three logistic models: one-parameter model, two-parameter model, and

three-parameter model. One-parameter model (which is also called the Rasch model) uses only item

difficulty in order to measure test takers' ability, whereas two-parameter model uses item difficulty

and test discrimination and three-parameter model uses likelihood (possibility) along with those two

parameters. As the number of parameters adds up, the process of computation becomes more

complex (Henning, 1987). Studies using these models are becoming numerous (Skehan, 1989;

Henning, 1987; Domino & Domino, 2006). However, in educational measurement, one-parameter

model tends to be preferred by language testers over the other two models in terms of the

simplicity of computation, easy interpretation, and the small size of sample required, even if there

has been much controversy on choice of a model (Henning, 1987).

2.3.2.1 The Rasch model: Fit statistics

In the Rasch model, fit (infit and outfit) statistics are used in order to detect the

discrepancies between empirical data and the Rasch model prescriptions (Bond & Fox, 2001). By

statistically indicating the degree of match between the observed performance and expected

performance, fit statistics report how well the empirical data accord with the Rasch model (Linacre,

2002). Routinely, fit statistics are reported in both an unstandardized and a standardized form. An

unstandardized form refers to mean square (MNSQ) and a standardized form is standardized t

(ZSTD).

A fit MNSQ value provides information about "how confident we can be in the measures

(logits) associated with the persons and the items" (Green, 2013, p. 167). Depending on the MNSQ

value, whether items or persons fit the Rasch model or not can be judged. The acceptable MNSQ

value of a person and an item ranges from +0.5 to +1.5, which is considered to be productive for

measurement (Green, 2013). On the other hand, all data whose MNSQ values are not between +0.5

and +1.5 are classfied as misfit data, indicating that the data do not fit the Rasch model.

If MNSQ value is less than +0.5, it means that a person or an item is performing in a too

predictable way. For example, if a person with a certain ability responds to all easy questions

correctly and responds to all difficult question incorrectly, the MNSQ value may be lower than +0.5.

"The MNSQ value of lower than +0.5 is considered to be 'less productive' for measurement" (Green, 2013,

p. 169). On the other hand, if MNSQ value is higher than +1.5, it means that persons or items are

performing in an unpredictable way. For example, if an able person responds to an easy item

incorrectly, MNSQ value can be higher. Because of the unpredictability, "the MNSQ value of higher

than +1.5 is considered 'unproductive' for measurement " (Green, 2013, p. 169). In the Rasch model, the

'unproductive' data (MNSQ > +1.5) are usually focused on and investigated rather than 'less

productive' data (MNSQ < +0.5).

"The infit and outfit statistics adopt slightly different techniques for assessing an item's fit in the

Rasch model" (Bond & Fox, 2001, p. 43). The infit MNSQ assigns relatively more weight to the

performances of persons which are closer to the item difficulty value (Ibid.). Thus, if a person

incorrectly answers the items particularly close to their ability level, the infit MNSQ value can be

affected. On the other hand, outfit MNSQ is more sensitive to the influence of outlying scores (lucky

guesses of low performers and careless mistakes of high performers). That is, outfit MNSQ is related to

how a person responds to the items that are very easy (item difficulty logit < -2.0) or very hard

(item difficulty logit > +2.0). Thus, if a very able person does not respond to a very easy item

correctly, the outfit MNSQ value can be affected.

It is usually easy to detect the reason for high outfit MNSQ (>+1.5), while high infit MNSQ

(>+1.5) does not always provide a clear reason why the person or item responded in such ways

(Green, 2013). In this sense, infit MNSQ is treated as greater threat which can distort the

11

measurement system, whereas outfit MNSQ is considered as less threat to measurement. In other

words, "aberrant infit scores usually cause more concern than large outfit statistics" (Bond & Fox,

2001, p. 43). From this perspective, in the Rasch model, more attention is routinely paid to infit

values than to outfit values (Bond & Fox, 2001).

Another value which can distinguish fit data from misfit data is ZSTD (which is also called

standardized t or fit t). As an alternative measure that indicates the degree of fit of an item or a

person to the Rasch model, ZSTD values also report the statistical probability of MNSQ statistics

occurring by chance when the data fit the Rasch model (Linacre, 2002). The acceptable value of ZSTD

ranges from -2.0 to +2.0. If infit and outfit ZSTD is higher than +2.0 (underfit) and lower than -2.0

(overfit), it means less compatibility with the Rasch model (Bond & Fox, 2001). However, in general, if

items and persons have infit MNSQ values within the acceptable range between +0.5 and +1.5,

ZSTD statistics can be ignored (Linacre, 2002; Bond & Fox, 2001).

Table 2.4 Scoring matrix for a 10-item vocabulary test (Henning, 1987, p. 119) ITEMS

PERSONS SUM 1 2 3 4 5 6 7 8 9 10

TOM 0 0 0 0 0 0 0 0 0 0 BEN 1 1 1 0 0 0 0 0 0 0 1 KIM 1 1 1 1 0 0 0 0 0 0 2 ANN 1 1 1 1 0 1 0 0 0 0 3 TIM 1 1 0 1 1 1 0 0 0 0 3 RIO 1 1 1 1 1 1 0 0 0 0 4 SUE 1 1 1 0 1 0 1 1 0 0 4 SAM 1 1 1 1 1 1 1 0 0 0 5 JUN 1 1 1 1 1 1 1 1 1 1 ROB 1 1 1 1 1 1 1 1 1 1

(* 0=wrong response, 1=right response)

PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE

ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD BEN -2.16 1 0.49 -0.76 0.28 -0.27 KIM -0.99 2 0.51 -1.19 0.39 -0.45

ANN -0.04 3 0.62 -0.82 0.48 -0.66

TIM -0.04 3 1.30 0.75 1.79 1.12

RIO 0.95 4 0.37 -1.31 0.29 -0.77

SUE 0.95 4 2.68 2.25 2.78 1.61

SAM 2.2 5 0.47 -0.78 0.25 -0.32 IT E M DIFFICULTY INFIT INFIT OUTFIT OUTFIT

SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD

3 -2.20 6 1.56 0.92 1.41 0.73 4 -1.08 5 1.15 0.45 1.37 0.67 5

-0.24 4 0.59 -1.10 0.48 -0.86

6 -0.24 4 0.93 -0.04 0.85 -0.06

7 1.34 2 0.62 -0.87 0.46 -0.41

8 2.42 1 1.21 0.52 0.78 0.30

Table 2.5 The Rasch analysis results (Person data and Item data) based on the data in the Table 2.4

There are two tables. Table 2.4 is the scoring matrix of vocabulary test and Table 2.5 is the

Rasch analysis report of the data in Table 2.4. The calibration was done by Winsteps (Version 3.81.0).

First of all, the number of examinees and items on both tables are different. In Table 2.4, there are

10 persons and 10 items, while in Table 2.5 are 7 persons and 6 items presented. In the Rasch model,

12

extreme scores (all correct and all wrong) always fit the Rasch model exactly, so the data which have

extreme scores are excluded from the computation of fit statistics (Linacre, 2002). Thus, TOM, JUN,

and ROB are excluded. After excluding the person data, item 1, 2, 9, and 10 are also found to be

extreme scores, so those 4 items are deleted. Table 2.5 indicates that SAM (+2.2) is the most able

person and BEN (-2.16) is the weakest person. In addition, the table shows that item 8 (+2.42) is the

toughest and item 3 (-2.20) is the easiest.

Based on the Rasch analysis, the information about the abilities of persons and the

difficulties of each item can be obtained. According to Table 2.5, the infit MNSQ of most people and

items are within the acceptable range, except SUE and item 3. The infit MNSQ and ZSTD of SUE are

+2.68 and +2.25 respectively. Those two values imply that SUE does not fit the Rasch model properly.

That is, it is identified that SUE is a misfit datum. When looking into SUE's responses, it can be found

that SUE did not correctly answer item 4 and 6 which were relatively easy for SUE. Compared with

SUE's latent trait, these unpredictable responses may raise the infit MNSQ and ZSTD values.

However, item 3 is not classified as a misfit datum. Even if infit MNSQ of item 3 is higher than +1.5,

infit ZSTD is within the acceptable range (<+2.0). In this case, it can be said that item 3 fit the Rasch

model.

2.4 Conclusion to Chapter 2

Language testing shares similar goals of understanding the process of language learning

(Shohamy, 2000). Thus, the theory of language testing is more likely to be congruent with

psychological knowledge on language learning (Lado, 1961). From this perspective, in the field of

language testing, there had been a tension between the analytical and the integrative (Davies, 1978).

For example, Lado (1961) believes that language is a linguistic phenomenon that tests linguistic

abilities. He argued that a language must be broken down into its linguistic components. On the

other hand, Taylor (2004) introduces in his article that recent language testing tends to pay much

attention to how different language variables interact one another. However, in language testing,

such distinction is "not a real or an absolute one" (Davies, 1978, p. 151). Rather, when developing a test,

more attention needs to be paid to the test takers that virtually take the tests or the test takers' needs

that can vary from one context to another (Shohamy, 2000). To sum up, I do not believe that there

exists a one-size-fits-all test instrument, but I believe that there exists a one-size-fits-all test

designer who understands the test contexts and examinees' (or test results users') demands.

In this chapter, the weighting methods and the ways of analysing test items were dealt with.

In the next chapter, the method which was implemented in the current research will be discussed.

13

Chapter 3. Research Methodology

3.1 Introduction to Chapter 3

The previous chapter explained two item weighting methods and the ways of analysing test

items. In this chapter, firstly, the context in which Korean teachers design a summative test will be

explained. After the explanation about the context, a research strategy will be discussed and the

overall procedures of this research will be delineated.

3.2 Context

High school students study for three years in South Korea and one year is divided into two

terms. In each term, the students take the paper-pencil examination twice. Along with the paper-

pencil examination, performance-based tests (e.g. essay writing, student portfolios, group projects,

etc.) are also administered to students in order to test their actual performances (Brown, 1994). By

summing the scores which students obtain from the paper-pencil tests and performance-based tests,

students' term scores are yielded and the final results are ranked. Subsequently, the rank is

converted into a percentage and based on the percentage, students' grades are assigned from 1 to 9

(see Table 3.1). The lower the grade, the better the result is.

Grade

1

2

3

4

5

6

7

8

9

Percentage

above 96%

95% to 89%

88% to 77%

76% to 60%

59% to 40%

39% to 23%

22% to 11%

10% to 4%

3% and below

Frequency Distribution

4%

11%

23%

40%

60%

77%

89%

96%

100%

If there are 100 students, the

number of students in each grade

4

7

12 17

20

17 12

7

4

Table 3.1 Grade index table

In South Korea, subject teachers in each school design a summative test themselves, so the

examination which students take differs between schools. Teachers autonomously decide the

language categories, test objectives, and even difficulty level before students take the examination.

After completing a test design, teachers submit the test sheets with a two-dimensional table of test

specifications (Figure 3.1) to the school head office. In the two-dimensional table of test

specifications, specific information about answers, item marks, difficulty levels, language categories,

and objectives are contained.

14

Figure 3.1 Example of two-dimensional table of test specifications

As can be surmised in Figure 3.1, teachers estimate the difficulty level and based on their

intuitive feel for the item difficulty (Rudner, 2001), they allocate different weightings to the items.

The more difficult they believe the item is, the higher mark the item is allocated. The primary reason to

allocate different marks to the items lies in minimising the number of the students who gain the

same total scores and in helping classify students (Shohamy, 1998), because the Korean education

office employs a policy against those who obtain the same total score. For example, out of 100

students, suppose that students who are in the rank from 9 to 13 have obtained the same score.

According to Table 3.1, those students are supposed to obtain a grade 2. However, the policy does

not allow their grade to be implemented. Instead, they are allocated grade 3. Thus, only 4 students

(from 5 to 8 in the rank) obtain grade 2. Teachers do not want their students to receive such

disadvantageous results, so the teachers allocate different marks to the test items. By adopting this

policy, the education office believes that the discrimination power of the test items becomes high

and the reliability of the test results can be enhanced.

The school test grade is used as a means by which students can gain entry to study at

university. Thus, most students tend to study assiduously so as to obtain as highest scores they can.

That is, summative tests have a high-stakes role in South Korea. It follows, then, that because the

tests are so important, teachers should pay careful attention to the overall procedures of designing

the tests. Among many factors which are related to the test design, the way of weighting the item

will be researched in this dissertation. To be specific, most Korean teachers use an 'a priori'

weighting method in deciding the item difficulty levels and assign different weights to the test items.

My belief, which led to the development of this study, was that teachers' decisions about the item

difficulty levels always accords with the students' actual responses to the items. In order to identify

the gap between teachers' decision about item difficulty levels and the virtual difficulty levels of the

test items, item analysis using the Rasch model will be implemented.

3.3 Research Strategy

The purpose of this research is to identify the gap between the virtual item difficulty and

teachers' item weighting. Since teachers' decision about item difficulty levels is well described in the two-

dimensional table of test specifications collected from schools (see appendix 3), the focus of this

research is to measure the virtual difficulty value of the individual test items. From this perspective,

quantitative research which views social reality as an objective reality and "advocates the

application of the methods of the natural sciences to the study of social reality and beyond" (Bryman,

2012, p. 28) seems to be more appropriate than a qualitative approach in conducting this study.

15

Of the quantitative research methods, item analysis will be used in this research. Crocker

and Algina (1986) define item analysis as the computation and examination of any statistical

property of test takers' responses to an individual test item. In this sense, epistemologically

objectivity needs to be secured in this research when processing the collected data precisely and

transforming them into numerical value (Kumar, 2005). The data obtained, thus, will be analysed

and interpreted with the computer statistic programme which is called 'Winsteps (Version 3.81.0)'.

3.4 Research Design

3.4.1 Overview of the Research Procedure

As the first step, the test items will be arranged in the difficulty levels which teachers

decided prior to the test administration. The two-dimensional table of test specifications which I

obtained from X, Y, and Z high school (see appendix 3) will be a source of reference for the item

arrangement. Through this process, how teachers in X, Y, and Z school have weighted the test items

will be illustrated. In relation to the second research question, the test items will be rearranged in

the language categories and item difficulty levels, which I believe will identify which language

categories teachers believe to be difficult or easy for students.

After classifying the test items in terms of the predetermined difficulty levels and the

language categories, I will calibrate the virtual difficulty values (logits) of each item using the Rasch

model. As a preparation for measuring the difficulty logits of the test items, I will dichotomously

transform students' responses into "1" for correct response and "0 (zero)" for wrong response (this

procedure will be specifically explained in 3.4.3). Based on the dichotomously transformed data, I

will measure the item difficulty logits.

However, in order to measure the difficulty logits of the item precisely, the element which

can affect the calibration needs to be removed. As explained in chapter 2, high person infit MNSQ

(>+1.5) and ZSTD (>+2.0) can statistically compromise the measurement process of the item

difficulty logits (Linacre, 2002). In this sense, I will calibrate the person infit MNSQ and ZSTD. After

calculating person infit MNSQ and ZSTD, all the misfit data whose infit MNSQ and ZSTD are over +1.5

and +2.0, respectively will be filtered out. All the process of measuring infit MNSQ and infit ZSTD will

be led by the statistical software, Winsteps (version 3.81.0).

Once the misfit person data which can affect the difficulty measurement has been detected

and removed, item analysis using the Rasch model will be done again in order to calculate virtual

item difficulty logits. In particular, in this process, by using Baker's (1985) verbal terms, item

difficulty logit calculated will be defined into five verbal terms according to the index (logits): very

easy, easy, medium, hard, and very hard. Figure 3.2 shows the difficulty index and the corresponding

verbal terms.

Figure 3.2 Difficulty level and index (Baker, 1985)

16

After the virtual difficulty logits of the test items are computed with the help of Winsteps

(version 3.81.0), two comparisons will be made. One will be between teachers' estimation on the

item difficulty levels and virtual item difficulty levels. The other will be between teachers' beliefs

about the difficulty levels of the language categories and students' virtual responses to the language

categories. Afterwards, based on the statistical results, the hypothesis will be checked and

knowledge (answers to the research questions) will be generated. The overall blueprint of current

research can be schematised as is in Figure 3.3.

Data collection and random sampling

Classifying the items and language categories in relation to difficulty levels

Detecting misfit person data and removing all misfit data detected

Calibrating virtual item difficulty logits

Comparing between teachers' beliefs and the Rasch analysis results / answering to the research questions

Figure 3.3 Overall procedures of the research

3.4.2 Data Collection and Sampling

In this research, two different kinds of samples are required: a two-dimensional table of test

specifications (with test items) and the test results about how students responsed to each test items.

Those two samples are extracted from three different public high schools in Daejeon, South Korea.

Those three schools are located within 2-mile radius. Since students in this city are randomly

assigned to the high schools by the city education office on the basis of their living area, the students of

those three schools are demographically similar to a large extent, which enhances the reliability

and validity of the findings to a certain extent (Scott & Morrison, 2006).

Table 3.2 Sample summarisation

X high school

Y high school

Z high school

Type of School

Co-education public School

Grade of subjects

3rd Grade

Location

Same part of city (all schools are within 3-mile

radius)

Size of

Collected data

262 362 195

Item Type

Multiple Choice

Number of Test Item

28 28 29

17

The primary process of this research is to calculate the virtual difficulty levels of the test

items using the Rasch model. In order to do so, it is necessary to collect data about how students

responded to the test items. Since students' response data are computerised and transformed into

excel files like Figure 3.4 after students have taken the test, it may not be difficult to obtain the

students' response data. Rather, the difficulty may lie in deciding the sample size for the accuracy of

research (Lewin, 2011). With regards to the sample size, Henning (1987) suggests that the

recommended sample size in the Rasch model is between 100 and 200 for statistically meaningful

results. Thus, by using excel function "RANDUNIQ", 200 discrete samples will be randomly extracted.

Figure 3.4 Example of students' responses data (a dot means correct answer and a number means

incorrect answers)

3.4.3 Data Analysis

Depending on the features of data, the method of data coding should be appropriately

implemented. If not, the findings would not be valid and reliable. The main aim of data analysis in

this research is to calculate the difficulty logits of all test items and then compare items' virtual

difficulty levels with teachers' item weighting. In order to do so, students' responses will be

dichotomously turned into "1" for correct responses and "0 (zero)" for wrong responses (see Table 2.4

in chapter 2). Based on dichotomously transformed data, the Rasch analysis will be done with the

aid of Winsteps (version 3.81.0). Through the Rasch analysis, raw data will be turned into a

report which contains much information (Oppenheim, 1992), that can be used to examine the

research hypothesis on teachers' decision about item weighting.

Black (1999) claims that well-presented descriptive statistics with visual aids can enhance

the comprehension of the outcomes of quantitative research. Similarly, Kumar (2005) suggests that

visualising the findings is one of the important ways of communication and effective data-display

techniques can help readers to understand the findings clearly and easily. From this perspective,

figures such as an item-person map will be inserted in this research, which may help readers

recognise the hierarchy of items' difficulty and the relationship between items and test-takers'

abilities (e.g. Figure 2.2). In addition, a number of tables will be used in order to show information

about item difficulty logits and mainly to identify the difference between teachers' item weighting

and virtual item difficulty levels.

3.4.3.1 Way of Categorising Language Elements

18

In order to answer to the second research question, it is necessary to categorise language

elements. As a way of classifying the language elements, I will refer to the direction of the test items,

because the directions of test items reflect the language category which teachers want to assess.

First of all, I will transform the Korean directions of all test items into English version. After changing

the Korean directions into English ones, the items with the same direction will be combined into the

same language category. Take X15 (X high school's 15th item) and Y3 (Y high school's third item) as an

example. As shown in Figure 3.5, the directions of X15 and Y3 were originally written in Korean, but I

will change them into English. Those two items have the same direction. Thus, those two items will be

combined into the same language category, "finding a suitable discourse marker" (see Table 4.2 and

appendix 3). In this way, all items will be categorised on the basis of the directions of the test items.

Figure 3.5 Examples of transforming the Korean direction into English version

19

3.4.4 Validity and Reliability

Validity refers to the accuracy of the data and the appropriateness of the research questions

being investigated (Denscombe, 2010). In order to attain trustworthy findings, therefore, careful

attention has to be paid to the process of narrowing the gap between what is true and what is

measured in relation to the research questions. In this research, for the purpose of making the data

accurate, the process of data refinement will be executed. To be specific, before calibrating items'

difficulty logits, by calculating person infit MNSQ and ZSTD, all person data whose infit MNSQ and

ZSTD is over +1.5 and +2.0, respectively will be deleted, because those misfit data may affect the

item difficulty logit calibration. After excluding misfit person data which can affect precise difficulty

logit measurement, only the fit person data to Rasch model will be analysed and the virtual difficulty

of each item will be calculated through the Rasch analysis. I believe that such a data refinement

process may be able to consolidate the validity of this research.

Along with the concept of validity, reliability also matters in this research, because reliability

serves as "important safeguards against the contamination of scientific data" (Krippendorff, 1980, p.

129). However, since "validity presumes reliability" (Bryman, 2012, p. 173), high level of reliability will

be secured through data refinement process in this research. In addition, in terms of sample size, as

claimed by Henning (1987), by attaining more than 100 and less than 200 samples, the reliability of

the statistical results may be secured to a certain extent.

3.5 Ethical Considerations

In analysing students' responses to the test items, ethical issues may not be at the forefront

(Neuman, 2005), because in this research, people being studied are not directly involved. Instead,

the students' response data and the two-dimensional table of test specifications are the privacy that

needs to be kept confidential, because both data contains information about school name, students'

names, and students' numbers. Such information may compromise the ethical issues. Thus, I had

asked teachers to remove all the private information before they provided the data to me. After

collecting the data from the teachers, I double-checked whether all private information had been

deleted by the teachers. All deleted information about the school name and students' names was

replaced with pseudonyms to conform to ethical guidelines. Furthermore, test items can be one the

concerns which affect the ethics of the current research. However, test items are meant to be

uploaded on the school webpage after examination. Thus, all who join the webpage can access the

data.

3.6 Conclusion to Chapter 3

In this chapter, I have explained the overall procedures of my quantitative research and its

theoretical underpinning with regards to the research questions. Moreover, the process of how to

collect and analyse the data has been introduced. In the following chapter, data will be analysed and

the findings of this research will be presented.

Chapter 4. Data Analysis


20

In the previous chapter, the context about how teachers develop test items was discussed.

In addition, the research method of this dissertation was explained along with issues relating to

validity and reliability. This chapter reports on how the data collected from three schools were

analysed quantitatively with the help of Winsteps. Subsequently, based on the statistical information

which Winsteps provided, the research questions will be addressed.

4.2 Selecting Data

The size of the samples I obtained from three public schools varied by institution. From X

high school, 262 samples were collected, from Y high school, 362 samples, and from Z high school,

195 samples were gained. Henning (1987) notes that the appropriate size of samples is from 100 to

200 in order to gain statically meaningful results in the Rasch model. Therefore, I reduced the

sample size of X and Y school into 200 using a random sampling process. In this process of reducing

the sample size, Excel function "RANDUNIQ", which makes it possible to extract 200 unique random

numbers, was used (see appendix 1 and 2).

4.3 Test Item Classification

While designing test items, teachers tend to decide the level of difficulty of test items and

allocate a different mark to each item on the basis of their prior knowledge and experiences.

Likewise, in X, Y, and Z high school, teachers had decided the difficulty level of each item before

students took the test, and they had allocated different marks to the items according to their feel for

the difficulty levels of the items (Rudner, 2001). That is, the more difficult they believed the test item

was, the higher mark they assigned to the item. However, as shown in Table 4.1, two test items (X14

and X24) of X high school were exceptions. Even though the difficulty level of those two items was

believed to be medium, teachers allocated lower marks than some of the easy test items (e.g. X10 or

X12). Aside from those two items, item marks were hierarchically assigned to all items on the basis

of the difficulty levels. Through the item classification of those three schools, I could find the

correlation between difficulty level and mark allocation. The capital letter X, Y, and Z in the table

refer to X, Y, and Z high school and the number followed by the capital letter means an item number.

Thus, X10 means X high school's tenth test item.

In addition, Table 4.1 shows that teachers had assigned a wide range of marks to items

within a single test. That is, in X high school's English test, teachers made 28 test items and 22

different marks were used and allocated to each item. Between the lowest mark (2.5) and the

highest mark (4.6), there were 20 different marks allocated to items. Similarly, 7 different marks

were allocated to Y high school's test items and 10 different marks were allocated to Z high school's

items. Such a way of assigning a wide range of marks to items within a single test led to making

another stratification between items with the same difficulty level. In other words, even if the

difficulty level of the items is the same, by assigning different marks, teachers make another item

difficulty layer between the items with the same difficulty level. Because of the stratifications, in

some cases, the mark difference between the items with the different difficulty levels (e.g., X12 and

X3) is smaller than the difference between the items with the same difficulty level (e.g. X4 and X15).

X High school Y High school Z High school

21

ITEM

MARK DIFFICULTY

LEVEL

ITEM MARK DIFFICULTY

LEVEL

ITEM MARK DIFFICULTY

LEVEL

X4 2.5 E Y19 3.1 E Z19 3 E X17 2.6 E Y26 3.1 E Z29 3

EX18 2.7 E Y15 3.2 E Z4

3.1 E X8 2.8 E Y25 3.2 E Z6 3.1

EX15 2.9 E Y21 3.3 E Z22

3.1 EX24 3 M Y27 3.3 E

Z26 3.1 EX16 3.1 E Y1 3.6 M

Z28 3.1 EX14 3.2 M Y2 3.6 M

Z9 3.2 EX10 3.3 E Y3 3.6 M

Z10 3.3 MX12 3.3 E Y4 3.6 M

Z27 3.3 M X3 3.4 M Y5 3.6 M Z3 3.4

MX27 3.4 M Y8 3.6 M Z7 3.4

M X2 3.5 M Y9 3.6 M Z8 3.4

MX7 3.5 M Y11 3.6 M Z13 3.4

MX28 3.6 M Y12 3.6 M Z15 3.4

M X9 3.7 M Y14 3.6 M Z21 3.4 M X13 3.7 M Y16 3.6 M Z5 3.5 M X25 3.8 M Y20 3.6 M Z14 3.5

MX26 3.9 M Y22 3.6 M Z17 3.5

M X1 4 M Y24 3.6 M Z11 3.6 M X20 4 M Y23 3.7 M Z16 3.6

MX21 4 M Y6 3.8 H Z24 3.7

M X23 4.1 H Y7 3.8 H Z25 3.7

MX22 4.2 H Y10 3.8 H Z18

3.8 H X19 4.3 H Y13 3.8 H Z23 3.8 H X6 4.4 H Y17 3.8 H Z1 3.9

HX5 4.5 H Y18 3.8 H Z2

3.9 H X11 4.6 H Y28 3.9 H Z12 3.9 H

Z20 3.9 H

(* E: easy, M: medium, H: hard)

Table 4.1 Item mark allocation and item difficulty of three schools.

As explained in the previous chapter, based on the directions of test items (see appendix 3), I

classified all the test items according to 17 different criteria, as illustrated in Table 4.2. Table 4.2

summarises how teachers weighted the test items according to language categories. In Table 4.2,

teachers in X, Y, and Z high school believed that the language categories, "identifying the

syntactically wrong (or correct) usage of word categories" (10) and "inferring a word, a phrase, and a

sentence" (11, 12, and 13), could be difficult for students to solve. Therefore, generally, teachers

assigned higher marks to those items pertinent to those language categories (10, 11, 12, and 13). On

the other hand, teachers considered "finding a wrong explanation about a given passage (7)" and

"finding suitable discourse markers (9)" as easy language categories, so they assigned relatively

lower marks to the items related to the language category (7) and (9). In addition, the language

categories "finding a title of a given passage" (5) and "placing sentences in a logical order" (15) were

mostly classified into the medium tasks.

Table 4.2 X, Y, and Z high schools' language category specification and item difficulty levels

Difficulty

1

2 3

Language Categories

Distinguishing a pronoun which indicates a different reference

Finding a non-cohesive sentence

Finding suitable words to the context of a given passage

Level

E

H

M

E

M

E

X4

Y1

8

Y

1

X12

X3

Z9

School and Item number

Y22

Z10

4 5 6

7

8

9

10

11

12

13 14 15

16

17

Finding a theme of a given passage Finding a title of a given passage Finding a topic of a given passage

Finding a wrong explanation about a given passage

Finding an awkward word to the context of the given passage

Finding suitable discourse markers

Identifying the syntactically wrong (or correct) usage of word categories

(e.g. noun, verb, relative pronoun, preposition, adjective, etc.)

Inferring a phrase based on a given passage

Inferring a sentence based on a given passage

Inferring a word based on a given passage Locating a given sentence in a passage

Placing sentences in a logical order

Summarising a given passage

Identifying author's mood

H

E

H

M

E

M

E

E

H

M

M E

H

M

E

H M

M

H M

H

M

E

M

E

H

E

E

X9

X10

X19

X20

Y27

X7

X8

X17

Y2

8

X1

Y3 X15

X5

Y

8

Y2

6

X2

3 X25

Z13

X22

X24

Z18

X1

3

Y1

5

X1

4

Z28

X1

1

Y2

4

X1

8

Y1

2

X21

Z5

Z4

Y2

5 X2

X16

X6

Y1

6

Y

6 X26

Z16

Z20

X2

8

Y

5

Z19

Y

9

Y1

0

Z3

Z1

7

X2

7

Z

6

Z2

9 Y1

1

Y1

9

Z

1

Z2

4

Y

7

Y

4

Y

2

Y2

0

Y2

3

Y2

1

Z2

Z25

Y1

3

Z11

Y1

4

Z26

Z7

Z22

Z23 Y17

Z14

Z21

Z27

Z8

Z1

2

Z1

5

*H: hard, M: medium, E: easy

4.4 Data Refinement

23

In order to precisely calibrate the item difficulty logits, the misfit person data need to be

excluded, because such misfit data can distort or degrade the measurement system (Linacre, 2002).

Thus, by using the system explained in the previous chapter, I analysed the response data using the

Rasch model and found the misfit data whose infit MNSQ and infit ZSTD values were not within

acceptable ranges. To be specific, two misfit person data (entry no. 3 and 80) were found in X high

school, while no misfit datum was found in Y and Z high school (see appendix 4). Those two misfit

data's infit MNSQ and ZSTD value were beyond the acceptable range as is in Table 4.3.

PERSO

N ENTRY

3 80 210

ABILITY

MEASURE - 0. 95 - 0. 39 5.

3

SCOR

E 9 12

28

INFIT MNSQ 1. 58 1. 67

1. 00

INFIT ZST

D 2. 63 3. 46

0. 00

OUTFIT MNSQ 1. 73 2. 90

1. 00

OUTFIT

ZSTD 1. 61 4. 27 0.

00

Table 4.3 Misfit person data and extreme scorer of X high school

In addition, one person (entry no. 210) in X high school responded to all test items correctly

(see appendix 4). Since extreme scorers (those who answer all test items correctly or wrongly)

always fit the Rasch model perfectly (Henning, 1987), the extreme scorer which was found in X high

school was excluded. After eliminating those three cases, the sample size of X high school became

197. However, the sample number of Y and Z high school did not change. The number of samples in Y

and Z high school was 200 and 195, respectively.

4.5 Item Difficulty Calibration

Based on the refined students' item response data (see appendix 5), the difficulty logits of

each school's test items were measured by Winsteps. After calculating the difficulty logit of each

item, each item was tagged with five difficulty levels according to the logits which Baker (1985)

suggested (see Figure 3.2): VH (very hard), H (hard), M (medium), E (easy), and VE (very easy).

Through this process, it was possible to compare the difficulty levels which teachers had decided

with the actual difficulty level which was measured by Winsteps.

4.5.1 X High School

X HIGH SCHOOL (Item Reliability .97)

ITEM DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE

ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL X22 3.36 21 0.27 1.03 0.21 1.90 1.64 VH X7 1.81 53 0.19 1.25 2.45 2.68 5.42 H X16 1.46 63 0.18 1.31 3.22 1.52 2.52 H X3 1.27 69 0.18 0.89 -1.27 0.94 -0.30 H X18 0.8 84 0.17 0.76 -3.18 0.71 -2.19 H X27 0.8 84 0.17 1.26 2.93 1.43 2.69 H X21 0.74 86 0.17 1.01 0.18 0.92 -0.50 H X5 0.71 87 0.17 1.25 2.84 1.37 2.44 H X23 0.71 87 0.17 0.80 -2.66 0.78 -1.66 H X28 0.62 90 0.17 0.91 -1.05 0.81 -1.41 H X1 0.5 94 0.17 1.31 3.47 1.45 2.95 H X24 0.29 101 0.17 0.98 -0.21 1.01 0.11 M X14 0.17 105 0.17 0.99 -0.07 0.92 -0.55 M X15 0.17 105 0.17 0.92 -1.00 0.84 -1.14 M X2 -0.12 115 0.17 0.90 -1.28 0.83 -1.06 M X13 -0.4 124 0.18 0.95 -0.61 0.83 -0.94 M

ITEM DIFFICULTY SCORE MODLSE INFIT INFIT OUTFIT OUTFIT DIFFICULTY (Response)

24

ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD LEVEL X8 -0.43 125 0.18 0.90 -1.23 0.83 -0.93 M X12 -0.46 126 0.18 0.99 -0.14 0.92 -0.40 M X19 -0.74 135 0.18 0.97 -0.33 1.40 1.76 E X25 -0.77 136 0.18 0.94 -0.68 0.79 -1.00 E X11 -0.84 138 0.18 0.96 -0.47 1.01 0.10 E X17 -0.91 140 0.18 0.95 -0.57 0.89 -0.40 E X6 -0.97 142 0.18 0.97 -0.30 0.76 -1.01 E X9 -1.11 146 0.19 1.09 0.99 1.14 0.60 E X10 -1.18 148 0.19 1.04 0.44 0.84 -0.56 E X26 -1.33 152 0.19 0.79 -2.36 0.59 -1.58 E X20 -1.4 154 0.2 0.83 -1.87 0.59 -1.52 E X4 -2.75 181 0.27 0.83 -0.93 0.40 -1.44 VE

Table 4.4 Item difficulty measurement of X high school test items

Table 4.4 shows the results of how difficult the items are. According to Table 4.4, the item

estimates range from +3.36 (X22) to -2.75 (X4) logits. That is, students in X high school had more

difficulty in responding to X22 correctly than in responding to any other items, whereas the students

solved X4 with ease. Out of 28 test items, 11 test items were estimated to be hard, while 10 items

were found to be easy for students. Furthermore, the number of items that were at a medium level

was relatively small. By using an item map, the hierarchy of the item difficulty can be schematised

(see Figure 4.1).

Figure 4.1 X high school item map

25

Teachers' Weighting The Rasch Analysis

MARK DIFLFEVULTY

IC

ITEM DIFFICULTY DIFFICULTY

4.6 4.5

4.4

4.3

4.2

4.1

H H

H

H

H

H

E

L

ENTRY X11 X5

X6 X19

X22

X23

LEVEL E H

E

E VH H

LOGIT -0.84 0.71 -0.97 -

0.74

3.36

0.71

4 M X1 H 0.5 4 M X20 E -1.4 4 M X21 H 0.74

3.9 M X26 E -1.33 3.8 M X25 E -0.77 3.7 M X9 E -1.11

3.7 M X13 M -0.4 3.6 M X28 H 0.62 3.5 M X2 M -0.12

3.5 M X7 H 1.81 3.4 M X3 H 1.27 3.4 M X27 H 0.8

3.3 E X10 E -1.18 3.3 E X12 M -0.46 3.2 M X14 M 0.17

3.1 E X16 H 1.46

3 M X24 M 0.29 2.9

E X15 M 0.17 2.8

E X8 M -0.43 2.7

E X18 H 0.8 2.6

E X17 E -0.91 2.5

E X4 VE -2.75

Table 4.5 X high school teachers' difficulty level estimation and virtual difficulty levels

In the middle of Table 4.5 are the items which X school teachers' designed. The left column of

Table 4.5 contains the information about teachers' prior weighting and mark allocation. In the right

side of the item entry column, the difficulty logits calibrated and difficulty levels of items are

presented. The items are arranged in the difficulty levels which teachers in X high school decided

prior to the test administration.

Table 4.5 indicates that there are a number of cases which display a gap between teachers'

item difficulty level estimation and the actual item difficulty level. Teachers in X high school believed

that X11 might be the most difficult item for students to solve, so they allocated the highest mark to

X11. However, according to Table 4.5, the measured difficulty logit of X11 is -0.84 suggesting that

X11 was an easy item. Under this model, the teacher should not have assigned the highest mark to

X11. Similarly, other two items, X6 and X19, had been estimated to be hard by teachers, but actually

the item logits of X6 (-0.97) and X19 (-0.74) indicate that it was easy for students to find the correct

answers of those two items. Teachers also misjudged the difficulty levels of two items, X16 and X18.

The teachers presumed that X16 and X18 might be an easy item. However, the calculated difficulty

logits of X16 (+1.46) and X18 (+0.8) intimate that it was hard for students to respond to those two

items correctly. According to Table 4.4, less than a third of students responded to X16 correctly and

less than a half of students answered to X18 correctly. Aside from those five items, in X high school's

test items, there were 13 other items which showed the incongruity between teachers' 'a priori'

weighting and virtual difficulty levels.

26

4.5.2 Y High School

Y HIGH SCHOOL (Item Reliability .94)

IT E M DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE

ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL Y26 1.48 41 0.2 1.13 1.14 1.16 0.79 H Y22 1.21 48 0.19 1.15 1.36 1.47 2.27 H Y17 1.03 53 0.19 1.24 2.24 1.52 2.72 H Y27 0.93 56 0.18 1.38 3.52 1.51 2.85 H Y13 0.76 61 0.18 0.87 -1.44 0.79 -1.52 H Y16 0.63 65 0.18 1.20 2.06 1.30 2.02 H Y24 0.63 65 0.18 0.89 -1.19 0.86 -1.02 H Y6 0.54 68 0.18 0.81 -2.21 0.73 -2.26 H Y19 0.54 68 0.18 0.98 -0.18 0.96 -0.30 H Y28 0.51 69 0.17 1.54 5.24 1.71 4.60 H Y8 0.16 81 0.17 0.85 -1.91 0.81 -1.76 M Y7 0.07 84 0.17 1.11 1.31 1.21 1.82 M Y12 -0.01 87 0.17 0.77 -3.15 0.72 -2.72 M Y3 -0.04 88 0.17 1.12 1.56 1.30 2.54 M Y20 -0.09 90 0.17 1.08 1.01 1.16 1.41 M Y23 -0.12 91 0.17 1.19 2.37 1.13 1.20 M Y5 -0.18 93 0.17 0.96 -0.54 0.90 -0.92 M Y14 -0.2 94 0.16 0.82 -2.63 0.78 -2.08 M Y21 -0.23 95 0.16 1.22 2.80 1.48 3.79 M Y11 -0.34 99 0.16 0.72 -4.35 0.68 -3.14 M Y9 -0.39 101 0.16 1.10 1.33 1.08 0.70 M Y18 -0.47 104 0.16 0.91 -1.31 0.81 -1.69 M Y2 -0.71 113 0.16 0.72 -4.68 0.63 -3.34 E Y15 -0.76 115 0.16 0.79 -3.34 0.68 -2.76 E Y25 -0.95 122 0.16 0.89 -1.68 0.84 -1.13 E Y4 -1.06 126 0.17 0.76 -4.03 0.62 -2.89 E Y1 -1.26 133 0.17 0.92 -1.21 0.85 -0.91 E Y10 -1.7 148 0.18 0.93 -0.95 0.89 -0.46 E

Table 4.6 Item difficulty measurement of Y high school test items


estimates range from +1.48 (Y26) to -1.7 (Y10) logits. Students in Y high school had more difficulty in

responding to Y26 correctly than in responding to any other items, whereas the students solved Y10

with ease. In Y high school, the Rasch model rated 10 items to have been hard, 12 items to have

been medium, and 6 items to have been easy. The frequency of easy items was low, whereas that of

medium and hard items was relatively high in Y high school's English test. Table 4.6 also indicates

that the item difficulty logit gap between adjacent items in the difficulty level borderline is wider

than other adjacent items with the same difficulty level, except two pairs of items (Y1 and Y10 / Y26

and Y22). To be specific, the item difficulty logit difference between two adjacent items in the

difficulty borderline (Y28 and Y8/ Y18 and Y2) is relatively wider than the logit difference between

two adjacent items in the same difficult level (e.g. Y17 and Y27). This suggests that students might be

able to readily distinguish hard items from medium items and medium items from easy items while

taking the test. Those gaps between adjacent items can also be identified in the item map (see

Figure 4.2).

27

Figure 4.2 Y high school item map

28


DIFFICULTY ITEM DIFFICULTY DIFFICULTY MARK

LEVEL ENTRY LEVEL LOGIT 3.9 H Y28 H 0.51 3.8 H Y6 H 0.54 3.8 H Y7 M 0.07 3.8 H Y10 E -1.7 3.8 H Y13 H 0.76 3.8 H Y17 H 1.03 3.8

H Y18 M -0.47 3.7

M Y23 M -0.12 3.6

M Y1 E -1.26 3.6

M Y2 E -0.71 3.6

M Y3 M -0.04 3.6

M Y4 E -1.06 3.6

M Y5 M -0.18 3.6 M Y8 M 0.16 3.6

M Y9 M -0.39 3.6

M Y11 M -0.34 3.6

M Y12 M -0.01 3.6 M Y14 M -0.2 3.6 M Y16 H 0.63 3.6

M Y20 M -0.09 3.6 M Y22 H 1.21 3.6 M Y24 H 0.63 3.3

E Y21 M -0.23 3.3 E Y27 H 0.93 3.2

E Y15 E -0.76 3.2

E Y25 E -0.95 3.1 E Y19 H 0.54 3.1 E Y26 H 1.48

Table 4.7 Y high school's teachers' weighting and virtual difficulty

English teachers in Y high school believed that Y28 might have been the most difficult for

students to solve, so they had allocated the highest mark (3.9) to Y28. In contrast, Y26 had been

considered to be the easiest item by teachers, so the teachers had assigned the lowest mark (3.1) to

Y26. However, the evidence which Winsteps provided manifested that there was a gap between

teachers' decision about the item weighting and the virtual item difficulty levels in Y high school.

That is, teachers' weighting on Y26 completely mismatched the virtual difficulty level. As seen in

Figure 4.2 and Table 4.7, Y26 is placed in the very top of the item map and has the highest difficulty

logit (+1.48) suggesting that no item is more difficult than Y26. In spite of that, teachers believed

that Y26 might be easy for students and the teachers allocated the lowest mark (3.1) to Y26. The

teachers should have assigned the highest mark to Y26. In addition, teachers postulated that Y10

might have been a tough item. However, Y26's difficulty logit measured by the Rasch model showed

that teachers' estimation, to a great extent, differed from the virtual item difficulty level. That is, the

difficulty logit of Y10 (-1.7) suggested that Y10 was an easy item for students. In Y high school, out of

28 items, 13 items' virtual difficulty levels did not match teachers' weighting.

4.5.3 Z High School

29

Z HIGH SCHOOL (Item Reliability .96)

ITEM DIFFICULTY SCORE INFIT INFIT OUTFIT OUTFIT DIFFICULTY MODLSE

ENTRY MEASURE (Response) MNSQ ZSTD MNSQ ZSTD LEVEL Z3 2.33 25 0.24 1.04 0.31 1.65 1.84 VH Z12 1.35 47 0.19 1.11 1.10 1.56 2.70 H Z20 1.31 48 0.19 1.00 0.03 0.88 -0.67 H Z23 0.91 60 0.18 0.95 -0.54 0.91 -0.61 H Z1 0.66 68 0.17 1.07 0.82 1.06 0.52 H Z2 0.63 69 0.17 1.10 1.16 1.19 1.49 H Z7 0.54 72 0.17 1.16 1.89 1.18 1.46 H Z16 0.54 72 0.17 1.09 1.14 1.26 2.01 H Z19 0.51 73 0.17 1.06 0.77 1.11 0.92 H Z11 0.37 78 0.17 0.92 -1.08 0.90 -0.89 M Z13 0.29 81 0.17 1.02 0.30 1.02 0.17 M Z17 0.18 85 0.17 1.26 3.28 1.26 2.22 M Z24 0.18 85 0.17 1.04 0.54 1.01 0.13 M Z21 0.15 86 0.17 0.93 -0.96 0.96 -0.36 M Z25 -0.04 93 0.16 1.25 3.31 1.20 1.73 M Z28 -0.1 95 0.16 1.02 0.26 1.14 1.22 M Z29 -0.18 98 0.16 1.02 0.26 1.01 0.16 M Z26 -0.42 107 0.16 0.94 -0.88 0.86 -1.14 M Z8 -0.5 110 0.16 1.05 0.72 1.15 1.17 M Z10 -0.5 110 0.16 0.80 -3.21 0.72 -2.44 M Z18 -0.6 114 0.16 1.12 1.75 1.10 0.79 E Z27 -0.6 114 0.16 0.94 -0.97 0.86 -1.12 E Z14 -0.69 117 0.16 0.73 -4.47 0.62 -3.18 E Z15 -0.77 120 0.17 0.92 -1.20 0.86 -1.00 E Z22 -0.77 120 0.17 1.12 1.79 1.66 3.92 E Z9 -0.99 128 0.17 0.74 -4.23 0.60 -2.90 E Z5 -1.05 130 0.17 0.78 -3.50 0.65 -2.36 E Z6 -1.16 134 0.17 0.78 -3.36 0.63 -2.36 E Z4 -1.6 148 0.18 0.94 -0.73 0.80 -0.90 E

*VH: VERY HARD, H: HARD, M: MEDIUM, E: EASY, VE:VERY EASY

Table 4.8 Item difficulty measurement of Z high school test items


difficulty estimates range from +2.33 (Z3) to -1.6 (Z4) logits. Students in Z high school had more

difficulty in responding to Z3 correctly than in responding to any other items, whereas the students

solved Z4 with ease. In Z high school, the Rasch model rated 9 items to have been hard, 11 items to

have been medium, and 9 items to have been easy. Compared to the difficulty level distribution of

other two schools' items, Z school's item difficulty levels seem to have been equally distributed to a

certain extent. In addition, as shown in Figure 4.3, most items (25 items) are located between +0.91

(Z23) and -1.16 (Z6) logits. However, Figures 4.3 also shows that there exists a large logit gap

between Z3 (+2.33) and Z12 (+1.35) suggesting that Z3 was too difficult for the students to find the

correct answer.

30

Figure 4.3 Z high school item map

31


DIFFICULTY ITEM DIFFICULTY DIFFICULTY MARK

LEVEL ENTRY LEVEL LOGIT 3.9 H Z1 H 0.66 3.9 H Z2 H 0.63 3.9

H Z12 H 1.35 3.9

H Z20 H 1.31 3.8 H Z18 E -0.6 3.8

H Z23 H 0.91 3.7

M Z24 M 0.18 3.7

M Z25 M -0.04 3.6

M Z11 M 0.37 3.6

M Z16 H 0.54 3.5

M Z5 E -1.05 3.5

M Z14 E -0.69 3.5

M Z17 M 0.18 3.4

M Z3 VH 2.33 3.4

M Z7 H 0.54 3.4 M Z8 M -0.5 3.4

M Z13 M 0.29 3.4

M Z15 E -0.77 3.4

M Z21 M 0.15 3.3 M Z10 M -0.5 3.3 M Z27 E -0.6 3.2

E Z9 E -0.99 3.1 E Z4 E -1.6 3.1

E Z6 E -1.16 3.1

E Z22 E -0.77 3.1

E Z26 M -0.42 3.1 E Z28 M -0.1 3

E Z19 H 0.51 3

E Z29 M -0.18

Table 4.9 Z high school's teachers' weighting and virtual difficulty

As identified in X and Y high school, there is also a discrepancy between teachers' 'a priori'

weighting and the virtual item difficulty levels in Z high school. That is, teachers in Z high school did

not allocate appropriate marks to the test items. For example, teachers believed that Z18 might have

been a difficult item, but the difficulty logit of Z18 was -0.6, suggesting that Z18 was an easy item for Z

high school's students. Teachers, therefore, should have assigned a lower mark to Z18 than 3.8.

Teachers also had classified item Z3 into a medium-levelled item and assigned 3.4 to Z3. However,

the Rasch model identified that Z3 was the most difficult of all items. Thus, teachers should have

given Z3 the highest mark, 3.9. In addition, the teachers misjudged the difficulty level of Z19. They

believed that Z19 was an easy item, so they allocated the lowest mark to Z19. However, Z19's virtual

difficulty logit calibrated was +0.51, suggesting that Z19 was a difficulty item. Thus, a higher mark

should have been assigned to Z19 than the teachers had assigned. In Z high school's test, out of 29

items, 12 items' virtual difficulty levels did not match teachers' weighting.

A comparison was made between teachers' weighting and the virtual difficulty levels in this

section. As a result, it was identified that in X, Y, and Z high schools' test items, there was a large gap

between teachers' weighting and the virtual difficulty levels. In the next section, another comparison will

be made in relation to language categories and item difficulty levels.

32

4.6 Language Category Classification

In the last subsection, the gap between teachers' weighting and virtual item difficulty was

identified. In this section, in relation to the second research question, a comparison will be made

between what teachers believed to be difficult and what students found to be difficult in terms of

the language category.

Teachers' weighting

IT E M

X

4

Y

18

Difficulty Level

E

H

1

Language Category


Finding a non-cohesive

1

Difficulty

Level

E

H

The Rasch analysis

ITEM

X4

X

20

X

23

X

25

X

21 X

15 X

5

Y

6 X

26

X

24 X

14

X

27 X

16 X

6

Y

8

Y

7

Y

4

X

28 Y

9

X

9 Y

23 X

8 X

17 X

1 Y

19 Z

1 Y

16

Y

13

Z11

Y

2

Y

20

X3

Z9 Y

12 Z7 X7

Z4 Y

25 X2 Y

21 Z2 Z24

Y

17

Z14 Z13

X

22 Y

14

X

13 Y

15

Z26

X

1

1

Y

2

4

Y

1 X

1

2 Y

2

2

Z

1

0

Z

1

7 X

1

0

X

1

9 Z

8 Y

2

7 Z

5

Z

6

Z

2

9

Y

2

8

Y

1

1 Y

3 Z22 Z23 Z25 Y 26

Z12 Z15 Z16 Z20 Z21

Z18 Y5

Z19 Z27 Z28 Y 10 Z3

M E H M

E H M

E H M

E H M

E H M

E H M

H M

E H M E

H M

E H M

H M

E H

M

E

M E H

E

2 3 4 5 6 7

8

9

1

0

1

1

1

2

1

3

1

4 1

5

1

6 s

e

n

t

e

n

c

e

Fi

nd

in

g

sui

ta

bl

e

w

or

ds

to

th

e c

o

n

t

e

xt

o

f

a

gi

v

e

n

p

a

s

s

a

g

e

F

i

n

d

i

n

g

a

t

h

e

m

e

o

f

a

g

i

v

e

n

p

a

s

s

a

g

e

F

i

n

d

i

n

g

a

t

i

t

l

e

o

f

a

g

i

v

e

n

p

a

s

s

a

g

e

F

i

n

d

i

n

g

a

t

o

p

i

c

o

f

a

g

i

v

e

n

p

a

s

s

a

g

e

F

in

di

n

g

a

wr

o

n

g

ex

pl

a

n

ati

o

n a

b

o

u

t

a

g

i

v

e

n

p

a

s

s

a

g

e

F

in

di

n

g

a

n

a

w

k

w

ar

d

w

or

d

to th

e

co

nte

xt

of

a

giv

en

pa

ssa

ge

F

i

n

d

i

n

g

s

u

i

t

a

b

l

e

d

i

s

c

o

u

r

s

e

m

a

r

k

e

r

s

I

d

e

n

t

i

f

y

i

n

g

t

h

e

s

y

n

t

a

c

t

i

c

a

l

l

y

w

r

o

n

g

(

o

r

c

o

r

r

e

c

t

)

u

s

a

g

e

o

f

w

o

r

d

c

a

t

e

g

o

r

i

e

s

(

e

.

g

.

n

o

u

n

, ver

b,

relat

ive

pro

nou

n,

adv

erb, p

r

e

p

o

s

i

t

i

o

n

,

a

d

j

e

c

t

i

v

e

,

e

t

c

.

)

I

n

f

e

r

r

i

n

g

a

p

h

r

a

s

e

b

a

s

e

d

o

n

a

g

i

v

e

n

p

a

s

s

a

g

e

Inf

erri

ng

a

se

nte

nc

e

ba

se

d

on a

g

i

v

e

n

p

a

s

s

a

g

e

I

n

f

e

rr

in

g

a

w

o

r

d

b

a

s

e

d

o

n a

g

i

v

e

n

p

a

s

s

a

g

e

L

o

c

a

ti

n

g

a

gi

v

e

n

s

e

n

t

e

n

c

e i

n

a

p

a

s

s

a

g

e

P

l

a

c

i

n

g

s

e

n

t

e

n

c

e

s

i

n

a

l

o

g

i

c

a

l

o

r

d

e

r

S

u

m

m

ari

sin

g a

giv

en

pa

ss

ag

e 2 3 4 5 6 7

8

9

10

11

12

13

14 15

16

M E H M

E H M

E H M

E H M

E H M

E H M

H M

E H M E

H M

E H M

H M

E H

M

E

M E H

E

X

12 Y

1

X

3 Z

10 Z

9 Y

12 X

9 X

21

Y

23

X

19

X

7

X

8

Z4 Y

25

X

17

X

1

X

2 X

16

X

15

Z2

2

X

5

Y

8 X

6

X

23 Y

7 X

25

Z1

6

Z1

3

X

22 X

24

Y2 Z19

X 13 Y 15 X 14 Z27 Y 24 X 11

Y

18 Y

22 Z

17 X

10

X

27

Z8 X

20 Z

5 Z

29 Y

28

Y

11

Y

19 Y

3 Y

16

Z

24

Y

6 Z

11

X

26 X

28 Y

14

Y

5 Z

18

Y

9 Z

3

Y

1

0

Y

27 Z

6 Y

21 Y

26

Z2

5

Y

13 Y

4 Z2

0 Z2

1 Y

20

Z7 Z1

Y

17 Z14 Z26

Z2

Z12 Z15 Z28

Z2

3

X 18 E 17 Identifying author's mood 17 H X 18

Table 4.10 Reclassification of language categories (teachers' weighting vs. the Rasch analysis)

All test items (85 items) which teachers in X, Y, and Z high school designed were arranged on

the basis of the language categories and difficulty levels as shown in Table 4.10. In the middle of the

table are 17 different language categories. In the left side of the table, items were arranged

according to the language categories and difficulty levels which teachers decided. In the right side of

the table, items were arranged according to the language categories and difficulty levels which the

Rasch model measured.

Table 4.10 shows that there also exists a gap between teachers' beliefs and students'

responses in terms of a language category. That is, in some cases, what teachers had believed to be a

difficult language category was an easy language category for students, or vice versa. Teachers in X

and Y high school had estimated that language category, "finding suitable discourse markers (9)",

might not be difficult, so most items related to this language category had been classified as easy

items. Unlike the teachers' estimation, however, it was found that the students in X and Y high

school had difficulty in solving the items pertinent to the category, "finding suitable discourse

markers (9)". That is, the language category was more difficult than teachers had expected.

Comparably, the language categories, "finding a suitable word to the context of the given passage

(3)" and "finding a title of a given passage (5)", were found to be more difficult for students than

teachers had predicted.

On the other hand, in case of the language category, "finding a theme of a given passage (4)",

teachers had approximated that such a language category might be tough for students to find the

correct answer, so most items related to this language category had been rated as hard items by

teachers in X, Y, and Z school. However, that was not the case. Once those items related to the

language category, "finding a theme of a given passage (4)"were administered to students in X, Y, and

Z school, the students found the correct answers of those items more easily than teachers had

estimated.

Teachers in X, Y, and Z high school had believed that students might have difficulty in solving

the items related to the language categories, "inferring a phrase (11), a sentence (12) and, a word

(13)". Teachers had believed that those language categories could not be easy for students to deal

with. Therefore, the teachers had classified all the test items related to those language categories

into hard or medium. The Rasch analysis showed that their expectation was only partially correct.

That is, teachers' estimation on the hard items related to those language categories was right to a

great extent, while it was discovered that some medium items related to those language categories

were found to be more difficult (e.g., Z16) or easier (e.g., X25) than teachers had judged.

In case of the language category, "identifying the syntactically wrong (or correct) usage of

word categories (10)", there existed a difference between schools. To be specific, X high school's

teachers had estimated the items related to this language category to be hard, while Z high school's

teachers had considered them to be hard and medium. Y high school's teachers had rated them to be

medium and easy. After analysing the item response using the Rasch model, it was discovered that

there was a gap between teachers' estimation and students' responses to the items relevant to the

language category (10) in X and Y high school. However, Z school teachers' weighting on those

items related to the language category (10) was aligned.

34

Aside from the examples suggested above, a number of examples in Table 4.9 manifest that

there exists a discrepancy between what teachers believed to be difficult and what students found

to be difficult in terms of the language category.

4.7 Summary

Through the Rasch analysis, the difficulty logits of three schools' items were calibrated.

Based on the logits calibrated, I tagged the items with Baker's verbal terms (very easy, easy, medium,

hard, and very hard) and then a comparison was made between teachers' weighting and the virtual

item difficulty levels. As a result, it was discovered that there was a large gap between teachers'

priori weighting and virtual difficulty levels. That is, what teachers had believed to be difficult was

easy for students, or vice versa. For example, in some cases, teachers had believed that certain items

(e.g. X11, Y10) might be difficult for students, so the teachers had assigned high marks to those

items. However, the Rasch model identified that it was not difficult for students to find the correct

answers of those items. In other cases, teachers had estimated that items (e.g. Y26, Z19) might be

easy for students, so the teachers had allocated low marks to those items. Yet, the students actually

had difficulty in responding to those items correctly. Thus, when considering the gap between

teachers' weighting and the virtual weighting of the item, teachers should have assigned higher or

lower marks to those items than they had assigned.

In relation to the language categories, it was found that there also existed a gap between

teachers' beliefs and students' responses. In other words, what teachers had believed to be a

difficult language category was easy for students, or vice versa. For example, even though teachers

had considered the language category, "finding suitable discourse markers (9)" to be an easy

language category, students did not take the category as an easy one. Rather, Winsteps provided

evidence that the students had difficulty in dealing with the language category, (9). In case of the

language category "finding a theme of a given passage" (4), for example, teachers had assumed that the

language category might be difficult. However, students took the language category as easier one

than teachers had expected.

To sum up, in relation to the first research question, the Rasch model confirmed that there

existed a large gap between teachers' item weighting and virtual item difficulty levels. Similarly, in

relation to the second research question, it was discovered that a gap also existed between what

teachers believed to be difficult and what students found to be difficult in terms of the language

category. In the next chapter, a discussion about current research will be made.

35

Chapter 5. Discussion


In this chapter, a discussion will be made about the findings of current research and how to

enhance the fairness of item weighting will be suggested. Thereafter, a recommendation for the

further research will be suggested and a conclusion will be drawn.

5.1 Discussion and Implications

It is difficult to predict test item difficulty levels before a test is administered to test-takers.

In spite of that, since the government policy (see 3.2) disadvantages to the students who obtain the

same scores, teachers have implemented a differential item weighting method in the belief that the

method may be able to effectively decrease the number of the students who gain the same scores.

While allocating differing marks to the test items, the teachers usually rely on their feel for the item

difficulty levels (Rudner, 2001). That is, the more difficult teachers believe the test item may be, the

higher mark teachers assign to the test item.

In relation to a differential weighting method, Blood (1951, cited in Sabers & White, 1969)

claims that expert judgement for the determination of the weights may make it possible to raise test

reliability without changing validity. Unlike Blood's claim, however, Feldt (2004) argues that test

designers' "somewhat arbitrary" judgement (p. 186) for the item weighting can be "detrimental to total

score reliability" (p. 188). In addition, Feldt points out that if teachers use a differential

weighting method, it is frequently happening that test designers have assigned the highest weight to

the less important test items. Because of such drawbacks, many experts (e.g., Gulliksen, 1950)

discourage a differential weighting method.

Even though caution must be exercised before generalising the results of the present

research to other contexts, since the sample was limited, the results of current research confirm that

there was a large gap between teachers' weighting and the virtual item difficulty. As Feldt (2004)

mentions, in this research, teachers assigned lower marks to some difficult items, while they gave

higher marks to the relatively easy items within the test. In relation to language categories, it was

discovered that there also was a gap between what teachers believed to be difficult and what

students found to be difficult. That is, what teachers believed to be a difficult language category was

shown to be easy for students, or vice versa.

The value of this study lies in examining the fact that a differential weighting method which

most Korean teachers implement could fail to reflect the importance and the difficulty of the items

to a certain extent. By using a differential weighting method, teachers may succeed in lining up

students' scores from the highest to the lowest (Shohamy, 1998) as the education office policy

demands, but this research identified that teachers often failed to distinguish the difficult items from

the easy ones. In this sense, I believe that solutions need to be put forward in order to minimise the

mistakes of allocating higher (lower) marks to the less (more) important items.

As a way of securing the consistency in allocating marks to the items, a post weighting

system needs to be established in my context. Few, if any, test designers can precisely measure the

difficulty levels of the test items before test takers take the test. Despite this, I wonder why schools

and the education office compel teachers to do what the teachers are not able to do. That is,

36

teachers have been required to design a test and assign the differential weights to the test items

simultaneously. However, as identified in this research, teachers' weighting prior to the test did not

seem to satisfy designers' hypothetical desires that difficult items should be assigned higher marks.

From this perspective, the conventions in which teachers predict the item difficulty levels and reflect

them into the item marks needs to be modified. To be specific, it would be better to separate item

weighting from a test design by letting teachers decide the difficulty levels on the basis of how

students respond to the items and assign corresponding marks to the items.

Test item design Test Mark Final

Administration Allocation Test Results

Figure 5.1 Procedure of test design and post weighting method

Many may assume that a post weighting method may be able to make teachers experience

physical and psychological burnout, because teachers have to calculate the statistical grounds on

their determination for the item weighting and assign the corresponding marks to the test items.

However, all the processes of marking students' answers are led by a computerised system which is

called NEIS (National Education Information System) in my context. Thus, through that system,

teachers can easily certify the results of how many students respond to a certain item correctly. That

is, it is not that difficult for the teachers to ground their decision about the item difficulty levels.

Instead, for the precise measurement of the difficulty index, a statistical programme needs to be set

up with the NEIS. What is more, if the programme assigns the difficulty levels of the items based on

the difficulty index (e.g. Baker's index), teachers may be able to weight the test items more

consistently.

Another feasible solution can be that teachers assign the same mark to all test items. As I

explained in chapter 3 (see 3.2), the government is "using tests as tools for setting educational

agendas" (Shohamy & McNamara, 2009, p. 1). To be specific, since the government policy

disadvantages the students who gain the same scores, "teachers are reduced to following orders"

(Shohamy, 1998, p. 340) and allocate various marks to the test items in order to reduce the number of

the students who gain the same scores. Thus, unless the education office abolishes the policy that

disadvantages those who have equal scores, teachers may not accept the solution of abolishing

differential weighting. That is, the policy which makes teachers inevitably implement a differential

weighting method should be repealed. Under this condition, I believe that in my context, an equal

weighting method may be applicable for the purpose of minimising teachers' subjectivity and

enhancing the fairness of the test results to a large extent.

It could be argued that equal weighting is the equivalent of not weighting. However, this is

not the case. As explained in chapter 2, an equal weighting method can naturally and internally give a

different effective weighting to each item (Wang & Stanley, 1970). In addition, compared to an

37

unequal weighting method which needs much time deciding the appropriate weighting of the item

(especially if more than two teachers are involved) (Stalnaker, 1938), an equal weighting method is

practical and bias-free. To sum up, an equal weighting method help teachers to consistently assign a

mark to the items and to save time of judging which item is more difficult or easier.

In a high stakes test, I believe that the test results should be scored in as precise a manner as

possible, because the test results can serve as a critical indicator of deciding the test takers'

eligibility for a certain area such as employment, school admission, and etc. The initial step of precise

scoring, I believe, comes from correct item weighting. By allocating appropriate marks to each item

precisely, it can be said that the test secures the fairness and consistency.

5.3 Recommendations for Future Research

This quantitative research aimed to calculate the difficulty levels on the basis of test takers'

item responses. In order to do so, I analysed the students' responses to the items using the Rasch

model and fulfilled the purpose of the research to a large extent. However, this quantitative

research does not show why teachers made such a decision about the item difficulty levels and mark

allocation. That is, this research is not enough to investigate more profound knowledge and reasons

about test designers' weighting decision. Thus, it must be worth conducting qualitative research for the

purpose of discovering teachers' beliefs on item weighting.

Especially, in my context, more than two teachers are usually involved in designing test

items. All teachers design certain numbers of test items and a test is composed of all the items each

teacher creates. While those teachers are designing the items and allocating the marks to them,

there must be many variables which may have influence on an item weighting decision, such as

teachers' professional identity, teachers' teaching experience, the atmosphere where teachers are

teaching, the national examination, etc. Qualitative research enquiring into the relationship between

those variables and item weighting must be worth conducting, because it can give much deeper

information on the way how teachers in various strata decide the item difficulty levels.

5.4 Conclusion

One of the concerns we have is whether a language test can produce scores that can

accurately reflect an examinee's ability in a certain area such as reading, writing, speaking, listening

(Weir, 2005; Tierney, 2006). In the similar vein, McNamara (2000) contends that a language test is a

procedure for gathering evidence with which we can predict the candidate's use of language in real

world contexts. However, fundamentally, in order for the test to measure the examinees' ability

precisely, teachers (test designers) need to pay much attention to the item weighting. That is, the

teachers need to weight the item precisely. Because depending on how teachers weight items and

allocate marks to the items, the test scores can differ and consequently, wrong weighting may be

able to prevent testers from predicting the examinees' actual abilities.

However, it is extremely difficult for teachers to weight the items precisely prior to the test

administration, because there are a number of variations which influence the weighting of the items,

such as examinees' prior knowledge, test methods, length of test content, language categories, etc.

While weighting the items, teachers may consider such variations. Yet, the final decisions about item

weighting are usually made on the basis of the teachers' subjective judgements and criteria on the

38

variations which influence the item weighting. Thus, as shown in this research, the item analysis data

indicated that what teachers believed to be difficult may not always be identical to what students

found to be difficult. In some cases, what teachers believed to be difficult was sometimes easy (or

medium) for students. As long as subjectivity is involved in item weighting, if students raise

objections to teachers' item weighting and mark allocation, it must be difficult for most teachers to

elucidate the reason logically, causing "ethical challenges" (Davies, 2004, p. 97).

In relation to the misjudgement of the weighting, I believe that teachers in my context

seem to have had an indifferent attitude. That is, once they decided the weight and assigned the

marks to the items, their matter of concern seems to shift toward the test results which

students obtain from the test. Of course, I do not deny the importance of the test results. No

matter how incorrectly the items are weighted, the test results may serve as a tool which

teachers and students use for their decision-making to a certain extent. However, it is no doubt

that the more accurately teachers weight the items, the more reliable the test results will be. In

this sense, I claim that when teachers weight the items, the teachers need to take a more critical

stance and make it a rule to find less subjective grounds for deciding the item weighting by

analysing the item response data which they gained from the previous test.

39

References

Allal, L., 2013. Teachers' professional judgement in assessment: a cognitive act and a socially

situated practice. Assessment in Education: Principles, Policy & Practice, 20(1), p. 20-34.

Bachman, L. F., 1989. Assessment and Evaluation. Applied Linguistics, Volume 10, pp. 210-226.

Bachman, L. F., 1990. Fundamental Considerations in language testing. New York: Oxford University

Press.

Bachman, L. F., 1991. What Does Language Testing Have to Offer?. TESOL Quarterly, 25(4), pp. 671-

704.

Baker, F. B., 1985. The basic of item response theory. Portsmouth, N.H.: Heinemann.

Black, T. R., 1999. Doing quantitative research in the social sciences : an integrated approach to

research design, measurement and statistics. London : SAGE.

Bond, T. G. & Fox, C. M., 2001. Applying the Rasch Model: Fundamental Measurement in the Human

Sciences. Mahwah, N.J.: Lawrence Erlbaum Associates.

Brown, G., 1997. Assessing student learning in higher education. London: Routledge.

Brown, H., 1994. Teaching by principles : an interactive approach to language pedagogy. 2 ed. Upper

Saddle River, N.J. : Prentice-Hall.

Bryman, A., 2012. Social research methods. 4 ed. Oxford: Oxford University Press.

Cliff, N., 1989. Ordinal consistency and ordinal true scores. Psychometrika, 54(1), pp. 75-91.

Crocker, L. & Algina, J., 1986. Introduction to classical and modern test theory. New York ; London :

Holt, Rinehart and Winston.

Davies, A., 1978. Language Testing. Language Teaching, 11(3), pp. 145-159 .

Denscombe, M., 2010. The good research guide for small-scale social research projects. 4 ed.

Maidenhead: McGraw-Hill/Open University Press.

DeVellis, R. F., 2006. Classical test theory. medical Care, 44(11), pp. S50-S59.

Diekhoff, G. M., 1983. Testing through relationship judgments. Journal of Educational Psychology,

75(2), pp. 227-233.

Domino, G. & Domino, M. L., 2006. Psychological Testing: An Introduction. 2 ed. New York ;

Cambridge : Cambridge University Press.

Embretson, S. E., 1999. Issues in the measurement of cognitive abilities. In: S. E. Embretson & S. L.

Hershberger, eds. The new rules of measurement: what every psychologist and educator should

know. Mahwah, N.J. ; London : L. Erlbaum Associates , pp. 1-15.

40

Feldt, L. S., 2004. Estimating the reliability of a test battery composite or a test score based on

weighted item scoring. Measurement And Evaluation In Counseling And Development, 37(3), pp.

184-190.

Fulcher, G. & Davidson, F., 2009. Test architecture, test retrofit. Language Testing, 26(1), pp. 123-

144.

Furr, R. M. & Bacharach, V. R., 2008. Psychometrics; an introduction. California: Sage Publications.

Green, R., 2013. Statistical analysis for language testers. New York: Palgrave Macmillan.

Guilford, J. P., 1954. Psychometric methods. 2 ed. New York: McGraw-Hill.

Gulliksen, H., 1950. Theory of mental tests. New York: Wiley.

Henning, G., 1987. A guide to language testing: development, evaluation, research. Cambridge:

Cambridge University Press.

Hughes, A., 2003. Testing for language teachers. 2 ed. Cambridge : Cambridge University Press.

Hu, G. & Mckay, S. L., 2012. English Language Education in East Asia: Some Recent Developments.

Journal of Multilingual and Multicultural Development , 33(4), pp. 345-362.

Kaplan, R. M. & Saccuzzo, D. P., 2005. Psychological testing : principles, applications and issues. 6 ed.

Belmont, Calif. : Thomson Wadsworth .

Kim, J. et al., 2010. An analysis of determinants and the validity of item weighting. The journal of

Curriculum and Evaluation, 13(2), pp. 197-218.

Koopman, R. F., 1988. On the sensitivity of a composite to its weights. Psychometrika, 53(4), pp. 547-

552 .

Krippendorff, K., 1980. Content analysis : an introduction to its methodology. Beverly Hills, Calif. :

Sage .

Kumar, R., 2005. Research methodology; A step-by-step guide for beginners. 2 ed. London: SAGE.

Lado, R., 1961. Language testing. London: Longmans.

Lawson, D. M., 2006. Applying the item response theory to classroom examinations. Journal of

Manipulative and Physiological Therapeutics, 29(5), pp. 393-397.

Leung, C. & Lewkowicz, J., 2006. Expanding horizons and unresolved conundrums: Language testing

and assessment. TESOL Quarterly, 40(1), pp. 211-234.

Lewin, C., 2011. Understanding and Describing Quantitative Data. In: B. Somekh & C. Lewin, eds.

Theory and Methods in Social Research. London : SAGE, pp. 220-230.

41

Linacre, J. M., 2002. Rasch.org. [Online]

Available at: http://www.rasch.org/rmt/rmt162f.htm

[Accessed 19 5 2014].

Lloyd-Jones, R., 1992. An over view of assessment. In: Assessment : from principles to action.

London : Routledge, pp. 1-12.

McNamara, T., 2000. Language Testing. Oxford: Oxford University Press.

McNamara, T. & Ryan, K., 2011. Fairness Versus Justice in Language Testing: The Place of English

Literacy in the Australian Citizenship Test. Language Assessment Quarterly, 8(2), pp. 161-178.

Mellenbergh, G. J., 1996. Measurement Precision in Test Score and Item Response Models.

Psychological Methods, 1(3), pp. 293-299.

Murphy, K. R. & Davidshofer, C. O., c1991. Psychological testing : principles & applications. 2 ed.

Englewood Cliffs, N.J. : Prentice Hall.

Neuman, W. L., 2005. Social research methods : qualitative and quantitative approaches. 6 ed.

Boston, Mass. ; London : Pearson.

Oppenheim, A. N., 1992. Questionnaire design, interviewing and attitude measurement. London ;

New York: CONTINUUM.

Pae, H., 2012. Convergence and discriminant: assessing multiple traits using multiple methods.

Educational Research and Evaluation, 18(6), pp. 571-596 .

Pae, H. K., 2012. A psychometric measurement model for adult English language learners: Pearson

Test of English Academic. Educational Research and Evaluation, 18(3), p. 211-229.

Rudner, L. M., 2001. Informed Test Component Weighting. Educational Measurement: Issues and

Practice, 20(1), pp. 16-19.

Sabers, D. L. & White, G. W., 1969. The Effect of Differential Weighting of Individual Item Responses

on the Predictive Validity and Reliability of an Aptitude Test. Journal of Educational Measurement,

6(2), pp. 93-96.

Scott, D. & Morrison, M., 2006. Key ideas in educational research. London ; New York : Continuum.

Sharkness, J. & DeAngelo, L., 2011. Measuring Student Involvement: A Comparison of Classical Test

Theory and Item Response Theory in the Construction of Scales from Student Surveys. Research in

Higher Education, 52(5), pp. 480-507 .

Shi, J. T. N.-Z. & Chang, H.-H., 2012. Item-Weighted Likelihood Method for Ability Estimation in Tests

Composed of Both Dichotomous and Polytomous Items. Journal of Educational and Behavioral

Statistics, 37(2), p. 298-315.

Shohamy, E., 1998. Critical Language Testing and Beyond.. Studies in Educational Evaluation, 24(4),

pp. 331-45 .

42

Shohamy, E., 2000. The relationship between language testing and second language acquisition,

revisited. System , Volume 28, pp. 541-553.

Shohamy, E. & McNamara, T., 2009. Language tests for citizenship, immigration, and asylum [Special

issue]. Language Assessment Quarterly, 6(1), pp. 1-5.

Skehan, P., 1989. Language testing part II. Language Teaching, 22(1), pp. 1-13.

Stalnaker, J. M., 1938. Weighting questions in the essay-type examination. Journal of Educational

Psychology, 29(7), pp. 481-490 .

Statman, S., 1998. Tester and testee: two sides of different coins. System , Volume 26, pp. 195-204.

Taylor, L., 2005. Washback and impact. ELT Journal , 59(2), pp. 154-155.

Taylor, L., 2013. Communicating the theory, practice and principles of language testing to test

stakeholders: Some reflections. Language Testing, 30(3), pp. 403-412 .

Taylor, L. B., 2004. Current Issues in English Language Testing Research. TESOL Quarterly, 38(1), pp.

141-146.

Tierney, R. D., 2006. Changing practices: influences on classroom assessment. Assessment in

Education: Principles, Policy & Practice, 13(3), pp. 239-264.

Wang, M. W. & Stanley, J. C., 1970. Differential Weighting: A Review of Methods and Empirical

Studies. Review of Educational Research, Volume 40, pp. 663-705.

Weir, C. J., 2005. Language Testing and Validation : An evidence-based approach. New York:

palgrave macmillan.

West, P. V., 1924. The Significance of Weighted Scores. Journal of Educational Psychology, 15(5), pp.

302-308 .

Wiliam, D., 2011. What is assessment for learning?. Studies in Educational Evaluation, Volume 37, p. 3-

14.

Wilson, M., 2013. Using the concept of a measurement system to characterize measurement models

used in psychometrics. Measurement, 46(9), pp. 3766-3774.

Woodford, P. E., 1980. Foreign Language Testing. Modern Language Journal, 64(1), pp. 97-102 .

Xu, T. & Stone, C. A., 2012. Using IRT Trait Estimates Versus Summated Scores in Predicting

Outcomes. Educational and Psychological Measurement, 72(3), pp. 453-468.

Zhana, Y. & Andrews, S., 2014. Washback effects from a high-stakes examination on out-of-class

English learning: insights from possible self theories. Assessment in Education: Principles, Policy &

Practice, 21(1), p. 71-89.

43

Appendix

Appendix 1

Randomly selected sample from X high school

1 2 34 810 11 13 17 18 20 21 24

25 27 28 29 30 32 33 35 38 39

40 41 42 43 45 46 47 48 49 50

52 53 55 56 57 58 59 61 63 64

65 66 67 69 71 72 73 74 75 78

80 82 84 85 86 87 88 89 90 91

92 93 94 95 96 97 99 101 102 103

105 106 108 109 111 112 113 114 115 116

117 118 119 122 123 124 126 127 128 129

130 132 133 134 135 136 137 140 141 142

145 147 148 149 151 152 153 154 155 156

157 158 159 160 161 162 163 164 165 166

167 168 169 170 171 172 173 174 175 176

177 178 179 180 181 182 183 184 185 186

187 188 189 190 191 192 193 194 195 196

197 198 199 200 201 202 203 204 205 206

207 208 210 213 214 215 216 217 219 221

222 226 227 228 229 230 231 232 233 234

236 237 239 240 242 247 248 249 251 254

255 256 257 258 259 260 262

44

Appendix 2

Randomly selected sample from Y high school

5 7 11 13 15 19 20 23 24 27

28 34 36 40 42 43 47 50 51 54

55 59 60 61 65 67 68 69 71 74

75 77 79 81 82 83 85 86 87 88

90 91 92 94 95 96 99 100 102 104

105 106 108 109 110 112 113 114 115 118

119 121 122 123 125 126 127 129 131 132

133 135 136 137 139 141 143 144 145 146

148 149 150 153 156 159 160 162 163 164

168 170 171 172 173 175 177 179 180 181

184 185 187 189 190 191 193 194 195 197

198 199 201 206 207 208 210 212 215 216

217 218 220 221 222 224 228 229 230 231

233 237 238 239 242 245 248 251 252 255

256 259 260 261 262 264 265 266 268 269

272 273 275 278 279 280 282 284 286 287

288 289 291 292 293 295 296 297 299 300

303 305 306 307 309 310 311 314 315 317

318 319 320 323 327 328 331 334 336 340

344 346 347 350 351 354 355 358 359 361

45

Appendix 3 X, Y, and Z schools' two-dimensional table of test specifications

X school's two-dimensional table of test specifications

Objective Taxonomy Difficulty

Item Num.

1 2

3

4

5 6 7

8

9 10

11 12

13 14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

language category

Finding an awkward word to the context of the given passage Finding an awkward word to the context of the given passage Finding a suitable word


Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Finding a topic of a given passage

Finding a topic of a given passage Finding a theme of a given passage

Finding a theme of a given passage Summarising a given passage Finding a non-cohesive sentence Locating a given sentence in a passage Placing sentences in a logical order



Finding a wrong explanation about a given passage Identifying author's mood Finding a title of a given passage

Finding a title of a given passage


Inferring a word based on a given passage


Inferring a word based on a given passage



Finding a title of a given long passage

Inferring words based on a given long passage

K

C

○ ○

○

○

○ ○

○

Ap

○ ○

○

○

An ○

○ ○

○

○

○

○

S

○

○

○

○

○

○

○

○

○

○

E

H

○ ○

○

○

○

○

M

○ ○

○

○ ○

○

○

○

○

○

○

○

○

○

E

○ ○ ○ ○

○

○

○

○

Mark

4

3.5

3.4

2.5

4.5 4.4 3.5

2.8

3.7

3.3

4.6 3.3

3.7 3.2

2.9

3.1

2.6

2.7

4.3

44

4.2

4.1

3

3.8

3.9

3.4

3.6

*K: Knowledge, C: comprehension, Ap: application, An: Analysis, S: synthesis E: Evaluation

46

Y school's two-dimensional table of test specifications

Item

language category

Objective Taxonomy

Difficulty

Mark

Num.

1

2

3

4

5

6

7

8

Finding non-cohesive sentence

Inferring words based on a given passage



Locating a given sentence in a passage



Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.)

V G

○

C

○

I

○

○

○

○

Ap

○ ○

H

○

○

M

○

○

○

○

○ ○

E 3.6

3.6

3.6

3.6

3.6

3.8

3.8

3.6

9 Placing sentences in a logical order ○ ○ 3.6

10 Summarising a given passage ○ ○ 3.8

11 Fiindingaasn awkward word to the context of the ○ ○ 3.6

g ven p sage

12 Finding a theme of a given passage ○ ○ 3.6

13 Inferring a phrase based on a given passage ○ ○ 3.8

14 Inferring words based on a given passage ○ ○ 3.6

15 Locating a given sentence in a passage ○ ○ 3.2 Identifying the syntactically wrong (or correct)

16 usage of word categories (e.g. noun, verb, ○ ○ 3.6 relative pronoun, preposition, adjective, etc.)

17 Inferring a phrase based on a given passage ○ ○ 3.8

18 Finding a non-cohesive sentence ○ ○ 3.8 19 Finding suitable

discourse markers ○ ○ 3.1 20 Placing sentences in

a logical order ○ ○ 3.6 21 Finding suitable

discourse markers ○ ○ 3.3 22 Finding suitable words

to a given passage ○ ○ 3.6 23 Finding a title of a

given passage ○ ○ 3.7 24 Summarising a given

passage ○ ○ 3.6

25 Findien aawroge explanation about

ng a giv p ssa

ng ○ ○ 3.2

Identifying the syntactically wrong (or correct) 26 usage of word categories (e.g. noun, verb, ○ ○ 3.1

relative pronoun, preposition, adjective, etc.)

27 Finding a title of a given passage ○ ○ 3.3

28 Fiindingaasn awkward word to the context of the

g ven p sage

○

○

3.9

*v: vocabulary, G: grammar, C: comprehension, I: inference, Ap: application

Z school's two-dimensional table of test specifications

Item

language category

Objective Taxonomy

Difficulty

Mark

Num.

1 2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23 24 25

26

27

28

29

Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.)

Summarising a given passage

Finding a topic of a given passage





Finding suitable words to the context of a given passage Finding suitable words to the context of a given passage







Finding a theme of a given passage





Finding suitable discourse markers Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Identifying the syntactically wrong (or correct) usage of word categories (e.g. noun, verb, relative pronoun, preposition, adjective, etc.) Placing sentences in a logical order

Placing sentences in a logical order

Placing sentences in a logical order Finding a wrong explanation about a given passage

K

○ ○

○ ○

○ ○ ○

C

○ ○

○

Ap

○

○

○

○

○

○

An

○

○ ○

○

○

S

○

○

○

○

○

○

○

○

E H

○ ○

○

○

○

○

M

○

○

○

○

○

○

○

○

○

○

○

○

○ ○

○

E

○

○

○

○

○

○

○

○

3.9 3.9

3.4

3.1

3.5

3.1

3.4

3.4

3.2

3.3

3.6

3.9

3.4

3.5

3.4

3.6

3.5

3.8

3

3.9

3.4

3.1

3.8 3.7 3.7

3.1

3.3

3.1

3

*K: Knowledge, C: comprehension, Ap: application, An: Analysis, S: synthesis E: Evaluation

48

Appendix 4 X, Y, and Z high school's person data analysis

X HIGH SCHOOL PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE

SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 1 0.91 19 0.86 -0.65 0.69 -0.87 140 1.6 22 0.73 -0.94 0.51 -0.99 2 2.6 25 0.61 -0.78 0.28 -0.82 141 0.91 19 0.98 -0.03 0.87 -0.25 3 -0.95 9 1.58 2.63 1.73 1.61 142 -1.38 7 1.07 0.38 1.03 0.23 4 0.71 18 0.84 -0.85 0.75 -0.75 145 2.2 24 1.08 0.34 1.18 0.48 8 0.34 16 1.06 0.39 1.07 0.32 147 1.6 22 1.02 0.15 0.86 -0.11 10 1.35 21 0.79 -0.82 0.66 -0.69 148 2.2 24 0.89 -0.16 0.83 -0.01 11 -0.57 11 1.06 0.39 0.94 -0.07 149 -0.03 14 1.02 0.16 0.95 -0.10

13 -0.39 12 0.92 -0.42 0.93 -0.13 151 -0.76 10 0.97 -0.11 0.86 -0.27 17 -2.19 4 1.22 0.67 1.69 0.98 152 2.6 25 0.92 -0.02 0.75 0.01 18 -1.38 7 1.15 0.68 1.10 0.36 153 -0.95 9 1.20 1.06 3.39 3.83 20 -0.95 9 1.26 1.32 1.20 0.59 154 1.88 23 0.73 -0.77 0.47 -0.87 21 -1.38 7 1.29 1.20 2.03 1.72 155 -0.95 9 0.80 -1.06 0.72 -0.64

24 -0.95 9 0.86 -0.70 0.77 -0.48 156 1.35 21 0.79 -0.83 0.78 -0.37 25 0.71 18 1.08 0.49 1.12 0.46 157 1.88 23 0.91 -0.18 0.64 -0.47 27 0.15 15 0.97 -0.15 0.91 -0.24 158 0.34 16 0.98 -0.08 0.92 -0.21 28 -0.57 11 1.19 1.11 1.19 0.66 159 3.13 26 1.24 0.57 0.66 0.07 29 -1.88 5 1.26 0.88 1.35 0.69 160 -0.57 11 1.06 0.39 0.95 -0.03 30 -0.76 10 0.89 -0.58 0.80 -0.47 161 2.6 25 0.77 -0.36 0.44 -0.48 32 -2.19 4 1.28 0.83 1.78 1.05 162 3.98 27 1.16 0.46 0.50 0.03

33 -0.03 14 0.96 -0.17 0.90 -0.28 163 -2.57 3 1.03 0.21 3.57 1.97 35 -0.39 12 1.24 1.42 2.44 3.49 164 1.12 20 0.92 -0.30 0.80 -0.41 38 -1.88 5 1.36 1.13 1.40 0.75 165 1.35 21 1.09 0.41 1.33 0.81 39 -1.16 8 1.02 0.17 1.24 0.63 166 1.88 23 0.76 -0.69 0.98 0.17 40 -0.57 11 0.97 -0.13 0.90 -0.20 167 1.88 23 1.13 0.50 1.29 0.64

41 -2.19 4 1.28 0.83 1.33 0.64 168 1.6 22 0.97 -0.02 1.04 0.26 42 -0.95 9 1.12 0.67 1.24 0.67 169 -1.62 6 0.99 0.07 1.33 0.70 43 1.12 20 0.98 -0.03 0.89 -0.16 170 1.12 20 0.84 -0.67 0.69 -0.74 45 2.6 25 0.71 -0.51 0.35 -0.66 171 1.35 21 0.88 -0.40 0.93 0.00 46 -0.57 11 1.22 1.28 1.24 0.79 172 2.2 24 0.82 -0.35 0.61 -0.36 47 -0.95 9 1.27 1.38 3.67 4.13 173 1.6 22 1.06 0.29 1.02 0.21 48 -0.76 10 1.04 0.26 2.82 3.51 174 -1.16 8 1.18 0.88 1.07 0.29

49 -0.95 9 1.03 0.21 0.89 -0.14 175 0.91 19 1.07 0.39 1.10 0.38 50 -1.88 5 1.26 0.88 1.35 0.69 176 -1.16 8 1.19 0.91 1.67 1.37 52 -0.57 11 0.91 -0.51 0.87 -0.28 177 1.88 23 1.17 0.59 1.67 1.12 53 -0.2 13 1.31 1.78 1.25 0.90 178 -0.2 13 0.58 -3.04 0.52 -1.97 55 -2.57 3 1.28 0.71 3.44 1.91 179 0.71 18 0.99 0.03 0.83 -0.44

56 -0.57 11 1.22 1.29 1.12 0.45 180 1.12 20 0.90 -0.38 0.74 -0.57 57 -0.57 11 1.15 0.91 1.17 0.60 181 -0.03 14 1.02 0.18 0.92 -0.20 58 2.6 25 0.91 -0.04 0.71 -0.05 182 -1.16 8 0.83 -0.79 0.70 -0.57 59 -2.19 4 1.16 0.54 0.84 0.06 183 1.12 20 0.78 -0.99 0.80 -0.42 61 -0.57 11 0.88 -0.68 0.77 -0.64 184 1.35 21 0.91 -0.28 0.91 -0.06 63 -0.57 11 0.73 -1.73 0.62 -1.20 185 -0.95 9 0.96 -0.16 0.92 -0.07 64 -1.16 8 0.99 0.02 1.41 0.95 186 1.88 23 0.98 0.04 1.00 0.20

65 -0.76 10 0.99 0.03 1.01 0.16 187 -1.16 8 0.86 -0.63 0.69 -0.60 66 -0.2 13 1.16 0.97 1.24 0.88 188 1.88 23 0.79 -0.57 0.50 -0.80 67 -1.38 7 1.39 1.55 2.34 2.07 189 3.98 27 0.55 -0.33 0.10 -0.68 69 -1.16 8 1.04 0.27 1.00 0.14 190 1.12 20 0.89 -0.42 0.76 -0.51 71 1.35 21 1.03 0.22 0.98 0.11 191 1.35 21 1.29 1.12 1.35 0.84

72 -0.76 10 0.98 -0.08 1.18 0.58 192 0.71 18 0.79 -1.14 0.66 -1.09 73 1.6 22 0.90 -0.27 0.69 -0.48 193 0.71 18 0.95 -0.23 0.91 -0.18 74 0.91 19 1.09 0.51 1.17 0.58 194 0.71 18 0.98 -0.02 0.97 0.01 75 0.71 18 1.31 1.54 1.62 1.73 195 2.6 25 0.82 -0.25 0.59 -0.22 78 0.15 15 1.39 2.15 1.97 2.83 196 0.91 19 0.74 -1.34 0.59 -1.25 80 -0.39 12 1.67 3.46 2.90 4.27 197 1.35 21 0.68 -1.34 0.50 -1.22 82 -0.57 11 1.07 0.46 0.96 -0.01 198 1.12 20 0.92 -0.31 0.75 -0.56

84 0.71 18 0.84 -0.80 0.76 -0.71 199 0.91 19 0.93 -0.29 0.88 -0.25 85 -0.95 9 0.93 -0.30 0.79 -0.42 200 1.6 22 1.15 0.60 1.36 0.80 86 1.12 20 1.07 0.38 1.06 0.27 201 0.34 16 0.98 -0.05 0.93 -0.16 87 1.12 20 0.81 -0.84 0.62 -0.96 202 -0.76 10 0.97 -0.12 1.18 0.59 88 0.52 17 0.79 -1.24 0.70 -1.04 203 0.52 17 1.04 0.26 0.96 -0.02

49

PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE

SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 89 1.12 20 0.89 -0.44 0.85 -0.25 204 -0.95 9 1.00 0.04 1.23 0.67 90 -1.88 5 0.97 0.00 0.87 0.04 205 3.13 26 0.60 -0.59 0.21 -0.62 91 -1.62 6 1.35 1.26 1.36 0.74 206 -1.88 5 0.83 -0.48 0.57 -0.52

92 0.71 18 0.95 -0.22 0.98 0.06 207 0.91 19 1.09 0.48 1.13 0.48 93 -1.16 8 1.09 0.50 1.17 0.50 208 3.13 26 0.60 -0.59 0.21 -0.62 94 -0.95 9 0.77 -1.24 0.63 -0.91 210 5.3 28 1.00 0.00 1.00 0.00 95 -1.62 6 0.83 -0.57 0.60 -0.60 213 -0.2 13 1.43 2.41 1.68 2.07 96 1.35 21 1.12 0.54 0.99 0.13 214 -0.03 14 1.09 0.57 1.05 0.28 97 0.71 18 0.71 -1.67 0.60 -1.35 215 0.91 19 0.85 -0.72 0.75 -0.65 99 0.91 19 0.85 -0.72 0.74 -0.69 216 1.88 23 1.03 0.21 0.75 -0.23

101 -1.38 7 1.17 0.75 1.42 0.89 217 1.88 23 0.98 0.05 0.64 -0.46 102 -1.62 6 1.01 0.13 1.87 1.37 219 -0.03 14 0.78 -1.38 0.71 -1.05 103 -1.62 6 0.83 -0.57 0.60 -0.60 221 2.6 25 0.89 -0.09 0.63 -0.17 105 -0.95 9 1.09 0.54 1.19 0.59 222 0.15 15 0.80 -1.27 0.73 -1.01 106 0.15 15 0.97 -0.14 0.90 -0.30 226 2.2 24 0.96 0.03 1.18 0.48

108 -2.19 4 0.80 -0.45 0.51 -0.48 227 0.91 19 0.99 0.02 1.02 0.18 109 -0.57 11 1.10 0.62 1.03 0.20 228 1.6 22 0.93 -0.14 0.87 -0.09 111 -0.57 11 1.19 1.14 1.30 0.94 229 0.15 15 0.95 -0.23 0.86 -0.44 112 0.34 16 0.78 -1.37 0.68 -1.17 230 -0.76 10 0.99 0.00 1.00 0.11 113 -0.39 12 1.02 0.19 0.93 -0.13 231 -0.57 11 0.79 -1.25 0.69 -0.93 114 -0.39 12 0.82 -1.13 0.93 -0.14 232 1.12 20 0.81 -0.84 0.61 -1.00 115 -1.62 6 1.01 0.13 1.72 1.20 233 1.88 23 0.94 -0.07 0.96 0.15

116 -1.88 5 1.26 0.88 1.35 0.69 234 2.2 24 0.94 -0.03 0.63 -0.33 117 -1.88 5 1.00 0.10 1.95 1.32 236 -1.16 8 0.89 -0.48 1.23 0.61 118 -1.88 5 1.26 0.88 1.35 0.69 237 -1.16 8 1.08 0.44 1.19 0.55 119 1.88 23 0.85 -0.36 0.61 -0.53 239 1.6 22 0.91 -0.24 0.83 -0.16 122 1.6 22 0.98 0.03 0.91 0.00 240 0.34 16 1.00 0.07 1.43 1.44

123 0.52 17 1.03 0.23 1.05 0.26 242 3.98 27 1.30 0.61 3.90 1.69 124 0.15 15 0.94 -0.30 0.86 -0.43 247 -0.39 12 1.09 0.61 0.97 0.01 126 0.71 18 0.77 -1.28 0.65 -1.14 248 3.13 26 1.08 0.33 0.41 -0.25 127 -0.2 13 0.96 -0.20 0.86 -0.42 249 -0.03 14 1.18 1.09 1.11 0.49 128 0.34 16 0.84 -0.94 0.81 -0.61 251 3.98 27 1.24 0.54 0.92 0.41 129 -1.62 6 0.89 -0.35 0.63 -0.52 254 2.6 25 1.25 0.65 0.73 -0.03 130 -0.57 11 0.73 -1.74 0.65 -1.09 255 3.98 27 1.09 0.39 0.35 -0.15

132 1.12 20 0.79 -0.96 0.60 -1.05 256 0.52 17 0.80 -1.15 0.71 -0.98 133 3.98 27 1.25 0.55 1.03 0.49 257 0.52 17 1.01 0.14 1.01 0.14 134 0.91 19 1.09 0.48 0.99 0.09 258 -0.57 11 0.83 -1.03 0.70 -0.88 135 1.6 22 1.04 0.25 0.96 0.10 259 0.91 19 1.02 0.15 0.92 -0.11 136 1.35 21 0.84 -0.60 0.68 -0.65 260 1.6 22 0.67 -1.20 0.46 -1.12

137 1.88 23 0.93 -0.12 0.83 -0.09 262 3.13 26 1.08 0.33 0.41 -0.25

50

Y HIGH SCHOOL

PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE SCORE

ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 5 -0.49 11 0.92 -0.49 1.02 0.16 184 1.03 20 1.06 0.35 0.99 0. 04

7 -0.84 9 1.30 1.59 1.49 1.85 185 -1.23 7 1.28 1.19 1.39 1. 19 11

-1.45 6 1.24 0.93 1.75 1.77 187 -1.7 5 1.23 0.80 1.70 1. 46 13

-0.84 9 1.28 1.53 1.44 1.69 189 -1.03 8 1.13 0.69 1.28 1. 01 15

-1.45 6 1.02 0.18 1.07 0.31 190 1.23 21 0.89 -0.42 0.80 - 0. 55

19 -2.34 3 0.99 0.14 1.71 1.12 191 -1.23 7 0.91 -0.34 1.14 0. 53 20

-1.03 8 1.28 1.34 1.40 1.38 193 -1.99 4 0.78 -0.52 0.59 - 0. 75

23 0.49 17 0.98 -0.08 0.92 -0.37 194 2.33 25 1.06 0.28 0.98 0. 18 24

-1.03 8 1.07 0.42 1.22 0.84 195 -1.7 5 0.84 -0.45 0.64 - 0. 80

27 -1.99 4 1.23 0.70 1.41 0.86 197 0.33 16 0.78 -1.77 0.73 - 1. 68

28 -1.23 7 1.22 0.97 1.34 1.07 198 -1.45 6 0.78 -0.83 0.62 - 1. 04

34 -1.7 5 0.92 -0.16 0.76 -0.45 199 -1.23 7 1.16 0.76 1.06 0. 28 36

-1.03 8 1.21 1.04 1.18 0.70 201 -1.45 6 1.16 0.68 1.29 0. 84 40

-2.34 3 1.05 0.26 0.90 0.06 206 -1.23 7 1.22 0.97 1.34 1. 07 42

-1.03 8 0.92 -0.33 0.86 -0.44 207 -0.16 13 0.93 -0.48 0.90 - 0. 61

43 -0.66 10 1.39 2.24 1.55 2.32 208 1.7 23 0.75 -0.79 0.53 - 1. 13

47 -0.16 13 0.87 -1.02 0.84 -1.04 210 -0.16 13 0.83 -1.41 0.86 - 0. 93

50 -0.84 9 0.84 -0.90 0.77 -0.95 212 -2.34 3 1.12 0.41 1.30 0. 63 51

-1.45 6 0.87 -0.42 0.78 -0.51 215 1.23 21 0.74 -1.21 0.62 - 1. 24

54 -0.16 13 1.13 1.00 1.12 0.80 216 0.17 15 0.76 -2.06 0.74 - 1. 81

55 -1.23 7 1.08 0.42 1.09 0.37 217 1.45 22 1.00 0.08 0.83 - 0. 35

59 1.98 24 0.93 -0.08 0.73 -0.39 218 0.49 17 0.79 -1.50 0.74 - 1. 49

60 -0.49 11 0.89 -0.75 0.83 -0.94 220 -1.23 7 1.31 1.33 1.60 1. 69 61

-1.45 6 0.98 -0.01 0.94 -0.04 221 1.23 21 0.86 -0.60 0.72 - 0. 87

65 -0.66 10 1.06 0.41 1.08 0.46 222 -1.7 5 1.21 0.75 1.27 0. 72 67

-0.16 13 0.84 -1.27 0.84 -1.07 224 0.33 16 1.03 0.28 1.01 0. 12 68

-1.7 5 1.07 0.32 1.37 0.89 228 2.33 25 1.02 0.18 0.85 - 0. 02

69 -1.45 6 1.25 0.97 1.52 1.33 229 0.33 16 0.98 -0.08 0.99 0. 01 71

-1.7 5 1.02 0.15 1.36 0.88 230 0.84 19 1.13 0.78 1.07 0. 36 74

3.54 27 1.05 0.36 1.32 0.65 231 -1.7 5 1.23 0.80 1.35 0. 86 75

-0.84 9 0.96 -0.14 0.90 -0.34 233 -1.99 4 0.98 0.07 0.88 - 0. 05

76 -1.23 7 1.22 0.97 1.34 1.07 237 -1.03 8 1.40 1.84 1.71 2. 21 79

-1.7 5 1.33 1.08 2.06 2.00 238 -1.23 7 1.27 1.16 1.24 0. 81 81

1.98 24 1.08 0.33 2.09 1.75 239 0.67 18 0.99 -0.03 0.96 - 0. 10

82 0.67 18 0.92 -0.49 0.88 -0.49 242 1.45 22 0.97 -0.02 0.83 - 0. 35

83 -1.23 7 1.22 0.97 1.34 1.07 245 -1.23 7 1.37 1.55 1.62 1. 75 85

-1.99 4 0.88 -0.21 0.77 -0.31 248 0.84 19 0.80 -1.17 0.79 - 0. 87

86 -1.45 6 1.29 1.10 1.73 1.73 251 -1.45 6 1.06 0.32 1.02 0. 17 87

0.67 18 1.14 0.91 1.18 0.88 252 -1.03 8 1.13 0.70 1.13 0. 56 88

-1.99 4 1.09 0.35 1.12 0.39 255 -1.7 5 0.87 -0.34 0.71 - 0. 60

90 -0.66 10 1.20 1.24 1.34 1.55 256 -0.66 10 0.85 -0.93 0.95 - 0. 20

91 0.33 16 0.87 -1.01 0.81 -1.12 259 0.84 19 0.82 -1.08 0.75 - 1. 05

92 -1.7 5 0.92 -0.16 0.76 -0.45 260 -1.03 8 1.10 0.57 1.21 0. 80 94

-1.7 5 1.36 1.15 1.85 1.70 261 0.17 15 0.95 -0.37 0.93 - 0. 42

95 -0.49 11 0.81 -1.34 0.76 -1.34 262 0.84 19 0.82 -1.04 0.74 - 1. 11

96 -1.03 8 0.94 -0.22 0.94 -0.12 264 -0.84 9 1.13 0.74 1.22 0. 92 99

-1.23 7 0.96 -0.11 0.98 0.04 265 -1.99 4 1.12 0.44 1.38 0. 82

100 1.23 21 0.92 -0.30 0.79 -0.58 266 -1.45 6 1.01 0.11 0.86 - 0. 27

51

PERSON ABILITY INFIT INFIT OUTFIT OUTFIT PERSON ABILITY INFIT INFIT OUTFIT OUTFIT SCORE SCORE

ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 102 -0.16 13 0.99 -0.02 1.02 0.17 268 0.33 16 0.88 -0.87 0.83 - 1. 01

104 0.67 18 0.82 -1.17 0.77 -1.09 269 0.84 19 0.84 -0.90 0.76 - 1. 00

105 -0.49 11 1.06 0.49 1.04 0.30 272 1.23 21 1.15 0.73 1.35 1. 07

106 -0.66 10 0.88 -0.71 0.90 -0.40 273 0.84 19 0.75 -1.55 0.66 - 1. 51

108 -0.84 9 0.84 -0.87 0.83 -0.67 275 1.98 24 0.98 0.07 0.75 - 0. 35

109 -0.66 10 1.27 1.63 1.32 1.44 278 0.84 19 0.93 -0.36 0.84 - 0. 60

110 -0.84 9 1.06 0.40 1.12 0.56 279 1.23 21 0.83 -0.75 0.75 - 0. 75

112 -0.32 12 0.86 -1.07 0.87 -0.79 280 -1.23 7 0.95 -0.16 0.81 - 0. 52

113 -0.66 10 1.15 0.93 1.26 1.20 282 1.98 24 0.90 -0.17 0.72 - 0. 40

114 -0.49 11 0.97 -0.18 0.90 -0.47 284 1.45 22 0.97 -0.05 0.84 - 0. 33

115 0.17 15 0.99 -0.06 0.99 -0.03 286 1.03 20 0.82 -0.93 0.72 - 1. 02

118 -1.03 8 1.16 0.81 1.38 1.33 287 1.45 22 0.97 -0.02 0.85 - 0. 29

119 0.84 19 0.97 -0.10 0.98 0.01 288 -0.16 13 1.02 0.16 0.99 0. 00

121 -0.66 10 0.79 -1.38 0.73 -1.33 289 -1.03 8 1.08 0.44 1.19 0. 74

122 -1.99 4 0.96 0.00 0.95 0.07 291 1.7 23 0.85 -0.41 0.63 - 0. 83

123 1.03 20 0.77 -1.24 0.69 -1.15 292 -1.45 6 0.76 -0.92 0.61 - 1. 08

125 1.45 22 1.02 0.18 0.87 -0.23 293 0.67 18 0.70 -2.11 0.64 - 1. 90

126 0.17 15 1.05 0.43 1.03 0.22 295 0.67 18 0.80 -1.31 0.74 - 1. 29

127 -0.49 11 0.95 -0.29 0.97 -0.09 296 3.54 27 1.06 0.37 1.59 0. 82

129 -0.16 13 0.91 -0.65 0.95 -0.31 297 -1.45 6 0.97 -0.03 0.90 - 0. 15

131 1.7 23 0.83 -0.48 0.73 -0.52 299 0.33 16 0.69 -2.63 0.64 - 2. 37

132 -1.45 6 0.99 0.06 1.03 0.21 300 -1.45 6 0.86 -0.46 0.72 - 0. 72

133 -0.84 9 0.99 0.00 0.99 0.03 303 2.79 26 0.98 0.16 0.65 - 0. 19

135 -0.49 11 0.83 -1.20 0.80 -1.12 305 0.84 19 0.85 -0.83 0.79 - 0. 83

136 -0.66 10 1.19 1.19 1.34 1.52 306 1.7 23 0.88 -0.31 0.79 - 0. 37

137 -1.45 6 0.93 -0.18 0.83 -0.35 307 0.17 15 0.82 -1.51 0.78 - 1. 48

139 0.84 19 0.93 -0.33 0.85 -0.57 309 -0.49 11 0.95 -0.32 0.92 - 0. 39

141 -1.23 7 1.02 0.17 1.12 0.46 310 1.7 23 1.05 0.27 1.20 0. 56

143 -1.23 7 0.91 -0.35 0.93 -0.12 311 0.84 19 0.81 -1.09 0.73 - 1. 15

144 2.79 26 0.94 0.09 0.84 0.08 314 -0.49 11 1.18 1.21 1.19 1. 03

145 -0.32 12 0.88 -0.85 0.86 -0.83 315 -0.66 10 0.96 -0.23 1.00 0. 05

146 -1.7 5 0.97 0.01 0.94 0.01 317 -0.84 9 0.83 -0.96 0.78 - 0. 93

148 -1.23 7 1.12 0.58 1.42 1.28 318 1.7 23 0.83 -0.50 0.67 - 0. 71

149 -1.45 6 1.28 1.06 1.47 1.23 319 1.45 22 0.91 -0.29 0.81 - 0. 40

150 -1.45 6 0.97 -0.02 1.19 0.60 320 -1.03 8 1.02 0.14 0.94 - 0. 13

153 -0.16 13 0.80 -1.63 0.75 -1.68 323 -0.49 11 0.87 -0.88 0.86 - 0. 75

156 -0.49 11 0.88 -0.85 0.86 -0.73 327 -2.8 2 0.81 -0.13 0.40 - 0. 66

159 -0.49 11 0.88 -0.83 0.88 -0.60 328 -1.99 4 1.19 0.62 1.28 0. 66

160 -0.32 12 0.95 -0.31 1.00 0.03 331 1.98 24 0.81 -0.43 0.55 - 0. 83

162 0.17 15 1.11 0.85 1.16 1.00 334 2.33 25 0.98 0.10 0.84 - 0. 04

163 0.33 16 0.70 -2.44 0.66 -2.23 336 -2.34 3 1.18 0.52 1.76 1. 18

164 0.33 16 1.23 1.66 1.22 1.28 340 1.45 22 1.09 0.44 1.32 0. 90

168 -1.7 5 1.18 0.66 1.29 0.76 344 2.33 25 0.90 -0.08 0.62 - 0. 47

170 1.23 21 0.96 -0.11 0.87 -0.33 346 -1.23 7 0.77 -1.04 0.63 - 1. 22

171 -0.49 11 1.08 0.61 1.18 0.97 347 -1.7 5 0.92 -0.16 0.76 - 0. 45

172 -0.84 9 1.22 1.22 1.31 1.25 350 0 14 0.91 -0.69 0.95 - 0. 30 173 2.33

25 1.08 0.33 1.06 0.31 351 -1.45 6 1.03 0.20 1.20 0. 62 175 -0.66 10

1.03 0.22 1.05 0.31 354 -0.66 10 0.92 -0.44 0.87 - 0. 58 177 -1.7 5

0.92 -0.16 0.76 -0.45 355 -1.45 6 1.14 0.61 1.08 0. 33 179 -1.45 6

1.06 0.32 1.17 0.56 358 0 14 0.96 -0.28 1.01 0. 14 180 -1.23 7

1.17 0.78 1.54 1.56 359 0.17 15 1.06 0.51 1.10 0. 69 181 -1.99 4

1.18 0.58 1.55 1.07 361 1.7 23 0.95 -0.08 0.85 - 0. 20

52

Z HIGH SCHOOL


SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD

1 1.53 23 0.80 -0.67 0.63 -0.93 99 0.06 15 0.90 -0.73 0.89 -0.61 2 -1.32 7 0.95 -0.13 0.83 -0.37 100 0.38 17 0.95 -0.30 0.89 -0.52

3 -0.75 10 1.15 0.96 1.13 0.56 101 -1.32 7 1.17 0.79 1.21 0.66 4 -1.53 6 1.15 0.62 1.49 1.15 102 0.06 15 1.04 0.36 1.07 0.44 5 -0.26 13 1.00 0.01 0.95 -0.23 103 3.7 28 0.72 -0.04 0.16 -0.55 6

2.44 26 1.07 0.30 0.78 -0.11 104 -0.93 9 1.13 0.76 1.77 2.30 7

-0.75 10 0.94 -0.34 0.93 -0.20 105 -0.93 9 0.84 -0.91 0.77 -0.77 8

0.22 16 0.68 -2.56 0.63 -2.33 106 0.06 15 1.05 0.41 1.06 0.39 9

1.3 22 1.10 0.47 1.10 0.39 107 0.38 17 0.81 -1.35 0.75 -1.36 10 0.55 18 0.72 -1.91 0.65 -1.86 108 -1.78 5 1.04 0.23 0.82 -0.22 11 1.3 22 0.76 -0.99 0.61 -1.20 109 -0.93 9 1.23 1.30 1.39 1.30

12 1.3 22 1.05 0.27 1.09 0.38 110 -1.53 6 1.33 1.25 1.53 1.21

13 -1.12 8 1.06 0.37 1.04 0.22 111 -1.32 7 1.19 0.87 1.41 1.11

14 -0.75 10 1.11 0.70 1.21 0.86 112 0.72 19 0.91 -0.45 0.85 -0.62

15 -1.12 8 1.28 1.36 1.29 0.92 113 -1.53 6 0.92 -0.24 0.80 -0.36 16 0.22 16 0.99 -0.02 0.96 -0.14 114 -1.32 7 1.12 0.58 2.17 2.49 17 -2.87 2 0.88 -0.01 0.52 -0.33 115 1.78 24 1.10 0.40 1.17 0.51

18 2.08 25 1.08 0.33 0.84 -0.12 116 0.72 19 1.16 0.96 1.18 0.80

19 -1.12 8 1.00 0.08 0.92 -0.15 117 2.92 27 1.06 0.30 0.94 0.24

20 -0.75 10 1.07 0.49 1.02 0.14 118 -1.32 7 1.19 0.87 1.27 0.79

21 -1.53 6 0.94 -0.13 0.91 -0.07 119 -0.93 9 1.13 0.78 1.79 2.33 22 -1.78 5 1.26 0.88 1.95 1.66 120 -1.53 6 1.10 0.45 1.13 0.44 23 1.3 22 0.82 -0.73 0.71 -0.82 121 -1.12 8 0.95 -0.21 0.96 0.00

24 1.3 22 0.96 -0.09 0.97 0.04 122 1.53 23 0.88 -0.34 0.87 -0.20

25 -1.78 5 0.97 -0.01 0.84 -0.18 123 1.1 21 1.04 0.25 0.98 0.04

26 -1.32 7 1.15 0.72 2.31 2.72 124 -0.1 14 0.76 -1.98 0.71 -1.76

27 -1.32 7 1.12 0.59 1.01 0.15 125 -1.12 8 1.01 0.11 0.98 0.06 28 0.55 18 0.79 -1.40 0.73 -1.40 126 1.3 22 0.95 -0.13 0.76 -0.64 29 -1.78 5 1.10 0.43 1.02 0.21 127 -0.42 12 0.95 -0.33 0.93 -0.29

30 0.72 19 0.94 -0.28 0.86 -0.55 128 -0.42 12 0.91 -0.63 0.87 -0.59

31 0.38 17 1.03 0.28 1.10 0.59 129 -1.12 8 0.88 -0.56 0.81 -0.53

32 -0.75 10 0.89 -0.65 0.81 -0.74 130 -0.93 9 0.80 -1.19 0.73 -0.93

33 3.7 28 1.14 0.45 3.03 1.44 131 1.1 21 0.88 -0.53 0.80 -0.62 34 -1.53 6 1.20 0.81 1.54 1.22 132 0.22 16 0.85 -1.12 0.88 -0.62 35 1.1 21 0.92 -0.33 0.82 -0.55 133 -0.26 13 1.05 0.44 1.06 0.38

36 0.9 20 1.06 0.39 1.06 0.31 134 0.38 17 1.09 0.63 1.17 0.90

37 2.08 25 0.89 -0.19 0.71 -0.39 135 1.3 22 1.23 0.97 1.16 0.54

38 1.1 21 1.18 0.85 1.39 1.26 136 -0.42 12 0.73 -2.17 0.69 -1.68

39 -1.12 8 0.84 -0.79 0.72 -0.83 137 0.22 16 0.77 -1.79 0.72 -1.72 40 -0.58 11 1.03 0.26 1.10 0.52 138 -1.78 5 1.12 0.47 1.68 1.30 41 -0.75 10 1.26 1.56 1.50 1.80 139 -1.53 6 1.02 0.16 0.93 -0.02

42 -1.12 8 1.01 0.14 1.17 0.61 140 -0.42 12 0.93 -0.47 0.96 -0.15

43 -0.1 14 0.71 -2.46 0.67 -2.09 141 -0.1 14 0.90 -0.73 0.85 -0.83

44 0.55 18 0.91 -0.56 0.85 -0.70 142 2.08 25 1.06 0.28 0.75 -0.29

45 -0.58 11 1.28 1.82 1.32 1.37 143 1.3 22 0.95 -0.15 0.88 -0.26 46 0.38 17 0.84 -1.14 0.77 -1.26 144 0.06 15 0.89 -0.81 0.84 -0.90 47 0.06 15 1.13 0.96 1.11 0.66 145 -1.32 7 1.25 1.11 1.26 0.76

48 -0.26 13 0.80 -1.58 0.75 -1.41 146 1.78 24 1.07 0.31 1.12 0.40

49 -2.41 3 1.04 0.23 0.82 -0.03 147 -0.42 12 0.81 -1.45 0.76 -1.25

50 0.38 17 0.89 -0.70 0.88 -0.60 148 -0.58 11 1.18 1.24 1.29 1.25

53


SCORE ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD ENTRY MEASURE MNSQ ZSTD MNSQ ZSTD 51 0.06 15 0.84 -1.26 0.81 -1.14 149 0.72 19 0.73 -1.66 0.64 -1.71

52 -1.53 6 0.80 -0.75 0.63 -0.87 150 3.7 28 1.07 0.38 0.84 0.33

53 -0.1 14 1.10 0.79 1.07 0.45 151 1.3 22 0.73 -1.15 0.57 -1.36

54 1.3 22 0.77 -0.93 0.64 -1.09 152 -1.32 7 1.24 1.06 1.31 0.89 55 0.9 20 0.96 -0.15 0.93 -0.19 153 1.3 22 0.77 -0.92 0.64 -1.08 56 0.22 16 0.74 -2.08 0.68 -1.98 154 -2.06 4 1.12 0.43 1.39 0.78

57 -1.32 7 1.08 0.41 1.09 0.36 155 0.9 20 0.89 -0.51 0.83 -0.61

58 -0.93 9 1.26 1.45 1.20 0.75 156 -0.75 10 1.12 0.75 1.07 0.35

59 0.22 16 1.09 0.69 1.04 0.30 157 -0.93 9 1.04 0.27 1.71 2.14

60 -0.58 11 0.94 -0.35 0.88 -0.50 158 -0.58 11 0.97 -0.15 0.92 -0.31 61 -0.58 11 0.79 -1.51 0.73 -1.23 159 0.06 15 0.91 -0.68 0.88 -0.67 62 -0.58 11 1.16 1.09 1.19 0.86 160 -1.12 8 1.34 1.61 1.56 1.58

63 0.06 15 0.86 -1.11 0.83 -0.99 161 -1.12 8 1.05 0.32 1.81 2.12

64 -1.12 8 1.28 1.38 1.51 1.46 162 0.38 17 1.08 0.58 1.10 0.59

65 2.44 26 1.21 0.59 1.46 0.81 163 1.1 21 0.95 -0.16 0.88 -0.31

66 -1.53 6 1.17 0.72 1.47 1.10 164 -0.93 9 1.03 0.21 1.13 0.52 67 0.55 18 0.91 -0.52 0.85 -0.69 165 0.22 16 1.04 0.36 1.02 0.15 68 0.55 18 1.02 0.15 1.00 0.06 166 -0.42 12 0.97 -0.20 0.89 -0.48

69 -1.12 8 1.12 0.67 1.03 0.19 167 -0.1 14 0.80 -1.64 0.76 -1.44

70 -0.42 12 0.92 -0.55 0.89 -0.47 168 -0.58 11 0.99 -0.02 0.95 -0.17

71 -1.78 5 0.97 0.01 0.75 -0.38 169 -1.12 8 1.24 1.19 1.21 0.71

72 1.3 22 0.95 -0.11 1.02 0.17 170 0.38 17 0.81 -1.37 0.75 -1.41 73 2.08 25 1.19 0.60 1.09 0.34 171 -0.42 12 1.07 0.54 1.06 0.34 74 -0.58 11 0.94 -0.37 0.88 -0.49 172 -2.41 3 1.10 0.36 1.08 0.35

75 -0.93 9 1.10 0.60 1.27 0.96 173 -0.75 10 1.15 0.95 1.68 2.33

76 -1.78 5 1.28 0.94 1.57 1.15 174 -1.53 6 1.03 0.18 2.38 2.49

77 2.08 25 0.98 0.07 1.00 0.19 175 -0.93 9 1.09 0.57 1.10 0.43

78 -1.12 8 0.79 -1.08 0.67 -1.04 176 -1.12 8 0.90 -0.43 0.92 -0.15 79 -0.93 9 1.01 0.14 0.99 0.07 177 -0.58 11 1.16 1.10 1.20 0.89 80 1.1 21 1.30 1.33 1.23 0.81 178 1.1 21 1.08 0.42 1.04 0.22

81 -1.78 5 1.14 0.53 1.57 1.14 179 -1.53 6 1.24 0.95 1.21 0.60

82 0.9 20 0.97 -0.10 0.92 -0.21 180 0.9 20 0.94 -0.27 0.88 -0.39

83 -1.32 7 1.00 0.08 1.14 0.49 181 2.08 25 0.71 -0.73 0.47 -1.01

84 0.55 18 0.88 -0.74 0.82 -0.88 182 -0.58 11 0.86 -0.95 0.80 -0.86 85 2.92 27 1.11 0.38 0.80 0.06 183 -1.12 8 0.88 -0.58 0.80 -0.55 86 -1.78 5 1.22 0.77 1.40 0.88 184 0.72 19 0.95 -0.23 0.87 -0.52

87 -1.32 7 1.06 0.35 1.09 0.35 185 -0.75 10 1.05 0.34 1.03 0.21

88 0.9 20 1.00 0.06 1.02 0.17 186 -0.1 14 1.05 0.45 0.99 0.01

89 1.53 23 0.87 -0.40 0.77 -0.50 187 -1.12 8 1.04 0.27 1.04 0.23

90 0.22 16 0.81 -1.48 0.75 -1.51 188 -0.75 10 0.95 -0.27 0.85 -0.52 91 -0.26 13 0.79 -1.68 0.74 -1.47 189 -0.26 13 1.05 0.41 1.14 0.79 92 1.78 24 0.94 -0.07 0.74 -0.44 190 -1.12 8 1.06 0.36 1.04 0.23

93 1.3 22 0.93 -0.22 0.88 -0.25 191 -1.12 8 1.29 1.40 1.36 1.11

94 -0.58 11 1.12 0.86 1.12 0.59 192 0.38 17 0.79 -1.52 0.74 -1.45

95 0.22 16 1.18 1.27 1.25 1.36 193 0.22 16 0.85 -1.11 0.81 -1.10

96 -0.93 9 0.93 -0.36 0.87 -0.38 194 1.53 23 0.74 -0.96 0.60 -1.03 97 -0.75 10 0.79 -1.39 0.73 -1.08 195 1.1 21 0.73 -1.29 0.62 -1.39 98 0.72 19 1.29 1.57 1.26 1.12

54

Appendix 5 X, Y, and Z high school students' item response: Guttman data

Person X High school students' Person Y High school students' Person Z High school students' Entry responses Entry responses Entry responses

001 002

004

008

010

011

013

017

018

020

021

024

025

027

028

029

030

032

033

035

038

039

040

041

042

043

045

046

047

048

049

050

052

053

055

056

057

058

059

061

063

064

065

066

067

069

071

072

073

074

075 078

0101010111111000101100011100 1111110111111110111110111111

0101010111111010101100001100

0101010111101010100100011100

1101110111110100111100111111

0101010111111010101100011100

0001010111000010101100001100

0001010011010000100100000100

0001010011110110101100001100

0101010111111000101100001100

0001010011111000101100001100

0101010011001000101100000100

0101010011111100101100111101

1001010111110010101100101101

0101010011100010101100011100

0001010011100000100100001100

0001010111010010101100000100

0001000011000000000100000100

1101010111110100100100000100

1101010011101010101100001100

0101010111111000101100001100

0001010011101000100100000100

0001010111100000101100000100

0001010011100000100100001100

0001010011111000101100001100

0001110111111010101110111101

1101110111111110111110111111

0001010011100110101100001100

0001010011100000100100001100

0001010011100000101100000100

0101010111100100100100000100

0001010011100000100100001100

0101010011110000101100011100

0101010111111000101100001100

0001010111110000101100001100

0101010111111110101100001100

0101010011000000100100000100

1101110111111110101110101111

0001010011111000101100001100

1001010011111100101100001100

0001010111110000101100011100

0001000111000000001100000100

0001010111100100100100001100

0001010011111110101100001100

0001010011100000100100001100

0101010111000000000100000100

1101010111110110101100111101

0001010011111000100100001100

0001010111111110101110111101

0101010111111100101100011100

1001010011111110101100001100 0101010111111010101100001100

005 1101000001000110000000101000 007 1101100011100110010110101000

011 1101100011100110010110101000

013 1111100011100110010110001000

015 1101000011000010010000001000

019 1101000011100010010010001000

020 1101100011100110010010101000

023 1111100011110110010010001000

024 1101000011100010010000001000

027 1101100011100110010110101000

028 1101100011100110010010101000

034 1001000001000000000100001000

036 1101100011100110010110101000

040 1101000011000010010000001000

042 1011001001000010000000101000

043 1101100011100110010110101000

047 1111100111000010010010001001

050 1111000001000010010000101000

051 1001000001010000000100000000

054 1101000011100110010010001000

055 1111000011110010010010001000

059 1111110111110110011110111000

060 1111100011100010010110101001

061 1101000011000110010100001000

065 1111000011100110010000001000

067 1101001001100110010010001000

068 1111100011100110010010101000

069 1101100011100110010010101000

071 1101100011100110010110101000

074 1101100011100110010110101000

075 1111000011000010010100001000

076 1101100011100110010010101000

079 1111100011100110010010101000

081 1111111111111111011110111001

082 1111100111110110011110001000

083 1101100011100110010010101000

085 1000000001000000000100000000

086 1111100011100110010010101000

087 1111101011110110010010101000

088 1111000011100010010010001000

090 1101100011100110010110101000

091 1101000111010110010010101000

092 1001000001000000000100001000

094 1111101011100110010010011000

095 1101100011100010010110101000

096 1111100011100010010000101000

099 1101000011100010010110101000

100 1111110011110110011110001001

102 1101000011110010010000101000

104 1111111111110110011110001000

105 1111100011100110010110001000 106 1101000011100110010010101000

001 10011101111011111110110101111 002 00011101110001101100010010101

003 00011101110001100100010011101

004 00011100110001100100010010100

005 00011101110001100100010110101

006 00011101111011110110110111111

007 00011101110001100000110000000

008 00011101110001100110010011101

009 00011101111011101100110111111

010 00011101111011111100010111100

011 00011101111001111110010111111

012 00011101110001100100110011111

013 00011101110001100100010000100

014 00011100110001100100010010100

015 00011101110001100100110011101

016 00011101110011101100010011110

017 00010000000000000100000000000

018 01011111111011111100110111101

019 00011100100001100100010000101

020 00011101110001100100010111111

021 00011100100001100100010000000

022 00011101110001100100010011111

023 00011101111011111110110111111

024 01011111111011111110110111111

025 00011100100001100100110000110

026 00011101110001100100010001101

027 00011101110001100100010011100

028 00011100110011101100110111101

029 00011101110001100100110001101

030 00011101111011100100110101110

031 00011101110001100100010111110

032 00011101100011100000110010001

033 00011101110001100100010001111

034 00011101110001100100010001111

035 00011101111011111110110001111

036 00011101111011101100010011111

037 11011101111011111110110101111

038 00011101110011100100110101110

039 00011100110000101100010000010

040 00011101110001100100010110100

041 00011101110001100100010001111

042 00011100100001100000010010101

043 00011101110001101000010001110

044 00011101111001111110110001111

045 00011101110001100100010011111

046 00011101110001100100010001111

047 00011101110001100100110000100

048 00011101110011100000110000010

049 00011100100001100100010000101

050 00011111111011110100110101110

051 00011100100001101100010111111 052 00011100100001000000010010000

55

082 084

085

086

087

088

089

090

091

092

093

094

095

096

097

099

101

102

103

105

106

108

109

111

112

113

114

115

116

117

118

119

122

123

124

126

127

128

129

130

132

133

134

135

136

137

140

141

142

145

147

148

149

151

152 153

1101010111110000101100011100 0101010111111010101110111100

0001010111111000100100001100

1101010111111100101100111100

1101010111111000101100111100

1001010111111010101100101100

1101110111111110111110101101

0001010011100000101100000100

0001010011101000101100001100

0001110111111110101110111110

0001000001000000000100000100

0001010011010010100100001100

0001010011001000101100000100

0101010111111010101100011100

0101010111101110101110011110

1101110011101110101110001101

0101010111111000101100001100

0001010011000000000100000100

0001010011001000101100000100

0001010111111100101100001100

1101010111110100101100011101

0001000011000000000100001100

0001010111100100101100001100

0001010011101110101100011100

0101010111101000101100001101

0101010011010100000100000100

0001010011110000101100000100

0001010011100100000100000100

0001010011100000100100001100

0001010011001000100100000100

0001010011100000100100001100

1101010111111110101100001101

1101110111111110101100011100

1101010111110100101100011100

1101010011101010100100011101

1101010011111010101110001101

1101010011101010101100011100

1101110111101010101100001100

0001010011110000100100001100

0001010011100100101100000100

1101010111111100101100111101

1101010111111110101100011101

1001010111110110101100001100

0001010111111100101100011100

1101110111110110101110011100

0101010111111110101100011100

0111110111111110111100111101

0101010011111010101100011100

0001010011111010101100001100

1001010111111110101100011101

0101110111111100101100111101

1011110111111011111110111111

0001010111111110101100011101

0001010111111100100100000100

1101010111111100101100111101 0001010111111000101100001100

108 1101100001100010000000101000 109 1101100011100110010110101000

110 1101100011100010010010001000

112 1101100011000110010000101000

113 1101000011100010010100001000

114 1111101011100110010010101000

115 1101001111110110010010101000

118 1101100011100110010110101000

119 1111101111110110010110101000

121 1001000001000110010100101001

122 1001000001000010000010001000

123 1111111111110110011110101000

125 1111101011110110010110101000

126 1111000011100010010010001000

127 1101000011010010010000101000

129 1111000011010010010110001000

131 1111111111111111011110111010

132 1101000011000010010000101000

133 1101000011100010010010101000

135 1111100001000010000010001000

136 1111000011100110010010001000

137 1101000001000110010000001000

139 1101001111110110010010001000

141 1101100011100010010010001000

143 1011000001000010000000001000

144 1111111111111111111011111011

145 1101100011100010010100101001

146 1101000011100110010010101000

148 1101100011100110010010101000

149 1101100011100110010110101000

150 1011000001000000000000001000

153 1101100111110110010110101000

156 1101100011100110010000101000

159 1101100011110010010000001000

160 1101000111100110010110001000

162 1101100011100110010110101000

163 1101001111110110011010101000

164 1101100011100110010110001000

168 1101100011100110010110101000

170 1111101111110110010110101000

171 1111000011100110010010001000

172 1101100011100110010110101000

173 1111101111110110010110101000

175 1111100011100110010110101000

177 1001000001000000000100001000

179 1111000011110110010010001000

180 1101100011100110010110101000

181 1101000011000010010000001000

184 1111101111100110010110101000

185 1101100011100110010110101000

187 1101100011100110010110001000

189 1101000011100110010010001000

190 1101101111110110010110101000

191 1101000001000010010010101000

193 1001000001010000000000000000 194 1101100011110110010110101000

053 00011101110001100100010011110 054 00011111111011110110110101111

055 00011101110001101100110111111

056 00011101110001101100110001100

057 00011101110001100100010000100

058 00011101110001100100110111111

059 00011101110001100100110000110

060 00011101110001101100010010111

061 00011100100001101100010000101

062 00011101110001100100010011111

063 00011101110001100000010101100

064 00011101110001100100010001111

065 00011101110011101100110111111

066 00011101110001100100010001100

067 00011101110001100100110101101

068 00011101110001100100110101111

069 00011101110001100100010001110

070 00011100110001100100010011101

071 00011100110001100000010000110

072 00011111111011101110110111111

073 00011101111011111100110111111

074 00011100110001101100010110100

075 00011100110001100100010000100

076 00011101110001100100110001111

077 00011101111011110110110111111

078 00011100110001101000010000100

079 00011101110001100100010011100

080 00011101110001100100110101101

081 00011101110001100100010001111

082 00011101110001100100110011110

083 00011101110001100100010001100

084 00011101111001100100110100111

085 00011111111011111100110111111

086 00011101110001100100010011111

087 00011101110001100100010001111

088 00011101110001100100110000110

089 00011101111011111100110111111

090 00011100111001101100110001100

091 00011101101001100110110000001

092 00011101111001100110110111111

093 00011101111001100110110111111

094 00011101110001100100010011100

095 00011101110001100100010001111

096 00011100100001100000010110000

097 00011101100000100000010011000

098 00011101110001100100010011111

099 00011101110011101100110001111

100 00011101111011101110010101101

101 00011101110001100100010111111

102 00011101110001101100010001111

103 11011111111111111111111111111

104 00011101110001100100010001100

105 00011100110001100100010000010

106 00011100110001101100010100110

107 00011100110011111100010001111 108 00011100100001100100010001110

56

154 155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208 213

1101010111111110111100111111 0001010111100000101100000100

0101110111111110101110111111

1101110111111110101110101100

0101010111110010101100011100

1111110111111110111110111101

0101010111100100101100001100

1111110111111110111110111101

1111110111111110111110111111

0101010111111000101100001100

0101110111111110101110101101

1101010111111110101100101101

1101110111111110111110111111

1101010111111110101100111101

1101010111111110101100101101

0101010011000000001100000100

1101110111110010101100111100

1101010111111100101110111101

1111010111111011111110111111

1101110111111010101100101101

1101010111100100100100001100

0001010111111000101100011101

0001010011000000100100000100

1101010111111110101100011101

0001010111101010101100001100

0001010111101000101100011101

0101010111101010101100111100

0001010011100000101100001100

1001000011010000001100001100

0101110111111110111100111101

1101010111111110101110111100

0101010011100000000100000100

0101110111111110101110111101

0001010111100000100100010100

1101010111111110111100111111

1111111111111111111110111111

0101110111111110101100011101

1001010111111100101100111101

0101010111111010101110101100

0101010111111010101110111101

0101110111111100101100111101

1111110111111110111110101111

0101110111111000101110111110

0101110111111110111110111100

0101110111110100101110011111

1001010111111110101100001100

1101110111111110101110111101

1101010111111010101100011100

0101010011100000101100001100

1101010111111110101100001101

0001010011110000101100000100

1111110111111111111110111111

0001010111000000000100000100

0101010011111000101100111101

1111110111111111111110111111 0001010111111000101100001100

195 1001000011000010000010001000 197 1101111111110110010010001001

198 1101100001000000000010000000

199 1111000011110110010010001000

201 1101100011000010010000101000

206 1101100011100110010010101000

207 1101000111110110010110001000

208 1111111111110111010110111001

210 1101000001100110010110001000

212 1101000011100110010010001000

215 1111111111111110011110001001

216 1111100011110110011100101000

217 1111110111110110010010111001

218 1111000011110010011110101000

220 1101100011100110010110101000

221 1111100011110110010110101000

222 1101100011100110010110101000

224 1111100011110110010110001000

228 1111111111110110011010101001

229 1111101111100110011010001000

230 1111100011100110010110101000

231 1101100011100110010010101000

233 1101000011100010010010001000

237 1101100011100110010110101000

238 1101100011100110010110101000

239 1101101111100110010110001000

242 1101101111110110010110001000

245 1101100011100110010110101000

248 1111101111110110011110111001

251 1101100011100110010110101000

252 1101100011100110010010001000

255 1101000001000010010110001000

256 1101001001100110010000001000

259 1111101011110110011010111001

260 1111100011100010010000001000

261 1101000011100110010010001000

262 1101011111110110011110101000

264 1101100011100110010010001000

265 1101100011000010010000001000

266 1111000011100010010110101000

268 1111000011110110010100001001

269 1111111011110010011110001000

272 1101100011100110010110101000

273 1111100111100110011110101000

275 1111111011110110010110111001

278 1101101111110110011110101000

279 1111111111111111011110111000

280 1101101011100010010010101000

282 1111111111111110011110011001

284 1111111111110110011010001000

286 1111110111110010010110111000

287 1101110111110111011010111001

288 1101100011100010010010001000

289 1111000011100010010010001000

291 1111101111110110010110101000 292 1101001001000010000000001000

109 00011101110001100100010011111 110 00011101110001100100010101101

111 00011101110001100100110001100

112 00011101110011100100010101111

113 00011100100000100100010000010

114 00011101110001100100010011100

115 00011101110001100100010001111

116 00011101110001100100010111111

117 11011111111011111111111111111

118 00011101110001100100110001111

119 00011100100001100100010010110

120 00011101110001100100010001100

121 00011100110000100000010000001

122 11011101111011110110110111111

123 00011101110001100100010011101

124 00011101100001100010110010111

125 00011100100001100100010000101

126 00011101111001100110110011111

127 00011101110001100100010101100

128 00011101110011101000010000010

129 00011101100001100000010001100

130 00011101110001100000010001000

131 00011101110001101100110111110

132 00011101111001100100110011100

133 00011101110001101100110010101

134 00011101110011101100010110110

135 00011101110001100100110001111

136 00011100110001100100010001101

137 00011101110011100010010101101

138 00011101110001100100010011111

139 00011100100001100000110110100

140 00011101110001100100010010100

141 00011100110001100000010100011

142 00011101111001101110110111111

143 00011101111011101110110001111

144 00011101111001101110110100111

145 00011101110001100100010111111

146 00011101110001100100010001111

147 00011101100001101100010101000

148 00011101110001100100010001111

149 00011101110001111110110001111

150 00011101111011101110110111111

151 00011111111001101100110111111

152 00011101110001100100010011111

153 00011111111011111110010101111

154 00011101110001100100010001100

155 00011101110011100110010111111

156 00011100110001100100110010100

157 00011100110001100100010010101

158 00011100110011101100010011100

159 00011101111011101110110001101

160 00011101110001100100010001111

161 00011101100001100000010010001

162 00011101111011100100010111111

163 00011101110011100100110011111 164 00011101110001100100010111100

57

214 215

216

217

219

221

222

226

227

228

229

230

231

232

233

234

236

237

239

240

242

247

248

249

251

254

255

256

257

258

259

260

262

0101010111110000100100000100 1101010111111100101100011101

0101010111111110101100001100

1101110111111110111110011101

0001010111110010101100001101

0101110111111110101110111101

1101010111110000101100001101

1101110111111110111100111111

1001010111111110101100111101

1101010111111110101100011100

0101010011110100100100011100

0101010011110000100100001100

1001010011011100000100001100

0101010111111110101110101101

1101110111111110101100101101

0101110111111110101110111101

0001010011110000100100000100

0001010111101010100100000100

0011110111111110111110111111

1101010011110010101100011100

0101010111111010101100001100

1001010011111110101100111100

1111110111111110111110111111

0001010111110110101100001100

1101110111111110101110111101

1101010111111110101110111101

1111110111111111111110111111

0101110011111110101100101101

0001010111111100101100011100

0101010111111000100100011100

0101010111110110101100001101

1101010111111110111110111110

1111110111111110111110111111

293 1111001111110110010110101000 295 1101111111110110010100001001

296 1101100011100110010110101000

297 1101000001000010010010101000

299 1101101011110110010010111000

300 1111000001100110000000001000

303 1111111111111111011110111000

305 1111101111110110010110001000

306 1101111111111111011110111001

307 1101100111100010010110101001

309 1101001011110010010010001000

310 1111101111110110010110101000

311 1111100111110110011110101001

314 1101100011100110010110101000

315 1101100001100010010000001000

317 1101000001110010010000001000

318 1111111111111111011010111000

319 1111111111111111011110111000

320 1111000011000110010000001000

323 1101101011100010010010101000

327 0000000001000000000000001000

328 1101100011100110010110101000

331 1111111111111110111110111010

334 1111111111111111011110111001

336 1101100011100110010110101000

340 1101101011110110010110001000

344 1111101111110110011110101000

346 1101000001010010010000001000

347 1001000001000000000100001000

350 1111100111100110011100001000

351 1111000011100010010010001000

354 1101000101010110010100001000

355 1101100011100010010010101000

358 1111000111100110010010101000

359 1111100111110110010010001000 361 1111111111111111011110111001

165 00011101100001100100010100101 166 00011101111011101100110010101

167 00011101100011100100010010011

168 00011101110001100100010001100

169 00011101110001100100010001111

170 00011101111011100110010101110

171 00011101110001100100110001100

172 00011101110001100100110001111

173 00011101110001100100010000101

174 00011101100000100000010000000

175 00011101110001100100010001101

176 00011100100001000100010000000

177 00011101110001100100010001111

178 00011101111011100110110011111

179 00011101110001100100010001111

180 00011111111011111110110000111

181 11011111110011111110111111111

182 00011100110001100100110011100

183 00011101100000100000010011000

184 00011101110001100100110001101

185 00011100100001100000010001101

186 00011101110001100100110011101

187 00011100100001100000110010100

188 00011101110001101100110101110

189 00011100110001100100010001111

190 00011101110001100100010010110

191 00011101110001100100010001111

192 00011101110001100100110011111

193 00011101101001100100010100111

194 11011101111011111110110111110

195 11011111110011101100110111101

58

A study on teachers' item weighting and the Rasch model ...

Documents

Transcript of A study on teachers' item weighting and the Rasch model ...