Download - Irt assessment

ITEM RESPONSE THEORY Maryam Bolouri

Different Measurement Theories

Classical Test Theory (CTT) or Classical True Score (CTS)

Generalizibility Theory (G-Theory)

Item Response Theory (IRT)

Problems with CTT

True score and error score have theoretical unobservable constructs

Sample dependence (test & testee)

Unified error variance No account of interaction of

error variances Single SEM across ability levels

Generalizibiliy Theory (An Extension of CTT)

G-Theory advantages: Sources and interaction of variances accounted for

G-Theory problems: Sample dependent and single SEM

IRT or Latent Trait Theory Item response theory (IRT) is an approach

used to estimate how much of a latent trait an individual possesses. The theory aims to link individuals’ observed performances to a location on an underlying continuum of the unobservable trait. Because the trait is unobservable, IRT is also referred to as latent trait theory

IRT can be used to link observable performances to various types of underlying traits.

Latent variables or construct or underlying trait

second language listening ability English reading ability test anxiety

Four Advantages of IRT: 1. ability estimates are drawn from the

population of interest, they are group independent. This means that ability estimates are not dependent on the particular group of test takers that complete the assessment.

2. it is used to aid in designing instruments that target specific ability levels based on the TIF. Using IRT item difficulty parameters makes it possible to design items with difficulty levels near the desired

cut-score, which would increase the accuracy of decisions at this crucial ability location.

Advantages of IRT: 3. IRT provides information about various

aspects of the assessment process, including items, raters, and test takers, which can be useful for test development. For instance, raters can be identified that have inconsistent rating patterns or are too lenient. These raters can then be provided with specific feedback on how to improve their rating behavior.

4. test takers do not need to take the same items to be meaningfully compared on the construct of interest (fairness)

lack of widespread use is likely due to practical and technical disadvantages of IRT when compared to CTT.1. the necessary assumptions underlying

IRT may not hold with many language assessment data sets.

2. lack of agreement on an appropriate algorithm to represent IRT-based test scores (to users) leads to distrust of IRTtechniques.

3. understanding of the somewhat technical math which underlies IRT models is intimidating to many.

lack of widespread use is likely due to practical and technical disadvantages of IRT when compared to CTT.

4. the relatively large samples sizes required for parameter estimation are not available for many assessment projects.

5. although IRT software packages continue to become more user friendly, most have steep learning curves which can discourage fledgling test developers and researchers.

History: ancient Babylon, to the Greek philosophers,

to the adventurers of the Renaissance” Current IRT practices can betraced back to

two separate lines of development:1) A method of scaling psychological and

educational tests, “intimations” of IRT for one line of development.

Fredrick Lord (1952): provided the foundations of IRT as a measurement theory by outlining assumptions and providing detailed models.

History: Lord and Novick’s (1968) monumental

textbook, Statistical theories of mental test scores, outlined the principles of IRT

2) George Rasch (1960), a Danish mathematician with

focus on the use of probability to separate test taker ability and item difficulty.

Wright and his graduate students are credited with many of the developments of the family of Rasch models.

The 2 development lines: They have led to quite similar practices one major difference: Rasch models are prescriptive. If data do not

fit the model, the data must be edited or discarded

.The other approach (derived from Lord’s work) promotes a descriptive philosophy. Under this view, a model is built that best describes the characteristics of the data. If the model does not fit the data, the model is adapted until it can account for the data.

History: The first article in the journal Language Testing

by Grant Henning (1984)“ advantages of latent trait measurement in

language testing,”About a decade after IRT appeared in the journal

Language Testing, an influential book on the subject was written by Tim McNamara (1996), Measuring Second Language Performance.

an introduction to many-facet Rasch model and FACETS software used for estimating ability on performance-based assessments.

studies which used MFRM began to appear in the language testing literature soon after McNamara publication

Assumptions underlying IRT models

1. Local independence : This means that each item should be assessed

independently of all other items. The assumption of local independence could be

violated on a reading test when the question or answer options for one item provide information that may be helpful for correctly answering another item about the same passage.

.

Assumptions underlying IRT models 2. Unidimensionality: In a unidimensional data set, a single ability

can account for the differences in scores. For example, a second language listening test would need to be constructed so that only listening ability underlies test takers’ responses to the test items. A violation of this assumption would be the inclusion of an item that measured both the targeted ability of listening as well as reading ability not required for listening comprehension

Assumptions underlying IRT models

3. it is , sometimes referred to as certainty of response

test takers make an effort to demonstrate the level of ability that they possess when they complete the assessment (Osterlind, 2010). Test takers must try to answer all questions correctly because the probability of a correct response in IRT is directly related to their ability. This assumption is often violated when researchers recruit test takers for a study, and there is little or no incentive for the test takers to offer their best effort.

Assumptions underlying IRT models It is important to bear in mind that

almost all data will violate one or more of the IRT assumptions to some extent. It is the degree to which such violations occur that determines how meaningful the resulting analysis is (de Ayala, 2009).

How to assess assumptions: Sample size:

In general, smaller samples provide less accurate parameter estimates, and models with more parameters require larger samples for accurate estimates. A minimum of about 100 cases is required for most testing contexts when the simplest model, the 1PL Rasch model, is used (McNamara, 1996). As a general rule, de Ayala (2009) recommends that the starting point for determining sample size should be a

few hundred.

IRT Parameters

1. Item Parameters Parameter is used in IRT to indicate a

characteristic about a test’s stimuli. a) Item Characteristic Curve (ICC) Difficulty (b) Discrimination (a) Guessing Factor (c) b) Item Information Function (IIF)2. Test Parameter a) Test Information Function (TIF) 3. Ability Parameter (Ө)

A test taker with an ability of 0 logits would have a 50% chance of correctly answering an item with a difficulty level of 0 logits.

ICC The probability of a test taker correctly

responding to an item is presented on the vertical axis. This scale ranges from zero probability at the bottom to absolute probability at the top.

The horizontal axis displays the estimated ability level of test takers in relation to item difficulties, with least at the far left and most at the far right. The measurement unit of the scale is a logit, and it is set to have a center point of 0.

ICC ICCs express the relationship

between the probability of a test taker correctly answering each item and a test taker’s ability. As a test taker’s ability level increases, moving from left to right along the horizontal axis, the probability of correctly answering each item increases, moving from the bottom to the top of the vertical axis.

ICC

the ICCs are somewhat S-shaped, meaning the probability of a correct response changes

considerably over a small ability level range. Test takers with abilities ranging from -3 to -1

have less than a 0.2 probability of answering the item correctly

test takers with abilities levels in the middle of the scale, between roughly -1 and +1, the probability of correctly responding to that item changes from quite low, about 0.1 to quite high, about 0.9

All ICC have the same level of difficulty Different location index Left ICC easy item Right ICC hard item Roughly half of the time the test takers

respond correctly, and the other half of the time, they respond incorrectly. So these test takers have about a 0.5 probability of answering these items successfully. By capitalizing on these probabilities, the test taker’s ability can be defined by the items that are at this level of difficulty for the test taker.

Figure 3 All have same level of difficulty Different level of discrimination Upper curve: highest discrimination short

distance to the left or right will have much different probability with dramatic change (steep)

The middle one has moderate level of discrimination

Lower one: very small slope and change slightly as a result of movement to the left or right point of 0.5

Some issues about ICC

When the a is less that moderate ICC is nearly linear and flat

When the a is more than moderate, it is likely to be steep in the middle section

A and b are independent of each other Horizontal line in ICC : means no

discrimination and undefined difficulty Probability of 0.5 corresponds to b in

easy items it occurs at low ability and in hard ones it occurs at high ability level.

Some issues about ICC

When the item is hard most of the ICC has the probability of correct response less than 0.5

When the item is easy most of the ICC has the probability of correct response that is larger than 0.5

Bear in mind

The figures show a range of ability is from -3 to + 3

The theoretical range of ability is from negative infinity to positive infinity.

All ICC become asymptotic to a probaility of zero at one tail and one at the other tail.

It is necessary to fit the curves on the computer screen.

Perfect discrimination

It is a vertical line along the ability scale.

It is ideal for distinguishing btw examinees with abilities above and below 1.5

No discrimination of examinees below or above 1.5

Different IRT Models

Model Item Format Features

1-Parameter Logistic Model/

Rasch Model

Dichotomous Discrimination power equal across all items. Difficulty varies across items

2-Parameter Logistic Model

Dichotomous Discrimination and difficulty parameters vary across items

3-Parameter Logistic Model

Dichotomous Also includes pseudo-guessing parameter

ICC models A model is a mathematical equation in which

independent variables are combined to optimally predict dependent variables

Each of these models has particular mathematical equation and are used to estimate individuals’ underlying traits on language ability constructs.

The standard mathematical model for ICC is the cumulative form of logistic function

It was first derived in 1844 and has been widely used in biological sciences to model the growth of plants and animals from birth to maturity

It was first used in ICC in the late 1950s because of its simplicity.

Parameter a is multiplied by 1.70 to obtain the corresponding logistic value

L=a (theta-b) Discrimination parameter is

proportional to the slope of the ICC

The most fundamental IRT model, the Rasch or 1-parameter (1PL) logistic model Relating test taker ability to the difficulty

of items makes it possible to mathematically model the

probability that a test taker will respond correctly to an item.

1 PL model

It was first published by Danish mathematician: Georg Rasch

Under this model, the discrimination parameter of the two-parameter logistic model is fixed at a value of a = 1.0 for all items;

only the difficulty parameter can take on different values. Because of this, the Rasch model is often referred to as the one parameter logistic model.

the probability of correct response includes a small component that is due to guessing.

Neither of the two previous item characteristic curve models took the guessing phenomenon into consideration.

Birnbaum (1968) modified the two-parameter logistic model to include a parameter that represents the contribution of guessing to the probability of correct response.

Unfortunately, in so doing, some of the nice mathematical properties of the logistic function were lost.

Nevertheless the resulting model has become known as the

three-parameter logistic model, even though it technically is no longer a logistic model. The equation for the three-parameter model is:

The equation for the three-parameter model is:

Range of parameters:

-3<a<+3 -2.80<b<+2.80 0<c<1 values above 0.35 are not

acceptable

Item parameters are not dependent upon the ability level of examinees or they are group invariant-parameters are the value of items not the group

1PL, 2PLs, 3PLs

Positive and Negative Discrimination Positive: the probability of correct

response increases as the ability level increases

Negative: the probability of correct response decreases as the ability level increases from low to high.

Items with negative discrimination occur in two ways: . First, the incorrect response to a two-

choice item will always have a negative discrimination parameter if the correct response has a positive value.

Second when something is wrong with the item: Either it is poorly written or there is some misinformation prevalent among the high-ability students.

AN ITEM INFORMATION FUNCTION (IIF) GIVING MAXIMUM INFORMATION FOR AVERAGE ABILITY LEVEL

A TEST INFORMATION FUNCTION (TIF)

ANOTHER TEST INFORMATION FUNCTION (TIF) GIVING MORE INFORMATION FOR LOWER ABILITY LEVELS

TIF Information about all of the items on

a test are often combined and presented in test information function (TIF) plots.

The TIF indicates the average item information at each ability level. The TIF can be used to help test developers locate areas on the ability continuum where there are few items. Items can then be written that target these ability levels.

Steps in running IRT analysis Data entry Model selection through scale

and fit analyses Estimating and inspecting 1. ICC 2. IIF 3. DIF (If needed) 4. TIF

Many-facet Rasch measurement model

The many-facet Rasch measurement (MFRM) model has been used in the language testing field to model and adjust for various assessment characteristics on performance-based tests.

Facets such as:1. test taker ability 2. item difficulty3. Raters4. Scales

Many-facet Rasch measurement model The scores may be affected by

factors like rater severity, the difficulty of the

prompt, or the time of day that the test is administered. MFRM can be used to identify such effects and adjust the scores to compensate for them.

The difference between this MFRM and the1PL Rasch model for items scored as correct or incorrect is that The severity of the rater : Rater severity denotes how strict a rater is

in assigning scores to test takers The rating step difficulty:rating step difficulty refers to how much of

the ability is required to move from one step on a rating scale to another

For example, on a five-point writing scale with 1 indicating least proficient and 5 most proficient, the level of ability required to move from a rating of 1 to 2, or between any two scales would be difficulty of rating step.

A test taker with an ability level of 0 would have virtually no probability of a rating of 1 or 5, a little above a 0.2 probability of a rating of 2, and about a 0.7 probability of a rating of 3.

CRC CRCs are analogous to ICCs. The

probability of assignment of a rating on the scale, the five-point scale

It indicates that a score of 2 is the most commonly assigned since it extends the furthest along the horizontal axis.

Ideally, rating categories should be highly peaked and equivalent in size and shape to each other.

Test developers can use the information in the CRCs to revise rating scales.

Use of MFRM:

investigating task characteristics and their effects on various types of performance-based assessments.

investigate the effects of rater bias, rater severity,

Rater training, rater feedback ,task difficulty and rating scale reliability

IRT Applications

Item banking and calibration

Adaptive Tests (CAT/IBAT) Differential Item Functioning (DIF) studies

Test equating

CAT Applications of IRT to computer adaptive testing

(CAT) are not commonly reported in the language assessment literature, likely because of the large number of items and test takers required for its feasibility. However, it is used in some large-scale language assessments and is considered one of the most promising applications of IRT.

A computer is programmed to deliver items increasingly closer to the test takers’ ability levels. In its simplest form, if a test taker answers an item correctly, the IRT-based algorithm assigns the test taker a more difficult item, whereas, if the test taker answers an item incorrectly, the next item will be easier. The test is complete when a predetermined level of precision of locating the test taker’s ability level has been achieved.

Differential Item Functioning (DIF) Differential Item Functioning

is said to occur when the probability of answering an item correctly is not the same for examinees who are on the same ability level but belong to different groups.

Differential Item Functioning (DIF) Language testers also use IRT techniques to

identify and understand possible differences in the way items function for different groups of

test takers. Differential item functioning (DIF), which can be an indicator of biased test items,

exists if test takers from different groups with equal ability do not have the same chance of

answering an item correctly. IRT DIF methods compare ICCs for the same item in the two

groups of interest.

Differential Item Functioning (DIF) DIF is an extremely useful and rigorous

method for studying groups differences: Sex Differences Race/Ethnic Differences Academic background differences Socioeconomic status differences Cross-cultural and Cross-national studies

Determine whether differences are an artifact of measurement or something different about the construct and population.

Bias & DIF The logical first step in detecting bias is

to find items where one group performs much better than the other group: such items function differently for the two groups and this is known as Differential Item Functioning (DIF).

DIF is a necessary but not sufficient condition for bias: bias only exists if the difference is illegitimate, i.e., if both groups should be performing equally well on the item.

Bias & DIF (Continued) An item may show DIF but not be biased if

the difference is due to actual differences in the groups' ability needed to answer the item, e.g., if one group is high proficiency and the other low proficiency: the low proficiency group would necessarily score much lower.

Only where the difference is caused by construct-irrelevant factors can DIF be viewed as bias. In such cases, the item measures another construct, in addition to the one it is supposed to measure.

Bias is usually a characteristic of a whole test, whereas DIF is a characteristic of an individual item.

An example of an item that displays uniform DIF

The item favors all males regardless of ability.

Only difficulty parameters differ across groups.

Comparison of CTT and IRT (Embreston & Reise, 2000)CTT

1. Single SEM across2. Longer test more

reliable

3. Score comparison across parallel forms are optimal

4. Unbiased estimates requires representative sample

IRT1. Various SEM

across2. Shorter test can

be equally or even more reliable (TIF)

3. Optimal when test difficulty varies between persons

4. OK with unrepresentative sample

Continued…CTT

5. Scores are meaningful against norm

6. Interval scales properties achieved through normal distribution

7. Mixed item formats leads to unbalance

8. Change score not comparable when initial score differ

IRT5. Test scores against

distance from items 6. Interval scales

properties achieved by applying justifiable measurement model

7. No problem

8. No problem

Continued…CTT9. Factor analysis

produces artifacts10. Item stimulus features

are not important compared to psychometric properties

11. No graphic displays of item and test parameters

* All in all, better and more practical for class based low-stake tests.

IRT9. Factor analysis

produces full information FA

10. Item stimulus features are directly related to psychometric properties

11. Graphic displays of item and test parameters

* Much more advantageous and preferable for high-stake, large-sample tests.

* THE ONLY CHOICE FOR ADAPTIVE TESTS.

future research: Techniques, such as item bundling

(to meet the assumption of local independence)

The development of techniques which require fewer cases for accurate parameter estimation

Guidance on using IRT (written resources specific to the needs of language testers)

computer-friendly programs so that the use of IRT techniques will become more prevalent

in the field

Thank you for your attention.

References: Bachman, L. F. (1990). Fundamental

considerations in language testing. Oxford: Oxford University Press.

Baker, F. B. (2001). The basics of item response theory. ERIC Clearing House on Assessment and Evaluation.

Embreston, S. E. & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Fulcher, G. & Davidson, F. (2007). Language testing and assessment: An advanced resource book. New York: Routledge

Fulcher, G. & Davidson, F. (2012). The Routledge Handbook of Language Testing. New York: Routledge