Item Response Theory - University of Chicago Webinar... · 2. REL Midwest and ARC Topics Covered...

REL Midwest and ARC

Item Response TheoryItem Response Theory

Kimberly Maier, Ph.D.Assistant Professor, Measurement and Quantitative MethodsMichigan State University

Andrew SwanlundSenior Research Associate and PsychometricianLearning Point Associates

November 2009

REL Midwest and ARC2

Topics Covered

Rationale for IRT/Rasch methods

The 1PL/Rasch Dichotomous Model

The Rasch Rating Scale Model

Psychometric Analysis/Diagnostics

Advanced IRT Models


What’s a Latent Trait?

An unobservable trait (theorized to exist) that cannot be directly measured, although it can usually be easily described.

Examples include the following:– Mathematics achievement– Intelligence– Attitudes– Opinions

Tests or questionnaires are used to assess the latent trait.


Test Theory

Test theory or psychometrics is used to:– Develop questionnaires.

Reliability

Validity

Determine dimensionality

– Provide measures (on a scale) for examinees.

Two “flavors” of test theory:– Classical Test Theory– Modern Test Theory (IRT)


Classical Test Theory

True score = Observed score + error

Focus on raw test scores

Item difficulty– How hard is it to get the item “right”?– Or, how hard is it to agree with a statement?– Measured as the proportion of respondents who

get the item “correct.”


Classical Test Theory

Item discrimination– How effectively an item differentiates between

examinees who are high and those who are low on the latent trait.

– Two types of measures of discrimination:

Index of discrimination – cannot perform statistical significance tests to determine if it is zero.

Various correlation coefficients (e.g., point biserial, biserial, tetrachoric, and phi coefficient) – measure of relationship between responses on an item and the performance on the entire test.


What Classical Test Theory Can’t Do

A person’s ability level and the survey item difficulties cannot be estimated separately.– Implications:

A person’s measure of the latent trait is dependent on the survey items administered.

Items’ means depend on the sample of people who took the survey.

– Therefore, ALL estimates of the model are sample dependent and cannot be compared across samples varying in the distribution of the underlying latent trait.


What Classical Test Theory Can’t Do

Doesn’t provide information about how examinees at different ability levels on the trait have performed on individual items.

Difficult to compare performance of examinees who have taken different tests that measure the same trait.

Difficult to apply results to another group to be tested.


Item Response Theory

Items on a test/instrument measure a single latent trait or several latent traits (multidimensional)

Allows one to compare the performance of one group taking Test A with another group taking Test B

The results of an item analysis can be applied to groups of respondents other than the original group used for the analysis


General Ideas of IRT

The item response model gives us an idea of the probability that a person with latent trait level θ will correctly answer an item of difficulty δ.

The relationship between ability (attitude) and item response is characterized by an item characteristic curve.

Each person is assumed to have a level of ability that situates him or her on the item characteristic curve.


Item Characteristic Curve (ICC)


Important Points About the ICC

Item difficulty– The level of ability at which 50 percent of the respondents

are able to correctly answer the item

Item discrimination– The slope of the item characteristic curve– Determines difference in probabilities of respondents of

different ability levels to answer the item correctly

Difficulty and discrimination are independent of one another.


Measurement

The value of a latent trait measure θ usually varies from -3 to +3, although the limits are - ∞

to +∞

(this can be rescaled to any metric if

desired).

The higher one’s ability level, the higher his or her probability of correctly answering the item.


Psychometric Models

Three Main Properties/Assumptions of IRT (including Rasch)*– Unidimensionality– Local Independence– Monotonicity of the Item Response Functions

*These also hold for Factor Analysis and Classical Test Theory, but we discuss them here in the IRT framework.


Psychometric Models

Unidimensionality– The latent trait (or construct) is represented by a single

number (often denoted θ)– Examples – mathematics ability, self-efficacy– Some constructs are multidimensional – personality – “the

big five” (openness, conscientiousness, extroversion, agreeableness, neuroticism).

Models we talk about today are for unidimensional constructs only.

Can model multidimensional constructs with MIRT (Reckase, 2009)


Psychometric Models

Local Independence– Responses to items are independent from one

another after taking into account examinee/respondent ability

– Items shouldn’t cue one another – that is, knowing the answer to one item gives you the answer to another


Psychometric Models

Monotonicity of the Item Characteristic Curve– Higher ability examinees should have a higher

probability of successful/favorable response than lower ability examinees



What kind of data can we use with the dichotomous model?– Multiple-choice or correct/incorrect test data– Checklist data– Yes/no survey responses



The 1-Parameter Logistic (or Rasch) Model:

Note: The probability of a correct response depends only on ability of the person () and the difficulty of the item ().

ji

jijiXP

exp1exp

,|1


1PL/Rasch Theoretical ICCs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-3 -2.7

-2.4

-2.1

-1.8

-1.5

-1.2

-0.9

-0.6

-0.3

1.527

E-15 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3

Ability

Prob

.

Diff iculty = -1

Diff iculty = 0

Diff iculty = 1



Measures of ability and item parameters are reported in logits:

– The mathematical unit of ability is defined as the log odds for succeeding on items of the kind chosen to define the “zero” point on the scale.

– The mathematical unit of an item’s difficulty is defined as the log odds for eliciting failure from persons with “zero” ability.

logit ln1

i

i

pp



Probabilities of correctly answering an item with difficulty of 1.0: Ability Probability

-3.00 0.02-2.00 0.05-1.00 0.120.00 0.271.00 0.502.00 0.733.00 0.88


2-PL Model

The 2-parameter logistic model:

Note: The probability of a correct response depends ability of the person (

), the difficulty of the item (bj ), and the discrimination of the item (aj ).

jij

jijjji ba

babaXP

exp1

exp,,|1


2-Parameter Logistic ICCs

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-3 -2.7

-2.4

-2.1

-1.8

-1.5

-1.2

-0.9

-0.6

-0.3

1.527

E-15

0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3

Ability

Prob

.


3-PL Model

The 3-parameter logistic model

Note: The probability of a correct response depends ability of the person (i ), the difficulty of the item (bj ), the discrimination (aj ), and the guessing parameter (cj ).

jij

jijjjjjji ba

bacccbaXP

exp1

exp1,,,|1


Difference Between Rasch and 2/3PL?

2/3PL models will likely better fit the data (higher parameterized models do that), but it requires a lot of data to find stable parameter estimates

Modeling items with high/low discrimination and high guessing parameters can lead to the inclusion of lower quality items (according to some); conversely, Rasch analysis may result in omitting some items

In 2/3PL, pattern of responses matter; in Rasch, there is one scale score per raw score



What are some uses of the rating scale model?– Survey response data with a standard set of

responses across many items (SD, D, A, SA)– Scoring rubrics where the performance levels are

defined similarly across all indicators

If score definitions vary across items, would require Partial Credit Model



Here’s the math…

))((exp

))((exp

00

0

jin

k

j

m

k

jin

x

jnix


Rasch Probability Curve

Category Probability Curves for an Agreement Scale…

ThresholdsThresholds


“Validation” of an Instrument

Some psychometric properties to consider– Reliability– Rating scale functioning– Item and person fit– Point-measure correlation– Differential item functioning (DIF)– Dimensionality


“Validation” of a Survey

Reliability– A definition: the degree to which scores are free

from measurement error, or how consistent the scores are within an administration and over time.

– In general, reliability increases with the number of items and the ability of those items to spread people out along the scoring metric.

– Rules of thumb – 0.7 = OK, 0.8 = good, 0.9 = excellent (but it depends on the use of scores)


Reliability and Score Distribution

A scale with reliability 0.40

20.00 40.00 60.00 80.00

TLSS

0

50

100

150

Cou

nt



Rating Scale Functioning– Are the respondents using the rating scale in a

consistent fashion?– Are any categories being over- or underutilized?– Is there a good distribution of responses across

categories?– Are the categories “disordered”?


Disordered Categories

A frequency scale with a problem…(never, daily, weekly, biweekly, monthly)

4/5 Threshold

3/4 Thresholddaily

never

weekly


What’s wrong with this scale?

A five-point partial credit observation item…



Item and Person Fit– Are there unpredictable responses in the data (under-fit)?

Can indicate multidimensionality, confusing wording or multiple meaning, content not consistent with the construct, or multiple classes of respondents

– Are there responses that are too predictable (over-fit)?

Can indicate redundancy in the items (or response sets)– Are those responses made be individuals at the center of

the distribution or at the extremes?



An example of a misfitting person…

KEY: .1.=OBSERVED, 1=EXPECTED, (1)=OBSERVED, BUT VERY UNEXPECTED.

NUMBER - NAME ------------------ MEASURE - INFIT (MNSQ) OUTFIT - S.E.

372 3457102 62.28 4.8 A 4.8 7.85

-10 10 30 50 70 90 110

|---------+---------+---------+---------+---------+---------| NUM Item

(2) 4 10* 3c

.4. 12* 3e

4 (5) 14* 3g

.3. 4 9* 3b

(3) 4 8* 3a

4 (5) 13* 3f

4 (5) 11* 3d

|---------+---------+---------+---------+---------+---------| NUM Item

-10 10 30 50 70 90 110



Point-measure correlation

Extent to which item-rating correlates with total score.+----------------------------------------------------------------------------+|ENTRY RAW MODEL| INFIT | OUTFIT |PTMEA| ||NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| Item G ||------------------------------------+----------+----------+-----+-----------|| 115 13 15 35.86 8.33|1.51 1.1|9.90 3.5|A-.24| @51b D || 116 29 377 90.20 2.05|1.23 1.5|4.26 6.0|B-.02| I52 C || 79 46 342 83.49 1.72|1.46 3.8|4.25 8.2|C-.16| I32 E || 114 15 373 97.81 2.73|1.11 .5|3.86 3.8|D .02| I51 C || 120 327 369 34.48 1.75|1.31 2.5|2.98 5.3|E-.04| I54 C || 117 17 29 53.96 4.51|1.72 2.8|2.83 3.7|F .05| @52b D || 124 286 368 43.80 1.38|1.20 2.6|1.63 3.6|G .21| avail57 A || 113 372 378 12.33 4.16|1.02 .2|1.61 .9|H .07| I50 D || 118 144 356 64.60 1.23|1.33 5.8|1.58 6.1|I .20| I53 C || 121 263 324 41.06 1.55|1.19 2.1|1.53 2.5|J .21| @54b D || 75 319 376 38.22 1.55|1.10 1.1|1.32 1.5|K .26| I28 E || 119 100 142 51.10 2.09|1.16 1.6|1.30 1.5|L .35| @53b D || 122 247 338 47.16 1.36|1.04 .6|1.21 1.6|M .37| I55 C || 123 259 370 48.77 1.27|1.19 3.1|1.17 1.5|N .30| avail56 A |



Differential Item Functioning Analysis– A method for looking for item bias (or differing

perceptions of the items for a survey)– Item bias is different from test bias (the

cumulative effect of item bias on total score)– DIF analysis looks for items where similar ability

respondents from different demographics respond in a very different manner



Dimensionality (Rasch PCA)– Similar to exploratory factor analysis– Examines variance structure in the model

residuals after factoring out variance explained by scale scores

– Looks at correlations in the residual matrix to identify factors (dimensions) that may be affecting patterns of responses


Estimation of Model Parameters

The values of the respondents’ latent trait measures and the difficulties of the items are all unknown quantities.

Likelihood approaches are commonly used to estimate parameters– 1PL/Rasch – Joint maximum likelihood (JMLE),

conditional maximum likelihood (CMLE)– All models – Marginal maximum likelihood

(MMLE)

Bayesian techniques are another option


Available Software

IRT (1/2/3PL, Graded Response Model, Generalized Partial Credit Model)– BILOG-MG 3.0, PARSCALE 4.0, TESTFACT 4.0,

and MULTILOG 7.0, R modules such as eRm

Rasch (1PL, Rasch Rating Scale Model, Partial Credit Model)– WINSTEPS, RUMM, BIGSTEPS, Conquest,

WINMIRA, R modules such as plink


More Advanced Models

Multidimensional random coefficients multinomial logit model (MRCML) – Conquest

Mixture distribution Rasch models (latent class analysis) – WINMIRA

Many-faceted Rasch model (FACETS)

Multilevel Rasch models (HLM)

Item Response Theory - University of Chicago Webinar... · 2. REL Midwest and ARC Topics Covered...

Documents

Transcript of Item Response Theory - University of Chicago Webinar... · 2. REL Midwest and ARC Topics Covered...