© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir...

21
© UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh

Transcript of © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir...

Page 1: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Assessing the Fit of IRT Models in Language Testing

Muhammad Naveed Khalid

Ardeshir Geranpayeh

Page 2: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Outline

• Item Response Theory (IRT)• Importance of Model Fit within IRT• Fit Procedures

• Issues and Limitations• Lagrange Multiplier (LM) Test

• An empirical study using LM Fit statistics• Sharing Results

• Conclusions

Page 3: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Item Response Theory (IRT)

A family of mathematical models that provide a common framework for describing people and items

Examinee performance can be predicted in terms of the underlying trait

Provides a means for estimating abilities of people and characteristics of items

Page 4: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

IRT Models

Dichotomous or Discrete

1 Parameter Logistic Model / Rasch (1PL)

2 Parameter Logistic Model (2PL)

3 Parameter Logistic Model (3PL)

Polytomous or Scalar

Partial Credit Model (PCM)

Generalized Partial Credit Model (GPCM)

Graded Response Model (GRM)

Page 5: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Shape of Item Response Function

Page 6: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Model for Item with 5 response categories

ProbabilityResponseCategory

Page 7: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

IRT Applications IRT applications in language testing are mainly used in

Test developmentItem bankingDifferential item functioning (DIF)Computerized adaptive testing (CAT)Test equating, linking and scaling Standard setting

The utility of the IRT model is dependent upon the extent to which the model accurately reflects the data

Page 8: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Model Fit from Item Perspective

Measurement Invariance (MI): Item responses can be described by the same parameters in all sub-populations.

Item Characteristic Curve (ICC): Describes the relation between the latent variable and the observable responses to items.

Local Independence (LI): Responses to different items are independent given the latent trait variable value.

Uni-dimensionaltySpeedednessGlobal

Page 9: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Consequences of Misfit

Yen (2000) and Wainer & Thissen (2003) have shown the inadequacy of model-data fit

Some of the adverse consequences are:

Biased ability estimatesUnfair ranksWrongly equated scores Student misclassificationsScore precisionValidity

Page 10: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Existing Item Fit Procedures

Chi – Square StatisticsTests of the discrepancy between the observed and

expected frequencies.

Pearson-Type Item-Fit Indices (Yen, 1984; Bock, 1972).

Likelihood Ratio Based Item-Fit Indices (McKinley & Mills, 1985).

Page 11: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Issues in Existing Fit Procedures The standard theory for chi-square statistics does not

hold. Failure to take into account the stochastic nature of the

item parameter estimates. Forming of subgroups for the test are based on model-

dependent trait estimates. There is an issue of the number of degrees of freedom. It is sensitive to test length and sample size.

Page 12: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange Multiplier (LM) Test

Glas(1999) proposed the LM test to the evaluation of model fit.

The LM tests are used for testing a restricted model against a more general alternative one.

Consider a null hypothesis about a model with parametersThis model is a special case of a general model with parameters

0

' '0 01 = ( , c)

' 1( ) ( ) ( )LM c h c W h c

Page 13: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

LM Item Fit Statistics

0i

exp( ( ) ))( )

1 exp( ( ) ))i n i n i

i ni n i n i

yP

y

0i Null Model Alternative Model

MI / DIF

LI

ICCexp( ( ))

( 1| , )1 exp( ( ))

igni n ig

ig

i n i

i n i

P X

exp( ( ))( 1, 1| , )

1 exp( ( ))ni nl n ili n i n l il

i n i n l il

P X X

Null Model 0il Alternative Model 0il

Null Model 0ig Alternative Model 0ig

Page 14: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Empirical Example

Data from Cambridge English First (FCE)– Reading 3 parts/30 questions – Listening 4 parts/30 questions

Sample size over 35000

The approach can be applied to any other language exam

Page 15: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange tests MI for Rasch MODEL

-------------------------------------------------------------- Focal-Group Reference Abs. Item LM df Prob Obs Exp Obs Exp Dif. -------------------------------------------------------------- 1 Item1 0.60 1 0.44 0.74 0.72 0.75 0.76 0.01 2 Item2 0.34 1 0.56 0.94 0.94 0.96 0.95 0.00 3 Item3 0.04 1 0.84 0.70 0.71 0.75 0.75 0.00 4 Item4 2.10 1 0.15 0.78 0.75 0.78 0.79 0.02 5 Item5 1.77 1 0.18 0.82 0.80 0.81 0.82 0.02 6 Item6 0.15 1 0.69 0.70 0.71 0.75 0.75 0.01 7 Item7 1.43 1 0.23 0.71 0.68 0.70 0.71 0.02 8 Item8 0.40 1 0.53 0.87 0.87 0.89 0.90 0.01 9 Item9 0.17 1 0.68 0.89 0.88 0.90 0.90 0.00 10 Item10 0.85 1 0.36 0.77 0.78 0.83 0.82 0.01 11 Item11 0.97 1 0.32 0.87 0.85 0.87 0.87 0.01 12 Item12 0.09 1 0.76 0.87 0.87 0.89 0.89 0.00 13 Item13 7.10 1 0.01 0.45 0.50 0.59 0.56 0.04 14 Item14 2.04 1 0.15 0.51 0.55 0.61 0.60 0.02 15 Item15 0.00 1 0.97 0.72 0.72 0.75 0.75 0.00 16 Item16 0.03 1 0.85 0.62 0.62 0.68 0.68 0.00 17 Item17 2.63 1 0.10 0.48 0.52 0.60 0.59 0.03 18 Item18 0.01 1 0.91 0.44 0.44 0.49 0.49 0.00 19 Item19 0.36 1 0.55 0.78 0.79 0.83 0.83 0.01 20 Item20 1.05 1 0.31 0.66 0.69 0.73 0.72 0.02 21 Item21 2.77 1 0.10 0.80 0.83 0.88 0.87 0.02 22 Item22 4.17 1 0.04 0.71 0.75 0.81 0.80 0.02 23 Item23 0.58 1 0.44 0.87 0.85 0.87 0.87 0.01 24 Item24 0.13 1 0.71 0.83 0.83 0.87 0.87 0.00 25 Item25 0.94 1 0.33 0.92 0.93 0.95 0.95 0.01 26 Item26 5.05 1 0.02 0.60 0.55 0.59 0.61 0.03 27 Item27 4.55 1 0.03 0.64 0.60 0.64 0.65 0.03 28 Item28 2.76 1 0.10 0.49 0.45 0.49 0.50 0.03 29 Item29 0.26 1 0.61 0.62 0.61 0.66 0.67 0.01 30 Item30 3.07 1 0.08 0.70 0.66 0.69 0.71 0.03 ---------------------------------------------------------------

Page 16: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange tests MI for Rasch MODEL

-------------------------------------------------------------- Focal-Group Reference Abs. Item LM df Prob Obs Exp Obs Exp Dif. -------------------------------------------------------------- 1 Item1 0.60 1 0.44 0.74 0.72 0.75 0.76 0.01 2 Item2 0.34 1 0.56 0.94 0.94 0.96 0.95 0.00 3 Item3 0.04 1 0.84 0.70 0.71 0.75 0.75 0.00 4 Item4 2.10 1 0.15 0.78 0.75 0.78 0.79 0.02 5 Item5 1.77 1 0.18 0.82 0.80 0.81 0.82 0.02 6 Item6 0.15 1 0.69 0.70 0.71 0.75 0.75 0.01 7 Item7 1.43 1 0.23 0.71 0.68 0.70 0.71 0.02 8 Item8 0.40 1 0.53 0.87 0.87 0.89 0.90 0.01 9 Item9 0.17 1 0.68 0.89 0.88 0.90 0.90 0.00 10 Item10 0.85 1 0.36 0.77 0.78 0.83 0.82 0.01 11 Item11 0.97 1 0.32 0.87 0.85 0.87 0.87 0.01 12 Item12 0.09 1 0.76 0.87 0.87 0.89 0.89 0.00 13 Item13 7.10 1 0.01 0.45 0.50 0.59 0.56 0.04 14 Item14 2.04 1 0.15 0.51 0.55 0.61 0.60 0.02 15 Item15 0.00 1 0.97 0.72 0.72 0.75 0.75 0.00 16 Item16 0.03 1 0.85 0.62 0.62 0.68 0.68 0.00 17 Item17 2.63 1 0.10 0.48 0.52 0.60 0.59 0.03 18 Item18 0.01 1 0.91 0.44 0.44 0.49 0.49 0.00 19 Item19 0.36 1 0.55 0.78 0.79 0.83 0.83 0.01 20 Item20 1.05 1 0.31 0.66 0.69 0.73 0.72 0.02 21 Item21 2.77 1 0.10 0.80 0.83 0.88 0.87 0.02 22 Item22 4.17 1 0.04 0.71 0.75 0.81 0.80 0.02 23 Item23 0.58 1 0.44 0.87 0.85 0.87 0.87 0.01 24 Item24 0.13 1 0.71 0.83 0.83 0.87 0.87 0.00 25 Item25 0.94 1 0.33 0.92 0.93 0.95 0.95 0.01 26 Item26 5.05 1 0.02 0.60 0.55 0.59 0.61 0.03 27 Item27 4.55 1 0.03 0.64 0.60 0.64 0.65 0.03 28 Item28 2.76 1 0.10 0.49 0.45 0.49 0.50 0.03 29 Item29 0.26 1 0.61 0.62 0.61 0.66 0.67 0.01 30 Item30 3.07 1 0.08 0.70 0.66 0.69 0.71 0.03 ---------------------------------------------------------------

Page 17: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange tests MI for Rasch MODEL

-------------------------------------------------------------- Focal-Group Reference Abs. Item LM df Prob Obs Exp Obs Exp Dif. -------------------------------------------------------------- 1 Item1 0.60 1 0.44 0.74 0.72 0.75 0.76 0.01 2 Item2 0.34 1 0.56 0.94 0.94 0.96 0.95 0.00 3 Item3 0.04 1 0.84 0.70 0.71 0.75 0.75 0.00 4 Item4 2.10 1 0.15 0.78 0.75 0.78 0.79 0.02 5 Item5 1.77 1 0.18 0.82 0.80 0.81 0.82 0.02 6 Item6 0.15 1 0.69 0.70 0.71 0.75 0.75 0.01 7 Item7 1.43 1 0.23 0.71 0.68 0.70 0.71 0.02 8 Item8 0.40 1 0.53 0.87 0.87 0.89 0.90 0.01 9 Item9 0.17 1 0.68 0.89 0.88 0.90 0.90 0.00 10 Item10 0.85 1 0.36 0.77 0.78 0.83 0.82 0.01 11 Item11 0.97 1 0.32 0.87 0.85 0.87 0.87 0.01 12 Item12 0.09 1 0.76 0.87 0.87 0.89 0.89 0.00 13 Item13 7.10 1 0.01 0.45 0.50 0.59 0.56 0.04 14 Item14 2.04 1 0.15 0.51 0.55 0.61 0.60 0.02 15 Item15 0.00 1 0.97 0.72 0.72 0.75 0.75 0.00 16 Item16 0.03 1 0.85 0.62 0.62 0.68 0.68 0.00 17 Item17 2.63 1 0.10 0.48 0.52 0.60 0.59 0.03 18 Item18 0.01 1 0.91 0.44 0.44 0.49 0.49 0.00 19 Item19 0.36 1 0.55 0.78 0.79 0.83 0.83 0.01 20 Item20 1.05 1 0.31 0.66 0.69 0.73 0.72 0.02 21 Item21 2.77 1 0.10 0.80 0.83 0.88 0.87 0.02 22 Item22 4.17 1 0.04 0.71 0.75 0.81 0.80 0.02 23 Item23 0.58 1 0.44 0.87 0.85 0.87 0.87 0.01 24 Item24 0.13 1 0.71 0.83 0.83 0.87 0.87 0.00 25 Item25 0.94 1 0.33 0.92 0.93 0.95 0.95 0.01 26 Item26 5.05 1 0.02 0.60 0.55 0.59 0.61 0.03 27 Item27 4.55 1 0.03 0.64 0.60 0.64 0.65 0.03 28 Item28 2.76 1 0.10 0.49 0.45 0.49 0.50 0.03 29 Item29 0.26 1 0.61 0.62 0.61 0.66 0.67 0.01 30 Item30 3.07 1 0.08 0.70 0.66 0.69 0.71 0.03 ---------------------------------------------------------------

Page 18: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange multipliers ICC for Rasch MODEL

--------------------------------------------------------------------------- Groups: 1 2 3 Abs. Item LM df Prob Obs. Exp. Obs. Exp. Obs. Exp. Dif. --------------------------------------------------------------------------- 1 Item1 3.56 2 0.17 0.56 0.55 0.72 0.71 0.82 0.83 0.01 2 Item2 1.98 2 0.37 0.60 0.59 0.79 0.78 0.89 0.90 0.01 3 Item3 1.25 2 0.54 0.54 0.56 0.76 0.74 0.86 0.87 0.01 4 Item4 1.23 2 0.54 0.67 0.66 0.83 0.83 0.91 0.92 0.01 5 Item5 2.81 2 0.24 0.71 0.71 0.86 0.84 0.91 0.92 0.01 6 Item6 2.96 2 0.23 0.58 0.57 0.68 0.71 0.84 0.83 0.02 7 Item7 2.65 2 0.27 0.17 0.19 0.33 0.31 0.49 0.49 0.01 8 Item8 4.82 2 0.09 0.65 0.66 0.76 0.77 0.87 0.86 0.01 9 Item9 4.40 2 0.11 0.20 0.20 0.33 0.36 0.60 0.58 0.02 10 Item10 3.89 2 0.14 0.24 0.23 0.51 0.54 0.84 0.82 0.02 11 Item11 1.62 2 0.44 0.73 0.72 0.86 0.88 0.95 0.95 0.01 12 Item12 19.55 2 0.00 0.42 0.37 0.50 0.57 0.77 0.76 0.04 13 Item13 0.94 2 0.63 0.43 0.44 0.76 0.75 0.91 0.92 0.01 14 Item14 2.82 2 0.24 0.64 0.63 0.89 0.88 0.96 0.97 0.01 15 Item15 11.03 2 0.00 0.36 0.36 0.65 0.63 0.81 0.84 0.02 16 Item16 3.88 2 0.14 0.52 0.51 0.83 0.83 0.95 0.96 0.01 17 Item17 0.84 2 0.66 0.51 0.51 0.77 0.77 0.92 0.92 0.01 18 Item18 0.85 2 0.65 0.25 0.25 0.41 0.41 0.59 0.60 0.01 19 Item19 0.99 2 0.61 0.49 0.50 0.70 0.70 0.86 0.85 0.01 20 Item20 0.90 2 0.64 0.34 0.33 0.59 0.59 0.81 0.81 0.00 21 Item21 1.02 2 0.60 0.18 0.17 0.27 0.28 0.44 0.43 0.01 22 Item22 2.92 2 0.23 0.43 0.44 0.72 0.72 0.90 0.89 0.01 23 Item23 0.26 2 0.88 0.73 0.73 0.93 0.93 0.98 0.98 0.00 24 Item24 1.47 2 0.48 0.69 0.70 0.91 0.90 0.97 0.97 0.01 25 Item25 0.61 2 0.74 0.45 0.46 0.61 0.59 0.71 0.72 0.01 26 Item26 8.56 2 0.01 0.53 0.56 0.74 0.71 0.81 0.82 0.02 27 Item27 2.76 2 0.25 0.36 0.36 0.56 0.58 0.79 0.78 0.01 28 Item28 1.64 2 0.44 0.38 0.36 0.53 0.56 0.76 0.75 0.02 29 Item29 0.31 2 0.86 0.55 0.55 0.78 0.79 0.92 0.92 0.00 30 Item30 2.21 2 0.33 0.37 0.39 0.53 0.50 0.62 0.63 0.02 ---------------------------------------------------------------------------

Page 19: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Lagrange multipliers LI for Rasch MODEL

------------------------------------------------------- Itm Itm LM df Prob Observed Expected Abs.Dif ------------------------------------------------------- 2 1 0.15 1 0.70 0.55 0.55 0.62 0.63 0.01 3 2 6.31 1 0.04 0.57 0.59 0.71 0.69 0.01 4 3 1.79 1 0.18 0.62 0.64 0.72 0.71 0.02 5 4 0.26 1 0.61 0.72 0.73 0.77 0.77 0.01 6 5 0.07 1 0.79 0.75 0.75 0.82 0.82 0.01 7 6 0.02 1 0.88 0.51 0.52 0.62 0.61 0.03 8 7 23.95 1 0.00 0.53 0.59 0.70 0.66 0.03 9 8 0.27 1 0.61 0.61 0.61 0.76 0.76 0.01 10 9 1.97 1 0.16 0.40 0.42 0.68 0.67 0.01 11 10 1.20 1 0.27 0.61 0.60 0.78 0.79 0.01 12 11 24.08 1 0.00 0.72 0.77 0.93 0.91 0.05 13 12 2.11 1 0.15 0.53 0.56 0.81 0.80 0.01 14 13 4.24 1 0.06 0.68 0.71 0.91 0.90 0.01 15 14 41.66 1 0.00 0.14 0.25 0.62 0.60 0.05 16 15 4.02 1 0.07 0.70 0.69 0.84 0.85 0.02 17 16 7.04 1 0.01 0.66 0.70 0.87 0.86 0.01 18 17 4.37 1 0.08 0.51 0.55 0.80 0.79 0.01 19 18 13.69 1 0.00 0.52 0.57 0.84 0.82 0.04 20 19 2.04 1 0.12 0.69 0.70 0.93 0.91 0.02 21 20 3.85 1 0.05 0.41 0.46 0.67 0.66 0.01 22 21 1.71 1 0.11 0.80 0.82 0.92 0.91 0.01 23 22 2.01 1 0.16 0.79 0.82 0.94 0.94 0.01 24 23 10.60 1 0.00 0.62 0.72 0.93 0.92 0.03 25 24 1.02 1 0.31 0.61 0.58 0.84 0.84 0.02 26 25 2.34 1 0.13 0.58 0.60 0.82 0.82 0.01 27 26 2.10 1 0.09 0.41 0.45 0.67 0.65 0.02 28 27 1.62 1 0.92 0.86 0.85 0.89 0.91 0.02 29 28 0.17 1 0.68 0.48 0.47 0.63 0.63 0.01 30 29 0.47 1 0.49 0.77 0.77 0.86 0.86 0.01 -------------------------------------------------------

Page 20: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Conclusions

LM statistics overcome existing FIT issuesLess computational intensiveSize of residuals in the form of Abs.Dif is

highly valuableFit of IRT model holds reasonably (FCE)Items violated - MI (4); ICC (3); LI (7)Magnitude of violation is not severe

Page 21: © UCLES 2013 Assessing the Fit of IRT Models in Language Testing Muhammad Naveed Khalid Ardeshir Geranpayeh.

© UCLES 2013

Thank you!&

Questions