University of Ostrava Czech republic 26-31, March, 2012.

39
University of Ostrava Czech republic 26-31, March, 2012

Transcript of University of Ostrava Czech republic 26-31, March, 2012.

Page 1: University of Ostrava Czech republic 26-31, March, 2012.

University of OstravaCzech republic

26-31, March, 2012

Page 2: University of Ostrava Czech republic 26-31, March, 2012.

Different forms of a test

Item banking

Achievement monitoring

Page 3: University of Ostrava Czech republic 26-31, March, 2012.

Classical Test Theory Item ResponseTheory

It is applied only for different test forms equating

It is often ignored (conception of parallel test forms)

Establishes equivalent scores on different test forms

Doesn’t create a common scale

Allows to satisfy all equating needs

Allows to put all estimates of item and examinee parameters to the common scale

Page 4: University of Ostrava Czech republic 26-31, March, 2012.

It is a special procedure that allows to establish relation between examinee scores on different test forms and place them onto the same scale.

As a result, measure based on responses to one test can be matched to a measure based on responses to another test, and the conclusions drawn about examinee are identical, regardless of the test form that produced the measure.

Equating of different test forms is called horizontal equating.

Page 5: University of Ostrava Czech republic 26-31, March, 2012.

The purpose: comparison of student achievements at different grade levels

Test forms are designed to be of different difficulties

Measures from different tests should be placed on the same linear continuum

Procedure of this test equating is called vertical equating.

Page 6: University of Ostrava Czech republic 26-31, March, 2012.

• Item bank – a set of items from which test forms that create equivalent measures may be constructed.

• Item bank is composed of a set of test items that have been placed onto a common scale, so that different subsets of these items produce interchangeable measures for an examinee.

• In the presence of item bank we dont need in further equating

Page 7: University of Ostrava Czech republic 26-31, March, 2012.

Both are designed to place estimated parameters onto a common scale

In test equating the goal is to place person measures from the multiple test forms onto the same scale

In item banking the goal is to place item calibrations on the same scale

Procedures are nearly identical when we use Rasch measurement

Page 8: University of Ostrava Czech republic 26-31, March, 2012.

Equating – procedure that ensures the examinee measures obtained from different subsets of items are interchangeable. When two tests are equated, the resulting measures are placed onto the same scale.

Scaling – procedure that associates numbers with the performance of examinees. Tests can be scaled identically, but have not been equated.

Page 9: University of Ostrava Czech republic 26-31, March, 2012.

Applies only to compare examinee test scores on two different test forms

A problem can be ignored (introduction of “parallel” test froms)

Implies only an establishment of relation between test scores on different test forms

Doesn’t imply creation of a common scale

Page 10: University of Ostrava Czech republic 26-31, March, 2012.

Linear equating

Equipercentile equating

Page 11: University of Ostrava Czech republic 26-31, March, 2012.

It is based on equating the standard score on test X to the standard score on test Y:

Thus, , where

,

BxAy

x

yA

xyBx

y

yx

yyxx

Page 12: University of Ostrava Czech republic 26-31, March, 2012.

Scores on tests X and Y are considered to be equivalent if their respective percentile ranks in any given group are equal.

Page 13: University of Ostrava Czech republic 26-31, March, 2012.

Both methods require assumptions concerning identity of test score destrubutions and about equivalence of examinee groups

Equating in CTT doesn’t imply creation of a common scale

Page 14: University of Ostrava Czech republic 26-31, March, 2012.

Measuring the same trait – tests of different content can not be equated (but can be scaled in a similar manner).

Invariance of equating results across samples of examinees

Independence of equating results on which test is used as a reference test

Page 15: University of Ostrava Czech republic 26-31, March, 2012.

• Method of common items: linkage between two test forms is accomplished by means of a set of items which are common for two test forms

• Method of common persons: linkage between

two test forms is accomplished by means of a set of persons who respond to both test forms

• Combined methods: linkage between two test forms is accomplished by means of common items and / or common persons plus common raters

Page 16: University of Ostrava Czech republic 26-31, March, 2012.

Internal anchor: Each test form has

one set of items that is shared with other forms and another set of items that is unique to this form

Page 17: University of Ostrava Czech republic 26-31, March, 2012.

External anchor:

Each test form has an additional set of items, that are not from these test forms

Page 18: University of Ostrava Czech republic 26-31, March, 2012.

Involving all examinees respond both test forms.

There are two approaches to this design:

- same group/ same time

- same group/ different time

Page 19: University of Ostrava Czech republic 26-31, March, 2012.

Linkage between two test forms is accomplished by means of a set of examinees who respond to all items.

Page 20: University of Ostrava Czech republic 26-31, March, 2012.

Selecting an equating method Parameter estimation Transformation of parameters from

different test froms to the same scale Evaluating the quality of the links between

test froms

Page 21: University of Ostrava Czech republic 26-31, March, 2012.

Simultaneous calibration: all parameters are estimated simultaneously in one run of the estimation software. Data are automatically scaled to the same scale.

Separate calibration: parameters are estimated for each test form separately. That is, the data are calibrated in multiple runs of the estimation software.

Separate calibration may be more difficult to accomplish because the test developer needs to transform measures to a common scale

Page 22: University of Ostrava Czech republic 26-31, March, 2012.

Separate calibration of all test forms with transformating measures to the common scale

Simultaneous calibration of all test forms and placing all measures on the common scale

Separate calibration of all test forms with anchoring the difficulty values of the common items and consecutive placing all parameters on the common scale

Page 23: University of Ostrava Czech republic 26-31, March, 2012.

As a rule this procedure is used with method of common items that are called nodal items in this case

Each test form is calibrated separately. As a result for each test form all estimates lie on the own scale. The only difference between scales is in difference between origins of the scales

This difference can be removed by means of calculating location shift

It is desirable to have not less that 15-20 % nodal items (some of them can be deleted from the link later).

Page 24: University of Ostrava Czech republic 26-31, March, 2012.

Choice of a common scale Selection of nodal items Calibration of all test forms Calculating equating constants Link quality evaluation Transformating all parameters onto a common

scale

Page 25: University of Ostrava Czech republic 26-31, March, 2012.

t12 – shift constant from test form 1 to test form 2; δi1 – difficulty estimate of item i in test from 1;δi2 – difficulty estimate of item i in test from 2;l – the number of common items.

Sometimes other formulas are applied - weighted mean, dispersion shift, etc.

lt

l

iii

1

12

12

)(

Page 26: University of Ostrava Czech republic 26-31, March, 2012.

δi1' = δi1 + t12 ,

where δi1 – difficulty estimate for item i in test form 1;

δi1' – difficulty estimate for the same item on the scale of test

form 2, i=1,…,k, k – the total number of test items;

θn1'= θn1 + t12,

where θn1 – ability estimate for examinee n who respond items of test form 1; θn1

' – ability estimate for the same examinee on the scale of test form 2, n=1,…, N; N – the total number of examinees who respond items of test form 1.

Shifted by this way parameter estimates of test from 1 will be placed to the scale of test form 2.

Page 27: University of Ostrava Czech republic 26-31, March, 2012.

Item-within-link (fit analysis of linking items);

Item-between-link (stability of the item calibrations between two test forms)

Page 28: University of Ostrava Czech republic 26-31, March, 2012.

where σi12 is defined by σi122 = σi1

2+ σi22 ;

σi1

, σi2 - standard errors of measurement for item i under

calibration of test form 1 and 2;

δi1 - difficulty estimate for item i in test form 1; δi1

' - difficulty estimate for the same item on the scale of test form 2; Ui ~ N(0,1)

12

11

i

iiiU

Page 29: University of Ostrava Czech republic 26-31, March, 2012.

All parameters of all test forms are estimated simultaneously

Is the simplest approach to equating test forms or calibrating an item bank because it requires no subsequent transformation of the estimated measures or calibrations. Data are automatically scaled to the same scale in one run the estimation software

Page 30: University of Ostrava Czech republic 26-31, March, 2012.
Page 31: University of Ostrava Czech republic 26-31, March, 2012.

As a rule this procedure is used with method of common items that are called anchor items in this case

Common items are estimated one time during calibration of the first test form

During calibration of another test form the calibration values for these items are treated as being fixed or known and are not estimated. As a result, the remaining parameter estimates are forced onto the same scale as the anchor items

It is easy to anchor items in most estimation software

Page 32: University of Ostrava Czech republic 26-31, March, 2012.

 IAFILE=* 2 -0.29 4 -1.06 8 -0.49 11 -0.04 17 -0.28 37 -2.20 38 -1.34 *

Numbers of anchor items and their difficulties are specified. These difficulty values will be fixed and not be estimated during calibration of new test form

Page 33: University of Ostrava Czech republic 26-31, March, 2012.

Choice of a common scale Selection of anchor items Calibration of the test form which scale is accepted as a

common scale Sequential calibration of other test forms with fixing the

difficulty values of anchor items Item-Within Link Fit (fit analysis of linking items);

Page 34: University of Ostrava Czech republic 26-31, March, 2012.

If we use different equating procedures, obtained scales will be different and can not be directly compared. It is connected with different ways of origin selection in different procedures.

There are papers (for example, Smith R.M. «Applications of Rasch Measurement». Chicago: Mesa Press. -1992) where all three procedures are analyzed. The precision of estimated examinee and item parameters is approximately the same and correlation between measures is high.

Page 35: University of Ostrava Czech republic 26-31, March, 2012.

Each test form has 26 dichotomous items Both test forms have 6 common items: № 4, 6, 7, 14, 20,

24 (23 % of the total number of items) The total number of examinees for test form 1 is 654, for

test form 2 - 661 For test calibration Winsteps software was used Means of examinee measures are -1,07 и -0,72 logits for

test form 1 and 2 correspondingly The first test form scale was chosen as a common scale

Page 36: University of Ostrava Czech republic 26-31, March, 2012.

Item numbe

r

Test form 1 Test form 2

ui

Difficulty

estimateδi

Standard Error

σi

Difficulty

estimateδi

Standard Error

σi

Shifted Difficul

ty estimate

δi'

4 -1.39 0.09 -1.07 0.09 -1.368 -0.176 -0.93 0.1 -0.54 0.09 -0.838 0.697 -2.57 0.1 -1.99 0.1 -2.288 2.014 -0.44 0.1 -0.32 0.09 -0.618 -1.3320 0.88 0.12 0.96 0.11 0.662 -1.34Sum -4.45 -2.96 -4.45Mean -0.89 -0.592 -0.89

Shift constant t12= - 0,298.

Page 37: University of Ostrava Czech republic 26-31, March, 2012.

It implies creation of a common response matrix for both test forms containing 1315 examinees and 46 different items.

Measures of all examinees and difficulty values of all items will be placed on a common scale that is centered in the difficulty mean of all 46 items

Page 38: University of Ostrava Czech republic 26-31, March, 2012.

Calibration of test form 1 Calibration of test form 2 with fixing the difficulty values of anchor

items from the first calibration IAFILE=*

4 -1.39

6 -0.93

7 -2.57

14 -0.44

20 0.88

* As a result examinee measures from both test forms will be on

the first test form scale

Page 39: University of Ostrava Czech republic 26-31, March, 2012.

Comparison of examinee measures from three equating procedures revealed approximately similar results: correlation is closed to 1

  The choice of equating procedure is determined

by the real data design and purpose of research