Examining How Rasch/IRT Can Increase Measurement Precision

Examining How Rasch/IRT Can Increase Measurement Precision

Ann M. Doucette, Ph.D.

The Evaluators’ Institute

American Evaluation Association Annual MeetingChicago, Illinois

12 November 2015

Ann M. Doucette, PhD

Learning Objectives• Recognize the role of measurement precision in

obtaining reliable and generalizable psychotherapy research outcomes

• Understand the basic differences between Classical Test Theory (CTT) and Item Response theory (IRT)

• Discuss the advantages IRT offers over CTT in terms of measurement precision and testing the assumption often made in psychotherapy research of scaled interval-level data

• Describe the basic characteristics and assumptions (e.g., unidimensionality and local independence) of one-parameter logistic IRT models (Rasch) and multiple-parameter models (2PL, 3Pl, and 4Pl models)

• Recognize the pivotal role of measurement as a foundation of evaluation practice

2

Ann M. Doucette, PhD 3

Performance measurements:

balanced scorecard model for, 106;

converting data into information,

111–116fig; cost effectiveness of,

102; criteria for good, 107–108;

current challenges to, 119–122;

customer satisfaction, 103;

disability adjudicators’ claimsprocessing, 110t; EA (evaluabilityassessment),

90–91; efficiency of, 102–103; improving performance through,

120–121; logic modeling used in, 72–75fig; networked environments implementation of, 121–122;

outcomes of, 101–102; outputs, 102; reliability of, 13–14b, 107, 558b, 559;

RFE (rapid feedback evaluation), 95–96; service quality, 103; STEM

instruction program, 75fig; strategy maps for, 106; tables of performance indicators, 86; validity of, 12–17, 107,

558b, 559. See also

Data; Data analysis; Logic models;

Program evaluation; Statistics

Measurement validity: descriptionof, 12, 558b; internal, 14–15,558b; process for achieving,12–13b; statistical conclusion,16–17


Measurement

A scientific method – a way of finding out

(more or less reliably) what level of an

attribute is possessed by the object or

objects under investigation . . . the

magnitude of a level of an attribute via its

numerical relation (ratio) to another level of

the same attribute (Michell, 2001, p. 212).

4


Measurement Requirements

Measures must be*

• Linear – ordered/monotonic, conjointly additive permitting arithmetic computations can be conductedo Assumes unidimensionality

• Item calibrations must not be dependent on whose responses are used to estimate the measurement model (sample invariant)

• Person estimates/measures must not be dependent on which items are answered (test free)

• Missing data should not matter* Thurstone, 1925, 1926)

5


Basic Assumptions: Measurement

• Reliability: the consistency across observed scores for a particular attribute (depression, achievement, vocational interest, etc. Respondents would obtain the same scale score placement on different occasions

• Validity: the accuracy of the measurement in terms of expected scores for a universe of observations defining a specific attribute for a given population and a given purpose

• Dimensionality: measure scores are interpreted as

assessing a single trait (depression, anxiety, etc.)

o Multidimensionality confounds, making it impossible

to distinguish the attribute assessed by the measure

6


Measurement Function

7

• Assigning a numerical value across a set of

item that represents person placement on a

define attribute

Mild Moderate Severe


If the student gets all items correct we can assume the student likely gets thealgebra, geometry, addition, subtraction, etc., correct; and has exceptional math skill

Mapping Items to Attribute Levels

8

Respondents

Students with

exceptional

mathematical skill

Students with

moderate

mathematical skill

Students with

poor

mathematical skill

Items

Calculus,

Trigonometry

Algebra, Geometry

Basic math (addition,

subtraction,

multiplication, long

division)

If the student gets all items correct we cannot assume the student has exceptional math skill

Ann M. Doucette, PhD 9


Approaches To Measurement

• Classical Test Theory (CTT)o Currently the approach most frequently used

• Latent Trait Modelso Used frequently in educational testing and

healthcare• Standardized tests and high stakes tests

• Increasing acceptance in behavioral healthcare

• NIH Roadmap for Medical Research Initiative – Patient-Reported Outcomes Measurement Information System(PROMIS)

10


Measurement Approaches

• Classical Test Theory (CTT)

o Currently the approach most

frequently used

• Latent Trait Models

o Used frequently in educational

testing and physical

healthcare

• Increasing acceptance in behavioral healthcare

• NIH Roadmap for Medical Research Initiative – Patient-Reported Outcomes Measurement Information System (PROMIS)

Classical Test Theory(True Score Theory)

Latent Trait Models(Strong True Score Theory)

Rasch Item Response Theory

1PL 2PL 3PL 4PL

11


Classical Test Theory (CTT)

• Classical Test Theory is sometimes referred to as true score theory

• Measure score = true score + random erroro Measurement error is normally distributed

o Measurement error is uncorrelated with the true score

• Item characteristics are measure and sample dependent

• All items contribute equally to the overall scale score (unless separately weighted using a scoring protocol)

• Response options (e.g. Likert scales) are assumed and frequently analyzed as equal interval scales

12


Classical Test Theory: Limitations

• Observed score is dependent on choice of items

• Item statistics – item difficulty and item discrimination are sample dependento No modeling of data at item level

• Assumed common measurement error across all items

• Measure reliability is sample dependent

• Limits o Ability to equate across tests

o Develop computer adaptive applications

13


Item Response Theory (IRT)

14

• Item Response Theory (IRT): modern latent trait

theory/Strong True Score Theory – an alternative scaling

methodology that enables the examination of the

hierarchical structure, unidimensionality and additivity

of items and measure scores

• Item – ability relationships: probability of correct

response is a function of responder ability, an estimate

of his/her level of the measured latent trait

o Higher ability increased probability of favorable score

• Estimates are provided for each individual item and

each response scale option

• Items provide differential weight/contribution to the

overall measure scale score

o Answering more difficult items yields a higher score


Assumptions: IRT Models

• Unidimensional: items measure the same underlying construct – a dominant first factoro Unidimensionality is seldom perfectly achieved – it is a matter

of the degree of the departure from unidimensionality

• Local Independence: response to an item is conditionally independent of responses to other items in a scale after controlling for the latent trait measured by the scale. o Latent variable explains how items are related – e.g., person

with high depression would have a specific response profile across items

• Monotonicity: consistent increasing relationship between theta (θ – proficiency/ability) and item responseso Expectation that respondent will use consistent response

options for items of like difficulty

15


Item Response Theory Models• Data modeling, statistical estimation of parameters

that represent the location of persons and items on a latent continuum, the measurement construct of interest

• Statistical models specifying the probability of a discrete outcome (e.g., a correct response to an item, the selection of a response option on a Likert scale), in terms of person and item parameters. o Person parameters, ability, indicate

• The level of impairment along the measured construct (e.g., depression, distress, etc.)

• The strength of an individual’s attitude

• Competence in an academic area (e.g., mathematics, etc.)

o Item parameters refer to • Difficulty

• Discrimination (slope or correlation)

• Guessing (lower asymptote)

• Carelessness (upper asymptote)16


Item Response Theory Models

• The Rasch model/1PL IRT model assumes that item discrimination parameters are equal, implying that two items of the same difficulty (b) on a test are equally precise at measuring ability

• 2PL models allow discrimination parameters to vary –item difficulty (b) and discrimination (a) are both modeled o Does the item discriminate across individuals with the same

ability?

• 3PL models estimate the contribution of guessing (c). Guessing is of questionable importance in behavioral healthcare, although is being used in the area of personality research

• 4PL models estimate the contribution of carelessness

17


IRT Models: Response Scales

• Dichotomous (yes-no) –

• Polytomous (Likert scales, rating scales, etc.)

o Partial Credit Model (PCM): thresholds for each item

response are independently estimated – ordinal

scale properties are tested

• Example: multiple choice items where some responses are

more correct than others

• Additional parameters are estimated increasing model fit

o Rating Scale Model (RSM): assumes all items share

the same rating scale (e.g., Likert scaling)

18

Commonly used in evaluations


Rasch versus 1PL

• Global fit of the model to the

data

• Summarizes the respondent

sample as a normal

distribution

• ICCs modeled to be parallel

with a slope of 1.7

(approximating the slope of

the cumulative normal ogive)

• Model reconsidered in the

event of data misfit in terms

of adding additional

parameters, such as

discrimination and guessing –

fit of the model to the data

• Local fit of the data to the

model, one parameter at a time

• Parameterizes each member of

the respondent sample

individually

• Item characteristic curves (ICCs)

modeled to be parallel with a

slope of 1 (the natural logistic

ogive)

• Item data violating models

assumptions (not supporting

parameter separability) in a

linear framework is examined for

possible deletion from the model

– data fit the model

Rasch 1PL

19


2PL Model: Item Discrimination (a) • The a parameter is the steepness of the curve at its

steepest point (slope of the line tangent to the ICC at b)o Its closest relative in classical test theory is the item total

correlation

• The steeper the curve, the more discriminating the item is, and the greater its item total correlation.

• As the a parameter decreases, the curve gets flatter until there is virtually no change in probability across the ability continuum.

• Items with very low a values provide little useful information for distinguishing among respondentso In classical test theory, these items would have low item total

correlations.

• The 2-parameter model allows both a and b parameters to vary to describe the items. This model is used where there is no suspected guessing.

20


Example: Item Discrimination

n

innini

n

innin

i

bpp

bpx

a2

1

))(1(

))((

1

Logit Distanceai= 2.0

Formula for ai is obtained from the Newton-Raphsonestimation equation:

Logit Distanceai = 6.38

n

innini

n

innin

i

bpp

bpx

a2

1

))(1(

))((

1

21


3PL/4PL Model: Guessing – Carelessness

• Individuals responding to certain measures, such as tests of knowledge may guess in selecting a response. o Example: standardized achievement tests where it is suggested

that if you do not know the correct response, guess

• In the 3PL IRT model, guessing is parameterized as a lower asymptote to the item's logistic ogive of the probability of a correct answer – guessability

• In the 4PL IRT model (4-PL), carelessness is parameterized as an upper asymptote – mistake-ability

• The literature suggests that when the lower or upper asymptote is .10 or greater, it is significant

• The lower and upper asymptote estimates are used to modify the item difficulty and person ability estimates.

• The Rasch measurement model handles guessability and mistake-ability as misfit, and does not make adjustments for item difficulties and person abilities

22


Example: Guessing – Carelessness (3PL/4PL)

Difference between Logit and Probit models: logistic approach yields flatter tails – normal/probit curves approach axes more quickly

Quantitatively Logit and Probit models yield similar results, however estimates are not directly comparable.

MPlus uses a Probit IRT model and provides a conversion factor that yields comparable Probit and Logit estimates.

23


Advantages of IRT Models

• Sample invariant

• Modeling and estimation of measurement

properties at the item level

• Estimation of measurement precision at the score

(response option) level

• Estimation of item contribution to the overall test

precision for score ranges

• Identification of individuals (groups) responding to

items/response option in inconsistent manner

• Calibration across measures – equating

• Foundation for computer adaptive testing (CAT)

24


IRT Models: Limitations

• Separate software is needed to conduct IRT analyses (e.g., Winsteps, Bilog, Bilog-MG, Multilog, Parscale, Conquest, Rumm, Testfact, etc.)o Each software package has unique functions – some are

idiosyncratic to specific software packages

o IRT software has varying ability in terms of its integration with other statistical programs (SPSS, SAS, etc.)

• Expertise is needed in selecting models and interpreting parameter estimates

• Large samples are needed for multi-parameter models (the more parameters the larger the sample needed)

• Less emphasis in assessment, measurement, statistical academic courses

25


Application of the Rasch Measurement Model


The Rasch Measurement Model

• The Rasch model, as opposed to 2- and 3-

parameter models, questions how well empirical

data (measure scores/responses) fit in terms of

the measurement model constraints.o The additional parameters in 2PL (item difficulty) and 3PL

(respondents guessing) models are used to explain variance

in the measurement model.

• The Rasch model assume items are

distinguishable in terms of item difficulty (θ).• The Rasch model provides “sample invariant”

(sample independent) item calibrations, item difficulties (θ), from easy to hard — no impairment to severe impairment.

27


The Rasch Model . . . Selecting a Few Good Items

The Rasch model asserts that . . .

• Variation in discrimination (2PL) is

symptomatic of multi-dimensionality and

item bias

• Guessing (3Pl) and carelessness (4PL) is an

unreliable person characteristic and a

measurement liability

o Guessing and carelessness are not item

characteristicsThe Rasch model eliminates items violating model assumptions

28


Rasch Measurement Model

Persons with no/little

need for intervention

Persons with high need

for intervention

Items about adversity,

severe problemsItems about day-to-day

issues, small problems

Moderate

Scale items represent a construct along a continuum from low to high, minimal to maximal, etc. Every scale item is calibrated along this continuum.

Low Measured Construct HighNeed Need

Rasch Ruler

29


Calibrating Items

Mild

1 54 6

Item Calibration

Easiest

Item

Hardest

Item

2 3

Person1

DeteriorationImprovement

Deterioration

2

• Person1 has no opportunity to demonstrate improvement. Scores can only indicate stability or deterioration.

• Person3 has no opportunity to demonstrate deterioration or minimal improvement.

Person3

No ItemsNo Items

Severe

Person2

Improvement

30

Examples: Data Sources

31


Data Source: Examples• Large Commercial Health Plan: Assessment of

Global Distress (Life Status Questionnaire 30 items)o Behavioral healthcare (N=258,393)

• 32% male

• Adults (18-24: 9%, 25-35: 26%, 36-45: 31%, 45-60: 30% )

• Diagnosis (Depression 42%, Adjustment 20%, Anxiety 9%, Bipolar 7%, Alcohol/Drugs 6%)

• Project Match: URICA – Stage of Changeo Funded by NIAAA, 6-year study ending in 8/2005

o ~ 1700 volunteers with alcohol abuse/dependence• N=1726 (952 outpatients, 744 after care; 1307 males)

32


URICA

• 32 item, self-report scale

• Respondents asked to think about entry into treatment or about problems they experience in their lives

• Four eight-item subscaleso Pre-contemplation

o Contemplation

o Action

o Maintenance

• Scoring: 5-point Likert scaleo Scale: 1 = strong disagreement

5 = strong agreement (3 = undecided)

o Total score: mean of C + A + M – PC = second order continuous Readiness to Change score

Questionnaires

Life Status Outcome Questionnaire

OQ 30

• 30 item, self-report scale

• Respondents asked to think back about how their felt over the past

• 30 items – no subscaleso Items ask about psychological,

emotional function, physical function, and substance abuse

• Scoring: 5-point Likert scale

o Scale: 0 = Never, 5 = almost always (3 = sometimes)

o Total score: sum across all 30 items – scale score: 0 - 120


Basic Measurement QuestionAre Items Sufficiently Co-located with

Respondents?

34

Person Mean

Item Mean

Baseline Person – Item Map (Global Distress Scale)

Mil

d D

istr

ess

G

reat

er D

istr

ess

1 SD

1 SD

Persons and items are well matched at

baseline. Items differentiate a significant

proportion of the sample .

No Items are available to differentiate

individuals who are mildly or severely

distress, except for alcohol/drug and

suicide focused items.

Items clustering together, although

added to the overall score, reflect little in

terms of increased or decreased

distress, as they assess the same level

of the construct.

MAP OF ITEMSS

<more>|<rare>

6 . +

|

|

|

. |

5 +

|

|

. |

|

4 . +

. |

|

. |

. |

3 . +

. |

. |

. |

. | 24 TROUBLE BECAUSE ALCOHOL

2 . +

. | 11 USE ALCOHOL/DRUGS TO GET GOING 20 PEOPLE CRITICIZE DRINKING

.# T|T

.## | 7 SUICIDAL

.### |

1 .#### +

.##### |S

.####### S| 23 TROUBLE GET ALONG FRIENDS

.########### | 12 WORTHLESS

.########### | 19 DIST. THOUGHTS 21 STOMACH 25 BAD THINGS HAPPEN 28 WRONG W/MIND

0 .############ +M 15 FREQ. ARGUE 17 HOPELESS 18 HAPPY 30 SATIS W/RELATIONS 9 SCHL SATIS.

.############ | 10 FEARFUL 2 NO INTEREST 5 SATISFIED/LIFE 8 FEEL WEAK

.############ M| 22 NOT WORKING/STUDYING 27 NOT DOING WELL WORK/SCHOOL

.######### | 14 LONELY 26 NERVOUS 29 FEEL BLUE

.########### |S 16 DIFF. CONCENTRATING 1 TROUBLE SLEEPING

-1 .######### + 13 CONCERN FAM. TROUBLE 4 BLAME SELF 6 IRRITATED

.###### | 3 STRESSED

.####### S|

.#### |T

.### |

-2 .## +

.## T|

.# |

. |

. |

-3 . +

. |

. |

. |

. |

-4 . +

. |

. |

|

. |

-5 +

|

. |

. |

|

-6 . +

<less>|<frequ>

EACH '#' IS 1728. 35

118 81 608 142 3 2 24,379

24,623

76,935

106,451

40,233

4,816

0

20,000

40,000

60,000

80,000

100,000

120,000

-6.00 -5.00 -4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00 5.00 6.00

10

8

6

4

2

Item

sF

req

uen

cy

Logit

Measure Score

Item Gap41.3% respondents

ItemGap

Decreasing person ability Increasing person abilityDecreasing item difficulty Increasing item difficulty

Per

son

Fre

qu

ency

ItemGap

ItemCluster

Global Distress


INPUT: 1726 PERSONS 32 ITEMS MEASURED 160 CATS --------------------------------------------------------------------------------

PERSONS - MAP - ITEMS

<more>|<rare>

6 # +

|

|

|

. |

. |

5 +

|

|

. |

|

. |

4 +

# |

|

. |

. |

.# T|

3 +

.### |

.## |

.# |

.## S|

.##### |

2 .## +

.##### |

.##### |

##### |

.##### M|

.####### |

1 .######### + URICA28 URICA9

.######### |

.######### |T URICA16

.############ | URICA32 URICA18

.#### S|S URICA31

.#### | URICA29 URICA6 URICA22 URICA10 URICA1 URICA27

0 .# +M URICA13 URICA3 URICA23

. | URICA114 URICA19 URICA26 URICA30 URICA17 URICA2 URICA25 URICA12 URICA11

. |S URICA24 URICA5 URICA20 URICA15 URICA21 URICA7 URICA8

. T| URICA4

. |T

. |

-1 . +

<less>|<frequent>

EACH '#' IS 16.

Sample Mean

Item Difficulty Mean

No items to differentiate individuals in terms of higher degrees of readiness to change

Items clustering together, although added to the overall

score, reflect little in terms of increased or decreased

distress, as they assess the same level of the construct.

Person – Item Map (URICA – Stage of Change)


300

200

100

2

4

6

-2.00 0.00 2.00 4.00 6.00

56.3% of the sampleNo items to accurately locate

readiness to change status

Nu

mb

er o

f I

tem

s

S

amp

le F

req

uen

cy

URICA: Readiness to Change

38

Measurement Model Fit: Global Distress+-------------------------------------------------------------------------------------------------------------------------------------+

|ENTRY RAW MODEL| INFIT | OUTFIT |PTBIS|EXACT MATCH|ESTIM| P- | |

|NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%|DISCR|VALUE| ITEMS |

|------------------------------------+----------+----------+-----+-----------+-----+-----+--------------------------------------------|

| 24 30095 258004 2.20 .00|1.29 9.9|2.21 9.9| .17| 89.7 90.9| .91| .12| 24 TROUBLE BECAUSE OF ALCOHOL |

| 11 40340 257941 1.84 .00|1.47 9.9|3.46 9.9| .19| 86.8 88.5| .87| .16| 11 USE ALCOHOL/DRUGSTO GET GOING |

| 20 65489 257906 1.82 .00|1.53 9.9|3.50 9.9| .15| 76.0 80.5| .77| .25| 20 PEOPLE CRITICIZE MY DRININKING |

| 7 158279 257603 1.47 .00|1.02 6.9| .99 -2.6| .50| 57.1 56.7| 1.00| .61| 7 SUICIDAL THOUGHTS |

| 23 306791 255346 .69 .00|1.04 9.9|1.02 6.7| .53| 49.3 48.3| .96| 1.20| 23 TROUBLE GETTING ALONG WITH FRIENDS |

| 12 378985 257605 .32 .00| .74 -9.9| .73 -9.9| .72| 49.5 41.8| 1.35| 1.47| 12 FEEL WORTHLESS |

| 19 378725 256349 .30 .00|1.06 9.9|1.08 9.9| .57| 41.8 41.5| .92| 1.48| 19 DISTURBING THOUGHTS |

| 28 394076 257615 .18 .00| .92 -9.9| .92 -9.9| .65| 42.3 39.5| 1.12| 1.53| 28 SOMETHING WRONG WITH MIND |

| 21 417669 258025 .15 .00|1.27 9.9|1.30 9.9| .45| 39.6 43.2| .63| 1.62| 21 UPSET STOMACH |

| 25 417668 257807 .12 .00| .90 -9.9| .90 -9.9| .64| 45.7 42.2| 1.14| 1.62| 25 FEEL SOMETHING BAD GOPING TO HAPPEN |

| 30 463694 257896 .06 .00|1.05 9.9|1.07 9.9| .53| 51.5 49.0| .95| 1.80| 30 SATISFIED WITH RELATIONSHIPS |

| 18 468437 255779 .01 .00| .84 -9.9| .84 -9.9| .65| 56.7 49.8| 1.17| 1.83| 18 HAPPY PERSON |

| 15 438912 257407 -.04 .00|1.38 9.9|1.41 9.9| .36| 39.9 45.9| .55| 1.70| 15 FREQUENT ARGUMENTS |

| 9 473076 257474 -.06 .00| .95 -9.9| .96 -9.9| .58| 52.9 48.3| 1.05| 1.84| 9 SCHOOL/WORK SATISFYING |

| 17 460962 257686 -.09 .00| .73 -9.9| .74 -9.9| .73| 51.7 43.1| 1.36| 1.79| 17 HOPELESS ABOUT FUTURE |

| 8 471292 255913 -.13 .00| .89 -9.9| .90 -9.9| .64| 49.2 45.0| 1.13| 1.84| 8 FEEL WEAK |

| 2 475865 257112 -.13 .00| .81 -9.9| .82 -9.9| .68| 51.9 45.7| 1.23| 1.85| 2 NO INTEREST |

| 10 479365 256375 -.16 .00| .98 -6.2| .99 -5.0| .60| 45.4 43.9| 1.02| 1.87| 10 FEARFUL |

| 5 510395 256460 -.24 .00| .93 -9.9| .95 -9.9| .60| 53.8 48.6| 1.07| 1.99| 5 SATISFIED WITH LIFE |

| 22 510135 255327 -.34 .00| .95 -9.9| .96 -9.9| .63| 44.5 41.0| 1.08| 2.00| 22 NOT WORKING/STUDYING AS WELL AS NEED TO |

| 27 514855 257073 -.37 .00| .84 -9.9| .86 -9.9| .67| 49.2 42.8| 1.21| 2.00| 27 NOT DOING WELL SCHOOL/WORK |

| 14 558250 257386 -.57 .00| .96 -9.9| .96 -9.9| .62| 45.2 43.2| 1.06| 2.17| 14 FEEL LONELY |

| 29 570205 256874 -.65 .00| .71 -9.9| .71 -9.9| .73| 55.0 46.2| 1.34| 2.22| 29 FEEL BLUE |

| 26 573998 257278 -.68 .00| .83 -9.9| .84 -9.9| .67| 50.2 44.9| 1.21| 2.23| 26 FEEL NERVOUS |

| 16 562628 257369 -.70 .00| .84 -9.9| .84 -9.9| .66| 51.3 46.4| 1.20| 2.19| 16 DIFFICULTY CONCENTRATING |

| 1 584301 257786 -.70 .00|1.26 9.9|1.29 9.9| .47| 40.3 42.5| .65| 2.27| 1 TROUBLE FALLING ASLEEP |

| 4 608845 256446 -.96 .00| .98 -7.4| .98 -6.4| .57| 48.8 47.1| 1.03| 2.37| 4 BLAME SELF |

| 6 609827 256324 -1.00 .00| .98 -6.7| .99 -4.7| .54| 54.6 52.0| 1.02| 2.38| 6 IRRITATED |

| 13 658911 257368 -1.05 .00|1.56 9.9|1.67 9.9| .34| 35.9 42.5| .24| 2.56| 13 CONCERN ABOUT FAMILY TROUBLES |

| 3 665764 256301 -1.28 .00| .97 -9.9| .96 -9.9| .58| 49.7 47.1| 1.05| 2.60| 3 STRESSED |

|------------------------------------+----------+----------+-----+-----------+-----+-----+--------------------------------------------|

| MEAN441594.5257061. .00 .00|1.02 -2.4|1.19 -2.7| | 51.8 49.6| | | |

| S.D.167214.3 791.4 .85 .00| .23 9.2| .68 8.9| | 12.1 12.9| | | |

+-------------------------------------------------------------------------------------------------------------------------------------+

Estimates identified as having misfit are reviewed for potential deletion form the measurement model. These estimates are indicative of multidimensionality, guessability and mistake-ability (carelessness) – the need for a multi-parameter model to explain variance from the measurement model. 39


-------------------------------------------------------------------------------------------------------------------------------------------------------------------

|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTBSE|EXACT MATCH|ESTIM| P- | |

|NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%|DISCR|VALUE| ITEM |

|------------------------------------+----------+----------+-----+-----------+-----+-----+------------------------------------------------------------------------|

| 1 7573 1714 .12 .03|1.39 6.3|3.71 9.9|A .19| 62.8 58.3| .67| 4.42| uriaq1R ‘don't have any problems need changing’ |

| 9 5752 1710 .94 .03|1.47 9.9|1.90 9.9|B .29| 32.5 38.2| .36| 3.36| uriaq9 ‘successful working problem unsure can keep up on own’ |

| 29 6812 1705 .25 .03|1.25 6.0|1.79 9.9|C .32| 50.8 50.3| .73| 4.00| Uriaq29R ‘have worries spend time thinking about them’ |

| 31 6964 1706 .28 .03|1.21 4.4|1.79 9.9|D .33| 58.0 54.6| .82| 4.08| Uriaq31R ‘would rather cope with faults than change them’ |

| 28 5600 1701 .95 .03|1.43 9.9|1.62 9.9|E .31| 33.9 39.3| .41| 3.29| Uriaq28 ‘frustrating might have recurrence - thought it was resolved’ |

| 16 6165 1709 .67 .03|1.21 6.3|1.53 9.9|F .41| 37.8 39.6| .71| 3.61| uriaq16 ‘not following through here prevent relapse’ |

| 23 7423 1714 -.07 .03|1.05 1.0|1.50 8.3|G .40| 62.6 57.2| .96| 4.33| uriaq23R ‘part of the problem but not really’ |

| 13 7532 1713 .01 .03|1.00 .1|1.47 7.0|H .43| 69.7 60.5| 1.01| 4.40| uriaq13R ‘I have faults nothing that I need to change’ |

| 32 6367 1706 .50 .03|1.17 4.6|1.45 9.9|I .37| 45.2 47.4| .77| 3.73| Uriaq32 ‘try and change and it comes back to haunt me’ |

| 18 6541 1710 .47 .03|1.08 2.1|1.31 7.3|J .44| 48.6 47.5| .91| 3.83| uriaq18 ‘thought I resolved problem still struggling’ |

| 11 7935 1712 -.24 .04| .89 -1.4|1.31 3.5|K .46| 76.2 68.6| 1.08| 4.63| uriaq11R ‘waste of time for me problem not with me’ |

| 5 7912 1713 -.33 .04|1.13 1.8|1.30 3.7|L .34| 74.3 67.1| 1.02| 4.62| uriaq5R ‘I'm not the problem one no sense to be here’ |

| 26 7396 1703 -.12 .03| .95 -1.0|1.21 3.8|M .47| 65.3 58.6| 1.03| 4.34| uriaq26R ‘psychology is boring just forget about problems’ |

| 3 7220 1714 -.04 .04|1.05 .8|1.18 3.4|N .39| 67.1 65.6| 1.00| 4.21| uriaq3 ‘doing something about problems bothering me’ |

| 2 7775 1714 -.22 .04|1.11 1.5|1.15 2.3|O .37| 72.0 63.7| 1.05| 4.54| uriaq2 ‘ready for some self-improvement’ |

| 6 6916 1713 .21 .03| .96 -1.0|1.11 2.4|P .50| 53.0 47.9| 1.06| 4.04| uriaq6 ‘worries me might slip back here to seek help’ |

| 12 7674 1713 -.23 .04| .83 -2.8|1.06 1.0|p .54| 71.6 61.8| 1.13| 4.48| uriaq12 ‘hoping this will help understand myself’ |

| 10 6841 1711 .13 .03| .95 -1.1|1.04 .8|o .48| 66.4 62.6| 1.04| 4.00| uriaq10 ‘problem is difficult working on it’ |

| 19 7240 1710 -.12 .03| .92 -1.7|1.03 .7|n .50| 64.1 59.8| 1.06| 4.23| uriaq19 ‘wish more ideas how to solve problem’ |

| 22 6902 1712 .15 .03| .89 -2.7| .99 -.1|m .55| 59.1 55.3| 1.10| 4.03| uriaq22 ‘need maintain the changes already made’ |

| 4 7915 1715 -.56 .05| .98 -.3| .97 -.5|l .42| 75.2 66.0| 1.13| 4.62| uriaq4 ‘worthwhile to work on my problem’ |

| 14 6904 1708 -.09 .03| .92 -2.1| .96 -.9|k .52| 57.7 54.8| 1.08| 4.04| uriaq14 ‘working hard to change’ |

| 8 7668 1715 -.42 .04| .95 -.9| .96 -.7|j .46| 72.3 63.2| 1.14| 4.47| uriaq8 ‘been thinking might want change myself’ |

| 7 7456 1713 -.39 .04| .92 -1.5| .92 -1.6|i .48| 71.2 65.0| 1.12| 4.35| uriaq7 ‘finally doing work on problem’ |

| 17 7066 1713 -.19 .04| .85 -2.9| .90 -2.0|h .54| 73.4 69.1| 1.10| 4.12| uriaq17 ‘not successful in changing at least working problem’ |

| 30 7030 1705 -.16 .04| .88 -2.6| .89 -2.5|g .53| 67.3 64.4| 1.10| 4.12| Uriaq30 ‘actively working on my problem’ |

| 25 7067 1707 -.23 .04| .89 -2.5| .88 -2.9|f .53| 66.0 64.0| 1.09| 4.14| uriaq25 ‘actually doing something about problem – changing’ |

| 27 7012 1704 .10 .03| .83 -4.3| .88 -2.8|e .59| 63.2 53.2| 1.20| 4.12| Uriaq27 ‘here to prevent relapse of problem’ |

| 24 7532 1708 -.27 .04| .79 -3.7| .85 -2.9|d .59| 72.0 63.4| 1.18| 4.41| uriaq24 ‘hope someone will have good advice’ |

| 21 7538 1710 -.39 .04| .79 -4.0| .77 -5.1|c .60| 71.5 64.1| 1.19| 4.41| uriaq21 ‘maybe place will be able to help’ |

| 15 7719 1710 -.37 .04| .75 -4.2| .74 -5.0|b .61| 74.6 63.2| 1.25| 4.51| uriaq15 ‘have problem I should work on it’ |

| 20 7539 1714 -.36 .04| .73 -5.5| .69 -6.9|a .66| 73.6 62.0| 1.25| 4.40| uriaq20 ‘started working problem would like help’ |

|------------------------------------+----------+----------+-----+-----------+-----+-----+------------------------------------------------------------------------|

| MEAN 7155.8 1710.2 .00 .03|1.01 .3|1.25 2.5| | 62.8 58.0| | | |

| S.D. 581.8 3.8 .37 .01| .19 4.0| .55 5.2| | 12.0 8.5| | | |

-------------------------------------------------------------------------------------------------------------------------------------------------------------------

Measurement Model Fit: URICA

40


Rasch ModelExamining Assumptions

DimensionalityLocal Independence

MonotonicityAdditivity


Measurement Model FitExamining Deviation From The Rasch Model

• The Rasch model specifies that item discrimination (item slope) is uniform across items o This supports additivity and construct stability

• Item discrimination is NOT parameterizedo The Rasch slope is the average discrimination of all the

items. It is not the mean of the individual slopes because discrimination parameters are non-linear

o Mathematically, the average slope is set at 1.0 when the Rasch model is formulated in logits, or 1.70 when it is formulated in probits (2-PL and 3-PL – 0.59 is the conversion from logits to probits

• Issue to considero Are the misfitting items distinct?

• Could they be considered a separate subscale?

o Does removal of these item appreciable improve the measurement model?

42


Item DiscriminationIs The One Parameter Model Sufficient?

Item differences (difficulty)/person differences (ability) contribute additively to the

probability (log-odds) of responding positively.

43


Challenges – Adding the 2nd ParameterInterpretation: Which Item Is More Difficult?

-5 -4 -3 -2 -1 0 1 2 3 4 5

Probability of scoring

higher on green item is

.9. At this level of

impairment the red

item is more difficult.

Probability of scoring

higher on red item is .6. At

this level of impairment.

Pro

bab

ilit

y o

f R

esp

on

se –

Hig

her

= M

ore

Im

pai

rmen

t

Mild Moderate Severe 44


2PL – Is More Better?

• Measurement development

o Parsimonious model

o Good data fit to the measurement model

o Objective is to minimize measurement variance –development of measures that can be generalized to

diverse population groups, conditions, etc.

• Performance on a developed measure

o Additional parameters may shed light on interpretation of

measurement variance.

o Example: items may function differently for individuals at different positions on the measured construct.

45

Questioning The Assumption Of Additivity, Interval Level Data and Uncorrelated error

46

Approaching Uniform ResponseGlobal Distress

Satisfied with relationships Happy

0 4

1 4

3

0 4

1 4

3

0 = Never 3 = Frequently

1 = Rarely 4 = Almost Always

2 = Sometimes

An individual has a 50% probability of responding with “0” or a “1” and so forth where the probability curve intersects (Rasch-Step Calibration). 47


Non-uniform ResponseGlobal Distress

Having trouble with alcohol Using alcohol/drugs to get going

0 4

1 4

3

0 4

1

32

7.11 logits 5.99 logits

.18 logits .42 logits

48

0 = Never 3 = Frequently

1 = Rarely 4 = Almost Always

2 = Sometimes


\Assumptions: Non-uniform And Non-interval Response Patterns

OBSERVED AVERAGE MEASURES BY AN OBSERVED CATEGORY -6 -4 -2 0 2 4 6

|---------+---------+---------+---------+---------+---------| ITEMS

| 012 3 4 | 20 PEOPLE CRITICIZE MY DRININKING

| 0 12 34 | 7 SUICIDAL THOUGHTS

| 0 12 3 4 | 13 CONCERN ABOUT FAMILY TROUBLES

| 0 1 2 3 4 | 3 STRESSED |---------+---------+---------+---------+---------+---------|

-6 -4 -2 0 2 4 6

Never(0)

Rarely(1)

Sometimes(2)

Frequently(3)

AlmostAlways

(4)

Assumptions:

• Response options are equal intervals – a score of “4” is twice as much as a score of “2”

• All respondents systematically interpret and use response options across items in the

same manner

49

Never (0)

Rarely (1)

Sometimes (2)

Frequently (3)

Almost Always (4)

Never (0)

Rarely (1)

Sometimes (2)

Frequently (3)

Almost Always (4)

Never (0)

Rarely (1)

Sometimes (2)

Frequently (3)

Almost Always (4)

Stress

-5 -4 -3 -2 -1 0 1 2 3 4 5

Rasch

Ru

ler –L

og

it Measu

re Sco

res

Stress

-5 -4 -3 -2 -1 0 1 2 3 4 5

Rasch

Ru

ler –L

og

it Measu

re Sco

res

Stress

-5 -4 -3 -2 -1 0 1 2 3 4 5

Rasch

Ru

ler –L

og

it Measu

re Sco

res

Stress Alcohol/Drugs Suicidality

50


1.20

1.00

0.80

0.60

0.40

0.20

0.00

-0.20

-0.30

-0.40

-0.60

-0.80

-1.00

-1.20

-1.40

-1.60

-1.80

-2.00

Maintenance

Action

Contemplation

Pre-contemplation• Scales have some overlap

• Estimates for the Contemplation

scale indicate respondents have an

easier time responding positively to

these items – little distinction

between the two scales

Sta

nd

ard

ized

Ite

m D

iffi

cult

y /

Per

son

Ab

ilit

yR

esis

tan

ce

C

om

mit

men

t to

Ch

ang

e

Questioning the Assumption of Additivity – URICAReadiness to Change

STAGES1. Pre-contemplation2. Contemplation3. Action4. Maintenance

51


Error – Assumed To Be UncorrelatedStandard Error as a Function of Ability

Measure Score (OQ-30 Total Scale)

θ

Sta

nd

ard

Err

or

SE = 0.25

θ = -2.34 to 2.47

SE = 1.0θ = -5.21 / θ = 5.04 X X

52

Interpreting Score Change

Do All Items Provide the Same Level Of

Information?

53


Total Scores – Equivalent MeaningDoes It Matter Where Change Occurs?

I feel stressed at work/school

I am concerned about family troubles

I feel irritated

I blame myself for things

Total Score: 16

I am annoyed by people who criticize drinking

I need a drink the next morning to get going

I have thoughts of ending my life

I have trouble at work because of drinking

drug use

Total Score: 16

Nev

er

Rar

ely

Som

etim

es

Fre

quen

tly

Alm

ost a

lway

s

4

4

4

4

4 1

33

3

0

2

3

1

04

Baseline Follow-up

Nev

er

Rar

ely

Som

etim

es

Fre

quen

tly

Alm

ost a

lway

s

2

6

6

54


Response Option Approaches

55

• Dominance model – Likert scalingo The more strongly an opinion or perspective is held the

higher the probability of endorsing a favorable response All items of less difficulty would also be endorsed favorably - additive

• Ideal point modelo Probability of endorsement is associated with the

respondent’s position on the latent trait (θ) • Probability decreases as the item location moves away (either below

or above) from his/her position on the latent trait

• Person with moderately strong opinion would likely not endorse (disagree with) items reflecting stronger and weaker positions

• Person can disagree for two reasons, the trait level (θ) is below item level, and conversely the trait level (θ) is above the item location

• Multi-dimensional pairwise preference (MUPP)o Forced-choice statements

• Statements representing trait levels (θ) that are nearly proximate –balanced item sets

• Increased measurement precision

• UnfoldingStrongly Disagree Mid-point Strongly Agree

─ + ─ +


Interpreting Score Changes

The estimated Reliable Change Index for the General Distress Measure is reported to be 10. An individual’s score must change by at least 10 points on the measure to be considered clinically significantly changed.

10 points has a differential effect depending where on the continuum the change occurs.

Raw

To

tal

Sco

re (

Hig

h =

Mo

re D

istr

ess)

10 point difference

10 point difference

.5 logit u

nits

4.1 logit units

Norm

al

M

ild

Mod.

S

ever

e

Measure Score (logits)

56


Multidimensional IRT Models - MIRT

57

• The complexity of assessment includes dimensions not directly measuredo Literacy level

o Conscientiousness

o Emotional stability

o Interpretation of construct meaning

• MIRTo between-item (assessment is multidimensional, (multiple latent

traits), however each item assess only one dimension)

o within item (multiple skills/competencies are needed to respond to an item) dimensionality

o Models test information as a response surface representing the relationship between correct response, person ability (θ) and discrimination (a)

• MIRT scores – complexity in interpretationo Multidimensional item difficulty and discrimination over a

range of θ vectors


MIRT Item Surface

58

-4

-2

0

2

4

-4

-2

0

2

4

0

0.2

0.4

0.6

0.8

1

1

Item Response Surface

2

Pro

babili

ty o

f C

orr

ect

Response

Contour Plot

of Item Response Surface0.1

0.1

0.1

0.2

0.2

0.2

0.3

0.3

0.3

0.4

0.4

0.4

0.5

0.5

0.5

0.6

0.6

0.6

0.7

0.7

0.7

0.8

0.8

0.8

0.9

0.9

0.9

1

2

-4 -2 0 2 4-4

-3

-2

-1

0

1

2

3

4

• Item surface increases more quickly across θ1

o Consequence of the a parameter (discrimination)


Reliable Change*Evaluating and Characterizing Effective Treatment Providers

Raw Score1 Raw Score2 Measure

Scores

Improvement 22% 24% 20%

Stability 58% 56% 72%

Deterioration 20% 20% 8%

2)(2 SESEDIFF DIFFSE

XXRCI 12

*Reliable Change (Jacobson & Truax, 1991)Note: Suggested reliable raw score change

for the General Distress Measure is 10 points

(Raw Score1).

Sample specific RCI is estimated at a change

of 12 points (Raw Score2).

Rasch a

a Multidimensional random coefficient multinomial logit model – alcohol/drug items treated as a separate subscale.


Assessing Outcomes

60

Ann M. Doucette, Ph.D.

Email: [email protected]

Examining How Rasch/IRT Can Increase Measurement Precision

Documents

Transcript of Examining How Rasch/IRT Can Increase Measurement Precision