Examining How Rasch/IRT Can Increase Measurement Precision
Transcript of Examining How Rasch/IRT Can Increase Measurement Precision
Examining How Rasch/IRT Can Increase Measurement Precision
Ann M. Doucette, Ph.D.
The Evaluators’ Institute
American Evaluation Association Annual MeetingChicago, Illinois
12 November 2015
Ann M. Doucette, PhD
Learning Objectives• Recognize the role of measurement precision in
obtaining reliable and generalizable psychotherapy research outcomes
• Understand the basic differences between Classical Test Theory (CTT) and Item Response theory (IRT)
• Discuss the advantages IRT offers over CTT in terms of measurement precision and testing the assumption often made in psychotherapy research of scaled interval-level data
• Describe the basic characteristics and assumptions (e.g., unidimensionality and local independence) of one-parameter logistic IRT models (Rasch) and multiple-parameter models (2PL, 3Pl, and 4Pl models)
• Recognize the pivotal role of measurement as a foundation of evaluation practice
2
Ann M. Doucette, PhD 3
Performance measurements:
balanced scorecard model for, 106;
converting data into information,
111–116fig; cost effectiveness of,
102; criteria for good, 107–108;
current challenges to, 119–122;
customer satisfaction, 103;
disability adjudicators’ claimsprocessing, 110t; EA (evaluabilityassessment),
90–91; efficiency of, 102–103; improving performance through,
120–121; logic modeling used in, 72–75fig; networked environments implementation of, 121–122;
outcomes of, 101–102; outputs, 102; reliability of, 13–14b, 107, 558b, 559;
RFE (rapid feedback evaluation), 95–96; service quality, 103; STEM
instruction program, 75fig; strategy maps for, 106; tables of performance indicators, 86; validity of, 12–17, 107,
558b, 559. See also
Data; Data analysis; Logic models;
Program evaluation; Statistics
Measurement validity: descriptionof, 12, 558b; internal, 14–15,558b; process for achieving,12–13b; statistical conclusion,16–17
Ann M. Doucette, PhD
Measurement
A scientific method – a way of finding out
(more or less reliably) what level of an
attribute is possessed by the object or
objects under investigation . . . the
magnitude of a level of an attribute via its
numerical relation (ratio) to another level of
the same attribute (Michell, 2001, p. 212).
4
Ann M. Doucette, PhD
Measurement Requirements
Measures must be*
• Linear – ordered/monotonic, conjointly additive permitting arithmetic computations can be conductedo Assumes unidimensionality
• Item calibrations must not be dependent on whose responses are used to estimate the measurement model (sample invariant)
• Person estimates/measures must not be dependent on which items are answered (test free)
• Missing data should not matter* Thurstone, 1925, 1926)
5
Ann M. Doucette, PhD
Basic Assumptions: Measurement
• Reliability: the consistency across observed scores for a particular attribute (depression, achievement, vocational interest, etc. Respondents would obtain the same scale score placement on different occasions
• Validity: the accuracy of the measurement in terms of expected scores for a universe of observations defining a specific attribute for a given population and a given purpose
• Dimensionality: measure scores are interpreted as
assessing a single trait (depression, anxiety, etc.)
o Multidimensionality confounds, making it impossible
to distinguish the attribute assessed by the measure
6
Ann M. Doucette, PhD
Measurement Function
7
• Assigning a numerical value across a set of
item that represents person placement on a
define attribute
Mild Moderate Severe
Ann M. Doucette, PhD
If the student gets all items correct we can assume the student likely gets thealgebra, geometry, addition, subtraction, etc., correct; and has exceptional math skill
Mapping Items to Attribute Levels
8
Respondents
Students with
exceptional
mathematical skill
Students with
moderate
mathematical skill
Students with
poor
mathematical skill
Items
Calculus,
Trigonometry
Algebra, Geometry
Basic math (addition,
subtraction,
multiplication, long
division)
If the student gets all items correct we cannot assume the student has exceptional math skill
Ann M. Doucette, PhD 9
Ann M. Doucette, PhD
Approaches To Measurement
• Classical Test Theory (CTT)o Currently the approach most frequently used
• Latent Trait Modelso Used frequently in educational testing and
healthcare• Standardized tests and high stakes tests
• Increasing acceptance in behavioral healthcare
• NIH Roadmap for Medical Research Initiative – Patient-Reported Outcomes Measurement Information System(PROMIS)
10
Ann M. Doucette, PhD
Measurement Approaches
• Classical Test Theory (CTT)
o Currently the approach most
frequently used
• Latent Trait Models
o Used frequently in educational
testing and physical
healthcare
• Increasing acceptance in behavioral healthcare
• NIH Roadmap for Medical Research Initiative – Patient-Reported Outcomes Measurement Information System (PROMIS)
Classical Test Theory(True Score Theory)
Latent Trait Models(Strong True Score Theory)
Rasch Item Response Theory
1PL 2PL 3PL 4PL
11
Ann M. Doucette, PhD
Classical Test Theory (CTT)
• Classical Test Theory is sometimes referred to as true score theory
• Measure score = true score + random erroro Measurement error is normally distributed
o Measurement error is uncorrelated with the true score
• Item characteristics are measure and sample dependent
• All items contribute equally to the overall scale score (unless separately weighted using a scoring protocol)
• Response options (e.g. Likert scales) are assumed and frequently analyzed as equal interval scales
12
Ann M. Doucette, PhD
Classical Test Theory: Limitations
• Observed score is dependent on choice of items
• Item statistics – item difficulty and item discrimination are sample dependento No modeling of data at item level
• Assumed common measurement error across all items
• Measure reliability is sample dependent
• Limits o Ability to equate across tests
o Develop computer adaptive applications
13
Ann M. Doucette, PhD
Item Response Theory (IRT)
14
• Item Response Theory (IRT): modern latent trait
theory/Strong True Score Theory – an alternative scaling
methodology that enables the examination of the
hierarchical structure, unidimensionality and additivity
of items and measure scores
• Item – ability relationships: probability of correct
response is a function of responder ability, an estimate
of his/her level of the measured latent trait
o Higher ability increased probability of favorable score
• Estimates are provided for each individual item and
each response scale option
• Items provide differential weight/contribution to the
overall measure scale score
o Answering more difficult items yields a higher score
Ann M. Doucette, PhD
Assumptions: IRT Models
• Unidimensional: items measure the same underlying construct – a dominant first factoro Unidimensionality is seldom perfectly achieved – it is a matter
of the degree of the departure from unidimensionality
• Local Independence: response to an item is conditionally independent of responses to other items in a scale after controlling for the latent trait measured by the scale. o Latent variable explains how items are related – e.g., person
with high depression would have a specific response profile across items
• Monotonicity: consistent increasing relationship between theta (θ – proficiency/ability) and item responseso Expectation that respondent will use consistent response
options for items of like difficulty
15
Ann M. Doucette, PhD
Item Response Theory Models• Data modeling, statistical estimation of parameters
that represent the location of persons and items on a latent continuum, the measurement construct of interest
• Statistical models specifying the probability of a discrete outcome (e.g., a correct response to an item, the selection of a response option on a Likert scale), in terms of person and item parameters. o Person parameters, ability, indicate
• The level of impairment along the measured construct (e.g., depression, distress, etc.)
• The strength of an individual’s attitude
• Competence in an academic area (e.g., mathematics, etc.)
o Item parameters refer to • Difficulty
• Discrimination (slope or correlation)
• Guessing (lower asymptote)
• Carelessness (upper asymptote)16
Ann M. Doucette, PhD
Item Response Theory Models
• The Rasch model/1PL IRT model assumes that item discrimination parameters are equal, implying that two items of the same difficulty (b) on a test are equally precise at measuring ability
• 2PL models allow discrimination parameters to vary –item difficulty (b) and discrimination (a) are both modeled o Does the item discriminate across individuals with the same
ability?
• 3PL models estimate the contribution of guessing (c). Guessing is of questionable importance in behavioral healthcare, although is being used in the area of personality research
• 4PL models estimate the contribution of carelessness
17
Ann M. Doucette, PhD
IRT Models: Response Scales
• Dichotomous (yes-no) –
• Polytomous (Likert scales, rating scales, etc.)
o Partial Credit Model (PCM): thresholds for each item
response are independently estimated – ordinal
scale properties are tested
• Example: multiple choice items where some responses are
more correct than others
• Additional parameters are estimated increasing model fit
o Rating Scale Model (RSM): assumes all items share
the same rating scale (e.g., Likert scaling)
18
Commonly used in evaluations
Ann M. Doucette, PhD
Rasch versus 1PL
• Global fit of the model to the
data
• Summarizes the respondent
sample as a normal
distribution
• ICCs modeled to be parallel
with a slope of 1.7
(approximating the slope of
the cumulative normal ogive)
• Model reconsidered in the
event of data misfit in terms
of adding additional
parameters, such as
discrimination and guessing –
fit of the model to the data
• Local fit of the data to the
model, one parameter at a time
• Parameterizes each member of
the respondent sample
individually
• Item characteristic curves (ICCs)
modeled to be parallel with a
slope of 1 (the natural logistic
ogive)
• Item data violating models
assumptions (not supporting
parameter separability) in a
linear framework is examined for
possible deletion from the model
– data fit the model
Rasch 1PL
19
Ann M. Doucette, PhD
2PL Model: Item Discrimination (a) • The a parameter is the steepness of the curve at its
steepest point (slope of the line tangent to the ICC at b)o Its closest relative in classical test theory is the item total
correlation
• The steeper the curve, the more discriminating the item is, and the greater its item total correlation.
• As the a parameter decreases, the curve gets flatter until there is virtually no change in probability across the ability continuum.
• Items with very low a values provide little useful information for distinguishing among respondentso In classical test theory, these items would have low item total
correlations.
• The 2-parameter model allows both a and b parameters to vary to describe the items. This model is used where there is no suspected guessing.
20
Ann M. Doucette, PhD
Example: Item Discrimination
n
innini
n
innin
i
bpp
bpx
a2
1
))(1(
))((
1
Logit Distanceai= 2.0
Formula for ai is obtained from the Newton-Raphsonestimation equation:
Logit Distanceai = 6.38
n
innini
n
innin
i
bpp
bpx
a2
1
))(1(
))((
1
21
Ann M. Doucette, PhD
3PL/4PL Model: Guessing – Carelessness
• Individuals responding to certain measures, such as tests of knowledge may guess in selecting a response. o Example: standardized achievement tests where it is suggested
that if you do not know the correct response, guess
• In the 3PL IRT model, guessing is parameterized as a lower asymptote to the item's logistic ogive of the probability of a correct answer – guessability
• In the 4PL IRT model (4-PL), carelessness is parameterized as an upper asymptote – mistake-ability
• The literature suggests that when the lower or upper asymptote is .10 or greater, it is significant
• The lower and upper asymptote estimates are used to modify the item difficulty and person ability estimates.
• The Rasch measurement model handles guessability and mistake-ability as misfit, and does not make adjustments for item difficulties and person abilities
22
Ann M. Doucette, PhD
Example: Guessing – Carelessness (3PL/4PL)
Difference between Logit and Probit models: logistic approach yields flatter tails – normal/probit curves approach axes more quickly
Quantitatively Logit and Probit models yield similar results, however estimates are not directly comparable.
MPlus uses a Probit IRT model and provides a conversion factor that yields comparable Probit and Logit estimates.
23
Ann M. Doucette, PhD
Advantages of IRT Models
• Sample invariant
• Modeling and estimation of measurement
properties at the item level
• Estimation of measurement precision at the score
(response option) level
• Estimation of item contribution to the overall test
precision for score ranges
• Identification of individuals (groups) responding to
items/response option in inconsistent manner
• Calibration across measures – equating
• Foundation for computer adaptive testing (CAT)
24
Ann M. Doucette, PhD
IRT Models: Limitations
• Separate software is needed to conduct IRT analyses (e.g., Winsteps, Bilog, Bilog-MG, Multilog, Parscale, Conquest, Rumm, Testfact, etc.)o Each software package has unique functions – some are
idiosyncratic to specific software packages
o IRT software has varying ability in terms of its integration with other statistical programs (SPSS, SAS, etc.)
• Expertise is needed in selecting models and interpreting parameter estimates
• Large samples are needed for multi-parameter models (the more parameters the larger the sample needed)
• Less emphasis in assessment, measurement, statistical academic courses
25
Ann M. Doucette, PhD
Application of the Rasch Measurement Model
Ann M. Doucette, PhD
The Rasch Measurement Model
• The Rasch model, as opposed to 2- and 3-
parameter models, questions how well empirical
data (measure scores/responses) fit in terms of
the measurement model constraints.o The additional parameters in 2PL (item difficulty) and 3PL
(respondents guessing) models are used to explain variance
in the measurement model.
• The Rasch model assume items are
distinguishable in terms of item difficulty (θ).• The Rasch model provides “sample invariant”
(sample independent) item calibrations, item difficulties (θ), from easy to hard — no impairment to severe impairment.
27
Ann M. Doucette, PhD
The Rasch Model . . . Selecting a Few Good Items
The Rasch model asserts that . . .
• Variation in discrimination (2PL) is
symptomatic of multi-dimensionality and
item bias
• Guessing (3Pl) and carelessness (4PL) is an
unreliable person characteristic and a
measurement liability
o Guessing and carelessness are not item
characteristicsThe Rasch model eliminates items violating model assumptions
28
Ann M. Doucette, PhD
Rasch Measurement Model
Persons with no/little
need for intervention
Persons with high need
for intervention
Items about adversity,
severe problemsItems about day-to-day
issues, small problems
Moderate
Scale items represent a construct along a continuum from low to high, minimal to maximal, etc. Every scale item is calibrated along this continuum.
Low Measured Construct HighNeed Need
Rasch Ruler
29
Ann M. Doucette, PhD
Calibrating Items
Mild
1 54 6
Item Calibration
Easiest
Item
Hardest
Item
2 3
Person1
DeteriorationImprovement
Deterioration
2
• Person1 has no opportunity to demonstrate improvement. Scores can only indicate stability or deterioration.
• Person3 has no opportunity to demonstrate deterioration or minimal improvement.
Person3
No ItemsNo Items
Severe
Person2
Improvement
30
Examples: Data Sources
31
Ann M. Doucette, PhD
Data Source: Examples• Large Commercial Health Plan: Assessment of
Global Distress (Life Status Questionnaire 30 items)o Behavioral healthcare (N=258,393)
• 32% male
• Adults (18-24: 9%, 25-35: 26%, 36-45: 31%, 45-60: 30% )
• Diagnosis (Depression 42%, Adjustment 20%, Anxiety 9%, Bipolar 7%, Alcohol/Drugs 6%)
• Project Match: URICA – Stage of Changeo Funded by NIAAA, 6-year study ending in 8/2005
o ~ 1700 volunteers with alcohol abuse/dependence• N=1726 (952 outpatients, 744 after care; 1307 males)
32
Ann M. Doucette, PhD
URICA
• 32 item, self-report scale
• Respondents asked to think about entry into treatment or about problems they experience in their lives
• Four eight-item subscaleso Pre-contemplation
o Contemplation
o Action
o Maintenance
• Scoring: 5-point Likert scaleo Scale: 1 = strong disagreement
5 = strong agreement (3 = undecided)
o Total score: mean of C + A + M – PC = second order continuous Readiness to Change score
Questionnaires
Life Status Outcome Questionnaire
OQ 30
• 30 item, self-report scale
• Respondents asked to think back about how their felt over the past
• 30 items – no subscaleso Items ask about psychological,
emotional function, physical function, and substance abuse
• Scoring: 5-point Likert scale
o Scale: 0 = Never, 5 = almost always (3 = sometimes)
o Total score: sum across all 30 items – scale score: 0 - 120
Ann M. Doucette, PhD
Basic Measurement QuestionAre Items Sufficiently Co-located with
Respondents?
34
Person Mean
Item Mean
Baseline Person – Item Map (Global Distress Scale)
Mil
d D
istr
ess
G
reat
er D
istr
ess
1 SD
1 SD
Persons and items are well matched at
baseline. Items differentiate a significant
proportion of the sample .
No Items are available to differentiate
individuals who are mildly or severely
distress, except for alcohol/drug and
suicide focused items.
Items clustering together, although
added to the overall score, reflect little in
terms of increased or decreased
distress, as they assess the same level
of the construct.
MAP OF ITEMSS
<more>|<rare>
6 . +
|
|
|
. |
5 +
|
|
. |
|
4 . +
. |
|
. |
. |
3 . +
. |
. |
. |
. | 24 TROUBLE BECAUSE ALCOHOL
2 . +
. | 11 USE ALCOHOL/DRUGS TO GET GOING 20 PEOPLE CRITICIZE DRINKING
.# T|T
.## | 7 SUICIDAL
.### |
1 .#### +
.##### |S
.####### S| 23 TROUBLE GET ALONG FRIENDS
.########### | 12 WORTHLESS
.########### | 19 DIST. THOUGHTS 21 STOMACH 25 BAD THINGS HAPPEN 28 WRONG W/MIND
0 .############ +M 15 FREQ. ARGUE 17 HOPELESS 18 HAPPY 30 SATIS W/RELATIONS 9 SCHL SATIS.
.############ | 10 FEARFUL 2 NO INTEREST 5 SATISFIED/LIFE 8 FEEL WEAK
.############ M| 22 NOT WORKING/STUDYING 27 NOT DOING WELL WORK/SCHOOL
.######### | 14 LONELY 26 NERVOUS 29 FEEL BLUE
.########### |S 16 DIFF. CONCENTRATING 1 TROUBLE SLEEPING
-1 .######### + 13 CONCERN FAM. TROUBLE 4 BLAME SELF 6 IRRITATED
.###### | 3 STRESSED
.####### S|
.#### |T
.### |
-2 .## +
.## T|
.# |
. |
. |
-3 . +
. |
. |
. |
. |
-4 . +
. |
. |
|
. |
-5 +
|
. |
. |
|
-6 . +
<less>|<frequ>
EACH '#' IS 1728. 35
118 81 608 142 3 2 24,379
24,623
76,935
106,451
40,233
4,816
0
20,000
40,000
60,000
80,000
100,000
120,000
-6.00 -5.00 -4.00 -3.00 -2.00 -1.00 0.00 1.00 2.00 3.00 4.00 5.00 6.00
10
8
6
4
2
Item
sF
req
uen
cy
Logit
Measure Score
Item Gap41.3% respondents
ItemGap
Decreasing person ability Increasing person abilityDecreasing item difficulty Increasing item difficulty
Per
son
Fre
qu
ency
ItemGap
ItemCluster
Global Distress
Ann M. Doucette, PhD
INPUT: 1726 PERSONS 32 ITEMS MEASURED 160 CATS --------------------------------------------------------------------------------
PERSONS - MAP - ITEMS
<more>|<rare>
6 # +
|
|
|
. |
. |
5 +
|
|
. |
|
. |
4 +
# |
|
. |
. |
.# T|
3 +
.### |
.## |
.# |
.## S|
.##### |
2 .## +
.##### |
.##### |
##### |
.##### M|
.####### |
1 .######### + URICA28 URICA9
.######### |
.######### |T URICA16
.############ | URICA32 URICA18
.#### S|S URICA31
.#### | URICA29 URICA6 URICA22 URICA10 URICA1 URICA27
0 .# +M URICA13 URICA3 URICA23
. | URICA114 URICA19 URICA26 URICA30 URICA17 URICA2 URICA25 URICA12 URICA11
. |S URICA24 URICA5 URICA20 URICA15 URICA21 URICA7 URICA8
. T| URICA4
. |T
. |
-1 . +
<less>|<frequent>
EACH '#' IS 16.
Sample Mean
Item Difficulty Mean
No items to differentiate individuals in terms of higher degrees of readiness to change
Items clustering together, although added to the overall
score, reflect little in terms of increased or decreased
distress, as they assess the same level of the construct.
Person – Item Map (URICA – Stage of Change)
Ann M. Doucette, PhD
300
200
100
2
4
6
-2.00 0.00 2.00 4.00 6.00
56.3% of the sampleNo items to accurately locate
readiness to change status
Nu
mb
er o
f I
tem
s
S
amp
le F
req
uen
cy
URICA: Readiness to Change
38
Measurement Model Fit: Global Distress+-------------------------------------------------------------------------------------------------------------------------------------+
|ENTRY RAW MODEL| INFIT | OUTFIT |PTBIS|EXACT MATCH|ESTIM| P- | |
|NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%|DISCR|VALUE| ITEMS |
|------------------------------------+----------+----------+-----+-----------+-----+-----+--------------------------------------------|
| 24 30095 258004 2.20 .00|1.29 9.9|2.21 9.9| .17| 89.7 90.9| .91| .12| 24 TROUBLE BECAUSE OF ALCOHOL |
| 11 40340 257941 1.84 .00|1.47 9.9|3.46 9.9| .19| 86.8 88.5| .87| .16| 11 USE ALCOHOL/DRUGSTO GET GOING |
| 20 65489 257906 1.82 .00|1.53 9.9|3.50 9.9| .15| 76.0 80.5| .77| .25| 20 PEOPLE CRITICIZE MY DRININKING |
| 7 158279 257603 1.47 .00|1.02 6.9| .99 -2.6| .50| 57.1 56.7| 1.00| .61| 7 SUICIDAL THOUGHTS |
| 23 306791 255346 .69 .00|1.04 9.9|1.02 6.7| .53| 49.3 48.3| .96| 1.20| 23 TROUBLE GETTING ALONG WITH FRIENDS |
| 12 378985 257605 .32 .00| .74 -9.9| .73 -9.9| .72| 49.5 41.8| 1.35| 1.47| 12 FEEL WORTHLESS |
| 19 378725 256349 .30 .00|1.06 9.9|1.08 9.9| .57| 41.8 41.5| .92| 1.48| 19 DISTURBING THOUGHTS |
| 28 394076 257615 .18 .00| .92 -9.9| .92 -9.9| .65| 42.3 39.5| 1.12| 1.53| 28 SOMETHING WRONG WITH MIND |
| 21 417669 258025 .15 .00|1.27 9.9|1.30 9.9| .45| 39.6 43.2| .63| 1.62| 21 UPSET STOMACH |
| 25 417668 257807 .12 .00| .90 -9.9| .90 -9.9| .64| 45.7 42.2| 1.14| 1.62| 25 FEEL SOMETHING BAD GOPING TO HAPPEN |
| 30 463694 257896 .06 .00|1.05 9.9|1.07 9.9| .53| 51.5 49.0| .95| 1.80| 30 SATISFIED WITH RELATIONSHIPS |
| 18 468437 255779 .01 .00| .84 -9.9| .84 -9.9| .65| 56.7 49.8| 1.17| 1.83| 18 HAPPY PERSON |
| 15 438912 257407 -.04 .00|1.38 9.9|1.41 9.9| .36| 39.9 45.9| .55| 1.70| 15 FREQUENT ARGUMENTS |
| 9 473076 257474 -.06 .00| .95 -9.9| .96 -9.9| .58| 52.9 48.3| 1.05| 1.84| 9 SCHOOL/WORK SATISFYING |
| 17 460962 257686 -.09 .00| .73 -9.9| .74 -9.9| .73| 51.7 43.1| 1.36| 1.79| 17 HOPELESS ABOUT FUTURE |
| 8 471292 255913 -.13 .00| .89 -9.9| .90 -9.9| .64| 49.2 45.0| 1.13| 1.84| 8 FEEL WEAK |
| 2 475865 257112 -.13 .00| .81 -9.9| .82 -9.9| .68| 51.9 45.7| 1.23| 1.85| 2 NO INTEREST |
| 10 479365 256375 -.16 .00| .98 -6.2| .99 -5.0| .60| 45.4 43.9| 1.02| 1.87| 10 FEARFUL |
| 5 510395 256460 -.24 .00| .93 -9.9| .95 -9.9| .60| 53.8 48.6| 1.07| 1.99| 5 SATISFIED WITH LIFE |
| 22 510135 255327 -.34 .00| .95 -9.9| .96 -9.9| .63| 44.5 41.0| 1.08| 2.00| 22 NOT WORKING/STUDYING AS WELL AS NEED TO |
| 27 514855 257073 -.37 .00| .84 -9.9| .86 -9.9| .67| 49.2 42.8| 1.21| 2.00| 27 NOT DOING WELL SCHOOL/WORK |
| 14 558250 257386 -.57 .00| .96 -9.9| .96 -9.9| .62| 45.2 43.2| 1.06| 2.17| 14 FEEL LONELY |
| 29 570205 256874 -.65 .00| .71 -9.9| .71 -9.9| .73| 55.0 46.2| 1.34| 2.22| 29 FEEL BLUE |
| 26 573998 257278 -.68 .00| .83 -9.9| .84 -9.9| .67| 50.2 44.9| 1.21| 2.23| 26 FEEL NERVOUS |
| 16 562628 257369 -.70 .00| .84 -9.9| .84 -9.9| .66| 51.3 46.4| 1.20| 2.19| 16 DIFFICULTY CONCENTRATING |
| 1 584301 257786 -.70 .00|1.26 9.9|1.29 9.9| .47| 40.3 42.5| .65| 2.27| 1 TROUBLE FALLING ASLEEP |
| 4 608845 256446 -.96 .00| .98 -7.4| .98 -6.4| .57| 48.8 47.1| 1.03| 2.37| 4 BLAME SELF |
| 6 609827 256324 -1.00 .00| .98 -6.7| .99 -4.7| .54| 54.6 52.0| 1.02| 2.38| 6 IRRITATED |
| 13 658911 257368 -1.05 .00|1.56 9.9|1.67 9.9| .34| 35.9 42.5| .24| 2.56| 13 CONCERN ABOUT FAMILY TROUBLES |
| 3 665764 256301 -1.28 .00| .97 -9.9| .96 -9.9| .58| 49.7 47.1| 1.05| 2.60| 3 STRESSED |
|------------------------------------+----------+----------+-----+-----------+-----+-----+--------------------------------------------|
| MEAN441594.5257061. .00 .00|1.02 -2.4|1.19 -2.7| | 51.8 49.6| | | |
| S.D.167214.3 791.4 .85 .00| .23 9.2| .68 8.9| | 12.1 12.9| | | |
+-------------------------------------------------------------------------------------------------------------------------------------+
Estimates identified as having misfit are reviewed for potential deletion form the measurement model. These estimates are indicative of multidimensionality, guessability and mistake-ability (carelessness) – the need for a multi-parameter model to explain variance from the measurement model. 39
Ann M. Doucette, PhD
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ENTRY TOTAL MODEL| INFIT | OUTFIT |PTBSE|EXACT MATCH|ESTIM| P- | |
|NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| OBS% EXP%|DISCR|VALUE| ITEM |
|------------------------------------+----------+----------+-----+-----------+-----+-----+------------------------------------------------------------------------|
| 1 7573 1714 .12 .03|1.39 6.3|3.71 9.9|A .19| 62.8 58.3| .67| 4.42| uriaq1R ‘don't have any problems need changing’ |
| 9 5752 1710 .94 .03|1.47 9.9|1.90 9.9|B .29| 32.5 38.2| .36| 3.36| uriaq9 ‘successful working problem unsure can keep up on own’ |
| 29 6812 1705 .25 .03|1.25 6.0|1.79 9.9|C .32| 50.8 50.3| .73| 4.00| Uriaq29R ‘have worries spend time thinking about them’ |
| 31 6964 1706 .28 .03|1.21 4.4|1.79 9.9|D .33| 58.0 54.6| .82| 4.08| Uriaq31R ‘would rather cope with faults than change them’ |
| 28 5600 1701 .95 .03|1.43 9.9|1.62 9.9|E .31| 33.9 39.3| .41| 3.29| Uriaq28 ‘frustrating might have recurrence - thought it was resolved’ |
| 16 6165 1709 .67 .03|1.21 6.3|1.53 9.9|F .41| 37.8 39.6| .71| 3.61| uriaq16 ‘not following through here prevent relapse’ |
| 23 7423 1714 -.07 .03|1.05 1.0|1.50 8.3|G .40| 62.6 57.2| .96| 4.33| uriaq23R ‘part of the problem but not really’ |
| 13 7532 1713 .01 .03|1.00 .1|1.47 7.0|H .43| 69.7 60.5| 1.01| 4.40| uriaq13R ‘I have faults nothing that I need to change’ |
| 32 6367 1706 .50 .03|1.17 4.6|1.45 9.9|I .37| 45.2 47.4| .77| 3.73| Uriaq32 ‘try and change and it comes back to haunt me’ |
| 18 6541 1710 .47 .03|1.08 2.1|1.31 7.3|J .44| 48.6 47.5| .91| 3.83| uriaq18 ‘thought I resolved problem still struggling’ |
| 11 7935 1712 -.24 .04| .89 -1.4|1.31 3.5|K .46| 76.2 68.6| 1.08| 4.63| uriaq11R ‘waste of time for me problem not with me’ |
| 5 7912 1713 -.33 .04|1.13 1.8|1.30 3.7|L .34| 74.3 67.1| 1.02| 4.62| uriaq5R ‘I'm not the problem one no sense to be here’ |
| 26 7396 1703 -.12 .03| .95 -1.0|1.21 3.8|M .47| 65.3 58.6| 1.03| 4.34| uriaq26R ‘psychology is boring just forget about problems’ |
| 3 7220 1714 -.04 .04|1.05 .8|1.18 3.4|N .39| 67.1 65.6| 1.00| 4.21| uriaq3 ‘doing something about problems bothering me’ |
| 2 7775 1714 -.22 .04|1.11 1.5|1.15 2.3|O .37| 72.0 63.7| 1.05| 4.54| uriaq2 ‘ready for some self-improvement’ |
| 6 6916 1713 .21 .03| .96 -1.0|1.11 2.4|P .50| 53.0 47.9| 1.06| 4.04| uriaq6 ‘worries me might slip back here to seek help’ |
| 12 7674 1713 -.23 .04| .83 -2.8|1.06 1.0|p .54| 71.6 61.8| 1.13| 4.48| uriaq12 ‘hoping this will help understand myself’ |
| 10 6841 1711 .13 .03| .95 -1.1|1.04 .8|o .48| 66.4 62.6| 1.04| 4.00| uriaq10 ‘problem is difficult working on it’ |
| 19 7240 1710 -.12 .03| .92 -1.7|1.03 .7|n .50| 64.1 59.8| 1.06| 4.23| uriaq19 ‘wish more ideas how to solve problem’ |
| 22 6902 1712 .15 .03| .89 -2.7| .99 -.1|m .55| 59.1 55.3| 1.10| 4.03| uriaq22 ‘need maintain the changes already made’ |
| 4 7915 1715 -.56 .05| .98 -.3| .97 -.5|l .42| 75.2 66.0| 1.13| 4.62| uriaq4 ‘worthwhile to work on my problem’ |
| 14 6904 1708 -.09 .03| .92 -2.1| .96 -.9|k .52| 57.7 54.8| 1.08| 4.04| uriaq14 ‘working hard to change’ |
| 8 7668 1715 -.42 .04| .95 -.9| .96 -.7|j .46| 72.3 63.2| 1.14| 4.47| uriaq8 ‘been thinking might want change myself’ |
| 7 7456 1713 -.39 .04| .92 -1.5| .92 -1.6|i .48| 71.2 65.0| 1.12| 4.35| uriaq7 ‘finally doing work on problem’ |
| 17 7066 1713 -.19 .04| .85 -2.9| .90 -2.0|h .54| 73.4 69.1| 1.10| 4.12| uriaq17 ‘not successful in changing at least working problem’ |
| 30 7030 1705 -.16 .04| .88 -2.6| .89 -2.5|g .53| 67.3 64.4| 1.10| 4.12| Uriaq30 ‘actively working on my problem’ |
| 25 7067 1707 -.23 .04| .89 -2.5| .88 -2.9|f .53| 66.0 64.0| 1.09| 4.14| uriaq25 ‘actually doing something about problem – changing’ |
| 27 7012 1704 .10 .03| .83 -4.3| .88 -2.8|e .59| 63.2 53.2| 1.20| 4.12| Uriaq27 ‘here to prevent relapse of problem’ |
| 24 7532 1708 -.27 .04| .79 -3.7| .85 -2.9|d .59| 72.0 63.4| 1.18| 4.41| uriaq24 ‘hope someone will have good advice’ |
| 21 7538 1710 -.39 .04| .79 -4.0| .77 -5.1|c .60| 71.5 64.1| 1.19| 4.41| uriaq21 ‘maybe place will be able to help’ |
| 15 7719 1710 -.37 .04| .75 -4.2| .74 -5.0|b .61| 74.6 63.2| 1.25| 4.51| uriaq15 ‘have problem I should work on it’ |
| 20 7539 1714 -.36 .04| .73 -5.5| .69 -6.9|a .66| 73.6 62.0| 1.25| 4.40| uriaq20 ‘started working problem would like help’ |
|------------------------------------+----------+----------+-----+-----------+-----+-----+------------------------------------------------------------------------|
| MEAN 7155.8 1710.2 .00 .03|1.01 .3|1.25 2.5| | 62.8 58.0| | | |
| S.D. 581.8 3.8 .37 .01| .19 4.0| .55 5.2| | 12.0 8.5| | | |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Measurement Model Fit: URICA
40
Ann M. Doucette, PhD
Rasch ModelExamining Assumptions
DimensionalityLocal Independence
MonotonicityAdditivity
Ann M. Doucette, PhD
Measurement Model FitExamining Deviation From The Rasch Model
• The Rasch model specifies that item discrimination (item slope) is uniform across items o This supports additivity and construct stability
• Item discrimination is NOT parameterizedo The Rasch slope is the average discrimination of all the
items. It is not the mean of the individual slopes because discrimination parameters are non-linear
o Mathematically, the average slope is set at 1.0 when the Rasch model is formulated in logits, or 1.70 when it is formulated in probits (2-PL and 3-PL – 0.59 is the conversion from logits to probits
• Issue to considero Are the misfitting items distinct?
• Could they be considered a separate subscale?
o Does removal of these item appreciable improve the measurement model?
42
Ann M. Doucette, PhD
Item DiscriminationIs The One Parameter Model Sufficient?
Item differences (difficulty)/person differences (ability) contribute additively to the
probability (log-odds) of responding positively.
43
Ann M. Doucette, PhD
Challenges – Adding the 2nd ParameterInterpretation: Which Item Is More Difficult?
-5 -4 -3 -2 -1 0 1 2 3 4 5
Probability of scoring
higher on green item is
.9. At this level of
impairment the red
item is more difficult.
Probability of scoring
higher on red item is .6. At
this level of impairment.
Pro
bab
ilit
y o
f R
esp
on
se –
Hig
her
= M
ore
Im
pai
rmen
t
Mild Moderate Severe 44
Ann M. Doucette, PhD
2PL – Is More Better?
• Measurement development
o Parsimonious model
o Good data fit to the measurement model
o Objective is to minimize measurement variance –development of measures that can be generalized to
diverse population groups, conditions, etc.
• Performance on a developed measure
o Additional parameters may shed light on interpretation of
measurement variance.
o Example: items may function differently for individuals at different positions on the measured construct.
45
Questioning The Assumption Of Additivity, Interval Level Data and Uncorrelated error
46
Approaching Uniform ResponseGlobal Distress
Satisfied with relationships Happy
0 4
1 4
3
0 4
1 4
3
0 = Never 3 = Frequently
1 = Rarely 4 = Almost Always
2 = Sometimes
An individual has a 50% probability of responding with “0” or a “1” and so forth where the probability curve intersects (Rasch-Step Calibration). 47
Ann M. Doucette, PhD
Non-uniform ResponseGlobal Distress
Having trouble with alcohol Using alcohol/drugs to get going
0 4
1 4
3
0 4
1
32
7.11 logits 5.99 logits
.18 logits .42 logits
48
0 = Never 3 = Frequently
1 = Rarely 4 = Almost Always
2 = Sometimes
Ann M. Doucette, PhD
\Assumptions: Non-uniform And Non-interval Response Patterns
OBSERVED AVERAGE MEASURES BY AN OBSERVED CATEGORY -6 -4 -2 0 2 4 6
|---------+---------+---------+---------+---------+---------| ITEMS
| 012 3 4 | 20 PEOPLE CRITICIZE MY DRININKING
| 0 12 34 | 7 SUICIDAL THOUGHTS
| 0 12 3 4 | 13 CONCERN ABOUT FAMILY TROUBLES
| 0 1 2 3 4 | 3 STRESSED |---------+---------+---------+---------+---------+---------|
-6 -4 -2 0 2 4 6
Never(0)
Rarely(1)
Sometimes(2)
Frequently(3)
AlmostAlways
(4)
Assumptions:
• Response options are equal intervals – a score of “4” is twice as much as a score of “2”
• All respondents systematically interpret and use response options across items in the
same manner
49
Never (0)
Rarely (1)
Sometimes (2)
Frequently (3)
Almost Always (4)
Never (0)
Rarely (1)
Sometimes (2)
Frequently (3)
Almost Always (4)
Never (0)
Rarely (1)
Sometimes (2)
Frequently (3)
Almost Always (4)
Stress
-5 -4 -3 -2 -1 0 1 2 3 4 5
Rasch
Ru
ler –L
og
it Measu
re Sco
res
Stress
-5 -4 -3 -2 -1 0 1 2 3 4 5
Rasch
Ru
ler –L
og
it Measu
re Sco
res
Stress
-5 -4 -3 -2 -1 0 1 2 3 4 5
Rasch
Ru
ler –L
og
it Measu
re Sco
res
Stress Alcohol/Drugs Suicidality
50
Ann M. Doucette, PhD
1.20
1.00
0.80
0.60
0.40
0.20
0.00
-0.20
-0.30
-0.40
-0.60
-0.80
-1.00
-1.20
-1.40
-1.60
-1.80
-2.00
Maintenance
Action
Contemplation
Pre-contemplation• Scales have some overlap
• Estimates for the Contemplation
scale indicate respondents have an
easier time responding positively to
these items – little distinction
between the two scales
Sta
nd
ard
ized
Ite
m D
iffi
cult
y /
Per
son
Ab
ilit
yR
esis
tan
ce
C
om
mit
men
t to
Ch
ang
e
Questioning the Assumption of Additivity – URICAReadiness to Change
STAGES1. Pre-contemplation2. Contemplation3. Action4. Maintenance
51
Ann M. Doucette, PhD
Error – Assumed To Be UncorrelatedStandard Error as a Function of Ability
Measure Score (OQ-30 Total Scale)
θ
Sta
nd
ard
Err
or
SE = 0.25
θ = -2.34 to 2.47
SE = 1.0θ = -5.21 / θ = 5.04 X X
52
Interpreting Score Change
Do All Items Provide the Same Level Of
Information?
53
Ann M. Doucette, PhD
Total Scores – Equivalent MeaningDoes It Matter Where Change Occurs?
I feel stressed at work/school
I am concerned about family troubles
I feel irritated
I blame myself for things
Total Score: 16
I am annoyed by people who criticize drinking
I need a drink the next morning to get going
I have thoughts of ending my life
I have trouble at work because of drinking
drug use
Total Score: 16
Nev
er
Rar
ely
Som
etim
es
Fre
quen
tly
Alm
ost a
lway
s
4
4
4
4
4 1
33
3
0
2
3
1
04
Baseline Follow-up
Nev
er
Rar
ely
Som
etim
es
Fre
quen
tly
Alm
ost a
lway
s
2
6
6
54
Ann M. Doucette, PhD
Response Option Approaches
55
• Dominance model – Likert scalingo The more strongly an opinion or perspective is held the
higher the probability of endorsing a favorable response All items of less difficulty would also be endorsed favorably - additive
• Ideal point modelo Probability of endorsement is associated with the
respondent’s position on the latent trait (θ) • Probability decreases as the item location moves away (either below
or above) from his/her position on the latent trait
• Person with moderately strong opinion would likely not endorse (disagree with) items reflecting stronger and weaker positions
• Person can disagree for two reasons, the trait level (θ) is below item level, and conversely the trait level (θ) is above the item location
• Multi-dimensional pairwise preference (MUPP)o Forced-choice statements
• Statements representing trait levels (θ) that are nearly proximate –balanced item sets
• Increased measurement precision
• UnfoldingStrongly Disagree Mid-point Strongly Agree
─ + ─ +
Ann M. Doucette, PhD
Interpreting Score Changes
The estimated Reliable Change Index for the General Distress Measure is reported to be 10. An individual’s score must change by at least 10 points on the measure to be considered clinically significantly changed.
10 points has a differential effect depending where on the continuum the change occurs.
Raw
To
tal
Sco
re (
Hig
h =
Mo
re D
istr
ess)
10 point difference
10 point difference
.5 logit u
nits
4.1 logit units
Norm
al
M
ild
Mod.
S
ever
e
Measure Score (logits)
56
Ann M. Doucette, PhD
Multidimensional IRT Models - MIRT
57
• The complexity of assessment includes dimensions not directly measuredo Literacy level
o Conscientiousness
o Emotional stability
o Interpretation of construct meaning
• MIRTo between-item (assessment is multidimensional, (multiple latent
traits), however each item assess only one dimension)
o within item (multiple skills/competencies are needed to respond to an item) dimensionality
o Models test information as a response surface representing the relationship between correct response, person ability (θ) and discrimination (a)
• MIRT scores – complexity in interpretationo Multidimensional item difficulty and discrimination over a
range of θ vectors
Ann M. Doucette, PhD
MIRT Item Surface
58
-4
-2
0
2
4
-4
-2
0
2
4
0
0.2
0.4
0.6
0.8
1
1
Item Response Surface
2
Pro
babili
ty o
f C
orr
ect
Response
Contour Plot
of Item Response Surface0.1
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5
0.5
0.5
0.6
0.6
0.6
0.7
0.7
0.7
0.8
0.8
0.8
0.9
0.9
0.9
1
2
-4 -2 0 2 4-4
-3
-2
-1
0
1
2
3
4
• Item surface increases more quickly across θ1
o Consequence of the a parameter (discrimination)
Ann M. Doucette, PhD
Reliable Change*Evaluating and Characterizing Effective Treatment Providers
Raw Score1 Raw Score2 Measure
Scores
Improvement 22% 24% 20%
Stability 58% 56% 72%
Deterioration 20% 20% 8%
2)(2 SESEDIFF DIFFSE
XXRCI 12
*Reliable Change (Jacobson & Truax, 1991)Note: Suggested reliable raw score change
for the General Distress Measure is 10 points
(Raw Score1).
Sample specific RCI is estimated at a change
of 12 points (Raw Score2).
Rasch a
a Multidimensional random coefficient multinomial logit model – alcohol/drug items treated as a separate subscale.
Ann M. Doucette, PhD
Assessing Outcomes
60
Ann M. Doucette, Ph.D.
Email: [email protected]