Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus...

10
1 Reliability: SEM and Agreement PSY 395 – Oswald Overview Review of Classical Test Theory Standard Error of Measurement (SEM) Comparing SEM to Reliability Reliability vs. Agreement Agreement Classical Test Theory – A refresher! If an immortal person took test an infinite number of times, what would be the mean of the observed scores? X = T + E, T is the same every time you took it E is random (sometimes +, sometimes -) and would average out to zero, so… average X = T Review of Classical Test Theory

Transcript of Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus...

Page 1: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

1

Reliability:SEM and Agreement

PSY 395 – Oswald

Overview

• Review of Classical Test Theory

• Standard Error of Measurement (SEM)

• Comparing SEM to Reliability

• Reliability vs. Agreement

• Agreement

Classical Test Theory – A refresher!

If an immortal person took test an infinite number of times, what would be the mean of the observed scores?

X = T + E,T is the same every time you took itE is random (sometimes +, sometimes -) and would average out to zero, so…average X = T

Review of Classical Test Theory

Page 2: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

2

Standard Error of Measurement (SEM)

• …but what would the standard deviation be?...

T = 10

+ errors- errors

Review of Classical Test Theory

The Mysteries of SEM …Revealed!

standard error of measurement = SEMstandard deviation of the error around any individual’s true score

±2 SEM captures95% of the error

Standard Error of Measurement

The Mysteries of SEM …Revealed!

…you’d like SEM/errorto be small

Why?How?

Standard Error of Measurement

Page 3: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

3

Large SEM

The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.

Amy scores 75 on the CAT.What if SEM = 10?Then…• 95% sure her true score is 55 – 95 (±2 SEM)

Standard Error of Measurement

Small SEM

The CAT (Creative Analogies Test) has 50 items. Most people score anywhere from 20 to 100.

Amy scores 75 on the CAT.What if SEM = 2?Then…• 95% sure her true score is 71 – 79 (±2 SEM)

Standard Error of Measurement

It’s SEMtacular!

1x xxSEM s r= −

xs

10 1 .84 4SEM = − =

xxr= SD of test scores = test reliability

10 1 .19 9SEM = − =

good reliability low SEM

poor reliability high SEM

Standard Error of Measurement

Page 4: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

4

SEM Example

Say you have a test wheremean = 50, sx = 10 and rxx = .84

SEM = − =10 1 84 4.

Fred scores 35 on this test, which means

• 95% sure his true score is 27 – 43 (±2 SEM)• 68% sure his true score is 36 – 44 (±1 SEM)

Standard Error of Measurement

SEM Assumptions

• a reliability coefficient based on an appropriate measure

• the sample appropriately represents the population

1x xxSEM s r= −

Standard Error of Measurement

Use of Cut Scores

• Cut scores are set values on a test that determines who passes and who fails the test.– Commonly used for licensure or certification

(bar exam, medical licensure, civil service)

• What is the impact of the SEM on interpreting cut scores?

Comparing SEM to Reliability

Page 5: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

5

Some general rules about reliability

• Bigger is better!– The more items, the more occasions, and the more

observers the better the reliability• The Spearman-Brown Prophesy formula gives

us the reliability of a test if the number of (similar) items is increased

• Need to compute the standard error of measurement to understand individual test scores

• A measure can not be valid if it isn’t reliable

Comparing SEM to Reliability

Reliability vs. Agreement

Ex. Two raters rate you on the MSU School Spirit Test!!!1-5 scale, 1=strongly disagree and 5=strongly agree

Are the scores reliably related? Do they agree?

R2R1Question

134. This person might tattoo her/his entire body green and white.

353. Maybe you think she/he is sick, but it’s just “Spartan Fever.”

352. For this person, an MSU football game is an all-American treat.

241. This person would change his/her name to Sparty or Spartyette.

Reliability vs. Agreement

Example 1 - Interrater Agreement

ExampleBrutus and Maureen are clinical psychologists. They independently conduct intake interviews for 200 clients presenting mood problems.

Agreement

Page 6: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

6

(Ex. 1, cont.) Assigned Categories

Clients assigned to 1 of 3 categories

– Cyclothymic– Bipolar– Depressed

Why would you think/hopethe therapists agree?

Agreement

(Ex. 1, cont.) Assignments

N=200 totalMaureen

cyclo bipo deprBrutus cyclo 106 10 4 120

bipo 22 28 10 60depr 2 12 6 20

130 50 20 140

Agreement

(Ex. 1, cont.) Convert # to Proportions

Proportions (=cell freq/200)Maureen

cyclo bipo deprBrutus cyclo .53 .05 .02 .60

bipo .11 .14 .05 .30depr .01 .06 .03 .10

.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement

Agreement

Page 7: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

7

Example 2:Brutus and Maureen: Bogus Clinicians

What if Brutus and Maureen were actors, just pretending to be clinical psychologists as a sick little drama class project?

In other words, what if Brutus and Maureen had NO IDEA WHAT THEY WERE DOING, and all their “diagnoses”were COMPLETELY DUE TO CHANCE?

Agreement

(Ex. 2, cont.) Frequencies

N=201 totalMaureen

cyclo bipo deprBrutus cyclo 22 23 22 67

bipo 23 22 22 67depr 22 22 23 67

67 67 67 67

Agreement

(Ex. 2, cont.) Proportions

Proportions (=cell freq/200)Maureen

cyclo bipo deprBrutus cyclo .11 .11 .11 .33

bipo .11 .11 .11 .33depr .11 .11 .11 .33

.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance

Agreement

Page 8: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

8

Wait a sec…

Two people could agree just by chance, right? (e.g., 2 people flippin’ coins)

(1) By how much do ratings beatchance agreement?

(2) What’s the MOST the ratings could beat chance by?

Agreement

Kappa CoefficientHow Much Above Chance Agreement?

κ =−

−P P

Pobs chance

chance1

The maximum amount beyond chance you couldhave agreement on.

The amount above chanceyou do have agreement on.

….out of….

Agreement

Proportions (=cell freq/200)Maureen

cyclo bipo deprBrutus cyclo .11 .11 .11 .33

bipo .11 .11 .11 .33depr .11 .11 .11 .33

.33 .33 .33 .33.11+.11+.11 = 33 % interrater agreement just by chance

. 3 3 .3 3 01 1 .3 3o b s c h a n c e

c h a n c e

P PP

κ − −= = =

− −

Agreement

Page 9: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

9

Proportions (=cell freq/200)Maureen

cyclo bipo deprBrutus cyclo .53 .05 .02 .60

bipo .11 .14 .05 .30depr .01 .06 .03 .10

.65 .25 .10 .70.53+.14+.03 = 70 % interrater agreement

.7 0 .4 7 5 .4 31 1 .4 7 5o b s ch a n ce

ch a n ce

P PP

κ − −= = =

− −

(chance is different from .33 here, why?)

Agreement

PregnancyTest

Say you had the followingdata on 1000 women…

Actual…

960 98%

9604098096020no preg20020preg

no pregpreg

Example 3:What if a pregnancy test was

claimed to be “98% effective”?

Agreement

κ = (Pobs – Pchance) / (1 – Pchance)κ = .98 – .94 = .67

1 – .94

PregnancyTest

Say you had the followingdata on 1000 women…

Actual…

.98.96.04

.98.96.02no preg

.02.00.02pregno pregpreg

(Ex. 3, cont.) Proportions

Page 10: Reliability: SEM and Agreement - Michigan State University Notes/395 - 3... · 7 Example 2: Brutus and Maureen: Bogus Clinicians What if Brutus and Maureen were actors, just pretending

10

What if a pregnancy test was claimed to be “98% effective”?

In these data, most (96%) were not pregnant anyway, so the test would have been correct 96% of the time even if the test always said the individual was not pregnant!

The pregnancy test does help above chance levels, but – given the sample data – not as much as the claim might lead one to believe.