Statistical Considerations for Educational Screening & Diagnostic Assessments

YAACOV PETSCHER, PH.D.FLORIDA CENTER FOR READING

RESEARCHFLORIDA STATE UNIVERSITY

Statistical Considerations for Educational Screening & Diagnostic Assessments

A discussion of methodological applications which have existed in the literature for a long time and are used in other disciplines but are emerging more now in education

Discussion Points

Assessment AssumptionsContexts of AssessmentsStatistical Considerations

Reliability Validity Benchmarking

“Disclaimer” Focusing on Breadth not Depth Based on applied contract and grant research One slide of equations

Assumptions of Assessment - Researchers

Constructs exist but we can’t see themConstructs can be measuredAlthough we can measure constructs, our

measurement is not perfectThere are different ways to measure any

given constructAll assessment procedures have strengths

and limitations

Assumptions of Assessment - Practitioner

Multiple sources of information should be part of the assessment process

Performance on tests can be generalized to non-test behaviors

Assessment can provide information that helps educators make better educational decisions

Assessment can be conducted in a fair manner

Testing and assessment can benefit our educational institutions and society as a whole

Contexts of Assessments

Instructional Formative Interim Summative

Research Individual Differences Group Differences (RCT) Growth

Legislative Initiatives NCLB Reading First Race to the Top Common Core

Common Core Adoption

Smarter Balanced

Within Common Core

USDOE PARCC Assessments Smarter Balanced Assessments Reading for Understanding Assessments I3 Assessments

Private Sector

Underlying “Code” of Assumptions

Researcher Constructs exist but we

can’t see them Constructs can be

measured Although we can measure

constructs , our measurement is not perfect

There are different ways to measure any given construct

All assessment procedures have strengths and limitations

Practitioner Multiple sources of information

should be part of the assessment process

Performance on tests can be generalized to non-test behaviors.

Assessment can provide information that helps educators make better educational decisions

Assessment can be conducted in a fair manner.

Testing and assessment can benefit our educational institutions and society as a whole.

Statistical Considerations - Reliability

Stability, accuracy, or consistency of test scores Many types

Internal consistency Retest Parallel-form Split-half

Should not be viewed as interchangeable Once could have very high stability but very poor

internal consistency Date of Birth/Height/SSN


Most frequently used framework is classical test theory

What does this assume?

T

X

e

Benefits of IRT

Puts persons and individuals on the same scale CTT looks at total score by p-value (difficulty)

Can result in shorter tests CTT reliability increases with more items

Can estimate the precision of scores at the individual level CTT assumes error is the same

Item Difficulty by Total Score Decile Groups

Item Difficulty by Ability

Items Don’t Always Do What We Want

Item Information

Test Information – Standard Error

Precision/Reliability


While precision improves on the idea of reliability, can precision be improved? Account for context effects (Wainer et al., 2000)

Petscher & Foorman, 2011 Account for time (Verhelst, Verstralen, & Jansen,

1997) Prindle, Petscher, & Mitchell, 2013


Context effects Any influence or interpretation that an item may

acquire as a result of its relationship to other items Greater problem in CAT due to unique testing Emerges as an item and passage level problem


Common stimulus


“If several questions within a test a test are experimentally linked so that the reaction to one question influences the reaction to another, the entire group of questions should be treated preferably as an ‘item’ when the data arising from application of split-half or appropriate analysis-of-variance methods are reported in the test manual”

APA Standards of Educational and Psychological Testing

(1966)

Expressed in IRT

)](exp[1

)](exp[)1()|1(

)(

)(

ijdiji

ijdijiiijiij ba

baccxp

)](exp[1

)](exp[)1()|1(

iji

ijiiijiij ba

baccxp

Study 1Reading Comprehension in Florida

Precision – After 3 passages

FAIR Technical Manual

Simulations are all well and good…

How does accounting for item dependency improve testing in real world?

N ~= 800, randomly assigned to testing condition Control was current 2pl scoring Experimental was unrestricted bi-factor

Evaluate Precision # of passages Prediction to state achievement

RCT

What this suggests

“Newer” models help us to more appropriately model the data

Precision/reliability are improved just by modeling the context effect

Improve the efficiency and precision of a computer-adaptive test by modeling the item-dependency

Study 2Morphology CAT

Accounting for Time

Somewhat similar to the item dependency model

IRT models are concerned with accuracyWhat about fluency?

CBM (DIBELS, AIMSweb, easyCBM) Brief assessments (TOWRE, TOSREC, etc)

Prindle, Petscher, Mitchell (2013) N = 200 Word knowledge test Limited to 60 sec Compared 1pl with a 1pl-response time models

Results

1pl marginal α = .80

1pl-rt marginal α = .87

What this suggests

Accounting for response time of items can improve precision for most participants

Limitations More difficult to do with younger children Requires computer delivery to record accuracy and

time Cannot do with connected text

Validity

Statistical Considerations – Factor Validity

Assessments are measures of hypothetical constructs

Assessments are measured with errorUse latent variable to leverage the common

varianceHow is this modeled?

Unidimensional Multidimensional

Three illustrations Petscher & Foorman, 2012 (Syntactic Awareness) Kieffer & Petscher, 2013 (Morphology/Vocabulary) Justice, Petscher, & Pentimonti, 2013 (Early Literacy)

Study 1Syntactic Awareness

Distribution of Ability

Precision (reliability) of Ability Scores

Predictive Validity of Factor Scores

Study 2Morphological

Awareness/Vocabulary

Morphological Awareness (MA) predicts Reading Comprehension (RC)

For a while, we have known that MA is correlated with reading comprehension (e.g., Carlisle, 2000; Freyd & Baron, 1982; Tyler & Nagy, 1990)

MA RC

MA predicts RC,above & beyond Vocabulary (V)

Unique contributions of MA to RC, controlling for vocabulary (e.g., Carlisle, 2000; Kieffer, Biancarosa, & Mancilla-Martinez, in press; Kieffer & Lesaux, 2008, 2012; Kieffer & Box, 2013; Nagy, Berninger, & Abbott, 2006)

MA RC

V

But wait…

Are we actually measuring MA and vocabulary as separate dimensions of lexical knowledge?Observed correlations between MA and

vocabulary are attenuated by measurement error

Reliability of researcher-created MA measures has been moderate In the .70-.80 range & occasionally lower

So, “unique” contributions of MA beyond V could be an artifact of measurement error

MA V

But wait…

Using Confirmatory Factor Analysis (CFA), Muse (2005) found that MA could not be distinguished from vocabulary in fourth grade, but instead form a unidimensional construct (See also Wagner, Muse, & Tannenbaum, 2007).

Spencer (2012) replicated this finding with eighth graders.

MA/V

On the other hand…

Using CFA, Kieffer & Lesaux (2012) found that MA was measurably separable from two other dimensions of vocabulary, though strongly related for both native English & language minority learners in Grade 6

Neugebauer, Kieffer, & Howard (under review) replicated this finding for Spanish speaking language minority learners in Grades 6-8

MA V

But

Is it possible a multidimensional structure exists but could be best captured by a general factor lexical knowledge and specific factors of morphological awareness and vocabulary?

If the common variance is captured by a general factor as well as specific factors, do they each predict distal outcome?

Modeling Dimensionality of Lexical Knowledge:Unidimensional

Fit poorlyRejected across parametric & nonparametric EFA & CFA models

Modeling Dimensionality of Lexical Knowledge:Two Dimensional

Modeling Dimensionality of Lexical Knowledge:Bi-factor Model

CFI = .98; TLI = .98; RMSEA = .015>1D: Δχ² = 66.71, Δdf = 34, p <.001>2D: Δχ² = 48.94, Δdf = 33, p <.05

Statistical Considerations - SEM

Study 3Early Literacy Skills

SPOT

Measure developed by Jackie Van LankveldEmbedded assessment when students read a

storyUsed primarily with students identified with

LIMeasures

Alphabet knowledge, phonological awareness, print knowledge

Present study N~=300 In this LI sample, how are the item responses best

represented?


Model X2 DF CFI TLI RMSEAUnidimensional 223.42 104 0.88 0.86 .064 (.052 ,.075)Multidimensional - 3 factor 146.77 101 0.96 0.95 .040 (.024, .053)Multidimensional - Bi-factor 107.65 89 0.98 0.98 .027 (.000, .044)


SS

Core

EV

WS

EL

Phon.

Print

Alphabet

-.30***

-.35***

-.46***-.53***

.01-.11

-.08

-.15

.71***

.69***

.62***

.82***

.07.14

-.07

-.03

Implications

Research Multidimensional General Good for individual

differences Limited in applicability

For now (Piasta, Petscher, Anthony, in preparation)

Practice Multidimensional Correlated Traits Good for easy-to-use in

classroom Limited in specificity

Model X2 DF CFI TLI RMSEAUnidimensional 223.42 104 0.88 0.86 .064 (.052 ,.075)Multidimensional - 3 factor 146.77 101 0.96 0.95 .040 (.024, .053)Multidimensional - Bi-factor 107.65 89 0.98 0.98 .027 (.000, .044)

Benchmarking

Statistical Considerations – Benchmarks

Students with poor reading skills have difficulty in closing achievement gaps

Accurate identification is necessary to remediate difficulties

Many assessments include guidelines for cut-points

Sample Risk Levels Chart

How to Validate – Current Theory

Variety of Methods Best Guess

+/- 1SD Percentile Ranks

Simple Stat Bivariate Correlations Interrater Reliability

More Advanced Logistic Regression Discriminant Function Analysis Achievement-IQ Discrepancies

Typical “Diagnostic/Screening” Q’s

WITR between blood characteristics and being HIV positive?

WITR between electromagnetic signals and correctly distinguishing from noise?

WITR between students’ scores on the Scholastic Reading Inventory and future risk on the SAT-10?

What is our question? Correlational?

Bivariate Correlation Interrater Reliability

Discrimination? Logistic Regression Discrimination Function Receiver Operating Characteristic (ROC) Curves

ROC

Graphical representation of operating pointsMultiple indices of efficiencyMoving cut-pointsOutperforms other techniques in diagnostic

efficiency (Hintze, 2005)

Advantages of using ROC

It defines the quality of a test or prediction using a measurement without specifying a cut off value for decision making

Greater flexibility in diagnostic accuracy and predictive power

Assuming Normal distribution The mean and Standard Error can be estimated The 95% CI can be estimated Statistical significance can be determined

Whether one test is better than another can be determined

Old School Discrimination

Form Two GroupsGiven the Test

4 Outcomes People who have the attribute were detected People who have the attribute were not detected People who don’t have the attribute were detected People who don’t have the attribute were not detected

Using the Results

What is a ROC Curve?

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity

Based on Cumulative Frequency %

Data Scheme

SRILexile Score

SAT-10 (<40th

%ile)Y-axis

SAT-10 (>=40th

%ile)X-axis

505 35 (.35) 5 (.05)

520 30 (.65) 10 (.15)

550 20 (.85) 20 (.35)

600 10 (.95) 30 (.65)

700 5 (1.00) 35 (1.00)

TOTALS N=100 N=100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity

505 35 (.35) 5 (.05)

520 30 (.65) 10 (.15)

550 20 (.85) 20 (.35)

600 10 (.95) 30 (.65)

700 5 (1.00)

35 (1.00)

Confusion Matrix

A B

C D

SAT10-Score

<40th%ile >=40th

%ile

At-Risk

SRI Score Not At-Risk

Confusion Matrix

A B

C D

FNTP

TP

CA

ASE

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

FPTN

TN

DB

DSP

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

FPTP

TP

BA

APPP

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

TNFN

TN

DC

DNPP

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

TNFNFPTP

TNTP

DCBA

DAOCC

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

TNFNFPTP

FNTP

DCBA

CABR

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Confusion Matrix

A B

C D

DCBA

DA OCC

DC

D NPP

BA

A PPP

DB

DSP

CA

ASE

At-Risk


SAT10-Score

<40th%ile >=40th

%ile

Data Scheme

SRILexile Score

SAT-10 (<40th

%ile)Y-axis

SAT-10 (>=40th

%ile)X-axis

505 35 (.35) 5 (.05)

520 30 (.65) 10 (.15)

550 20 (.85) 20 (.35)

600 10 (.95) 30 (.65)

700 5 (1.00) 35 (1.00)

TOTALS N=100 N=100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity

505 35 (.35) 5 (.05)

520 30 (.65) 10 (.15)

550 20 (.85) 20 (.35)

600 10 (.95) 30 (.65)

700 5 (1.00)

35 (1.00)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity

505 35 (.35) 5 (.05)

Classification – Example 1

Evaluation of Cut Scores

505 35 (.35) 5 (.05)

Column NSRI At-Risk Not At-RiskAt-Risk 35 5 40Not At-Risk 65 95 160Row N 100 100 200

SE = .35 PPP = .88SP = .95 NPP = .41FN = .65 OCC = .65FP = .05

SAT-10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1-Specificity

Sensitivity

520 30 (.65) 10 (.15)

Classification – Example 2

Evaluation of Cut Scores

520 30 (.65) 10 (.15)

Column NSRI At-Risk Not At-RiskAt-Risk 65 15 100Not At-Risk 35 85 100Row N 100 100 200

SE = .65 PPP = .81SP = .85 NPP = .71FN = .35 OCC = .75FP = .15

SAT-10

Cut-Point Selection

Column N Column NSRI At-Risk Not At-Risk SRI At-Risk Not At-RiskAt-Risk 65 15 100 At-Risk 35 5 40Not At-Risk 35 85 100 Not At-Risk 65 95 160Row N 100 100 200 Row N 100 100 200

SE = .65 PPP = .81 SE = .35 PPP = .88SP = .85 NPP = .71 SP = .95 NPP = .41FN = .35 OCC = .75 FN = .65 OCC = .65FP = .15 FP = .05

SAT-10SAT-10

Lexile = 520 Lexile = 505

Choose Lexile = 520 Right?

What are you Maximizing?

Properties of the Test - Population Sensitivity Specificity

Properties of the Sample Positive Predictive Power Negative Predictive Power

It’s All About the Base Rate!

Base Rates are Variables Too!!

Screening Test RSN test of Fan Fanaticism SE = .95, SP=.90

Administered to Two Samples Sample 1 – 2,000 people in Boston where 50% have

the problem Sample 2 – 2,000 people in New York where 15% have

the problem

Boston: Base Rate of 50%

Criterion

RSN Jail No Jail Total

At-Risk 950 100 1050

Not At-Risk 50 900 950

1,000 1,000 2,000 Sensitivity = 950/(1,000) = .95 Specificity = 900/(1,000)=.90

PPP =950/(1,050)=.95 NPP=900/(950)=.95

Overall Correct Classification=(950+900)/2,000=.925

NYC: Base Rate of 15%

Criterion

RSN Jail No Jail Total

At-Risk 285 170 455

Not At-Risk 15 1,530 1,545

300 1,700 2,000 Sensitivity = 285/(300) = .95 Specificity = 1,530/(1,700)=.90

PPP =285/(455)=.63 NPP=1,530/(1,545)=.99

Overall Correct Classification=(285+1,530)/2,000=.91

Comparison

Indices Boston NYCSE 0.95 0.95SP 0.90 0.90PPP 0.95 0.63NPP 0.95 0.99OCC 0.93 0.91

Applied to All 1st Graders in Florida Base Rate 15%

Criterion

Screen Problem No Problem Total

At-Risk 26,754 15,958 42,712

Not At-Risk 1,408 143,626 145,034

28,162 159,584 187,746 Sensitivity = 26,754/(28,162) = .95

Specificity = 143,626/(159,584)=.90

PPP =26,754/(42,712)=.63 NPP=143,626/(145,034)=.99

Overall Correct Classification=(26,754+143,626)/187,746=.91

Statewide Screening

If we had employed a test with those measurement properties statewide to detect children who where at-risk for reading problems, we would have mislabeled around 16,000 kids as at-risk who weren’t (37%).

However, we would only have only missed 1,400 students who missed potential services that needed them (1%).

Concluding Thoughts - Reliability

Researchers Evaluating other methods of reliability

Precision Generalizability

Practitioners What is being reported?

Internal consistency, test-retest, etc How reliable is it?

Nunnally/Bernstein >.80 research >.90 clinical

Concluding Thoughts – Factor Validity

Researchers Testing additional specifications outside of the

traditional 1/multi framework Bi-factor, Causal Indicator, etc.

Practitioners What type of factor analysis was done?

EFA/CFA Rules of thumb?

Too many 200?

Concluding Thoughts - Benchmarking

Researchers Improve the rigor of our methods

ROC, Diagnostic Measurement, Cost Curves

Practitioners Identify what “at-risk” means Establish the goal of the screening process Study how the screen was developed Determine the base rate Attend to the +/- predictive power Collect local data

Implications of these Considerations

We must be careful in how we choose assessments AYP Value-added modeling Promotion/Retention

Moving toward a new phase in assessments Computer-delivered Computer-adaptive

Smarter Balanced, FCRR, RFU

Be more aware of what other disciplines are doingBe more aware of what’s in older literature

Technology!

Great Resources

IRT The Theory and Practice of Item Response Theory (De

Ayala) Fundamentals of Item Response Theory (Hambleton et

al.)Factor Analysis

CFA for Applied Research (Brown)SEM

Beginner’s Guide to SEM (Schumacker & Lomax)ROC analysis

Analyzing ROC Curves with SAS (Gonen)

Resources

Shameless Plug IRT

R.J. De Ayala Factor Analysis

Rex Kline SEM

Richard Lomax Benchmarking

Chris Schatschneider

Statistical Considerations for Educational Screening & Diagnostic Assessments

Documents

Transcript of Statistical Considerations for Educational Screening & Diagnostic Assessments