Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I...

30
Assessing the Assessment Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement Parallel forms Split-half (internal consistency) Content Criterion Construct Reliability is a necessary prerequisite for validity.

Transcript of Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I...

Page 1: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Assessing the AssessmentAssessing the Assessment

Reliability. Am I measuring something?

Validity. Am I measuring what I think I am measuring?

Test-retestInterobserver agreementParallel forms

Split-half (internal consistency)

Content

Criterion

Construct

Reliability is a necessary prerequisite for validity.

Page 2: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ReliabilityReliabilityReliability refers to the consistency of a measure. Across

A reliable test has little measurement error.

Time Versions

Raters

And so on

Observed Score = True Score + Error

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

Page 3: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ReliabilityReliability

True score – true or perfectly True score – true or perfectly accurateaccurate

E.g. the timeE.g. the timeOften a fictional mark in psychologyOften a fictional mark in psychologyBased on multiple measurementsBased on multiple measurementsAggregation = averaging a number of Aggregation = averaging a number of

imprecise measurements to increase imprecise measurements to increase reliabilityreliability

Page 4: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ReliabilityReliabilityTest-retest

Interobserver agreement

Administer same measure at two points in time

Multiple observers/judges/raters/scorers rate same target

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

Parallel formsCompare alternate forms of same test

Split-half reliabilitySplit test into two halves and compare scores across halvesCoefficient alpha: average of all possible split-half reliabilities

Page 5: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ValidityValidityIs the test measuring what I think it is?

There are three types of validity

This requires empirical demonstration

Content Validity

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

Criterion Validity

Construct Validity

Page 6: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ValidityValidityContent Validity

A test has content validity if it adequately covers the area of content it is supposed to cover.Difficult to examine statistically

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

Content validity typically must be built in at beginningCourse exams are the best examples

Page 7: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

ValidityValidityCriterion Validity

For criterion validity, tests are evaluated against some criterionOften called predictive validity

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

bilityErrorVariaTrueScore

ariabilityTrueScoreVliability

Re

Most at issue for tests employed to make decisionsSelection of students

Parole decisions

Jobs

Microsoft PowerPoint Presentation

Page 8: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Criterion Validity - ConcurrentCriterion Validity - Concurrent Concurrent validity: does my measure Concurrent validity: does my measure

correlate highly with an established correlate highly with an established measure?measure?

Can my measurement instrument predict Can my measurement instrument predict a criterion that occurs at the same point in a criterion that occurs at the same point in time?time?

Can my measure (i.e. my Can my measure (i.e. my operationalization) distinguish between operationalization) distinguish between two groups that it should be able to two groups that it should be able to distinguish between?distinguish between?

Page 9: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Criterion Validity - PredictiveCriterion Validity - Predictive

Can my measure predict future Can my measure predict future behavior?behavior?– If yes, has predictive validity (a type of If yes, has predictive validity (a type of

criterion validity)criterion validity)

Page 10: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Predictive Validity of the GREPredictive Validity of the GRE

Graduate Record Examination

Kuncel, N.R., Hezlett, S.A., & Ones, D.S. (2001). A comprehensive meta-analysis of

the predictive validity of the Graduate Record Examinations: Implications for graduate school student selection and performance. Psychological Bulletin, 127, 162-181.

Originally designed to measure “basic developed abilities relevant to performance in graduate studies”

Used often and heavily in decisions about admissions

Verbal measure: analogy, antonym, sentence completion, reading comprehension

Quantitative measure: quantitative, quantitative comparison, data interpretation

Analytic measure: analytical and logical reasoningSubject test: acquired knowledge in particular area

Page 11: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Predictive validity of GREPredictive validity of GRE Want to establish predictive validity of GRE Want to establish predictive validity of GRE

What will my criterion of graduate school What will my criterion of graduate school performance be?performance be?

Use several indicators of “performance”:Use several indicators of “performance”:– Graduate GPAGraduate GPA– 11stst year graduate GPA year graduate GPA– Comprehensive exam scoresComprehensive exam scores– Publication citation countsPublication citation counts– Faculty ratingsFaculty ratings– (these are the criteria)(these are the criteria)

Page 12: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Predictive Validity of the GREPredictive Validity of the GRE

Page 13: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Predictive Validity of the GREPredictive Validity of the GRE

Page 14: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Predictive Validity of the GREPredictive Validity of the GRE

Page 15: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

SummarySummary

All areas of GRE were found to be valid All areas of GRE were found to be valid predictors of GGPA, 1predictors of GGPA, 1stst year GGPA, faculty year GGPA, faculty ratings, and comprehensive exam scores.ratings, and comprehensive exam scores.

GRE subject tests were consistently better GRE subject tests were consistently better predictors of the criteria than quantitative predictors of the criteria than quantitative or verbal tests; or verbal tests;

also better than UGPAalso better than UGPA

Page 16: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct ValidityConstruct Validity

Most important type of validityMost important type of validity

““If this were a measure of …, what would it If this were a measure of …, what would it look like?”look like?”

Depends heavily on theory:Depends heavily on theory: How is this construct related to other constructs?How is this construct related to other constructs? Requires broad thinkingRequires broad thinking In validating my construct, I am validating my theoryIn validating my construct, I am validating my theory

Page 17: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Steps to establish construct validitySteps to establish construct validity

1.1. Need to establish convergent correlationsNeed to establish convergent correlations measures of constructs that theoretically measures of constructs that theoretically shouldshould be be

related to each other are, in fact, observed to be related related to each other are, in fact, observed to be related to each other (that is, you should be able to show a to each other (that is, you should be able to show a correspondence or correspondence or convergenceconvergence  between similar   between similar constructs)constructs)

2.2. Need to establish divergent correlationsNeed to establish divergent correlations

measures of constructs that theoretically should measures of constructs that theoretically should notnot be be related to each other are, in fact, observed to not be related to each other are, in fact, observed to not be related to each other (that is, you should be able to related to each other (that is, you should be able to discriminatediscriminate between dissimilar constructs) between dissimilar constructs)

3.3. Build nomological netBuild nomological net

Page 18: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Convergent validityConvergent validity

Measures that Measures that shouldshould be related be related are relatedare related

These 4 items are These 4 items are convergingconverging on the on the same thing (don’t same thing (don’t know for sure that know for sure that it is “self-esteem” it is “self-esteem” yetyet

Page 19: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Divergent ValidityDivergent Validity

Self-esteem Self-esteem measures do not measures do not correlate with locus correlate with locus of control of control measuresmeasures

These measure These measure seem to be tapping seem to be tapping different thingsdifferent things

Page 20: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Establishing convergent and Establishing convergent and divergent validitydivergent validity

Page 21: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Nomological NetworkNomological Network

Must develop a “lawful Must develop a “lawful network” for your network” for your measure in order to measure in order to establish construct establish construct validity.validity.

IncludesIncludes– Theoretical frameworkTheoretical framework– Empirical frameworkEmpirical framework– ObservablesObservables

Page 22: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Childhood Psychopathy ScaleChildhood Psychopathy Scale

Lynam, D.R. (1997). Pursuing the psychopath: Capturing the fledgling psychopath in

a nomological net. Journal of Abnormal Psychology, 106, 425-438.“The construct of psychopathy and attendant personality information might profitably be used at the childhood level to identify a more homogeneous group of antisocial children.”

Page 23: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

PsychopathyPsychopathy The [psychopath] is unfamiliar with the primary facts or data of The [psychopath] is unfamiliar with the primary facts or data of

what might be called personal values and is altogether incapable what might be called personal values and is altogether incapable of understanding such matters. of understanding such matters.

It is impossible for him to take even a slight interest in the tragedy It is impossible for him to take even a slight interest in the tragedy or joy or the striving of humanity as presented in serious literature or joy or the striving of humanity as presented in serious literature or art. He is also indifferent to all these matters in life itself. or art. He is also indifferent to all these matters in life itself. Beauty and ugliness, except in a very superficial sense, goodness, Beauty and ugliness, except in a very superficial sense, goodness, evil, love, horror, and humour have no actual meaning, no power evil, love, horror, and humour have no actual meaning, no power to move him. to move him.

He is, furthermore, lacking in the ability to see that others are He is, furthermore, lacking in the ability to see that others are moved. It is as though he were colour-blind, despite his sharp moved. It is as though he were colour-blind, despite his sharp intelligence, to this aspect of human existence. It cannot be intelligence, to this aspect of human existence. It cannot be explained to him because there is nothing in his orbit of explained to him because there is nothing in his orbit of awareness that can bridge the gap with comparison. He can awareness that can bridge the gap with comparison. He can repeat the words and say glibly that he understands, and there is repeat the words and say glibly that he understands, and there is no way for him to realize that he does not understand (Cleckley, no way for him to realize that he does not understand (Cleckley, 1941, p. 90 quoted in Hare, 1993, pp. 27-28).1941, p. 90 quoted in Hare, 1993, pp. 27-28).

Page 24: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

• Developed Child Psychopathy Scale• Principles of rational scale construction• Working from Psychopathy Checklist (PCL-R),

identified mother-reported items that assessed PCL-R constructs

Operationalized 13 of the 20 PCL-R constructs at 3- to 4-item scales – glibness, untruthfulness, manipulation, lack of guilt,

poverty of affect, callousness, parasitic lifestyle, behavioral dyscontrol, lack of planning, impulsiveness, unreliability, failure to accept responsibility, criminal versatility

Page 25: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Items on the CPSItems on the CPS

Page 26: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct Validity of the CPSConstruct Validity of the CPSIf the CPS is truly assessing psychopathy, scores on the CPS should be positively related to serious delinquency

Page 27: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct Validity of the CPSConstruct Validity of the CPSIf the CPS is truly assessing psychopathy, scores on the CPS should be positively related to stable delinquency

Page 28: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct Validity of the CPSConstruct Validity of the CPSIf the CPS is truly assessing psychopathy, scores on the CPS should be positively related to impulsivity

Page 29: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct Validity of the CPSConstruct Validity of the CPSIf the CPS is assessing psychopathy, scores on the CPS should be positively related to externalizing problems and negatively related to internalizing problems

Page 30: Assessing the Assessment Reliability. Am I measuring something? Validity. Am I measuring what I think I am measuring? Test-retest Interobserver agreement.

Construct Validity of the CPSConstruct Validity of the CPSIf the CPS is assessing psychopathy, scores on the CPS should predict delinquency above and beyond other well known predictors