8EmbretsonConstruct Validity2008

66
Construct Validity: A Universal Validity System Susan Embretson Georgia Institute of Technology

description

Truth Validity

Transcript of 8EmbretsonConstruct Validity2008

Construct Validity: A Universal Validity System

Susan Embretson

Georgia Institute of Technology

Introduction

• Validity is a controversial concept in educational and psychological testing

• Research on educational and psychological tests during the last half of the 20th century was guided by distinction of types of validity• Criterion-related validity, content validity and

construct validity

• Construct validity is the most problematic type of validity • It involves theory and the relationship of data to

theory

Introduction Yet the most controversial type of validity became

the sole type of validity in the revised joint standards for educational and psychological tests (AERA/APA/NCME, 1999) In the current standards “Validity refers to the degree to

which evidence and theory support the interpretations of test scores entailed by proposed uses of test”

Content validity and criterion-related validity are two of five different kinds of evidence.

Reflects substantial impact from Messick’s (1989) thesis of a single type of validity (construct validity) with several different aspects.

Topics

Overview of the validity conceptCurrent issues on validity

Discontent with construct validity for educational testsNeed for content validity

Critique of content validity as basis for educational testing

Universal system for construct validityApplies to all tests

Achievement testsAbility testsPersonality/psychopathology

Summary

History of the Construct Validity Concept: Origins

• American Psychological Association (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51, 2, 1-38.

• Prepared by a joint committee of the American Psychological Association, American Educational Research Association, and National Council on Measurements Used in Education.

– “Validity information indicates to the test user the degree to which the test is capable of achieving certain aims. … “Thus, a vocabulary test might be used simply as a measure of present vocabulary, as a predictor of college success, as a means of discriminating schizophrenics from organics, or as a means of making inferences about "intellectual capacity.“

– “We can distinguish among the four types of validity by noting that each involves a different emphasis on the criterion. (p. 13)

History of the Construct Validity Concept: Origins

Types of validity by useContent validity

“The test user wishes to determine how an individual would perform at present in a given universe of situations of which the test situation constitutes a sample.”

Predictive validity“The test user wishes to predict an individual's future

performance.”

Concurrent validity“The test user wishes to estimate an individual's present status

on some variable external to the test.”

Construct validity“The test user wishes to infer the degree to which the individual

possesses some trait or quality (construct) presumed to be reflected in the test performance.”

History of the Construct Validity Concept: Origins

Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302.

“We can dsitinguish among four types of validity by noting that each one puts a different emphasis on the criterion. In predictive or concurrent, the criterion behavior is of concern to the tester and he may have no concern whatever with the type of behavior observed on the test”

“ Content validity is studied when the tester is concerned with the type of behavior in the test performance. Indeed, if the test is a work sample, the test may be an end in itself.”

“Construct validity is ordinarily studied when the tester has no definite criterion measure of the quality with which he is concerned, and must use indirect measures. Here the trait or quality underlyng the test is of central importance…….”

Implications of Original Views

• Same test can be used in different ways

• Relevant type of validity depends on test use

• The types of validity differ in the importance of the behaviors involved in the test

More Recent Views on Types of Validity

• Standards for Educational and Psychological Testing (1954; 1966; 1974, 1985, 1999)

• 1985– “Traditionally, the various means of accumulating

validity evidence have been grouped into categories called content-related, criterion-related and construct-related evidence of validity. …” “These categories are convenient.…but the use of category labels does not imply that there are distinct types of validity…”

– “An ideal validation includes several types of evidence, which span all three of the traditional categories.”

Conceptualizations of Validity: Psychological Testing Textbooks

• “All validity analyses address the same basic question: Does the test measure knowledge and characteristics that are appropriate to its purpose. There are three types of validity analysis, each answering this question in a slight different way.” (Friedenberg,1995)

• “ …..the types of validity are potentially independent of one another.” (Murphy & Davidshofer,1988)

• “There are three types of evidence: (1) construct-related, (2) criterion-related, and (3) content-related.” …..”It is important to emphasize that categories for grouping different types of validity are convenient; however, the use of categories does not imply that there are distinct forms of validity.” Kaplan & Saccuszzo (1993)

Most Recent View on Types of Validity

• Standards for Educational and Psychological Testing 1999– “Validity refers to the degee to which evidence and theory

support the interpretations of test scores entailed by proposed uses of tests”. The proposed interpretation refers to the construct or concepts that the test is intended to represent.” (p.9)

– “These sources of evidence may illuminate different aspects of validity, but they do not represent distinct types of validity. Validity is a unitary concept.”

– “The wide variety of tests and circumstances makes it natural that some types of evidence will be especially critical in a given case, whereas other type will be less useful.” (p. 9)

– “Because a validity argument typically depends on more than one proposition, strong evidence in support of one in no way diminishes the need for evidence to support others. (p. 11).

The Sources of Validity EvidenceEvidence based on test content

Logical & empirical analysis of adequacy representing a content domain -- Includes themes, wording, item format and procedures for administration & scoring

Evidence based on response processesTheoretical and empirical analysis of test taker’s response process

with respect to construct

Evidence based on internal structureRelationships among test items correspond to construct structure

Evidence based on relations to other variablesConvergent & discriminate evidenceTest-criterion relationshipsValidity generalization

Evidence based on the consequences of testingDifferent impact by group, claims of testing benefits

Implications of 1999 Validity Concept

No distinct types of validity

Multiple sources of evidence for single test aim Example-Mathematical achievement test used to

assess readiness for more advanced course

Propositions for inference1) Certain skills are prerequisite for advanced course

2) Content domain structure for the test represents skills

3) Test scores represent domain performance

4) Test scores are not unduly influenced by irrelevant variables, such as writing ability, spatial ability, anxiety etc.

5) Success in advanced course can be assessed

6) Test scores are related to success in advanced curriculum

Current Issues with the Validity Concept: Educational Testing

Crocker (2003)Content aspect of validity deserves more prominence

Educational accountability needs content representativeness

More methods for content related evidence needed Design-- test specification and item generation;

Item review tasks; Subject matter expert reliability

Data analysis techniques for content judgments

Fremer (2000) Construct validity is an unreachable goal

Borsboom, Mellenbergh & van Heerden (2004)Current validity theory “fails to serve either the

theoretically oriented psychologist or the practically inclined tester”

Current Issues with the Validity Concept: Educational Testing

Lissitz and Samuelson (2007) Propose some changes in terminology and

emphasis in the validity concept Argue that “construct validity as it currently

exists has little to offer test construction in educational testing”.

In fact, their system leads to a most startling conclusion Construct validity is irrelevant to defining what

is measured by an educational test!! Content validity becomes primary in determining

what an educational test measures

Current Issues with the Validity Concept: Educational Testing

Several published responses in Educational ResearcherEmbretson, S. E. (2007). Construct validity: A

universal validity system or just another test Evaluation Procedure? Educational Researcher, Vol. 36, No. 8, pp. 449–455.

Lissitz’ response: Organize a conference!

Critique of Content Validity as Basis for Educational Testing

• Content validity is not up to the burden of defining what is measured by a test

• Relying on content validity evidence, as available in practice, to determine the meaning of educational tests could have detrimental impact on test quality

• Giving content validity primacy for educational tests could lead to very different types and standards of evidence for educational and psychological tests

Validity in Educational Tests Response to Lissitz & Samuelson

• Background• Embretson, S. E. (1983). Construct validity:

Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197.

• Construct representation• Establishes the meaning of test scores from Identifying

the theoretical mechanisms that underlie test performance (i.e., the processes, strategies and knowledge)

• Nomothetic span • Establishes the significance of test scores by Identifying

the network of relationships of test scores with other variables

Validity in Lissitz and Samuelson’s Framework

Taxonomy of test evaluation procedures1) Investigative Focus

Internal sources = analysis of the test and its itemsProvides evidence about what is measured

External sources =relationship of test scores to other measures & criteria

Provides evidence about impact, utility and trait theory

2) PerspectiveTheoretical orientation = concern with measuring

traits

Practical orientation = concern with measuring achievement

Figure 2. Taxonomy of Test Evaluation Procedures

Perspective

Theoretical Practical

Internal Latent Process

Content and Reliability

External Nomological Network

Utility and Impact

Figure 1. The Structure of the Technical Evaluation of Educational

Testing

TestEvaluation

Internal External

Latent Process

Utility (Criterion)Content

Reliability

Theory (Nomological)

Impact

Validity

Implications for ValiditySystem represents best current practices

Internal meaning (validity) established For educational tests, content and reliability evidence

Evidence based on internal structure (i.e., reliability, etc.)

Evidence based on test content

For psychological tests, depends on latent processesEvidence based on response processes

Evidence based on internal structure (item correlations)

But, notice the limitationsResponse process and test content evidence are not

relevant to both types of tests

External evidence based on relations to other variables has no role in validity

External Evidence Only?

Construct validity is removed from the validity sphere! Critical to this view of construct validity is classification as

external evidence However, Cronbach and Meehl’s conceptualization

did include internal sources of evidence Studies of internal structure Studies of change Studies of processes

Within the nomological network, these sources would be classified as test to construct evidence.

Thus, construct validity need not be decentralized for this reason

Current Practice of Construct Validity

However, internal sources of information have no priority in Cronbach and Meehl Simply another sources of evidence

Considering only external sources may characterize some current practices Re-conceptualize test meaning based on external evidence

rather than develop new tests

Concern about the strong role of external sources motivated Embretson (1983) distinctions If internal sources are primary, then item and test design

principles can become central in establishing test validity (Embretson, 1995)

Construct Validity for Psychological Tests in a Revised Taxonomy

• If construct validity included internal sources• Now crucial to meaning for psychological tests

• Requires scientific foundation for item and test design principles

• Impact of item features and testing procedures on KSAs

But, concept of construct validity still not relevant to include internal evidence for educational tests

Test meaning depends primarily on content-related evidence and reliability evidence

Internal Evidence for Educational Tests

Reliability concept in the Lissitz and Samuelson framework is generally multifaceted and traditional Item interrelationships Relationship of test scores over conditions or

time Differential item functioning (DIF) Adverse impact

(Perhaps adverse impact and DIF could be considered as external information)

Internal Evidence for Educational Tests

• Concept of Content Validity • Previous test standards (1985)**

Content validity was a type of evidence that “…..demonstrates the degree to which a sample of items, tasks or questions on a test are representative of some defined universe or domain of content”

Two important elements added by L&S Cognitive complexity level

“whether the test covers the relevant instructional or content domain and the coverage is at the right level of cognitive complexity”

Test development procedures Information about item writer credentials and quality control

Test Blueprints as Content Validity Evidence

Blueprints specify percentages of test items that should fall in various categories

Example- test blueprint for NAEP for mathematics Five content strands Three levels of complexity Majority of states employ similar strands

But, several reasons why blueprints and other forms of test specifications (along with reliability evidence) are not sufficient to establish meaning for an educational test

1. Domain Structure is a Theory Which Changes Over Time

NAEP framework, particularly for cognitive complexity, has evolved (NAGB, 2006)

Views on complexity level also may change based on empirical evidence, such as item difficulty modeling, task decomposition and other methods

Changes in domain structure also could evolve in response to recommendations of panels of experts. National Mathematics Advisory Panel

Recommend changes in the basic strands

2. Reliability of Classifications is Not Well Documented

Scant evidence that items can be reliably classified into the blueprint categories

Certain factors in an achievement domain may make these categorizations difficultFor example, in mathematics a single real-world

problem may involve algebra and number sense, as well as measurement content

Item could be classified into three of the five strands.

Similarly, classifying items for mathematical complexity also can be difficult

Abstract definitions of the various levels in many systems

3. Unrepresentative Samples from Domain

Practical limitations on testing conditions may lead to unrepresentative samples of the content domain More objective item formats, such as multiple

choice and limited constructed response have long been favored Reliably and inexpensively scored

But these formats may not elicit the deeper levels of reasoning that experts believe should be assessed for the subject matter

4. Irrelevant Item Solving Processes

Using content specifications, along with item writer credentials and item quality control, may not be sufficient to assure high quality tests Leighton and Gierl (2007) view content specifications

as one of three cognitive models for making inferences about examinee’s thinking processes For the cognitive model of test specifications for

inferences is that no evidence is provided that examinees are in fact using the presumed skills and knowledge to solve items

NAEP Validity Study for Mathematics: Grade 4 and Grade 8 Mathematicians examined items from NAEP and

some state accountability tests Results

Small percent of items deemed flawedn(3-7%), Larger percent of items deemed marginal (23-30%) Marginal items had construct-irrelevant difficulties

problems with pattern specifications unduly complicated presentation unclear or misleading language excessively time-consuming processes

Marginal items previously had survived both content-related and empirical methods of evaluation

Examples of Irrelevant Knowledge, Skills and Abilities

• Source• National Mathematics Advisory Panel (2008).

Foundations for success: The final report of the National Mathematics Advisory Panel. Washington, DC: Department of Education

• Method- logical-theoretical analysis by mathematicians & curriculum experts• Mathematics involves aspects of logical

analysis, spatial ability and verbal reasoning, yet their role can be excessive

Dependence on Non-Mathematical Knowledge

Dependence on Logic, Not Mathematics

Excessive Dependence on Spatial Ability

Excessive Dependence on Reasoning and Minimal

Mathemataics

Implication for Educational Tests

Identifying irrelevant sources of item performance requires more than content-related evidence Latent process evidence is relevant

E.g., methods include cognitive analysis (e.g., item difficulty modeling), verbal reports of examinees and factor analysis

External sources of evidence may provide needed safeguards Example: Implications of the correlation of an algebra test

with a test of English If this correlation is too high, it may suggest a failure in the

system of internal evidence that supports test meaning

Construct Validity as a Universal System and a Unifying Concept

Features Consistent with current Test Standards (1999) Consistent with many of Lissitz and

Samuelson’s distinctions and elaborations Validity Concept

Universal All sources of evidence are included Appropriate for both educational and psychological

tests Interactive

Evidence in one category is influenced or informed by adequacy in the other categories

Categories of Evidence in the Validity System

• Eleven categories of evidence • Conceive the categories for application to both

educational and psychological tests• Consistent with most validity frameworks and the

current Test Standards (1999), it is postulated that tests differ in which categories in the system are most crucial to test meaning, depending on its intended use

• Even so, most categories of evidence are potentially relevant to a test

A Universal Validity System

ItemDesign

Principles

Domain Structure

Psycho-metric

PropertiesUtility

Other Measures

Impact

TestSpecs

Logic/Theory

Latent ProcessStudies

TestingConditions

Scoring Models

Internal Meaning ExternalSignificance

Internal Categories of Evidence

Logic/Theoretical Analysis Theory of the subject matter content, specification of areas and their interrelationships

Latent Process Studies Studies on content interrelationships, impact of item design features on psychometric properties & response time, impact of various testing conditions. etc.

Testing Conditions Available test administration methods, scoring mechanisms (raters, machine scoring, computer algorithms), testing time, locations, etc. Included because they determine the item types for which it is important to develop design principles

Item Design Principles Scientific evidence and knowledge about how features of items impact the KSAs applied by examinees-- Formats, item context, complexity and specific content as determining relevant & irrelevant basis (KSAs) for item responses

Internal Categories of EvidenceDomain Structure Specification of content areas and levels, as

well as relative importance and interrelationships

Test Specifications Blueprints specifying domain structure representation, constraints on item features, specification of testing conditions

Psychometric Properties Item interrelationships, DIF, reliability, relationship of item psychometric properties to content & stimulus features, reliability

Scoring Models Psychometric models and procedures to combine responses within and between items, weighting of items, item selection standards, relationship of scores to proficiency categories, etc. Decisions about dimensionality, guessing, elimination of poorly fitting items etc. impacts scores and their relationships

External Categories of Evidence

Utility Relationship of scores to external variables, criteria & categories

Other Measures Relationship of scores to other tests of knowledge, skills and abilities

Impact Consequences of test use, adverse impact, proficiency levels & etc

The Universal System of Validity

• Test Specifications is the most essential category: it determines (with Scoring Models)• Representation of domain structure• Psychometric properties of the test• External relationships of test scores

• Preceding Test Specifications are categories that involve scientific evidence, knowledge and theory• Domain Structure• Item Design Principles

• In turn preceded by• Latent Process Studies • Logical/Theoretical Analysis • Testing Conditions

General Features of Validity System

Test meaning is determined by internal sources of information

Test significance is determined by external sources of information

Content aspects of the test are central to test meaning Test specifications, which includes test content and

test development procedures, have a central role in determining test meaning

Test specifications also determine the psychometric properties of tests, including reliability information

General Features of the Universal Validity System

Broad system of evidence is relevant to support Test Specifications Item Design Principles --Relevancy of

examinees’ responses to the intended domain Domain Structure --Regarded as a theory Other preceding evidence

Latent Process Studies Logical/theoretical analyses of the domain Testing Conditions

General Features of the Universal Validity System

Interactions among components Internal evidence expectations for external External evidence informs adequacy of

evidence from internal sources Potential inadequacies arise when

Hypotheses are not confirmed Unintended consequences of test use

System of evidence includes both theoretical and practical elements

Relevant to educational and psychological tests

The Universal System of Validity• Example of Feedback

• Speeded math test to emphasize automatic numerical processes• External evidence-- strong adverse impact • Internal evidence categories to question

• Item Design• Relationship of item speededness to automaticity

• Domain Structure• Heavy emphasis on the automaticity of numerical skills

Analysis of Categories

Other categories elaborate their distinctions “Psychometric Properties”

Evidence in Lissitz and Samuelson “Reliability” category

“Latent Process Studies” category as related to a specific test

Scoring Models is a separate category Impact of decisions about dimensionality, guessing,

elimination of poorly fitting items and so forth is highlighted for its impact on scores and their relationships

Test Specifications category is construed broadly Include test blueprints, item writer guides, item writer

credentials, test administration procedures and so forth.

Application to Educational and Psychological Tests: Achievement

Current emphasisTest specification

Central to standards-based testingDomain structures

Essential to blueprintsScoring models & Psychometric properties

State-of-art in large scale testing

Underemphasized areasItem design principles

Research basis is emerging Latent process studies

Important in establishing construct-relevancy of student responsesLogical/Theoretical Analysis

Important in defining domain structure Implications of feedback from studies on

UtilityOther MeasuresImpact

Application to Educational and Psychological Tests: Achievement

Example: Feedback from external relationshipsImplications of negative evidenceSpeeded math test to emphasize automatic numerical processes

External evidence-- strong adverse impact for certain groupsIssues to question

Item designRelationship of item speededness to automaticity

Domain structureHeavy emphasis on the automaticity of numerical skills

Example: Item Design & Latent Process StudiesItem response format for mathematics items

Katz, I.R., Bennett, R.E., & Berger, A.E. (2000). Effects of response format on difficulty of SAT-Mathematics items: It’s not the strategy. Journal of Educational Measurement, 37(1), 39-57.

Application to Educational and Psychological Tests: Personality

Current emphasisLogical/Theoretical Analysis

I.e., personality theories

UtilityPrediction of job performance

Other MeasuresFactor analytic studies

Underemphasized areasTest SpecificationsDomain StructureItem Design PrinciplesLatent Process Studies

Application to Educational and Psychological Tests: Personality

• Test Specifications & Domain Structure• Multifaceted constructs

• Ignoring domain structure Lack of convergent validity• Unbalanced or uncontrolled item set

• Emphasizing facet that is best represented if items selected for internal consistency

• Item selection will not be consistent

• Example– Conscientiousness construct• Major subdivisions

• Dependabilty, Achievement (Moutafi, Furnham & Crump, 2006)

• Duty (-), Achievement Striving (+) (Moon, 2001)• Opposing relationship to commitment

Application to Educational and Psychological Tests: Personality

Test Specifications & Domain Structure• Example of structure in personality

• Facet theory to • Define domain membership• Define domain structure & observations

• Roskam, E. & Broers, N. (1996). Constructing questionnaires: An application of facet design and item response theory to the study of lonesomeness. In G. Engelhard & M. Wilson (Eds.). Objective Measurement: Theory into Practice Volume 3. Norwood, NJ: Ablex Publishing. Pp. 349-385.

Facet Theory Approach to Measure of Lonesomeness

Application to Educational and Psychological Tests: Personality

Item Design Principles & Latent Process StudiesMost measures are self-report formatBasis of self-report may involve strong

construct-irrelevant aspectsTasks require judgments about relevance of

statement to own behavior and then reliably summarizing

California Psychological Inventory itemsWhen in a group of people I usually do what the others want

rather than make suggestions There have been a few times when I have been very mean to

another person.I am a good mixer. I am a better talker than listener.

Application to Educational and Psychological Tests: Personality

• Science of self-report is emerging and linked to cognitive psychology

• Stone, A. A., Turkkan, J. S., Bachrach, C.A., Jobe, J. B., Kurtzman, H. S. & Cain, V. S. (2000). The science of self-report. Mahwah, NJ: Erlbaum Publishers.

• Studies on how item and test design impacts self-report accuracy – Self-reports under optimal conditions are biased

• Daily diaries of dietary self-reports contain insufficient calories to sustain life

• Smith, A. F., Jobe, J. B., & Mingay, D. M.  (1991b).  Retrieval from memory of dietary information.  Applied Cognitive Psychology, 5, 269-296.

• Personality inventories are far less optimal for reliable reporting

Application to Educational and Psychological Tests: Personality

Mechanisms in self-reportResponse styles

Social desirabilityAcquiesence

Memory & ContextWhen memory information is sufficient, other

methods are appliedContext

Information earlier in the questionnaireAmbiguity of issue discussedMoods evoked by earlier questions

Self-Report Context Effects

Application to Educational and Psychological Tests: Personality

Item Design PrinciplesLievens, F. & Sackett, P. (2007). Situational judgment tests in

high stakes settings: Issues and strategies with generating equivalent forms. Journal of Applied Psychology, 92, 1043-1055.

Application to Educational and Psychological Tests: Personality

• Integration of Item Design Principles & Logical/Theorical Analysis & Latent Process Studies– Example Test of Aggression

• James, L. R. McIntrye, M. D., Glisson, C. A., Green, P. D. (2005). A

conditional reasoning measure for aggression. Organizational Research Methods, 8, 69-80.

• Item design based on hypothesis that responses to ambiguous scenarios involve justification mechanisms related to aggression

Sample Item with Hostile Attribution Bias for Keyed Response

Summary

History of validity shows changes in the concept Notion of types still apparent

Construct validity is appropriate for educational tests Content aspect is not sufficient

Construct validity is a universal system of evidence relevant to diverse tests