Assessment Validation in the Context of High-Stakes Assessment

9

Click here to load reader

Transcript of Assessment Validation in the Context of High-Stakes Assessment

Page 1: Assessment Validation in the Context of High-Stakes Assessment

Assessment Validation in the Context of High-Stakes Assessment Katherine Ryan, University of Illinois

Including the perspectives of stakeholder groups (e.g., teachers, parents) can improve the validity of high-stakes assessment in- terpretations and uses. How stakeholder groups view high- stakes assessments and their uses may differ significantly from state-level policy officials. The views of these stakeholders can contribute to identifying the strengths and weaknesses of the intended assessment interpretations and uses. This article pro- poses a process approach to validity that addresses assess- ment validation in the context of high-stakes assessment. The process approach includes a test evaluator or validator who considers the perspectives of five stakeholder groups at four different stages of assessment maturity in relationship to six aspects of construct validity. The tasks of the test evaluator and how stakeholders‘ views might be incorporated are illustrated at each stage of assessment maturity. How the test evaluator might make judgments about the merit of high-stakes assess- ment interpretations and uses is discussed.

ost states are enacting and imple- M menting multilevel educational accountability systems that define con- tent and performance standards that emphasize high achievement, including complex understanding of subject areas and higher order thinking (Kuper- mintz, Ennis, Hamilton, Talbert, & Snow, 1995; Ladd, 1996). These content (what students should know and be able to do in mathematics, reading, etc.) and performance standards (level and quality of knowledge and skills in specific content areas) are accompa- nied by assessments. The assessments, representing the content and perfor- mance standards, are used for holding schools accountable to improve in- struction, student learning, grade pro- motion, and certification.

When test results are used for po- tentially serious consequences like grade promotion, certification, or the award of salary increases, the assess- ment is characterized as “high stakes.” As Kane (2001, p. 2) says, “Note that it

is their consequences that insert the ‘high stakes’.” The consequences of high-stakes assessments impact all students, teachers, and schools. While the goal of standards-based account- ability is to improve teaching and learning for all, particular groups of students, teachers, and schools (e.g., low income) may be disproportionately affected by the consequences. Cer- tainly, how stakeholders like students, teachers, and parents view high-stakes assessment interpretations and uses may differ significantly from state-level policy officials who are interested in educational outcomes. Including the perspectives of stakeholder groups like school administrators, teachers, par- ents, and students in the assessment validation process can improve the va- lidity of high-stakes assessment inter- pretations and uses. The views of these stakeholders can contribute to identi- fying the strengths and weaknesses of the intended assessment interpreta- tions and uses.

In this article I propose a process approach to validity that addresses as- sessment validation in the context of high-stakes assessment. The process approach is linked to three themes: the notion of a “test evaluator,” the links between evaluation and validity in- quiry, and the role of the stakeholders in assessment validation. After I pre- sent the process approach, I brieflydis- cuss how the test evaluator might for- mulate a judgment about the merit or value of assessment interpretations and uses. I conclude with some general comments on future directions for val- idation inquiry.

The Test Evaluator as a Public Scientist

Unlike the test developer, the evalu- ator holds no brief for or against the test, but rather is committed to serve all the persons having stakes in af- fairs the test may influence. Unlike the writer of tcst reviews, the evalu- ator undertakes independent re- search. Unlike the investigators who accumulate background knowledge while satisfying motives of their own, but like the program evaluator, the test evaluator is expected to produce a report in a limited amount of time. (Cronbach, 1989, p. 164)

Both Linn (1998) and Cronbach (1988,1989) have suggested that a “test evaluator” or “validator” is needed in as- sessment validation. Explicitly ac- knowledging the political dimension in the work of the evaluator, Cronbach et al. (1980) characterized the evaluator as a “public scientist” prescribing the evaluator to serve the interests of the “public good.” In this article, I present

Katherine Ryan is Associate Professor of Educadionul PsJjchology, University of Illi- nois, Chum,paign, IL 61820. Her specialixa- tions are educational evaluation and ap- plied measurement.

Spring 2002 7

Page 2: Assessment Validation in the Context of High-Stakes Assessment

the test evaluator’s obligations and re- sponsibilities in relationship to a mul- tifaceted framework that represents a process approach to validation inquiry (see Fig. 1). I particularly emphasize the test evaluator’s responsibilities in determining the validation questions whlie addressing the dilemma of the confirmationist bias (Cronbach, 1988; Haertel, 1999; Shepard, 1993). The confirmationist bias is the tendency to look for supporting evidence in the val- idation of assessment interpretations and uses instead of a more balanced view examining both the strength and weaknesses of intended interpreta- tions and uses.

As illustrated in Figure 1, I propose including the perspectives of stake- holder groups and/or audiences (those who might be interested in the findings about the evaluation of assessment interpretations and use) in the assess- ment validation process. Stakeholders are groups who have interests that are at stake in the assessment interpreta- tions and uses. The perspectives of these groups may be considered at four different stages of assessment maturity in relationship to six aspects of con- struct validity. This validation ap- proach, which is fundamentally an- chored by “the value implications of score meaning as a basis for action and the social consequences of score use” (Messick, 1995, p. 741), should shape the practices of the test evaluator.

Because of the high stakes involved in testing for accountabilitypurposes, it

is best for the test evaluator to be lo- cated externally to both the assessment development and use. (There may be an individual directly connected to the assessment who is the “internal evalua- tor” or the tasks may be completed by several people holding different posi- tions who are performing an “internal evaluation function.”) The external test evaluator or validator is responsible to and for all stakeholders throughout the validation inquiry. This charge is par- ticularly critical at the times the ques- tions to be studied are being identified (Cronbach, 1989). Balancing rather than favoring one group’s interests and ideology over another is a central activ- ity for test evaluators. Acknowledging their own values and interests is impor- tant. To gain reliable understanding of stakeholders’ perspectives, test evalua- tors will need to be close as opposed to distant from the stakeholders.

Questions for potential study con- cerning possible assessment interpre- tations and uses are gathered from all stakeholders. These stakeholders are policymakers and government officials as well as less privileged groups. How and whether each question is studied are decided by a winnowing process in- volving negotiation and judgment in di- vergent and convergent phases in ques- tion development. Cronbach (1989) and Shepard (1993) propose four crite- ria to consider in prioritizing validity questions: uncertainty about the ques- tion, cost, criticality of information, and information yield from the study. Even

when questions are not studied, bring- ing the issues to light is helpful in know- ing what was not studied.

However, the test evaluator, by the nature of training and work, carries a confirmationist bias (Haertel, 1999). This bias is complex, including man- agement, government, administration, professions, and discourses surround- ing high-stakes testing in society. Con- sequently, the test evaluator is situated in such a way that he or she is responsi- ble for balancing many interests. Some of these are interests within which he or she is vested. At the same time, the evaluator’s external location, with links to outside institutions, technical skills, and knowledge of analytical frame- works, and the evaluator’s experience with the use of empirical evidence are assets. They are a key part of the war- rant that she or he brings to the valida- tion inquiry. However, it is the role of the test validator in specifying study questions and in collecting data and all other phases of the validation process to bring a balanced perspective avoid- ing the confirmationist bias.

Links Between Evaluation and Validity Inquiry

Validation of a test or test use is evaluation. . . . Validation speaks to a diverse and potentially critical au- dience, therefore the argument must link concepts, evidence, social and personal consequences, and values. (Cronbach, 1988, p. 4)

Policymakers u)

03 a 3 c

0 a, School Officials n

9 0

P 2 C c 0 Teachers a,

E c (d s

L

c a,

c C

cn Parents and Students

Assessment Maturitv

Illuminators I /

v)

0 a, 0.

c

P - (d

C a, 3 0- a, v) C

._ c

s /

FIGURE 1 . A Process Approach to Validity Inquiry.

8 Educational Measurement: Issues and Practice

Page 3: Assessment Validation in the Context of High-Stakes Assessment

In the brief sentences above, Cron- bach introduces two key concepts that substantially alter how to think about assessment validation: the relationship between validation and evaluation and a diverse and potentially critical audi- ence. The notion of validation as a con- struction of and an evaluation of the ar- guments for and against assessment interpretations and uses has been given serious consideration in the modern conceptions of the validation process (Cronbach, 1988; Haertel, 1999; Kane, 1992; Linn, 1998; Messick, 1989, 1995; Shepard, 1993). The concept of “critical audience” has not been as clearly artic- ulated; still, the notion of audience, stakeholder, and multiple perspectives in the validation process has been visu- alized (Haertel, 1999, 2001; Lane, Park & Stone, 1998; Messick, 1995; Moss, 1998; Shepard, 1993).

Over a decade ago, Cronbach (1988) pointed out the parallels between eval- uation inquiry and validation inquiry, especially their roles in shaping policy and practice. He proposed that some of the solutions that evolved in the more recent approaches in program evalua- tion theory and practice might be help- ful in reconsidering the kind of inquiry needed in the validation of the inter- pretations and uses of assessments. In addition to conceptualizing validation of an assessment or assessment use as evaluation, he proposed a “validity ar- gument” that corresponds to the logic of the “evaluation argument” (Cronbach, 1988; House, 1977). Cronbach proposed four principles to guide the develop- ment of the assessment validation ar- gument: (a) the limitations of each in- terpretation are shaped by the degree of justification; (b) the interpretation can be a description, prediction, or a recommended decision; (c) the local users’ inferences and policies (and test developers’ interpretations) should be examined; and (d) the task of valida- tion involves examining the strengths and weaknesses of the assessment in- terpretations and uses.

While the term evaluation is used within the literature onvalidation (e.g., Shepard, 1993), the meaning of evalua- tion in the validation inquiry context is not clearly defined. This is a key con- cept in constructing and examining the arguments for and against assessment interpretations and uses. I define evalu- ation here as a systematic examination of interpretations and uses occurring in and resulting from an assessment or

accountability system. (Other indica- tors could also be examined as part of this systematic examination.) The eval- uation is conducted to assist in (a) im- proving the assessment interpretations or uses and/or (b) making judgments about the merits or worth of these inter- pretations and uses. Validation inquiry is the overall evaluation of the intended and unintended interpretations and uses of test score interpretations.

To make a judgment about the valid- ity of assessment interpretations and uses, some criteria can be selected to justify the judgments. The validator or test evaluator presents an argument with evidence specifying the criteria employed in the evaluation and their justification. Determining criteria in an evaluation is not a straightforward process. Nor can all possible assess- ment interpretations and uses be stud- ied. In the validity criteria facet of Fig- ure I , I present an example of the criteria that might be appropriate for validation of intended assessment in- terpretations and uses.

Validity Criteria Facet: Evidence to be Collected Theoretically, criteria in an evaluation can and should come from any number of sources (e.g., from an examination of what is being evaluated, stakeholders and audiences, research literature, etc.). Messick’s (1995) theory of con- struct validitycan be considered as pro- viding general criteria for all educa- tional and psychological assessments. For the purposes of this paper, I am adopting these criteria as the initial benchmark for the validity facet of the framework for implementing validation inquiries (see Fig. 1). There are other sources of criteria that are more or less similar, including the Standardsfor Ed- ucational and Psychological Testing (AERA, APA, & NCME, 1999) and the five perspectives on the validity argu- ment (Cronbach, 1988).

Messick criticized historical concep- tualizations of validity for not address- ing two major issues: “. . . the value im- plications of score meaning as a basis for action and the social consequences of score use” (Messick, 1995, p. 741). (See Messick, 1989, for a complete dis- cussion of these issues and his argu- ment concerning their centrality to va- lidity.) Instead, he proposed a unified concept of validity based on an ex- panded theory of construct validity that “integrates considerations of content,

criteria, and consequences into a con- struct framework for the empirical test- ing of rational hypotheses about score meaning and theoretically relevant re- lationships including those of an ap- plied and a scientific nature” (Messick, 1995, p. 751).

He concluded that construct validity should incorporate any evidence that impacts the meaning and interpreta- tion of the assessment scores (Messick, 1989, 1995). Validity is defined as an overall judgment of the extent to which the empirical evidence and theory sup- port the adequacy and appropriateness of the interpretations from assessments for a specific use. Critical to this defini- tion is the notion that validity is not a property of a test or assessment. In- stead, validity is a characteristic of the meaning and interpretation of the as- sessment scores and any actions based on the assessment scores.

The following list provides brief de- finitions of terms within Messick’s construct validation theory, presents potential sources of evidence, and il- lustrates how evaluators and stake- holders might collect evidence or be involved.

Content aspects include evidence of content relevance and represen- tativeness. Establishing the bound- aries of the domain to be assessed is critical in conceptualizing content considerations. Sources ofevidence typically are results of job analysis, task analysis, logical analysis, and other forms of analy- sis conducted by expert judges. Stakeholders could participate by assisting in determining the boundaries of the construct and in collecting sources of evidence con- cerning the criticality or impor- tance of particular dimensions de- rived from the task analysis. Substantive aspects involve evi- dence supporting the theoretical and empirical analysis of the pro- cesses, strategies, and knowledge proposed to account for respon- dents’ item and/or task perfor- mance on the assessment. Sources of evidence include analysis of individual responses or response processes through think-aloud protocols or simply asking respon- dents about their responses. Stake- holders can make judgments about whether the theoretical analysis supports students’ item and/or task performance and what might

Spring 2002 9

Page 4: Assessment Validation in the Context of High-Stakes Assessment

be missing. Some stakeholder groups might participate in the analysis of think-aloud protocols. Structural aspects are most similar to concerns relating to the internal structure of an assessment. Based on Loevinger’s (1957) concept of structural fidelity, roughly speak- ing, structural considerations in- volve assessing howwell the scoring structure parallels the construct domain. Sources ofeviderzce involve structural considerations based on investigations of the interitem cor- relations and test dimensionality. Stakeholders can examine the structural dimensions of the as- sessment and address concerns or issues about howwell the structure maps to the construct. External aspects include the fa- miliar types of convergent and dis- criminant evidence from multi- trait multimethod studies. Sources oievidence concerning the rele- vance of the criterion are also ad- dressed in external considerations. Stakeholders can suggest and par- ticipate in studies investigating convergent and discriminant valid- ity. They can also propose relevant criteria a.nd participate in studies examining these criteria. Generalixability aspects are con- cerned with the degree to which score meaning and use can be generalized to other populations, contexts, and tasks including the test (assessment)-criterion rela- tionship. Sources ofevidence con- sist of prediction studies and other studies of how particular factors (e.g., type of assessment ta,ker) might impact the assessment- criterion relationship. Stakeholders can suggest factors that they con- sider rclevant for investigation. Consequential aspects are con- cerned with score meaning and the intended and unintended consequences of assessment use. Sources of evidence within the high-stakes assessment context might involve a study of how the use of assessment scores for teacher salary increases impacts teachers’ instructions to the stu- dents. Intended consequences (e.g., more learning) and unin- tended consequences (e.g., teach- ing to the test) are examined. Questionnaires, classroom obser- vations, and case studies are the

most typical methods used to study consequences, Stakeholders can present their perspectives on how the consequences of high- stakes assessments are influenc- ing teaching and learning. These issues can be investigated provid- ing sources of evidence about as- sessment consequences.

Validation and Multiple Perspectives In his 1988 and 1989 papers, Cronbach fundamentally shifted validation from a “. . . ritual performed behind the scene with the professional elite as witness and judge” (1988, p. 3) to audiences with multiple perspectives. Validation involves “activities that clarify for a rel- evant community what a measurement means and the limitations of each in- terpretation” (Cronbach, 1988, p. 3).

The notion of the audience, stake- holder, and multiple perspectives in the validation process, especially in rela- tionship to values, is receiving atten- tion (Cronbach, 1988; Haertel, 1999, 2001; Messick, 1995; Shepard, 1993). Values are fundamental to the meaning and outcomes of assessment (Messick, 1995). Messick (1995) presents a per- suasive case for the examination of ex- plicit and implicit values in score inter- pretation and use. He proposes that looking at both assessment interpreta- tion and use from multiple perspectives is one approach to making tacit values visible in the validation process. In terms of score interpretation, this involves em- pirically and substantively examining the alternative ideologies and theories surrounding the construct of interest (Messick, 1995). The same strategy of multiple perspectives can also be used in examining assessment uses.

Open dialogue and debate will bring different value commitments concern- ing assessment use to light. However, while Messick (1995) is emphatic about the importance of multiple perspectives in evaluating the arguments for and against assessment interpretations and uses, whose values the multiple per- spectives represent is not well defined.

The Role of Stakeholders and Audiences in Validity Arguments and Conclusions

Accountability for educational out- comes should be a shared responsi- bility of states, school districts, pub- lic officials, educators, parents, and students. High standards cannot be

established and maintained merely by imposing them on students. (Heuber & Hauer, 1999, p. 5)

The role of stakeholders in the as- sessment validation process has only been minimally articulated. In the con- text of large-scale assessment pro- grams, it is obvious that particular as- sessment interpretations and uses do have an impact on the stakeholders and audiences. For example, results from a recent investigation suggest high- school graduation tests increase the probability that the lowest achieving students will drop out (Jacob, 2001).

Including stakeholders’ perspectives is one approach for addressing the con- firmationist bias (Cronbach, 1988; Haertel, 1999; Shepard, 1993). The con- firmationist bias violates one of the principles 1 (d), i.e., examining the strengths and weaknesses of the as- sessment interpretations and uses] that Cronbach proposes for developing the validity argument cited in the sec- tion above. Haertel (1999) proposed moving beyond the “coqfirmationist bias to look for evidence against our in- tended assessment interpretations” (p. 6) in high-stakes assessment in his 1999 NCME Presidential address. Specifically, he suggested that even talking about ivhether to include differ- ent groups of people in a discussion about the meaning and use of assess- ment interpretations wasofvalue to the assessment community. He asked the assessment community to imagine sur- veying teachers, principals, and school superintendents to ask them if they re- gard test scores as adequa.te measures of what they are accomplishing and to listen to what they say (Haertel, 1999).

Stalceholder Facet: Who Are the Stalceholders and Audiences? Who are potential stakeholders and au- diences in a high-stakes assessment validation inquiry? An adaptation of Cronbach’s (1980) policy-shaping com- munity might be a good starting point for considering who these groups might be. The stakeholder facet in Figure 1 presents various groups who could be the stakeholders and audience in the high-stakes assessment validation. There are many other possible stake- holders groups including combinations of groups. For instance, in Illinois there is a coalition between large school dis- tricts and the business community. The coalition is actively lobbying for in- creased accountability.

10 Educational Measurement: Issues and Practice

Page 5: Assessment Validation in the Context of High-Stakes Assessment

As illustrated in Figure 1, the differ- ent stakeholder groups and audiences within the assessment andlor account- ability system include policymakers at the federal andlor state level, school offi- cials (superintendents and principals) , and those involved in direct instruction (teachers). Federal and state policymak- ers are concerned with educational out- comes and how they contribute to policy. Their interests in assessments are fo- cused on the measurement of these ed- ucational outcomes. Superintendents and principals are responsible for the administration and daily operations of the educational programs. They are held accountable for seeing that there is progress toward meeting educational standards and outcomes that are now represented as increases in test scores and other indicators.

Teachers and other direct service personnel are responsible for instruc- tion. There is a move toward holding teachers responsible for increases in assessment scores and other indicators by rewarding increases with pay in- creases and bonuses. Theoretically, teachers (and probably principals and superintendents) are interested in how and whether high-stakes assess- ments can adequately measure student learning.

Members of the public include two groups: individuals directly affected by the assessment or system consequences (parents, students) and illuminators (journalists, academicians) (Cronbach et al., 1980). Parents and students have a stake in how the assessment or ac- countability system validation might af- fect their interests such as whether high-school diplomas are awarded based on assessment performance. The illuminators interpret and communi- cate information about the assessment, accountability system, and the valida- tion inquiry. They are interested in dis- entangling or perhaps entangling issues surrounding the assessment and as- sessment validation inquiry.

Issues in Including Stakeholders’ Perspectives. The logic of the validity argument becomes substantially more complex with the addition of stake- holders’ and audiences’ perspectives. Not only the nature of assessment in- terpretations and uses needs to be con- sidered when determining appropriate criteria. Determining appropriate crite- ria from points of view is of equal im- portance. There is a fundamental dis-

tinction between ‘bood of a kind X which means the thing fulfills a role and goodfrom a Ypoint of view” (House & Howe, 1999, p. 21). This distinction is at the heart of considering assessment in- terpretations and uses as “good from a technical and functional perspective (good of a k ind ) and considering whether assessment interpretations and uses are “good from the perspec- tives of groups whose interests are af- fected differentially (good f rom a stake- holder group’s point of view). For example, developers of a high-stakes statewide assessment comprised of multiple-choice items may be able to point with pride to the technical quality of this assessment (good of a k ind ) . However, in spite of the technical qual- ityof the assessment, teachers (a stake- holder group) might argue assessment scores are not adequate measures of what students are accomplishing or of what they are teaching (goodfrom a stalceholder group’s point of view).

The nature of the assessment inter- pretations and uses, their functions, and what these are actually doing are the critical considerations in deciding who the stakeholders and audiences are and what the criteria are. It will not be surprising, given the multiple per- spectives of audiences and stakehold- ers, that they may find different criteria acceptable, different arguments per- suasive, and different evidence credi- ble. Teachers, for example, may find the results of think-aloud protocols of a small number of students’ responses to standardized test items as evidence of critical thinking in mathematics. On the other hand, the legislature and pol- icymakers might find such studies of lit- tle interest or unconvincing. .

Assessment Maturity Facet: Who, How, and When to Include Stakeholders in the Assessment Validation Process Who, how, and when to include stake- holders in the assessment validation process shifts substantially in rela- tionship to the maturity of the assess- ment. The assessment maturity facet in Figure 1 proposes four stages in the maturity of an assessment: conceptu- alization, design, implementation, op- erational. While five stakeholder groups and four stages of assessment maturity suggest there are multiple places and multiple groups of stakeholders to be considered in the validation inquiry, it is unlikely that each group will be in- volved in each phase of the validation

inquiry, Nevertheless, this represen- tation does reflect the potential polit- ical and social complexity of high- stakes assessment and accountability (see Fig. 1).

While in theory these stages are con- sidered to be a linear sequential process, they are interconnected and overlapping. I propose considering the validation inquiry in four separate phases because what stakeholders are involved, the questions of interest, and the potential studies to be conducted shifts substantially from one phase of the validation process to another. In ad- dition, if one of the goals of validation inquiry is to assist in improving the as- sessment or accountability interpreta- tions and uses, defining key points for a feedback mechanism in the validation process is critical for improvement to take place. It is difficult to make changes in how an assessment is con- ceptualized when it is fullyoperational. For example, changes in the assess- ment conceptualization based on an evaluation of the construct representa- tion are much easier to accomplish be- fore the assessment is fielded.

Conceptualization Phase. Foremost in the conceptualization stage is deter- mining the purpose of the assessment, identifying who will be taking the test, and establishing the proposed interpre- tation and use, the specific context of use, construct analysis, assessment for- mats, and the assessment specifica- tions. Interactions between constructs and formats are critical considerations in format selections (Willingham & Cole, 1997). At this stage, there are multiple alternatives proposed, dis- cussed, discarded, and perhaps revis- ited with significant negotiations on the substantive features and details of the conceptualization before a set of as- sessment specifications is agreed upon. Rival hypotheses to be considered in- volving the meaning of low achieve- ment scores include whether low scores are due to a lack of motivation, lack of competence, or emotional upset. Of equal importance at this phase of the validation process are key features of the construct of interest that are either not assessed well or at all in the context of large-scale assessment. The issue of group differences in test performance needs to be addressed in conceptualiz- ing the assessment. In a hypothetical example for a high-school exit exam, the rate of success and failure changed

Spring 2002 11

Page 6: Assessment Validation in the Context of High-Stakes Assessment

substantially for males and females de- pending on the construct being as- sessed (Willingham & Cole, 1997).

Haertel (1999) presents an example of what a “theory” behind the conceptu- alization of a high-stakes assessment might look like (see Fig. 2). He identi- fies (three purposes for large-scale as- sessments: (a) accountability, (b) con- tribution to the public discourse about educational concerns, and (c) testing to improve teaching and learning. As il- lustrated in Figure 2, a key assumption anchoring these purposes is that “test scores tell how well schools are per- forming” (Haertel, 1999, p. 5). This basic premise leads to two others (e.g., test scores show how much students know and can do) which in turn lead to five others (see Fig. 2). This “theory” represents intended assessment in- terpretations and uses that need to be

validated. All stakeholder groups may not agree that these interpretations are justified.

Stakeholder gruups. Figure 2 basi- cally represents policymalters’ views of educational accountability: if schools are managed well and doing a good job (managerial efficiency), then students will learn more and demonstrate this improvement by higher scores on stan- dardized achievement tests (outcome). Stakeholder groups like teachers and school administrators may not agree that standardized test scores can ade- quately represent how much students know and can do. Teachers and other stakeholder groups may think that per- formance assessment tasks (e.g., an ex- tended week-long project) are critical for representing student achievement. Parents, teachers, and school adminis-

trators may be particularly concerned about whether standardized test scores are good indicators for specific groups of student (e.g., low achieving students, students with poor test-taking skills). The perspectives of policymakers, par- ents, teachers, school administrators, the public, and other stakeholders may not agree, but they do need considera- tion, for example, in the selection of constructs to be measured for high- school exit exams where group differ- ences in test scores are at issue.

Tusks of the evaluator. The role of the validator in this phase involves ex- amining all aspects of the assessment conceptualization process to determine if and how these activities were accom- plished. Central to the evaluator’s role is identifying the stakeholders like teachers and administrators and ensur-

1. Answering test items requires skills included in the curriculum

Contenf-related criterion-related and construct-related validity evidence; cognitive labs, and studies of item bias and test bias address this premise

Test scores tell how well schools are performing

2. Skills demonstrated on tests can also be used on other tasks

Such transfer is assumed under differential and behaviorist perspectives, but problematical from cognitive and situative perspectives on testing.

13. Proficiency on tested skills shows proficiency on untested skills as

Some important schooling outcomes cannot be fested efficiently. Also year-to yeargrowth on one high-stakes test offen fails to transfer even to other similar tests.

The more a school’s students know and can

do, the better the school is Test scores show how much students know and can do

5. Students’ skill

Cognitive learning outcomes are not the

Valid inferences from test performance to aualitv of schooling requite much contextual information. Out-of-school factors are critical to sound test interpretation.

FIGURE 2. permission, from Haertel, 1999.)

Unpacking the Premise that Test Scores Show How Well Schools Are Performing. (Reprinted, with

12 Educational Measurement: Issues and Practice

Page 7: Assessment Validation in the Context of High-Stakes Assessment

ing their point of view is included. For instance, helping all stakeholders groups reveal their particular “theory” of what high-stakes assessment inter- pretations and uses are going to accom- plish in this particular context is one of the test evaluator’s tasks. The evaluator can do this by first presenting stake- holders with an illustration of what a “high-stakes assessment theory” is. Fig- ure 2 would be a good example. Then the evaluator can ask them how their theory is the same and how it is differ- ent. Stakeholders’ alternative theories may identify an area of weakness in the intended assessment interpretations and uses that needs to be justified. For instance, teachers may think high- stakes assessments lead to “teaching to the test.” The test evaluator may decide this is a key issue that needs further in- vestigation. This particular concern can be studied as part of the assessment validation process.

Design Phase. Based on the specifi- cations developed in the conceptualiza- tion stage, samples of the best possible items are prepared for field-testing. At this stage there are painstaking efforts to ensure the items are carefully mapped to the constructs and specifi- cations identified in the conceptualiza- tion stage. A field trial takes place in re- alistic field conditions in efforts to predict what the assessment will look like when fully operationalized. The ac- tivities include choosing items and top- ics, editorial, sensitivity, and substan- tive reviews of sample assessment tasks in relationship to the specifications and the assessment overall, opportunity to learn, and statistical analyses [ e.g., dif- ferential item functioning (DIF) analy- ses] ensuring the technical quality of the preliminary assessment analysis.

Taking into consideration the latest findings, which suggest that scores on multiple-choice tasks and, to a lesser extent, concrete items disproportion- atelyfavor male test takers, is critical in field trials. Think-aloud protocol stud- ies investigating the substantive as- pects of validity, whether the assess- ment tasks are engaging students in higher order thinking, are conducted routinely. Standard setting, a defining feature of standards-based educational reform, can be examined empirically and substantively at this phase of the validation process. The potential con- sequences of various cut scores that will be used to define “basic” and “profi-

cient” can be studied at this phase of the validation process.

Staicehotdergroups. Haertel(2OOl) calls for standard setting to be a more participatory process. In a participatory standard-setting process, stakeholders like teachers, school administrators, and the public can be involved in the standard-setting process as part of the “due process” enacted by policymakers (e.g., Superintendent of Instruction). The values, beliefs, and intentions of the panelists (e.g., stakeholders like teachers and representatives of the workforce) should be adequately re- flected in the cut scores selected to rep- resent the “basic” and “proficient” labels of the performance benchmarks. Other activities involving stakeholders might include teachers reviewing the results of think-aloud protocols to verify that the assessment tasks do assess higher order thinking skills.

Tasks of the evaluator. Mapping the “theory” behind the assessment concep- tualization is not enough. The validator, in the design phase, examines whether the conceptualization of the assess- ment is actually being implemented. The tasks include examining all aspects of the assessment design process to identify if and how these activities were accomplished. The evaluator’s role is to identify the stakeholders’ interests in the process and ensure their points of view are part of the assessment design.

For example, stakeholders like teach- ers and others can participate in tradi- tional activities such as review of the item pools for cultural sensitivity and ir- relevant sources of difficulty. Including community members and perhaps par- ents in the validation of cutscores for high-school graduation exams is another exllmple of how the validator could in- clude stakeholder perspectives. The val- idator is also responsible for determining whether the cut scores are reliable.

I w ~ ~ l e m e n ~ a t ~ o n Phase. In the im- plementation stage, the assessment is actually administered under conditions that are similar to when the assessment will be used for accountability pur- poses. Defining, ensuring, and assessing the effects of the proposed standard conditions and timing, and examining the accommodations that are suitable for students with disabilities are care- fully considered assessment adminis- tration issues. The routine technical

work such as scoring, scaling, and equating are completed. The reliability of the assessment scores, subgroup per- formance, and comparability are as- sessed. Studies predicting the impact of actually implementing concrete conse- quences based on the assessment re- sults are appropriate here-for stu- dents, groups of students, effects on instruction, and educational outcomes. These include the differential effects from various decision models and con- ditions of use, and an investigation of proposed criterion information.

Stakeholder groups. Stakeholder groups can be involved in studies inves- tigating the meaning of assessment in- terpretations and uses from their par- ticular perspectives. In general, as part of the accountabilitymodel, policymak- ers do assume that standardized test scores reflect what students know and can do. Whether teachers, principals, and parents think the assessment ade- quately reflects student achievement can be investigated. How students, par- ents, teachers, and superintendent and principals interpret the meaning of these assessment scores can also be ex- amined. In addition, the perspectives of students, parents, teachers, and admin- istrators concerning the social costs of particular accountability consequences that are being considered based on as- sessment scores can be investigated.

Tasks of the evaluator. The tasks of the validator include reviewing the technical and administrative aspects of the implementation phase. In addition, the validator may conduct several stud- ies to investigate the meaning of as- sessment interpretations and uses from the perspectives of stakeholders groups or perhaps review these kinds of stud- ies that are already completed. Con- ducting studies or reviewing studies ex- amining the number of schools that would be put on the warning lists based on particular performance levels and comparing the impact of these various performance levels on individuals and specific subgroups illustrate some of kinds of activities engaging the test evaluator during this phase.

The tasks for test evaluators include studying the social consequences of ac- countability. Studying social conse- quences is complex. As Cole and Zieky (2001) suggest, the same consequence will be viewed differently depending on

Spring 2002 13

Page 8: Assessment Validation in the Context of High-Stakes Assessment

people’s values. They illustrate this point with an example from high-school graduation tests. When low-achieving students do not pass high-school gradu- ation tests, some people or groups will consider this to be necessary for a high- school diploma to have meaning. Other will see this as unfair to the student and consider the failure to reflect other fac- tors such as poor schooling or a lack of school funding.

Operational Phase. The assessment is administered routinely and is used for accountability purposes during the operational phase. At least in theory, it is a t this phase that the notion of con- sequences plays out: for example, stu- dents attending a school which has not shown improvement based on test scores could be given a voucher to at- tend a school of their choice. Investiga- tions examining the effects of the con- sequences and systemic effects for students, groups of students, and com- munities, and the effects on instruction and educational outcomes are all po- tential studies of interest. Criterion studies or a review of completed crite- rion studies are conducted here.

Stakeholder groups. Policymakers, teachers, school administrators, par- ents, and other stakeholders can partic- ipate in selecting the particular criteria that need to be studied. Policymakers may be more interested in studying whether linking teachers’ salary in- creases to students’ achievement actu- ally impacts students’ test scores and how this link might change teachers’ in- structional practices. Other stakeholder groups like school administrators and teachers may be interested in examin- ing whether there is an increase in teaching to the test or the influence of testing on tracking or promotion.

Tmks ofthe evaluator. The role of the test evaluator shifts significantly during the operational phase of the assessment maturity process when a judgment about the merit or worth of the assessment sys- tem is to be made. That is, afundamental judgment is made about the validity of the assessment interpretations and uses. While the test evaluator has synthesized information during all phases of the as- sessment validation, synthesizing evi- dence, concepts, values, consequences, and stakeholders’ interests that are likely to be conflicting to make judgments about merit or worth is a complex task.

To date, there are few formal rules or pro- cedures for making judgments (House, 1995; House & Howe, 1999).

A complete discussion of the philos- ophy and logic of making judgments of merit or worth are beyond the scope of this article. Nevertheless, I briefly high- light suggested approaches in program evaluation and present one option that could be used in judging the merit or worth of assessment interpretations and uses.

Judging the Merit or Value of Assessment Interpretations and Uses Malting a fundamental judgment about the validity of the assessment interpre- tations and uses is a significant chal- lenge. Within the field of evaluation, how to synthesize diverse criteria and data to formulate judgments of merit and integrating different stakeholder interests into this synthesis process has received substantial attention (House, 1995; Scriven, 1994). There is no clear consensus on either issue: synthesizing the criteria or the integration of dif- ferent stakeholder interests. There are alternatives for addressing the synthe- sis problem (House, 1995). These in- clude numerical approaches which in- volve the qumtification of judgments (Mehrens, 1990), a heuristic method classifying criteria as primary or sec- ondary before synthesis (Scriven, 1994) , and a synthesis approach (House, 1995). Mehrens argues that combining data is necessary only for summative purposes. In the context of the evalua- tion of teaching, he recommends either the conjunctive method (multiple cut- offs model) or the compensatory model (data are combined based on an algo- rithm that permits low scores on some measures to be compensated by scores on other measures). He notes the quan- tification of judgments is preferred over clinical methods (defined as eyeballing the data) when a criterion is available. Scriven (1994) criticizes a quantitative weight and sum approach for assuming that a single scale can be used for weights, performances, and number of criteria. To classify criteria as primary or secondary, he proposes a qualitative weight and sum approach where there are five levels of qualitative importance: essential, very important, important, just significant, and not significant. Secondary criteria are employed only after primary criteria (essential) are considered.

I propose synthesizing the criteria and data for making judgments about assessment interpretations and uses with House’s (1995) synthesis ap- proach. Critical to this approach is the notion that the decision is context spe- cific rather than a law-like generaliza- tion about merit or worth. The “all things considered synthesis provides the most coherence possible from the information available from various sources” (p, 34). Coherence involves the fitting together of available evi- dence; the judgment can be disputed if critical evidence is omitted (House, 1995). The Standards for Educational and Psychological Testing (AERA et al., 1999) also cites the notion that a co- herent account of evidence is a key to making a defensible judgment about test score interpretations and uses. I propose grounding the synthesis in di- alectical reasoning to specifically ad- dress the issue of the confirmationist bias. By dialectical reasoning, I mean that the reasons given to support a judgment will be designed to particu- larly address the doubts or questions about the judgment that others have raised or might raise (Blair, 1995).

How to integrate different stakeholder interests into this synthesis process is an even greater challenge. Of course theo- retically, stakeholder and audience per- spectives should be included in each phase of the validation process. Never- theless, there are two approaches that might be considered in making judg- ments about the assessment validation process. Developing multiple syntheses, which reflect multiple judgments (e.g., constructing several value positions de- pending on different stakeholders’ per- spectives), is one possibility (Shadish, Cook, & Leviton, 1991). Arriving at a sin- gle qualified judgment is another possi- bility (House, 1995). The same kind of “intuitive to and fro” reasoning is the heuristic for arriving at a single judg- ment. The qualifications or contextual features of the validation process are also critical here. That is, judgments are specifically tailored to the assessment context at hand.

Conclusions The framework presented in this article is an initial effort in developing a set of concrete strategies for implementing Val- idation inquiries, addressing the con- cerns about how to include stakeholders as part of the validation process, and

14 Educational Measurement: Issues and Practice

Page 9: Assessment Validation in the Context of High-Stakes Assessment

defining the tasks and activities of the test evaluator and stakeholders in the validation process. There are several im- portant topics that were not addressed in this article or only briefly discussed.

If the assessment community is going to move beyond the confirmationist bias, how to make judgments about the merit of assessment interpretations and uses is critical. While House’s (1995) approach and others may be a good starting point, further theoretical and empirical work is needed. How to integrate stakeholders’ perspectives in the synthesis process is problematic and makes for a substan- tially more complex process. Construct- ing several value positions depending on different stakeholders’ perspectives is highly complicated.

Managing conflicting views and ad- vice is a challenge. Some views will be based on misunderstandings. As Bren- nan (2001) suggests, teachers, school administrators, politicians, and the public are innocent about the complex- ities of an assessment program that meets the Standards for Educational and Psychological Testing (AERA et al., 1999). The evaluator has the opportu- nity, in this case, to be a teacher about the intricacies of assessment interpre- tations and uses. On the other hand, teachers, school administrators, the public, and other stakeholders do think group performance differences matter when considering issues like construct choices for high-school exit exams and providing students with an adequate opportunity to show what they know in the high-stakes assessment contexts. As Cole and Zieky (2001) suggest, these stakeholders’ views are reasonable while a technical position maintaining otherwise is lcss so.

In efforts to draw attention to the Slandards for Educational and Psycho- logical Testing (AERA et al., 1999), Fre- mer (2000a) suggests that the concept of construct validation is elevated to such a high level that it may be out of reach. While Green (2000) argued that con- struct validation is not out of reach, Fre- mer (2000b) and Green agree that con- crete examples illustrating validation tasks would be useful. Further develop- ments in validation theory and inquiry would benefit greatly from empirical studies where test evaluators or re- searchers actually implement and illus- trate forms of validation inquiry. To date, many tasks of validation remain either undeveloped or can be more clearly rec- ognized as forms of tacit knowledge.

References American Educational Research Associa-

tion (AERA), American Psychological As- sociation (APA), & National Council on Measurement in Education (NCME) (1999). Standardsfor educational and psychological testing. Washington, DC: American Psychological Association.

Blair, J. A. (1995). Informal logic and rea- soning in cvaluation. In D. M. Fournier (Ed.), Reasoning in evaluation: Inferen- tial leaps and links. (pp. 71-80). New Di- rections fo r Evaluation, 68.

Brennan, R. L. (2001). Some problems, pit- falls, and paradoxes in educational mea- surement. Educational Measurement: Is- sues and Practice, 20(4), 6-18.

Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal offlducational Measurement, 38(4), 369-382.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3-37). Hillsdale, NJ: Erlbaum.

Cronbach, L. J. (1989). Construct valida- tion after thirty years. In R. E. Linn (Ed. ) , Intelligence: Measurement, the- ory, and puhlic policy (pp. 147-171). Urbana: University of Illinois Press.

Cronbach, L. J., et al. (1980). Toward a re- f o r m of program evaluation: Aims, ,methods, and institutional arrange- ments. San Francisco: Jossey-Bass.

Fremer, J. (2000b, December). My last (?) commcnt on construct validity. National Council on Measurement Newsletter,

Fremer, J. (2000a, September). Promoting high standards and the problem of con- struct validation. National Council on Measurement Newsletter, 8(3 ) , 1.

Green, J. (2000, December). Standards for validation. Nalional Council o n Mea- surement Newsletter, 8f4), 8.

Naertel, E. H. (1999). Validity arguments for high-stakes testing: In search of evi- dence. Educational Mea,suremmat: Is- sues and Practice> 18(4) , 5-9.

Haertel, E. H. (2001, April). Standard set- ting as a participatory process. Paper presented a t the Annual Meeting of the American Educational Research Associa- tion, Seattle, WA.

Heubert, J. P., & Hauer, R. M. (Eds.). ( 1999). High stakes: Testingfor tracking, promotion, and graduation. Washing- ton, DC: National Academy Press.

House, E. R. (1977). The logic @the evalu- ation argument. Los Angeles: Center for the Study of Evaluation.

House, E. R. (1995). Putting things together coherently: Logic and justice. In D. M. Fournier (Ed.), Reasoning in evaluation: IvJ&rentiaL leaps and links (pp. 33-48). New Directions for Evaluation, 68.

House, E. R., & Howe, K. (1999). Values in evaluation and social research. Thou- sand Oaks, CA Sage.

8(4), 2.

Jacob, B. A. (2001). Getting tough? The im- pact of high school graduation exams. Educational Evaluation and Policy Analysis, 23(2), 99-121.

Kane, M. T. (1992). An argument-based ap- proach to validity. P~ychological Bulletin,

Kane M. T. (2001, April). The role ofpolicy assumptions in ualidating high-stakes testingprograms. Paper presented at the Annual Meeting of the American Educa- tional Research Association, Seattle, WA.

Kupermintz, H., Ennis, M. M., Hamilton, L. S., Talbert, J. E., & Snow, R. E. (1995). Enhancing the validity and use- fulness of large-scale assessments: I . NELS:88 mathematics achievement. American Educational Research Jour- nal, 32, 525-554.

Ladd, H. F. (Ed.). (1996). Holding schools accountable: Performance-based rdorm in Education. Washington, DC: Brook- ings Institute.

Lane, S., Park, C. S., & Stone, C. A. (1998). A framework for evaluating the conse- quences of assessment programs. Educa- tional Measurement: Issues and Prac- tice, 17( 2), 24-27.

Linn, R. L. (1998). Partitioning responsibility for the evaluation of consequences of as- sessment programs. Educational Mea- surement: Issues and Practice> 17( 2),

Loevinger, J. (1957). Objective tests as in- s t ruments of psychological theory

112, 527-535.

28-30.

[Monograph]. Psychological Reports, 3, 635-694.

Mehrens, W. A. (1990). Combining evalua- tion data from multiple sources. In J. Millman and L. Darling-Hammond (Eds.), Teacher evaluation (pp. 322-344). New- bury Park, C A Sage.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), E d ~ ~ a ~ ~ ~ ~ ~ a l Measurement (3rd ed.) (pp. 13-103). New York: American Council on Education and Macmillan.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons, responses and perfor- mances as scientific inquiry into score meaning. American Psychologist, 50(9),

Moss, P. A. (1998), The role of conse- quences in validity theory. Educational Measurement: Issues and Practice,

Scriven, M. (1994). The final synthesis. Evaluation Practice, 15(3), 367-382.

Shadish, W. R., Cook, T. D., & Leviton, L. C. (1991). Foundations of program evalu- ation. Newbury Park, CA: Sage.

Shepard, L. (1993). Evaluating test valid- ity. Review of Research in education^ 19,

Willingham, W. W., & Cole, N. S. (1997). Gender and fa i r assessment. Hillsdale, NJ: Erlbaum.

741-749.

I7(2) , 6-13.

405-450.

Spring 2002 15