Assessment Validation in the Context of High-Stakes Assessment

Assessment Validation in the Context of High-Stakes Assessment Katherine Ryan, University of Illinois

Including the perspectives of stakeholder groups (e.g., teachers, parents) can improve the validity of high-stakes assessment interpretations and uses. How stakeholder groups view high- stakes assessments and their uses may differ significantly from state-level policy officials. The views of these stakeholders can contribute to identifying the strengths and weaknesses of the intended assessment interpretations and uses. This article proposes a process approach to validity that addresses assessment validation in the context of high-stakes assessment. The process approach includes a test evaluator or validator who considers the perspectives of five stakeholder groups at four different stages of assessment maturity in relationship to six aspects of construct validity. The tasks of the test evaluator and how stakeholders‘ views might be incorporated are illustrated at each stage of assessment maturity. How the test evaluator might make judgments about the merit of high-stakes assessment interpretations and uses is discussed.

ost states are enacting and imple- M menting multilevel educational accountability systems that define content and performance standards that emphasize high achievement, including complex understanding of subject areas and higher order thinking (Kuper- mintz, Ennis, Hamilton, Talbert, & Snow, 1995; Ladd, 1996). These content (what students should know and be able to do in mathematics, reading, etc.) and performance standards (level and quality of knowledge and skills in specific content areas) are accompa- nied by assessments. The assessments, representing the content and performance standards, are used for holding schools accountable to improve instruction, student learning, grade promotion, and certification.

When test results are used for potentially serious consequences like grade promotion, certification, or the award of salary increases, the assessment is characterized as “high stakes.” As Kane (2001, p. 2) says, “Note that it

is their consequences that insert the ‘high stakes’.” The consequences of high-stakes assessments impact all students, teachers, and schools. While the goal of standards-based accountability is to improve teaching and learning for all, particular groups of students, teachers, and schools (e.g., low income) may be disproportionately affected by the consequences. Cer- tainly, how stakeholders like students, teachers, and parents view high-stakes assessment interpretations and uses may differ significantly from state-level policy officials who are interested in educational outcomes. Including the perspectives of stakeholder groups like school administrators, teachers, parents, and students in the assessment validation process can improve the validity of high-stakes assessment interpretations and uses. The views of these stakeholders can contribute to identifying the strengths and weaknesses of the intended assessment interpretations and uses.

In this article I propose a process approach to validity that addresses assessment validation in the context of high-stakes assessment. The process approach is linked to three themes: the notion of a “test evaluator,” the links between evaluation and validity inquiry, and the role of the stakeholders in assessment validation. After I present the process approach, I brieflydis- cuss how the test evaluator might formulate a judgment about the merit or value of assessment interpretations and uses. I conclude with some general comments on future directions for validation inquiry.

The Test Evaluator as a Public Scientist

Unlike the test developer, the evaluator holds no brief for or against the test, but rather is committed to serve all the persons having stakes in af- fairs the test may influence. Unlike the writer of tcst reviews, the evaluator undertakes independent research. Unlike the investigators who accumulate background knowledge while satisfying motives of their own, but like the program evaluator, the test evaluator is expected to produce a report in a limited amount of time. (Cronbach, 1989, p. 164)

Both Linn (1998) and Cronbach (1988,1989) have suggested that a “test evaluator” or “validator” is needed in assessment validation. Explicitly acknowledging the political dimension in the work of the evaluator, Cronbach et al. (1980) characterized the evaluator as a “public scientist” prescribing the evaluator to serve the interests of the “public good.” In this article, I present

Katherine Ryan is Associate Professor of Educadionul PsJjchology, University of Illi- nois, Chum,paign, IL 61820. Her specialixa- tions are educational evaluation and ap- plied measurement.

Spring 2002 7

the test evaluator’s obligations and responsibilities in relationship to a mul- tifaceted framework that represents a process approach to validation inquiry (see Fig. 1). I particularly emphasize the test evaluator’s responsibilities in determining the validation questions whlie addressing the dilemma of the confirmationist bias (Cronbach, 1988; Haertel, 1999; Shepard, 1993). The confirmationist bias is the tendency to look for supporting evidence in the validation of assessment interpretations and uses instead of a more balanced view examining both the strength and weaknesses of intended interpretations and uses.

As illustrated in Figure 1, I propose including the perspectives of stakeholder groups and/or audiences (those who might be interested in the findings about the evaluation of assessment interpretations and use) in the assessment validation process. Stakeholders are groups who have interests that are at stake in the assessment interpretations and uses. The perspectives of these groups may be considered at four different stages of assessment maturity in relationship to six aspects of construct validity. This validation approach, which is fundamentally an- chored by “the value implications of score meaning as a basis for action and the social consequences of score use” (Messick, 1995, p. 741), should shape the practices of the test evaluator.

Because of the high stakes involved in testing for accountabilitypurposes, it

is best for the test evaluator to be lo- cated externally to both the assessment development and use. (There may be an individual directly connected to the assessment who is the “internal evaluator” or the tasks may be completed by several people holding different positions who are performing an “internal evaluation function.”) The external test evaluator or validator is responsible to and for all stakeholders throughout the validation inquiry. This charge is particularly critical at the times the questions to be studied are being identified (Cronbach, 1989). Balancing rather than favoring one group’s interests and ideology over another is a central activ- ity for test evaluators. Acknowledging their own values and interests is important. To gain reliable understanding of stakeholders’ perspectives, test evaluators will need to be close as opposed to distant from the stakeholders.

Questions for potential study concerning possible assessment interpretations and uses are gathered from all stakeholders. These stakeholders are policymakers and government officials as well as less privileged groups. How and whether each question is studied are decided by a winnowing process involving negotiation and judgment in di- vergent and convergent phases in question development. Cronbach (1989) and Shepard (1993) propose four criteria to consider in prioritizing validity questions: uncertainty about the question, cost, criticality of information, and information yield from the study. Even

when questions are not studied, bring- ing the issues to light is helpful in know- ing what was not studied.

However, the test evaluator, by the nature of training and work, carries a confirmationist bias (Haertel, 1999). This bias is complex, including man- agement, government, administration, professions, and discourses surrounding high-stakes testing in society. Con- sequently, the test evaluator is situated in such a way that he or she is responsible for balancing many interests. Some of these are interests within which he or she is vested. At the same time, the evaluator’s external location, with links to outside institutions, technical skills, and knowledge of analytical frame- works, and the evaluator’s experience with the use of empirical evidence are assets. They are a key part of the war- rant that she or he brings to the validation inquiry. However, it is the role of the test validator in specifying study questions and in collecting data and all other phases of the validation process to bring a balanced perspective avoid- ing the confirmationist bias.

Links Between Evaluation and Validity Inquiry

Validation of a test or test use is evaluation. . . . Validation speaks to a diverse and potentially critical audience, therefore the argument must link concepts, evidence, social and personal consequences, and values. (Cronbach, 1988, p. 4)

Policymakers u)

03 a 3 c

0 a, School Officials n

9 0

P 2 C c 0 Teachers a,

E c (d s

L

c a,

c C

cn Parents and Students

Assessment Maturitv

Illuminators I /

v)

0 a, 0.

c

P - (d

C a, 3 0- a, v) C

._ c

s /

FIGURE 1 . A Process Approach to Validity Inquiry.

8 Educational Measurement: Issues and Practice

In the brief sentences above, Cron- bach introduces two key concepts that substantially alter how to think about assessment validation: the relationship between validation and evaluation and a diverse and potentially critical audience. The notion of validation as a con- struction of and an evaluation of the arguments for and against assessment interpretations and uses has been given serious consideration in the modern conceptions of the validation process (Cronbach, 1988; Haertel, 1999; Kane, 1992; Linn, 1998; Messick, 1989, 1995; Shepard, 1993). The concept of “critical audience” has not been as clearly articulated; still, the notion of audience, stakeholder, and multiple perspectives in the validation process has been visu- alized (Haertel, 1999, 2001; Lane, Park & Stone, 1998; Messick, 1995; Moss, 1998; Shepard, 1993).

Over a decade ago, Cronbach (1988) pointed out the parallels between evaluation inquiry and validation inquiry, especially their roles in shaping policy and practice. He proposed that some of the solutions that evolved in the more recent approaches in program evaluation theory and practice might be helpful in reconsidering the kind of inquiry needed in the validation of the interpretations and uses of assessments. In addition to conceptualizing validation of an assessment or assessment use as evaluation, he proposed a “validity argument” that corresponds to the logic of the “evaluation argument” (Cronbach, 1988; House, 1977). Cronbach proposed four principles to guide the development of the assessment validation argument: (a) the limitations of each interpretation are shaped by the degree of justification; (b) the interpretation can be a description, prediction, or a recommended decision; (c) the local users’ inferences and policies (and test developers’ interpretations) should be examined; and (d) the task of validation involves examining the strengths and weaknesses of the assessment interpretations and uses.

While the term evaluation is used within the literature onvalidation (e.g., Shepard, 1993), the meaning of evaluation in the validation inquiry context is not clearly defined. This is a key concept in constructing and examining the arguments for and against assessment interpretations and uses. I define evaluation here as a systematic examination of interpretations and uses occurring in and resulting from an assessment or

accountability system. (Other indicators could also be examined as part of this systematic examination.) The evaluation is conducted to assist in (a) improving the assessment interpretations or uses and/or (b) making judgments about the merits or worth of these interpretations and uses. Validation inquiry is the overall evaluation of the intended and unintended interpretations and uses of test score interpretations.

To make a judgment about the validity of assessment interpretations and uses, some criteria can be selected to justify the judgments. The validator or test evaluator presents an argument with evidence specifying the criteria employed in the evaluation and their justification. Determining criteria in an evaluation is not a straightforward process. Nor can all possible assessment interpretations and uses be studied. In the validity criteria facet of Fig- ure I , I present an example of the criteria that might be appropriate for validation of intended assessment interpretations and uses.

Validity Criteria Facet: Evidence to be Collected Theoretically, criteria in an evaluation can and should come from any number of sources (e.g., from an examination of what is being evaluated, stakeholders and audiences, research literature, etc.). Messick’s (1995) theory of construct validitycan be considered as providing general criteria for all educational and psychological assessments. For the purposes of this paper, I am adopting these criteria as the initial benchmark for the validity facet of the framework for implementing validation inquiries (see Fig. 1). There are other sources of criteria that are more or less similar, including the Standardsfor Ed- ucational and Psychological Testing (AERA, APA, & NCME, 1999) and the five perspectives on the validity argument (Cronbach, 1988).

Messick criticized historical concep- tualizations of validity for not addressing two major issues: “. . . the value implications of score meaning as a basis for action and the social consequences of score use” (Messick, 1995, p. 741). (See Messick, 1989, for a complete discussion of these issues and his argument concerning their centrality to validity.) Instead, he proposed a unified concept of validity based on an ex- panded theory of construct validity that “integrates considerations of content,

criteria, and consequences into a construct framework for the empirical testing of rational hypotheses about score meaning and theoretically relevant re- lationships including those of an ap- plied and a scientific nature” (Messick, 1995, p. 751).

He concluded that construct validity should incorporate any evidence that impacts the meaning and interpretation of the assessment scores (Messick, 1989, 1995). Validity is defined as an overall judgment of the extent to which the empirical evidence and theory support the adequacy and appropriateness of the interpretations from assessments for a specific use. Critical to this defini- tion is the notion that validity is not a property of a test or assessment. In- stead, validity is a characteristic of the meaning and interpretation of the assessment scores and any actions based on the assessment scores.

The following list provides brief de- finitions of terms within Messick’s construct validation theory, presents potential sources of evidence, and il- lustrates how evaluators and stakeholders might collect evidence or be involved.

Content aspects include evidence of content relevance and represen- tativeness. Establishing the boundaries of the domain to be assessed is critical in conceptualizing content considerations. Sources ofevidence typically are results of job analysis, task analysis, logical analysis, and other forms of analysis conducted by expert judges. Stakeholders could participate by assisting in determining the boundaries of the construct and in collecting sources of evidence concerning the criticality or importance of particular dimensions de- rived from the task analysis. Substantive aspects involve evidence supporting the theoretical and empirical analysis of the processes, strategies, and knowledge proposed to account for respon- dents’ item and/or task performance on the assessment. Sources of evidence include analysis of individual responses or response processes through think-aloud protocols or simply asking respon- dents about their responses. Stake- holders can make judgments about whether the theoretical analysis supports students’ item and/or task performance and what might

Spring 2002 9

be missing. Some stakeholder groups might participate in the analysis of think-aloud protocols. Structural aspects are most similar to concerns relating to the internal structure of an assessment. Based on Loevinger’s (1957) concept of structural fidelity, roughly speak- ing, structural considerations involve assessing howwell the scoring structure parallels the construct domain. Sources ofeviderzce involve structural considerations based on investigations of the interitem cor- relations and test dimensionality. Stakeholders can examine the structural dimensions of the assessment and address concerns or issues about howwell the structure maps to the construct. External aspects include the fa- miliar types of convergent and discriminant evidence from multi- trait multimethod studies. Sources oievidence concerning the relevance of the criterion are also addressed in external considerations. Stakeholders can suggest and participate in studies investigating convergent and discriminant validity. They can also propose relevant criteria a.nd participate in studies examining these criteria. Generalixability aspects are concerned with the degree to which score meaning and use can be generalized to other populations, contexts, and tasks including the test (assessment)-criterion relationship. Sources ofevidence con- sist of prediction studies and other studies of how particular factors (e.g., type of assessment ta,ker) might impact the assessment- criterion relationship. Stakeholders can suggest factors that they consider rclevant for investigation. Consequential aspects are concerned with score meaning and the intended and unintended consequences of assessment use. Sources of evidence within the high-stakes assessment context might involve a study of how the use of assessment scores for teacher salary increases impacts teachers’ instructions to the students. Intended consequences (e.g., more learning) and unintended consequences (e.g., teaching to the test) are examined. Questionnaires, classroom obser- vations, and case studies are the

most typical methods used to study consequences, Stakeholders can present their perspectives on how the consequences of high- stakes assessments are influenc- ing teaching and learning. These issues can be investigated providing sources of evidence about assessment consequences.

Validation and Multiple Perspectives In his 1988 and 1989 papers, Cronbach fundamentally shifted validation from a “. . . ritual performed behind the scene with the professional elite as witness and judge” (1988, p. 3) to audiences with multiple perspectives. Validation involves “activities that clarify for a relevant community what a measurement means and the limitations of each interpretation” (Cronbach, 1988, p. 3).

The notion of the audience, stakeholder, and multiple perspectives in the validation process, especially in relationship to values, is receiving attention (Cronbach, 1988; Haertel, 1999, 2001; Messick, 1995; Shepard, 1993). Values are fundamental to the meaning and outcomes of assessment (Messick, 1995). Messick (1995) presents a per- suasive case for the examination of ex- plicit and implicit values in score interpretation and use. He proposes that looking at both assessment interpretation and use from multiple perspectives is one approach to making tacit values visible in the validation process. In terms of score interpretation, this involves empirically and substantively examining the alternative ideologies and theories surrounding the construct of interest (Messick, 1995). The same strategy of multiple perspectives can also be used in examining assessment uses.

Open dialogue and debate will bring different value commitments concerning assessment use to light. However, while Messick (1995) is emphatic about the importance of multiple perspectives in evaluating the arguments for and against assessment interpretations and uses, whose values the multiple perspectives represent is not well defined.

The Role of Stakeholders and Audiences in Validity Arguments and Conclusions

Accountability for educational outcomes should be a shared responsibility of states, school districts, public officials, educators, parents, and students. High standards cannot be

established and maintained merely by imposing them on students. (Heuber & Hauer, 1999, p. 5)

The role of stakeholders in the assessment validation process has only been minimally articulated. In the context of large-scale assessment programs, it is obvious that particular assessment interpretations and uses do have an impact on the stakeholders and audiences. For example, results from a recent investigation suggest high- school graduation tests increase the probability that the lowest achieving students will drop out (Jacob, 2001).

Including stakeholders’ perspectives is one approach for addressing the confirmationist bias (Cronbach, 1988; Haertel, 1999; Shepard, 1993). The confirmationist bias violates one of the principles 1 (d), i.e., examining the strengths and weaknesses of the assessment interpretations and uses] that Cronbach proposes for developing the validity argument cited in the sec- tion above. Haertel (1999) proposed moving beyond the “coqfirmationist bias to look for evidence against our intended assessment interpretations” (p. 6) in high-stakes assessment in his 1999 NCME Presidential address. Specifically, he suggested that even talking about ivhether to include different groups of people in a discussion about the meaning and use of assessment interpretations wasofvalue to the assessment community. He asked the assessment community to imagine sur- veying teachers, principals, and school superintendents to ask them if they re- gard test scores as adequa.te measures of what they are accomplishing and to listen to what they say (Haertel, 1999).

Stalceholder Facet: Who Are the Stalceholders and Audiences? Who are potential stakeholders and audiences in a high-stakes assessment validation inquiry? An adaptation of Cronbach’s (1980) policy-shaping community might be a good starting point for considering who these groups might be. The stakeholder facet in Figure 1 presents various groups who could be the stakeholders and audience in the high-stakes assessment validation. There are many other possible stakeholders groups including combinations of groups. For instance, in Illinois there is a coalition between large school districts and the business community. The coalition is actively lobbying for in- creased accountability.


As illustrated in Figure 1, the different stakeholder groups and audiences within the assessment andlor accountability system include policymakers at the federal andlor state level, school officials (superintendents and principals) , and those involved in direct instruction (teachers). Federal and state policymakers are concerned with educational outcomes and how they contribute to policy. Their interests in assessments are fo- cused on the measurement of these educational outcomes. Superintendents and principals are responsible for the administration and daily operations of the educational programs. They are held accountable for seeing that there is progress toward meeting educational standards and outcomes that are now represented as increases in test scores and other indicators.

Teachers and other direct service personnel are responsible for instruction. There is a move toward holding teachers responsible for increases in assessment scores and other indicators by rewarding increases with pay increases and bonuses. Theoretically, teachers (and probably principals and superintendents) are interested in how and whether high-stakes assessments can adequately measure student learning.

Members of the public include two groups: individuals directly affected by the assessment or system consequences (parents, students) and illuminators (journalists, academicians) (Cronbach et al., 1980). Parents and students have a stake in how the assessment or accountability system validation might af- fect their interests such as whether high-school diplomas are awarded based on assessment performance. The illuminators interpret and communi- cate information about the assessment, accountability system, and the validation inquiry. They are interested in dis- entangling or perhaps entangling issues surrounding the assessment and assessment validation inquiry.

Issues in Including Stakeholders’ Perspectives. The logic of the validity argument becomes substantially more complex with the addition of stakeholders’ and audiences’ perspectives. Not only the nature of assessment interpretations and uses needs to be considered when determining appropriate criteria. Determining appropriate criteria from points of view is of equal importance. There is a fundamental dis-

tinction between ‘bood of a kind X which means the thing fulfills a role and goodfrom a Ypoint of view” (House & Howe, 1999, p. 21). This distinction is at the heart of considering assessment interpretations and uses as “good from a technical and functional perspective (good of a k ind ) and considering whether assessment interpretations and uses are “good from the perspectives of groups whose interests are affected differentially (good f rom a stakeholder group’s point of view). For example, developers of a high-stakes statewide assessment comprised of multiple-choice items may be able to point with pride to the technical quality of this assessment (good of a k ind ) . However, in spite of the technical qual- ityof the assessment, teachers (a stakeholder group) might argue assessment scores are not adequate measures of what students are accomplishing or of what they are teaching (goodfrom a stalceholder group’s point of view).

The nature of the assessment interpretations and uses, their functions, and what these are actually doing are the critical considerations in deciding who the stakeholders and audiences are and what the criteria are. It will not be surprising, given the multiple perspectives of audiences and stakeholders, that they may find different criteria acceptable, different arguments per- suasive, and different evidence credi- ble. Teachers, for example, may find the results of think-aloud protocols of a small number of students’ responses to standardized test items as evidence of critical thinking in mathematics. On the other hand, the legislature and policymakers might find such studies of lit- tle interest or unconvincing. .

Assessment Maturity Facet: Who, How, and When to Include Stakeholders in the Assessment Validation Process Who, how, and when to include stakeholders in the assessment validation process shifts substantially in relationship to the maturity of the assessment. The assessment maturity facet in Figure 1 proposes four stages in the maturity of an assessment: conceptualization, design, implementation, operational. While five stakeholder groups and four stages of assessment maturity suggest there are multiple places and multiple groups of stakeholders to be considered in the validation inquiry, it is unlikely that each group will be involved in each phase of the validation

inquiry, Nevertheless, this represen- tation does reflect the potential political and social complexity of high- stakes assessment and accountability (see Fig. 1).

While in theory these stages are considered to be a linear sequential process, they are interconnected and overlapping. I propose considering the validation inquiry in four separate phases because what stakeholders are involved, the questions of interest, and the potential studies to be conducted shifts substantially from one phase of the validation process to another. In addition, if one of the goals of validation inquiry is to assist in improving the assessment or accountability interpretations and uses, defining key points for a feedback mechanism in the validation process is critical for improvement to take place. It is difficult to make changes in how an assessment is con- ceptualized when it is fullyoperational. For example, changes in the assessment conceptualization based on an evaluation of the construct representa- tion are much easier to accomplish before the assessment is fielded.

Conceptualization Phase. Foremost in the conceptualization stage is determining the purpose of the assessment, identifying who will be taking the test, and establishing the proposed interpretation and use, the specific context of use, construct analysis, assessment formats, and the assessment specifications. Interactions between constructs and formats are critical considerations in format selections (Willingham & Cole, 1997). At this stage, there are multiple alternatives proposed, discussed, discarded, and perhaps revis- ited with significant negotiations on the substantive features and details of the conceptualization before a set of assessment specifications is agreed upon. Rival hypotheses to be considered involving the meaning of low achievement scores include whether low scores are due to a lack of motivation, lack of competence, or emotional upset. Of equal importance at this phase of the validation process are key features of the construct of interest that are either not assessed well or at all in the context of large-scale assessment. The issue of group differences in test performance needs to be addressed in conceptualizing the assessment. In a hypothetical example for a high-school exit exam, the rate of success and failure changed

Spring 2002 11

substantially for males and females depending on the construct being assessed (Willingham & Cole, 1997).

Haertel (1999) presents an example of what a “theory” behind the conceptualization of a high-stakes assessment might look like (see Fig. 2). He identi- fies (three purposes for large-scale assessments: (a) accountability, (b) con- tribution to the public discourse about educational concerns, and (c) testing to improve teaching and learning. As illustrated in Figure 2, a key assumption anchoring these purposes is that “test scores tell how well schools are performing” (Haertel, 1999, p. 5). This basic premise leads to two others (e.g., test scores show how much students know and can do) which in turn lead to five others (see Fig. 2). This “theory” represents intended assessment interpretations and uses that need to be

validated. All stakeholder groups may not agree that these interpretations are justified.

Stakeholder gruups. Figure 2 basi- cally represents policymalters’ views of educational accountability: if schools are managed well and doing a good job (managerial efficiency), then students will learn more and demonstrate this improvement by higher scores on standardized achievement tests (outcome). Stakeholder groups like teachers and school administrators may not agree that standardized test scores can adequately represent how much students know and can do. Teachers and other stakeholder groups may think that performance assessment tasks (e.g., an ex- tended week-long project) are critical for representing student achievement. Parents, teachers, and school adminis-

trators may be particularly concerned about whether standardized test scores are good indicators for specific groups of student (e.g., low achieving students, students with poor test-taking skills). The perspectives of policymakers, parents, teachers, school administrators, the public, and other stakeholders may not agree, but they do need consideration, for example, in the selection of constructs to be measured for high- school exit exams where group differences in test scores are at issue.

Tusks of the evaluator. The role of the validator in this phase involves examining all aspects of the assessment conceptualization process to determine if and how these activities were accomplished. Central to the evaluator’s role is identifying the stakeholders like teachers and administrators and ensur-

1. Answering test items requires skills included in the curriculum

Contenf-related criterion-related and construct-related validity evidence; cognitive labs, and studies of item bias and test bias address this premise

Test scores tell how well schools are performing

2. Skills demonstrated on tests can also be used on other tasks

Such transfer is assumed under differential and behaviorist perspectives, but problematical from cognitive and situative perspectives on testing.

13. Proficiency on tested skills shows proficiency on untested skills as

Some important schooling outcomes cannot be fested efficiently. Also year-to yeargrowth on one high-stakes test offen fails to transfer even to other similar tests.

The more a school’s students know and can

do, the better the school is Test scores show how much students know and can do

5. Students’ skill

Cognitive learning outcomes are not the

Valid inferences from test performance to aualitv of schooling requite much contextual information. Out-of-school factors are critical to sound test interpretation.

FIGURE 2. permission, from Haertel, 1999.)

Unpacking the Premise that Test Scores Show How Well Schools Are Performing. (Reprinted, with


ing their point of view is included. For instance, helping all stakeholders groups reveal their particular “theory” of what high-stakes assessment interpretations and uses are going to accomplish in this particular context is one of the test evaluator’s tasks. The evaluator can do this by first presenting stakeholders with an illustration of what a “high-stakes assessment theory” is. Fig- ure 2 would be a good example. Then the evaluator can ask them how their theory is the same and how it is different. Stakeholders’ alternative theories may identify an area of weakness in the intended assessment interpretations and uses that needs to be justified. For instance, teachers may think high- stakes assessments lead to “teaching to the test.” The test evaluator may decide this is a key issue that needs further investigation. This particular concern can be studied as part of the assessment validation process.

Design Phase. Based on the specifications developed in the conceptualization stage, samples of the best possible items are prepared for field-testing. At this stage there are painstaking efforts to ensure the items are carefully mapped to the constructs and specifications identified in the conceptualization stage. A field trial takes place in re- alistic field conditions in efforts to predict what the assessment will look like when fully operationalized. The activities include choosing items and topics, editorial, sensitivity, and substantive reviews of sample assessment tasks in relationship to the specifications and the assessment overall, opportunity to learn, and statistical analyses [ e.g., differential item functioning (DIF) analyses] ensuring the technical quality of the preliminary assessment analysis.

Taking into consideration the latest findings, which suggest that scores on multiple-choice tasks and, to a lesser extent, concrete items disproportion- atelyfavor male test takers, is critical in field trials. Think-aloud protocol studies investigating the substantive aspects of validity, whether the assessment tasks are engaging students in higher order thinking, are conducted routinely. Standard setting, a defining feature of standards-based educational reform, can be examined empirically and substantively at this phase of the validation process. The potential consequences of various cut scores that will be used to define “basic” and “profi-

cient” can be studied at this phase of the validation process.

Staicehotdergroups. Haertel(2OOl) calls for standard setting to be a more participatory process. In a participatory standard-setting process, stakeholders like teachers, school administrators, and the public can be involved in the standard-setting process as part of the “due process” enacted by policymakers (e.g., Superintendent of Instruction). The values, beliefs, and intentions of the panelists (e.g., stakeholders like teachers and representatives of the workforce) should be adequately re- flected in the cut scores selected to represent the “basic” and “proficient” labels of the performance benchmarks. Other activities involving stakeholders might include teachers reviewing the results of think-aloud protocols to verify that the assessment tasks do assess higher order thinking skills.

Tasks of the evaluator. Mapping the “theory” behind the assessment conceptualization is not enough. The validator, in the design phase, examines whether the conceptualization of the assessment is actually being implemented. The tasks include examining all aspects of the assessment design process to identify if and how these activities were accomplished. The evaluator’s role is to identify the stakeholders’ interests in the process and ensure their points of view are part of the assessment design.

For example, stakeholders like teachers and others can participate in tradi- tional activities such as review of the item pools for cultural sensitivity and ir- relevant sources of difficulty. Including community members and perhaps parents in the validation of cutscores for high-school graduation exams is another exllmple of how the validator could include stakeholder perspectives. The validator is also responsible for determining whether the cut scores are reliable.

I w ~ ~ l e m e n ~ a t ~ o n Phase. In the implementation stage, the assessment is actually administered under conditions that are similar to when the assessment will be used for accountability purposes. Defining, ensuring, and assessing the effects of the proposed standard conditions and timing, and examining the accommodations that are suitable for students with disabilities are carefully considered assessment administration issues. The routine technical

work such as scoring, scaling, and equating are completed. The reliability of the assessment scores, subgroup performance, and comparability are assessed. Studies predicting the impact of actually implementing concrete consequences based on the assessment results are appropriate here-for students, groups of students, effects on instruction, and educational outcomes. These include the differential effects from various decision models and conditions of use, and an investigation of proposed criterion information.

Stakeholder groups. Stakeholder groups can be involved in studies investigating the meaning of assessment interpretations and uses from their particular perspectives. In general, as part of the accountabilitymodel, policymakers do assume that standardized test scores reflect what students know and can do. Whether teachers, principals, and parents think the assessment adequately reflects student achievement can be investigated. How students, parents, teachers, and superintendent and principals interpret the meaning of these assessment scores can also be examined. In addition, the perspectives of students, parents, teachers, and administrators concerning the social costs of particular accountability consequences that are being considered based on assessment scores can be investigated.

Tasks of the evaluator. The tasks of the validator include reviewing the technical and administrative aspects of the implementation phase. In addition, the validator may conduct several studies to investigate the meaning of assessment interpretations and uses from the perspectives of stakeholders groups or perhaps review these kinds of studies that are already completed. Con- ducting studies or reviewing studies examining the number of schools that would be put on the warning lists based on particular performance levels and comparing the impact of these various performance levels on individuals and specific subgroups illustrate some of kinds of activities engaging the test evaluator during this phase.

The tasks for test evaluators include studying the social consequences of accountability. Studying social consequences is complex. As Cole and Zieky (2001) suggest, the same consequence will be viewed differently depending on

Spring 2002 13

people’s values. They illustrate this point with an example from high-school graduation tests. When low-achieving students do not pass high-school graduation tests, some people or groups will consider this to be necessary for a high- school diploma to have meaning. Other will see this as unfair to the student and consider the failure to reflect other factors such as poor schooling or a lack of school funding.

Operational Phase. The assessment is administered routinely and is used for accountability purposes during the operational phase. At least in theory, it is a t this phase that the notion of consequences plays out: for example, students attending a school which has not shown improvement based on test scores could be given a voucher to at- tend a school of their choice. Investiga- tions examining the effects of the consequences and systemic effects for students, groups of students, and com- munities, and the effects on instruction and educational outcomes are all potential studies of interest. Criterion studies or a review of completed criterion studies are conducted here.

Stakeholder groups. Policymakers, teachers, school administrators, parents, and other stakeholders can participate in selecting the particular criteria that need to be studied. Policymakers may be more interested in studying whether linking teachers’ salary increases to students’ achievement actually impacts students’ test scores and how this link might change teachers’ in- structional practices. Other stakeholder groups like school administrators and teachers may be interested in examining whether there is an increase in teaching to the test or the influence of testing on tracking or promotion.

Tmks ofthe evaluator. The role of the test evaluator shifts significantly during the operational phase of the assessment maturity process when a judgment about the merit or worth of the assessment system is to be made. That is, afundamental judgment is made about the validity of the assessment interpretations and uses. While the test evaluator has synthesized information during all phases of the assessment validation, synthesizing evidence, concepts, values, consequences, and stakeholders’ interests that are likely to be conflicting to make judgments about merit or worth is a complex task.

To date, there are few formal rules or pro- cedures for making judgments (House, 1995; House & Howe, 1999).

A complete discussion of the philos- ophy and logic of making judgments of merit or worth are beyond the scope of this article. Nevertheless, I briefly high- light suggested approaches in program evaluation and present one option that could be used in judging the merit or worth of assessment interpretations and uses.

Judging the Merit or Value of Assessment Interpretations and Uses Malting a fundamental judgment about the validity of the assessment interpretations and uses is a significant challenge. Within the field of evaluation, how to synthesize diverse criteria and data to formulate judgments of merit and integrating different stakeholder interests into this synthesis process has received substantial attention (House, 1995; Scriven, 1994). There is no clear consensus on either issue: synthesizing the criteria or the integration of different stakeholder interests. There are alternatives for addressing the synthesis problem (House, 1995). These include numerical approaches which involve the qumtification of judgments (Mehrens, 1990), a heuristic method classifying criteria as primary or secondary before synthesis (Scriven, 1994) , and a synthesis approach (House, 1995). Mehrens argues that combining data is necessary only for summative purposes. In the context of the evaluation of teaching, he recommends either the conjunctive method (multiple cut- offs model) or the compensatory model (data are combined based on an algo- rithm that permits low scores on some measures to be compensated by scores on other measures). He notes the quan- tification of judgments is preferred over clinical methods (defined as eyeballing the data) when a criterion is available. Scriven (1994) criticizes a quantitative weight and sum approach for assuming that a single scale can be used for weights, performances, and number of criteria. To classify criteria as primary or secondary, he proposes a qualitative weight and sum approach where there are five levels of qualitative importance: essential, very important, important, just significant, and not significant. Secondary criteria are employed only after primary criteria (essential) are considered.

I propose synthesizing the criteria and data for making judgments about assessment interpretations and uses with House’s (1995) synthesis approach. Critical to this approach is the notion that the decision is context specific rather than a law-like generaliza- tion about merit or worth. The “all things considered synthesis provides the most coherence possible from the information available from various sources” (p, 34). Coherence involves the fitting together of available evidence; the judgment can be disputed if critical evidence is omitted (House, 1995). The Standards for Educational and Psychological Testing (AERA et al., 1999) also cites the notion that a co- herent account of evidence is a key to making a defensible judgment about test score interpretations and uses. I propose grounding the synthesis in dialectical reasoning to specifically address the issue of the confirmationist bias. By dialectical reasoning, I mean that the reasons given to support a judgment will be designed to particularly address the doubts or questions about the judgment that others have raised or might raise (Blair, 1995).

How to integrate different stakeholder interests into this synthesis process is an even greater challenge. Of course theoretically, stakeholder and audience perspectives should be included in each phase of the validation process. Never- theless, there are two approaches that might be considered in making judgments about the assessment validation process. Developing multiple syntheses, which reflect multiple judgments (e.g., constructing several value positions depending on different stakeholders’ perspectives), is one possibility (Shadish, Cook, & Leviton, 1991). Arriving at a single qualified judgment is another possibility (House, 1995). The same kind of “intuitive to and fro” reasoning is the heuristic for arriving at a single judgment. The qualifications or contextual features of the validation process are also critical here. That is, judgments are specifically tailored to the assessment context at hand.

Conclusions The framework presented in this article is an initial effort in developing a set of concrete strategies for implementing Val- idation inquiries, addressing the concerns about how to include stakeholders as part of the validation process, and


defining the tasks and activities of the test evaluator and stakeholders in the validation process. There are several important topics that were not addressed in this article or only briefly discussed.

If the assessment community is going to move beyond the confirmationist bias, how to make judgments about the merit of assessment interpretations and uses is critical. While House’s (1995) approach and others may be a good starting point, further theoretical and empirical work is needed. How to integrate stakeholders’ perspectives in the synthesis process is problematic and makes for a substantially more complex process. Construct- ing several value positions depending on different stakeholders’ perspectives is highly complicated.

Managing conflicting views and ad- vice is a challenge. Some views will be based on misunderstandings. As Bren- nan (2001) suggests, teachers, school administrators, politicians, and the public are innocent about the complex- ities of an assessment program that meets the Standards for Educational and Psychological Testing (AERA et al., 1999). The evaluator has the opportunity, in this case, to be a teacher about the intricacies of assessment interpretations and uses. On the other hand, teachers, school administrators, the public, and other stakeholders do think group performance differences matter when considering issues like construct choices for high-school exit exams and providing students with an adequate opportunity to show what they know in the high-stakes assessment contexts. As Cole and Zieky (2001) suggest, these stakeholders’ views are reasonable while a technical position maintaining otherwise is lcss so.

In efforts to draw attention to the Slandards for Educational and Psycho- logical Testing (AERA et al., 1999), Fre- mer (2000a) suggests that the concept of construct validation is elevated to such a high level that it may be out of reach. While Green (2000) argued that construct validation is not out of reach, Fre- mer (2000b) and Green agree that concrete examples illustrating validation tasks would be useful. Further develop- ments in validation theory and inquiry would benefit greatly from empirical studies where test evaluators or re- searchers actually implement and illustrate forms of validation inquiry. To date, many tasks of validation remain either undeveloped or can be more clearly rec- ognized as forms of tacit knowledge.

References American Educational Research Associa-

tion (AERA), American Psychological As- sociation (APA), & National Council on Measurement in Education (NCME) (1999). Standardsfor educational and psychological testing. Washington, DC: American Psychological Association.

Blair, J. A. (1995). Informal logic and reasoning in cvaluation. In D. M. Fournier (Ed.), Reasoning in evaluation: Inferen- tial leaps and links. (pp. 71-80). New Di- rections fo r Evaluation, 68.

Brennan, R. L. (2001). Some problems, pit- falls, and paradoxes in educational measurement. Educational Measurement: Is- sues and Practice, 20(4), 6-18.

Cole, N. S., & Zieky, M. J. (2001). The new faces of fairness. Journal offlducational Measurement, 38(4), 369-382.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3-37). Hillsdale, NJ: Erlbaum.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. E. Linn (Ed. ) , Intelligence: Measurement, theory, and puhlic policy (pp. 147-171). Urbana: University of Illinois Press.

Cronbach, L. J., et al. (1980). Toward a re- f o r m of program evaluation: Aims, ,methods, and institutional arrange- ments. San Francisco: Jossey-Bass.

Fremer, J. (2000b, December). My last (?) commcnt on construct validity. National Council on Measurement Newsletter,

Fremer, J. (2000a, September). Promoting high standards and the problem of construct validation. National Council on Measurement Newsletter, 8(3 ) , 1.

Green, J. (2000, December). Standards for validation. Nalional Council o n Mea- surement Newsletter, 8f4), 8.

Naertel, E. H. (1999). Validity arguments for high-stakes testing: In search of evidence. Educational Mea,suremmat: Is- sues and Practice> 18(4) , 5-9.

Haertel, E. H. (2001, April). Standard setting as a participatory process. Paper presented a t the Annual Meeting of the American Educational Research Associa- tion, Seattle, WA.

Heubert, J. P., & Hauer, R. M. (Eds.). ( 1999). High stakes: Testingfor tracking, promotion, and graduation. Washing- ton, DC: National Academy Press.

House, E. R. (1977). The logic @the evaluation argument. Los Angeles: Center for the Study of Evaluation.

House, E. R. (1995). Putting things together coherently: Logic and justice. In D. M. Fournier (Ed.), Reasoning in evaluation: IvJ&rentiaL leaps and links (pp. 33-48). New Directions for Evaluation, 68.

House, E. R., & Howe, K. (1999). Values in evaluation and social research. Thou- sand Oaks, CA Sage.

8(4), 2.

Jacob, B. A. (2001). Getting tough? The impact of high school graduation exams. Educational Evaluation and Policy Analysis, 23(2), 99-121.

Kane, M. T. (1992). An argument-based approach to validity. P~ychological Bulletin,

Kane M. T. (2001, April). The role ofpolicy assumptions in ualidating high-stakes testingprograms. Paper presented at the Annual Meeting of the American Educa- tional Research Association, Seattle, WA.

Kupermintz, H., Ennis, M. M., Hamilton, L. S., Talbert, J. E., & Snow, R. E. (1995). Enhancing the validity and use- fulness of large-scale assessments: I . NELS:88 mathematics achievement. American Educational Research Jour- nal, 32, 525-554.

Ladd, H. F. (Ed.). (1996). Holding schools accountable: Performance-based rdorm in Education. Washington, DC: Brook- ings Institute.

Lane, S., Park, C. S., & Stone, C. A. (1998). A framework for evaluating the consequences of assessment programs. Educa- tional Measurement: Issues and Prac- tice, 17( 2), 24-27.

Linn, R. L. (1998). Partitioning responsibility for the evaluation of consequences of assessment programs. Educational Mea- surement: Issues and Practice> 17( 2),

Loevinger, J. (1957). Objective tests as in- s t ruments of psychological theory

112, 527-535.

28-30.

[Monograph]. Psychological Reports, 3, 635-694.

Mehrens, W. A. (1990). Combining evaluation data from multiple sources. In J. Millman and L. Darling-Hammond (Eds.), Teacher evaluation (pp. 322-344). New- bury Park, C A Sage.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), E d ~ ~ a ~ ~ ~ ~ ~ a l Measurement (3rd ed.) (pp. 13-103). New York: American Council on Education and Macmillan.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons, responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9),

Moss, P. A. (1998), The role of consequences in validity theory. Educational Measurement: Issues and Practice,

Scriven, M. (1994). The final synthesis. Evaluation Practice, 15(3), 367-382.

Shadish, W. R., Cook, T. D., & Leviton, L. C. (1991). Foundations of program evaluation. Newbury Park, CA: Sage.

Shepard, L. (1993). Evaluating test validity. Review of Research in education^ 19,

Willingham, W. W., & Cole, N. S. (1997). Gender and fa i r assessment. Hillsdale, NJ: Erlbaum.

741-749.

I7(2) , 6-13.

405-450.

Spring 2002 15

Assessment Validation in the Context of High-Stakes Assessment

Documents

Transcript of Assessment Validation in the Context of High-Stakes Assessment