We are pleased to Present this - The National...

We are pleased to Present this. the seventh volume of published symposia, papers, and surveys of the National Association of Test Directors. The Association considers the promotion of discourse on testing matters from both theoretical and practical perspectives to be essential to its mission. The publication of this document is one important activity that is undertaken to address that goal. These papers were presented at the April, 1991 meeting of the National Council an Measurement In Education in Chicago, IL. They reflect topics of major interest to members of the measurement and testing communities.

Ernest Bauer OAKLAND (MI) PUBLIC SCHOOLS

Peter Wolmut MULTNOMAH (OR) EDUCATION SERVICE DISTRICT

Co-editors

TABLE OF CONTENTSSymposium I:

More Authentic Assessment: Theory and PracticeIntroduction .....

More Authentic Assessment: The Big Picture Walter Z. Hathaway

A Survey of More Authentic Assessment PracticesJoe B. Hanson

Portfolio Assessment: Issues Related to AggregationAllan Olson

Large Seal* Assessment: Focusing on What Students Really Know and Can Do .Frank Horvath

An Inclusive Approach to Alternative Assessment Lew Pike

Discussion Barbara Presseissen

Symposium II: Measurement Issues in Performance Assessment

Performance Assessment: What's Out There and How Good Is It Really? ............ Judy ArterPerformance Testing and Standardized Testing: What Are Their Proper Places? .............. John FramerReliability of Performance Assessments: Let's Make Sure We Account for the Errors.. Michael Trevisan

Discussion I.Gilbert Sax

Discussion II ......Richard Stiggins

Symposium III Local. State. National, and international Indicator Systems: Will We Know Where We Are If We Got There?Introduction ..........The Council of Great City Schools Indicator Project... Sharon Johnson-LewisThe Michigan Educational Cost/Quality Indicator Project Edward RoeberThe International Indicator System Project ............. Gary PhillipsIndicator System from a Local District Perspective .......... Kevin Matter

Authors AND EditorsJUDY ARTER Director, Test Center Northwest Regional Educ. Lab. 101 99 Main St., Ste 101 Portland OR 97204ERNEST BAUER Director of Testing Oakland Public Schools 2100 Pontiac Lake Rd. Pontiac MI 48054JOHN FRM93L Senior Development Leader ETS, Mailstop 07-E Princeton NJ 08541WALTER E. HATHAWAY Director of Research & Eval. Portland Public Schools 501 N. Dixon St. Portland OR 97227JOE B. HANSEN Exec. Director, Eval. & Research Colorado Springs School District 1115 N. Z1 Paso St. Colorado Springs CO 80903FRANK G. HORVATH Director, Student Evaluation Alberta Education 11160 Jasper Av. Edmonton ALTA T5K OL2SHARON JOHNSON-;LEVIS Moore & Associates 25160 Latiser, Ste 200 Southwell MI 48034KEVIN MATTER Coord., Testing & Research Cherry Crook Public Schools 4700 S. Yosemite Englewood CO 80122ALLAN OLSON Executive Director Northwest Evaluation Association 5 Centerpoints Dr., Suite 100 Lake Oswego OR 97035EDITORSGARY PHILLIPS Natl. Center for Educ. Statistics US Dept. of Education 555 Nov Jersey, NV Washington DC 20208LEWIS W. PIKE Mgr., Research & Analysis Fairfax County Public Schools 7423 Camp Alger Ave. Falls Church VA 22042BARBARA PRZSSZISSEN Director, National Networking Research for Better Schools 444 N. 3rd St. Philadelphia PA 19123-4107EDWARD D. ROEBER Supv., Assess. & Accreditation Michigan Dept. of Education POB 30008 Lensing MI 48909GILBERT SAX Prof. of Educ. Psychology' Univ. of Washington Seattle VA 98195RICHARD STIGOINS Director of Assessment Northwest Regional Educ. Lab. 101 SW Main St., Ste. 500 Portland OR 97204MICHAEL TREVISAN RMC Research Corp. 522 SW 5th Av., Ste. 1407 Portland OR 97204PETER WOLHUT Director, School Multnomah ESD 11611 NE Ainsworth Portland OR 97220ACKNOWLEDGEMENTS

Support Service

The editors wish to express deepest appreciation to the members of. the Board of the National council on Measurement in Education for their continued support of National Association of Test Directors endeavors. Special thanks go to Nancy Rodgers, Camella Forster, Sue Aschim, and Charlene Smith for their assistance in producing this volume.

Sympos ium I MORE AUTH ENT IC ASSESSMENT: THEORY AND PRACTICE

Local district test directors are being pressed to move rapidly into assessment methodologies considered to be more "authentic". In this symposium organized by Joe Hanson [Colorado Springs Public Schools], Walter Hathaway [Portland (OIL) Public Schools] provides some background for the Impetus for authentic assessment, points out some of its advantages and disadvantages, and makes some proposals for its Inclusion in testing programs. Joe Hanson presents the results of a national survey of authentic assessment practices, including names and addresses of persons in organizations using authentic assessment methods. Allan Olson [Northwest Evaluation Association] offers a white paper of issues related to the aggregation of portfolio assessment data. Frank Horvath (Alberta Department of Education] describes their attempts to incorporate performance assessment in their examinations. Lewis Pike [Fairfax (VA) County Schools] shares information about the development of performance tests in Mathematics. Barbara Pressaissen [Research for Better Schools), In bar role as discussant, provides bar perspective on issues.I

MORE AUTHENTIC ASSESSMENT: THE BIG PICTURE*Walter E. Hathaway Portland Public SchoolsThere is a growing dilemma in the field of educational assessment. A schism is emerging from the conflict between two trends: educational reform and accountability versus school restructuring, teacher empowerment, and integrated curricular approaches. On the one hand the increasing top down press for educational accountability and productivity at all levels has led to a dramatic rise in system wide student assessment, most of it traditional, standardized testing. For example, all fifty states now have some form of standardized, statewide assessment. On the other hand, there has been a move to restructuring for enhanced school autonomy and teacher control. Concurrently, there has been a move toward holistic approaches to curriculum and evaluation which encompass a greater committment to teaching higher order thinking skills. These concerns have brought about increasing resistance to traditional testing and interest in more "authentic" assessments. Thus the dilemma is that policy makers who want to evaluate the success of system wide educational reforms, usually want traditional testing; and those educators interested in.the most promising "grass roots" school and classroom educational reforms and restructuring prefer to forego traditional tests in favor of more authentic assessments.

The term "authentic" assessment as used throughout this paper means the gathering and evaluation of evidence produced in a naturalistic time frame and context these assessments reveal student performance on meaningful and challenging tasks as close as possible to the ones which the student and others are expected to actually do in the "real world". These assessments are often based upon performances, demonstrations, exhibitions, portfolios or projects. For example, if a student is being assessed in science, he or she may be asked to perform *a scientific experiment rather than take a traditional science test. If the student is expected to write persuasive essays, then he or she will write persuasive essays in a naturalistic time and context. And if the student is expected to ice skate proficiently, then he or she will be asked to ice skate and he or she will be evaluated in terms of bothdifficulty and performance.

There are several key characteristics of "authentic" assessment. It should: a) require a short chain of inference fjorn the test performance to the real world competence (direct relevance), b) foster disciplined inquiry, c) challenge the student to integrate knowledge, and d) have value beyond evaluation (Archbald and Newman, 1988).

The current attacks upon traditional standardized tests in favor of more authentic, "open," assessments by teachers and curriculum specialists seems quite ironic to those who recall that decades ago Terman and others let the development of standardized measurement in response to "grass roots" educators who found that open-ended assessments of the sorts now. being promoted as cure-alls were inherently subjective, unreliable, inaccurate, inconsistent, and inequitable.

There are some serious problems with large scale applications of more "authentic" approaches to educational assessment, namely they have not been proven to be as valid, reliable, objective, accurate or useful for system wide assessment as the standardized measures they replace even though they are many times more costly. The issue can be put this way:

Which, if any, of the new classroom focused and developed methods canproduce cost effective information that will be useful not only for theinstruction of individual students, but which can also be made sufficientlyaccurate, objective and discriminative as to be meaningfully aggregated tovalidly represent degrees and differences in student group performance andin program effectiveness at the school, state and national levels?

The. National Commission on Testing and Public Policy (1990) estimates that in this country, mandatory standardized testing annually consumes some 20 million school days and between 700 and 900 million dollars. A great deal of this testing is new. Much of it is being done as part of recent educational reform efforts as educational policy makers and taxpayers are increasingly insisting on a "results or outcomes based" approach to evaluating the results of educational expenditures.

At the same time as the policy makers are demanding more standardized testing for evaluating the success of educational reform efforts, American students consistently perform at levels far below those achieved in the majority of industrialized nations (Shanker, pg. 1). A number of people believe that an over reliance on standardized testing itself may be a primary factor in America's educational lag. 'The U.S. is the only nation that relies on multiple choice tests for large-scale assessment," states Linda Darling- Hammond. "Most countries we compete with in Europe and Asia that out achieve us use essays, oral exams and exhibits of students' work" (Newsweek, Jan. 8, 1990). In response to growing concern over American students' poor international showing, members of the National Commission on Testing and Policy (1990) have recommended that alternative forms of assessment be adopted in American schools. More and more American educators, especially teachers and curriculum specialists, are demanding that testing become more "authentic" i.e. assess in a realistic and integral way meaningful skills and abilities including those of higher thinking and problem solving that enable students to become successful, productive adults. To the proponents of more authentic assessment and to the opponents of standardized testing many of the evaluation tools currently used in America’s schools provide little worthwhile information, lack "authenticity" and, ultimately, may undermine and subvert the educational process itself.

This paper: a) examines some of the major criticisms being leveled at standardized tests and misuses of their results; b) describes and discusses the claimed advantages and disadvantages of more authentic assessment; and c) proposes a general direction that might be taken toward integrating traditional and newer forms of assessment.

Criticisms of Current Standardized. TestingThe National Commission on Testing and Public Policy (1990) has identified key problems with standardized testing as it now exists:1. Tests are imperfect and therefore potentially misleading as measures of individual performance in education

and employment.2. Some tests result in unfair treatment of individuals and groups.3. Students are subjected to too much testing in this nation's schools.4. Some testing practices in both education and employment undermine5. important social policies and issues intended to develop or utilize human talent6. Tests have become instruments of public policy without sufficient public accountability (Commission

Report, pg. 6).Perhaps the biggest complaint leveled against standardized, objective achievement testing is that it fails to assess real mastery and therefore is of limited validity as an assessment of student learning. "A true test asks students to show what they know and can do, not to spout unrelated facts they have memorized the night before" (Horace, March 1990, pa. 1).

Traditional testing has long been criticized for "neglecting the kind of competence expressed in authentic, 'real life' situations beyond school -- speaking, writing, reading and solving mechanical, biological, or civic problems" (Archbald and Newman, pg. vi).

These charges of invalidity and irrelevance are typically derived from the assumptions rather than from empirical studies. In reality, there is a considerable empirical research supporting the advantages of objective items, including the ability of well designed tests made up of such items to tap "complex thinking reasoning, evaluation of arguments, and the application of knowledges to new situations. For example, "...Objective tests prove to be more valid predictors of the quality of essays written under proper conditions than do essay tests" (Anastasi, 1982 pg. 398, 399Y).'Much of the problem with current testing seems intimately linked to two testing assumptions, decomposability and decontextualization. These two assumptions underlie almost all current, traditional testing practices and are being challenged.

Early psychological theories were based on the assumption that thought was made' I up ofa number of independent pieces of knowledge and that all skills could be broken down smaller and more easily measurable components. Thus, if you wished to test whether a person was a skilled reader, you needed only determine whether they were able to perform the key subtasks that make up the skill of reading. This approach has been harshly criticized in recent years by proponents of holistic and integrated approaches to curriculum and instruction. They maintain that complex abilities cannot be defined solely by their components and that the whole is greater than the sum of its parts (e.g. Anderson, 1983). Thus, there may indeed be a high correlation between student scores on multiple-choice verbal tests and their ability to perform a skill such as writing, it is feared that at least some students will typically be misclassified as poor writers or as having incomplete verbal skills after multiple choice testing when in fact a more "authentic" and holistic assessment might have more accurately indicated that they were in fact competent.

The second major assumption apparent in almost all standardized achievement tests is that each component of a complex s kill remains unaffected by the context in which it is used. In other words, if a student is able to perform decontextualized editing, a common element of standardized verbal tests, they will also be able to perform similar skills when editing their own work. However, studies have shown that there can be no absolute line drawn between data and its interpretation (e.g., Lakatos, 1978; Toulmin, 1972). In other words, the context in which a skill is performed is relevant; "knowledge and skill cannot be detached from their context of practice

and use" (Resnick, pg. 9). Decontextualization, then, is pointed to by critics as another factor that can hinder a test's ability to measure accurately the broad abilities it purports to.

Because traditional test results are narrow and fallible, some misclassification of who is and who is not competent to perform well in the real world is inevitable. According to the critics of traditional testing the burden of these misclassifications falls disproportionately on certain ethnic and linguistic minority groups, as well as on students who have special learning needs, styles, or difficulties.

The reasons posited for the observed disparity between minority and majority traditional test scores include culturally biased tests, differences in economics or education, and the limited power of existing tests to predict success. Whatever causes such disparity, the fact remains that many minorities are being denied opportunity. Whenever testing limits the choices of individuals in certain groups, our assessment practices must be reexamined. In addition, the opponents of current standardized testing claim that it also. tends to discriminate against children with learning disabilities due to its rigid time limits and inflexible, multiple choice answers.

To the opponents of traditional, standardized testing the investment in it is also excessive. They note that over 20 million school days a year are used for simply taking standardized tests. Even more importantly it seems to them that tests are becoming more and more widely used for such controversial practices as kindergarten promotion and advancement from grade to grade, placement in "special learning' programs and graduation from high school. Moreover:

From 1972 to 1985 the numbers of state testing programs skyrocketed from I to 34. Every state now has a mandated testing program of some kind.

Actual revenue from sales of tests and elected services has been estimated a half billion dollars per year.

The direct cost for state and local testing plus indirect teacher costs may be as high as 915 million dollars annually (Commission Report", pg 7).

These figures, the critics say, fail to include another significant important cost of so much standardized testing -- learning opportunity cost. Much of the time spent teaching the routine and lower-order thinking skills often present on standardized tests could be put to much better use. In an effort to improve increasingly "high stakes" test scores, many educators have resorted to spending inordinate amounts of class time actually "teaching to the test." Such huge fiscal and opportunity costs could be justified as legitimate educational expenses if they positively affected our school systems. However, the critics say the continuing trend of increased standardized testing has not created appreciable improvement in student performance (e.g., Shanker, pg. 1; Commission Report, pg. 18).

One of the primary functions of testing is to assess educational quality. In recent years local attention to reform has been directed by mandatory, "high stakes" testing. The danger in such testing is that when the stakes are raised, the pressure on schools to improve their scores leads to disastrous solutions and undermines the educational process itself. In Pennsylvania, for example, mandatory state test scores were made public in 1987. Immediately the test scores became a "benchmark" for comparison between Pennsylvania school systems -- the tests became "high stakes." Schools that performed poorly lamented that they would have to alter their curriculum for the following year. One superintendent explained, "We don't believe in the test that strongly, but we will be forced to see that all material is covered before the tests ... We won't be caught in the newspapers again" (Corbett and Wilson, 1989). Others involved in similar dilemmas agreed. "Teachers feel jerked around," a Maryland teacher confided. "Me test dictates what I will teach in my classroom it (Corbett and Wilson, 1989). The charge of the opponents of standardized testing is, then, that more and more schools are becoming involved in "high stakes" testing and are therefore led to "teach to the test" in order to raise test scores. Improving test

results becomes more important than other, arguably more important, teaching and learning and societal responsibilities.

The final criticism of traditional testing rests on the apparently fatal allure of one point in time test scores in isolation to all other actual or possible kinds of evidence. Students are placed in Talented and Gifted Programs or remedial programs largely on their scores. Programs, policies, budgets and professionals all rise and fall with test scores. And this in the face of an almost universal commitment within education to using multiple indicators to support important educational decisions. When teachers spend valuable class time emphasizing test taking strategies, children learn those skill components and test strategies rather than higher level processes (N. Frederiksen, 1984).. As Grant Wiggins, President of CLASS (Consultants for Learning Assessment and School Structure), points out, "What you test is what you get." "If we want to have quality assessment that creates quality work we need to test f or the task we want kids to be good at (from Videotape "Imulti-dimensional Assessment Strategies," 1990). This point of view of the tyranny of test over teachers seems to regard them as something less than fully professional.

If "high-stakes" tests were adequate measures of students' performance, then they would serve to reinforce curriculum and aid learning. When imperfect tests become too important, however, say the critics of existing tests, school curriculum is actually debased because it focuses on simplistic multiple choice questions and test-taking skills (Koretz, 1988).

There are a number of advantages claimed for more authentic alternative assessment techniques: they measure directly what children should know; they emphasize higher thinking skills, personal judgment and collaboration; they urge children to become active participants in the learning process; and they allow educators to "teach to the test" without destroying validity.

There are also a number of disadvantages besides the earlier discussed ones ofundemonstrated validity and reliability. These include:

high cost; difficulty in making results consistently quantifiable and aggregatable undemonstrated validity, reliability and comparability of the current subjective scoring systems

Advantages

One of the greatest advantages claimed for "authentic" testing is that it can test what educators want children to know. Because "authentic" testing assumes decomposability nor decontextualization, skills can be tested "holistically" and in context. Holistic testing and test scoring defines procedures which lead the entire its outcome to be relevant in determining mastery. This theory stands inscoring technique used on most existing tests, which is to count up the subtract points on the basis of them. Context too, is often important in "authentic” testing. According to the authentic assessment proponents, how students apply intimately linked to where, and when and how they apply them. Authentic design, tends to take context into consideration.

Authentic assessment also emphasizes the skills of higher thinking and personal and allows collaboration. Performance tests can allow students to write, create, do original research, analyze, pose and solve problems. While much of the standardized testing fails to even approximate such tasks according to some critics (e.g., Resnick 1989; Shanker, 1990). Peter Elbow and Pat Belanoff, examiners of a progressive writing program at the State University of New York at Stony Brook, discovered that "authentic" assessment teaches students "that their reactions and opinions about serious matters deserve time and attention," whereas standardized tests often

stifle creativity and personal insight because the multiple choice format implies that all the students can do is choose (or guess) someone else’s "right" answer (Resnick, 1989). Such a format does not allow the students to engage in interpretive activity and ultimately may leave the test-taker feeling powerless and uninvolved. ""Authentic" assessment, however, is designed to create an environment in which students can "show" what they know, leaving the power in their hands and allowing them to utilize higher thinking skills (Horace, March 1990).

"Authentic" Assessment also helps children become more involved in their own learning process. Howard Gardner of Harvard's Project Zero claims that there are seven basic intelligences: linguistic, musical, spatial, logical/mathematical, bodily kinesthetic, interpersonal and intrapersonal. The majority of class time and standardized testing-are focused on only two of these intelligences: linguistic and logical/mathematical. Two very important intelligences, interpersonal and intrapersonal, are often neglected. Well developed intrapersonal intelligence is a common trait in successful individuals. Most authentic testing involves some form of self-criticism and personal evaluation, whether it be editing a piece of writing or critiquing a drawing. Most standardized testing, however, is thought to involve other peoples' work (editing someone else’s writing, solving problems using predetermined techniques, etc.) and actually discourages interpersonal intelligence (Resnick, 1989; Archbald and Newman, 1989). Interpersonal intelligence, the ability to relate with others, is also claimed to be fostered with "authentic" assessment.

Many educators also feel that new forms of assessment should be collaborative (e.g., Valencia, McGinley and Pearson, in press). In the world beyond school, students will usually have to work and create with others; rarely does someone in the "real world" create and perform without outside criticism and help. Collaborative assessment helps students develop their intrapersonal intelligence and strengthens the bond between teacher and student (e.g., Valencia, McGinley and Pearson, in Press; Elbow and Belanoff, 1986).

Arguably the most important advantage of "authentic" assessment is that it allows tests to be instructional. Rather than be an after-the-fact check-up on students' learning, "authentic tests can reinforce the curriculum and establish genuine intellectual standards. Thus, teachers can "teach to the test" without undermining the validity of the test. In fact with "authentic" assessment, teaching to the test is not only possible,, it is desirable (Resnick, 1989). Such an attitude conflicts with the general assumption that "teaching to the test" is a poor practice. However, with current standardized testing "teaching to the test" is indeed problematic, mainly because of the concept of indicators. While a high verbal score on the SAT may be an indicator of how well a student will perform on an actual written composition, the student need only be able to perform well on multiple choice-type questions to indicate this ability. Such testing assumes that the student is being taught proper writing skills in the classroom. But, as pointed out earlier, students taking standardized tests need only to be able to perform specific test exercises to scare well (Cannell, 1989). If "authentic." testing measures the skills and abilities educators believe are crucial in performing beyond school, then teaching to the test will raise school standards, improve curriculum and benefit society (Wiggins in Education Week, 1989). Thus, a major claimed advantage of "authentic" testing is that it frees. educators from spending time on minimal, reductionist, test-forced curriculum. Again, there might be alternate routes to this liberation, such as regulations and rules preventing the use of student test scores to evaluate teachers (and principals).

Disadvantages

Although the costs of standardized testing today are staggering, "Authentic" assessment could prove to be many times more expensive. The need for increased professional time for assessment and such costly items as video cameras could increase assessment costs significantly. For example, in the R.O.P.E. (Right Of Passage Experience) program used at Walden III, an alternative public school in Racine, Wisconsin, at least 10 hours of extra teacher time are needed for each graduating student (Horace, March 1990). In a school with a graduating

class of 500, that would amount to at least 5,000 more paid hours per year! At 20 dollars an hour, teaching costs alone would increase $100,000 per school! While there are many different estimates of potential cost, it is clear that "authentic" assessment requires far more teacher and student time than computer scored multiple choice tests or even than emerging versions. of traditional tests adapted to include some open ended items and other changes to respond to new curriculum and instructional programs. Because few states, districts or schools have utilized extensive amounts of "authentic assessment except in pilot versions, actual costs remain unclear. One obvious way to conserve resources while using the new measures of assessment for policy making would be to assess overall system performance.

Further difficulties in "authentic' assessments stem from the problems encountered 'in attempts to make their results valid, reliable and comparable. Here the key issue is subjectivity in evaluating performances. It is difficult to assign a specific, adequately discriminating, scaled score or percentile to a more "authentic" assessment such as an essay and it is even harder in the case of portfolio evaluations, etc. compared to having a computer count the number of wrong responses to the items on a well designed, objective, standardized test. Rarely does a scale on a performance-based assessment contain more than 10 points. In research done by the Portland Public Schools, we have demonstrated that the results of Direct Writing assessment done in a manner consistent with writing as a process are invariably prompt dependent. As a result, system wide assessment of writing performance cannot be validly compared over time. Thus, it is not possible to answer the question, "Is this eighth grade doing better or worse than last year's?" But this is just the sort of question policy makers need and want to have answers to so that they can modify policies, programs and resources in productive ways. "Authentic" assessments alone thus far cannot readily serve all the decision making needs of educational policy makers, planners, designers and resource allocators beyond the individual classroom. A sense of the nature and strengths and weaknesses of more authentic assessments can be gained by investigating some sample applications.

Walden III's R.O.P.E. Program

Walden III, an alternative public school in Racine, Wisconsin, has developed a program to address the issue of student preparation for life beyond school. In order to graduate each senior must demonstrate mastery in 15 areas of knowledge. and competence by completing and submitting a portfolio of work before a committee made up of staff members, another student in a lower grade, and an adult from the community. The portfolio includes: an autobiography, self-analysis, essays, artistic products, letters of recommendation, and various other indicators of mastery. The portfolio itself is presented by the student before the committee and carefully evaluated and approved before graduation can occur. Clearly, Walden III’s program meets the first three criteria, direct relevance, disciplined inquiry and integration of knowledge admirably. In addition, the forth characteristic, value beyond evaluation, is fulfilled by the actual process of completing the portfolio itself. The student may spend more than two years working on the project outside of class. They have as long as they like beginning junior year. AU the time spent is both educational and self-directed, allowing the student to learn the responsibility and self-discipline which will be needed in college and in later life.

Key School. Indianapolis. Indiana

The Key School is the child of Professor Howard Gardner of Harvard's Project Zero. Located in Indianapolis, Indiana, the 5th grade school is one of the most progressive public schools in the nation. The school utilizes video cameras to tape all projects and oral tests that the students' complete. A full time video technician helps keep a video rile on each child which can be viewed by students, teachers and parents alike. The classroom environment is non-competitive and the school's philosophy is to build students' strengths rather than reinforce weaknesses. From all accounts the school has been very successful.' However, the cost of all the equipment and

extra teaching time is very high. The Key School is experimental and there are few similar programs in existence for financial and logistical reasons. Clearly, it too meets all four of the "authentic" assessment criteria.

Michigan Educational Assessment Program (MEAP) – An Accommodation

The Michigan Educational Assessment Program (MEAP) was established in the late 1960's to provide information on the progress of Michigan students in the essential skills areas. However, when it was decided that these tests no longer provided Michigan educators with adequate feedback on the progress and status of Michigan basic skills education, a group of teachers and curriculum specialists designed the Michigan Essential Skills Reading Test. While the tests shall use a multiple choice format, they are untimed. The test also attempts to measure attitudes about reading and self-perceptions of the test-takers. The passages read are Iong (e.g., 500-2,000 words) and the questions, although multiple choice, are designed to challenge the reader to construct meaning from the text. in addition, the test is designed to assess the. familiarity the test-taker has with the reading selection topic. In context, this contradicts the theory that reading assessment selections should be interest and curriculum neutral and context free. Instead the test assesses the student's relevant prior knowledge and experience. These characteristics allow the MEAP test to meet the first three criteria outlined earlier - disciplined inquiry and integration of knowledge -- the final and arguably most important criteria, value beyond evaluation, was tackled by the Michigan program also. Test result forms are designed in such a manner that the student, teacher and parent can immediately see not only the student's performance in individual areas but also the influence of each performance on other areas. For example, if topic familiarity is low then lower scores on other sections might be a result of inadequate knowledge of the topic. If the self-perception section indicates that the child is uninterested, then the teacher or parent can immediately try to bring their interest level up. The Michigan Program is promising because it shows the degree and limits to which the tenets of more "authentic” assessment can be accommodated within the less resource intensive and well-established standardized, multiple choice format.

Some Possible Steps Forward

Three general recommendations for improving assessment in public and private schools in this nation seem to offer promise of successfully dealing with the current dilemma:

1. Integrate standardized test use as far as possible with more "authentic" forms of testing while respecting their differing aims;

2. Use multiple indicators in assessment; and3. Use results in context.

The safe route between the Scylla of traditional "stifling," "limiting," "distorting," objective, multiple choice standardized testing and the Charybdis of alternative, "fuzzy” "ungeneralizable," "unaggregatable," subjective methods of assessment is to find ways to adapt and to get more useful information from the former while developing more useful and cost effective versions of the latter.A possible interim, "compromise" overall accommodation between alternative, more authentic assessment and traditional standardized measures of academic achievement could go as follows:

1. Accept, promote and practice the belief that multiple measures are better than single ones, especially in measuring gain from one assessment to another versus one time levels of performance;

2. Embrace the desirability of more authentic measures whenever possible and cost effective;3. Encourage and support development and use of more authentic assessment techniques by teachers in

their classrooms for assessing and monitoring their students' progress and their needs for further learning opportunities and experiences;

4. Work with curriculum and instructional professionals to modifystandardized, multiple choice testing systems so that they:

a) Develop and include instruments which assess new dimensions such as context and prior knowledge as the curriculum being assessed suggests.

b) Develop and add to standardized assessmentc) systems more "authentic" items, eg. for Reading obtain permission to use long passages of

connected, meaningful text from "published" materials and ask multiple questions, at least some of which tap higher order thinking skills; or for Mathematics, use everyday problems and permit use of calculators for all items except those which assess computation and estimation, etc.

d) C.Add open ended, extended, non-multiple choice items to standardized, multiple choice tests. For example, on a Mathematics test pose problems and give students time and space to work out the answers and a place on the answer sheet to code in their responses. Use information from such adaptations in an integrated fashion along with the traditional scores.

e) Add additional more authentic "items" in systemwide assessments using a (matrix) sampling design in order to lend depth of insight into the meaning of large group, aggregate data while maintaining cost effectiveness.

For long range solutions to our dilemma we must turn once again to research and development and to evaluation. Here we need to continue, extend and evaluate pilot efforts to develop and use in a cost effective way such non-traditional assessment approaches as portfolios, projects and performance assessments. Pay attention to the problems of consistency of rating like responses over different raters and over time as well as to gaining accurate information on scales of sufficient range to permit meaningful and necessary discriminations. Pay particular attention to.the need for developing and reporting accurate portrayals of the degree of error and uncertainty of the estimates. At the same time continue to work to adapt and evaluate traditional measures.

In both cases separately and jointly research the construct, content, predictive-criterion and face validity as well as the relevance of each type of measure and of reports of their results.

Another useful area to continue and expand psychometric research and development is computerized (adaptive) testing, especially cutting edge systems in the areas of artificial intelligence, expert systems, fuzzy logic and video disk/computer interfaces. These systems have the promise in the long run of assessing the higher order thinking and problem solving skills that critics of traditional standardized testing say it fails to adequately assess.

None of these recommendations call for an immediate, wholesale overhaul of existing testing procedures. Instead they urge immediate implementation and evaluation of the emerging assessment methodology in those relatively low stakes areas (such as classroom assessment) and continued and extended research and development for implementation in relatively high stakes areas such as system wide assessment for evaluation accountability and planning and graduation competence certification. Since assessment's fundamental purpose should remain as helping children, their parents and their teachers receive useful feedback, both positive and negative, about the performance and needs of students and the education system which serves them. The controversy over standardized vs. "authentic assessments unfortunately diverts attention from the more important mission. Tests and assessments are only tools; they can be either valuable or worthless depending on where, when and how they are used. The solution to our current dilemma is not as simple as saying "no more standardized tests." Perhaps we should be saying, "no more closed doors, and "no more closed minds."

ReferencesA- Anastasi, (1982). Psychological Testing. Fifth Edition- New York, NY: Macmillan Publishing Co., Inc., London, England: Collier Macmillan Publishers.

Archbald D.A., and Newman, F.M. (1988). Beyond Standardized Testing . Reston, VA: National Association of Secondary School Principals. Anderson, J.R. (1983). The Architecture of Cognition. Cambridge, MA: Harvard University Press.Belanoff, P., and Elbow, P. (1986). Using Portfolios to Increase Collaboration and Community in a Writing Program in Cassally and Villard 1986. New Methods and College Writing Programs MILA

Brandt, R., (1989). On Misuse of Testing: A Conversation With George Madaus. Washington, D.C.: Educational Leadership.

Cannell, JJ. (1989). How Public Educators Cheat On Standardized Tests Albuquerque, NM. Friends for Education.College Board (1988). National College-bound Seniors: 1988 Profile. Profiles of SAT and Achievement Test Takers National Ethnic Sex Profile. New York: The College Board.

The Columbian, Sunday, October 29, 1989. "End of Standardized tests requested."

Corbett, H.D., and Wilson, B. (1989) Raisine the Stake in Statewide Mandatorv Minimum Competency Testing, Philadelphia, PA. Research For Better Schools.

Costa, A-L, (1989). Re-assessing Assessment. Washington, D.C.: Educational Leadership..Horace, March 1990. "Performance and Exhibitions: The Demonstration of Mastery." The coalition, of Essential Schools.

Kirst, M.W., (1991). Interview on Assessment Issues With Lorrie, Shephard. Interview on Assessment Issues With James Popham Washington, D.C.: Educational Researcher.

Koretz, D. (1988). Arriving in Lake Wobegon: Are -Standardized Achievement Tests Exaggerating Advancement and Distorting Instruction American Educators.

Lakatos, 1. (1978). The Methodology of Scientific Research Programs. Philosophical Papers, Volume 1. J. Worrall & J. Currie (Eds.). New York: Cambridge University Press.

Lewis, M., Lindaman, A.D., (1989). How Do We Evaluate Student Writing? One District' ANSWER, Washington, D.C.: Educational Leadership.

Martinez, M.E., Lipson, J.I., (1989). Assessment - for Learning. Washington, D.C.: Educational Leadership.

Michigan State Board of Education (1989). Essential Skills Reading Test Blueprint. Lansing, Michigan. Unpublished.

"Multidimensional Assessment: Strategies for the classroom" Video tape #4 in the series ."Restructuring to Promote Learning in America's Schools, " 1990.1National Commission on Testing and Public Policy (1990). From Gatekeeper to Gatewav: transforming testing in America Chestnut Hill, PA. National Commission on Testing and Public Policy.Newell A., and Simon, H.A. (1972). Human Problem Sol Englewood Cliffs, NJ: Prentice Hall.

Newsweek, January 8, 1990. Not as easy as A, B or C."

Nickerson, R.S., (1989). New Directions in Educational Assessment, Washington, D.C.: Educational Researcher.

Resnick, C.B. (1989). Tests as Standards of Achievement in Schools. Paper prepared for the Educational Testing Service Conference. The Uses o"tandardized Tests in AmericaEducation, New York.

Robinson, S.P., (1989). -The Agenda for Reform in the Use of Standardized Tests Achieving the Ideal of Inclusiveness. Princeton, NJ: Educational Testing Service.

Roeber, E., Dutcher, P., (1989). Michigan's Innovative Assessment of Reading. Washington, D.C.: Educational Leadership.

Shanker, A. (1990). The Social and Educational Dilemmas of Test Use, New York: Educational Testing Service.

Shepard, LA., (1989). 3MV W& Need Better Assessments, Washington, D.C.: Educational Leadership.

Stiggins, R.J. (1987). Design and Development of Performance Assessments, in I.T.E.M.S., Fall 1987.

Toulmin, S.E. (1972). Human Understanding. Princeton, NJ: Princeton University Press.

Valencia, S.W., Pearson, P.D., Peters, C.W., Wixson, KK, (1989). Theory and Practice in Statewide Reading Assessment: Closing the QzZ Washington, D.C.: Educational Leadership.

Wolf, D.P., (1989). Portfolio Assessment: Sampling Student Work Washington, D.C.: Educational Leadership.

IA SURVEY OF MORE AUTHENTIC ASSESSMENT PRACTICES*Joe B. Hanson Colorado Springs School District 11INTRODUCTIONThe American education system is under intense scrutiny and pressure from virtually every political, educational, business or interest group in our society, for its alleged shortcomings. since The publication of A Nation At Risk: The Imperative for School Reform in 1983, the challenge to produce students who can compete intellectually in the global society has been the focus and concern of professional educators, politicians and others interested in the preservation of -the American standard of living. The call for educational restructuring has grown from a whisper to a tumultuous roar. The burden on the public schools to cure the ills of society has grown in geometric proportions over the past two decades, while the willingness of the public to pay spiraling educational costs continues to diminish. Greater accountability for the educational tax dollar is being demanded by nearly everyone who has a stake in public education at the same that the demand for increased academic achievement is being so emphatically stated. Schools of Choice, voucher systems, site based management, and numerous other "solutions" are being advocated by educational. interest groups of all types. Education is becoming the top priority issue for the nation as we approach the twenty-first century.The Need for More Authentic AssessmentAs the pressure mounts for significant educational reform, the public is demanding more and stronger-evidence that such reforms are working to produce students who can think, communicate effectively, solve complex problems and find solutions to the economic, social and political, environmental and other problems facing our society. As all of this has been occurring, educators, over the past decade, have become increasingly, aware of the limitations of standardized tests as the means of assessing student performance and evaluating significant educational change. Numerous commissions, study groups and forums have been created to deal with what is perceived to be a major educational problem: the need for greater authenticity in the way we assess student learning and growth. One such commission, the National Commission on Testing and Public Policy (NCTPP) recently concluded a three year study of 11 ... trends , practices and impacts of the use of standardized testing instruments and other forms of assessment in schools, the workplace, and the military." A key finding of this commission was that:

"Current testing, predominantly multiple choice in format, is over-relied upon, lacks adequate public accountability, sometimes leads to unfairness in the allocation of opportunities, -and too often undermines vital social policies.' (NCTPP, 1990, p. iX)

This broadly based commission included individuals with expertise, interests and experience in a wide variety of fields, including education, business, labor, law, assessment and measurement, and manpower development and training. The NCTPP study concluded that

"To help promote greater development of the talents of all our people, alternative forms of assessment must be developed and more critically judged and used, so that testing and assessment open gates of opportunity rather than close them off.,, (p. x.)

The)commission identified five key limitations of testing that must be addressed:1. Tests are imperfect and therefore potentially misleading as measures of individual performance in

education and employment.2. Some test uses result in unfair treatment of individuals and groups.3. Students are subjected to too much testing in the nation's schools.4. Some testing practices in both education and employment undermine important social policies and

institutions intended to develop or utilize human talent.5. Tests have become instruments of public policy without sufficient public accountability.

Whether one agrees with these findings and recommendations or not, they serve to articulate clearly the frustration of educators and education watchers with the limitations of conventional testing. They also serve to underscore the need for increased relevancy and authenticity in educational assessment tools and techniques.

In recent years, numerous authors have reviewed the deficiencies of standardized tests and called for greater authenticity in educational assessments (Archibald and Newmann, 1988, Cannell, 1989; Shepard, 1989; Neil and Medina, 1989; Wiggins, 1989, 1990.) These authors and numerous others have called for the development and use of assessment techniques that are "more authentic." This, means that the assessments used should have certain characteristics that simulate the conditions a student would experience in applying his/her knowledge or skill in a real world environment. Thus the student has an opportunity to demonstrate his/her knowledge, extend it and apply it. Authentic assessments, as defined by Archibald and Newmann (1988) must most three criteria: 1.) production of discourses, things or performances, 2.) flexible use of time, and 3.) collaboration. These criteria are based on the concept of "disciplined inquiry" in the real world environment, outside of the classroom. In describing the Connecticut Common Core of Learning's "enriched performance tasks", Baron (1990), lists features such as: grounded in real world contexts; involve sustained work; based upon the most essential aspects of the discipline (s)'being assessed; are broad in scope; blend essential content with essential practice; present non-routine, open-ended and sometimes loosely structured problems that require the student to both define the problem and determine a strategy for solving it; encourage group discussion and "brainstorming"; require students to make, explain, and defend their assumptions, predictions and estimates; stimulate students to make connections and generalizations that will increase their understanding of important concepts and processes; are accompanied by explicitly stated scoring criteria related to content, process, group skills, communication skills, and a variety of motivational dispositions and "habits of mind"; spur students to monitor themselves and to think. about their progress; necessitate that students use a variety of skills both for acquiring information and for communicating their strategies, data conclusions, and reflections.

"Authentic evaluation of educational achievement directly measures actual performance in the subject area. Standardized multiple choice tests, on the other hand, measure test taking skills directly, and everything else either indirectly or not at all." (FairTest, undated)

An authentic assessment task should be worthwhile, significant and meaningful, and should provide substantive information (Archibald and Newmann, 1988).Authentic assessments require students to demonstrate what they know and can do rather than to select a "correct" answer from a list of alternatives. In this sense they are performance based.They have diverse and varied formats that are derived from and relevant to the skill or knowledge being tested. For example, a direct writing assessment that evaluates a student's performance on a sample of the student's writing is an authentic assessment.They involve complex and sustained tasks that tend to call for higher order cognitive skills, than recall, recognition, or simple deductive reasoning. Some higher order cognitive skills described in the book Dimensions of Thinking: a framework for thinking and instruction (Marzano, Brandt, Hughes, Jones, Pressiesen, Rankin, Suhor, 1988) that are associated with authentic assessments. include: information gathering, generating, analyzing, integrating, and evaluating.More recently, numerous organizations have emerged to join the battle 'to reduce our- dependency on standardized testing and increase the authenticity of assessments, including the National Education Association (NEA), The National Center for Fair and open Testing (FairTest) , The National Urban Alliance (NUA) and the American Federation of Teachers (AFT).

The response to this overwhelming ground swell of concern regarding the need for better assessment tools has bee ' n widespread. The National Assessment for Educational Progress (NAEP), operated by Educational Testing Service (ETS), the largest test publisher in the United States, has responded by incorporatinq open ended performance type items in the 1990 math and language assessments, with plans to expand the number of these items in the 1992 assessment. Other test publishers are following suit by offering commercial packages incorporating features of "authentic assessment." Additionally, 'numerous state assessment programs are incorporating more authentic techniques.

Whole Language Assessment

The trend toward "authentic assessment" has been paralleled by a call for greater authenticity in instruction in an effort to have instruction correspond more closely to emerging knowledge on human learning. The recognition of multiple forms of intelligence (Gardner, 1983) has had a pronounced effect on pedagogical thought in recent years, resulting in efforts to engage students in a variety of modes of instruction in an effort to develop the whole child and provide the student, an opportunity to exercise his or her strengths in the mastery of the instructional material experienced. Performance based assessments have both led and followed from the development of performance based approaches to learning.There is perhaps no discipline more fundamental to the development of the human intellect than reading. Without adequate reading skill, most knowledge remains locked away out of reach of those who would seek it. Few would argue that even in the electronic age, reading is not the most important of skills a student must acquire early in his/her schooling,,. if that student is to succeed in school, and indeed, in life. Therefore the teaching of reading has been a central concern of education from the beginning of the written word.The "whole language" approach to teaching reading is based on the precept that reading, writing and speaking are integrally linked to one another and to learn each is to learn something of the others. Whole language is a "constructivist" approach which advocates that understanding can best be accomplished through the, act of reading itself rather than through the development of discrete component skills, such as word recognition, phonics, spelling and so forth. This approach relies on the use of contextual cues, prior knowledge and potential helpers for the construction of meaning from written material (Valencia and Pearson, 1987.) It calls for the integration of reading throughout the curriculum in social studies, science, mathematics and other subject areas as a fundamental way of learning. It is as much a theory of learning as a theory of language development. (Harste, 1989). Whole language is said to integrate the holistic psychological research of Piaget, Vygotsky and schema theorists with the social, functional-linguistic research of Halliday. It goes beyond the positive, child-centered education movements of the past to integrate scientific concepts and theories of language processes, learning and cognitive development, teaching and curriculum into a practical philosophy to guide classroom decision making (Goodman, K., 1989). So whole language is much more than an instructional technique for language development, it is a theory, philosophy and a major educational movement that is gradually but with certainty, replacing traditional, basal reader, skills based approaches to reading and language instruction.This approach to instruction poses a challenge to educators to find appropriate assessment techniques for measuring student progress in reading and language development. It is antithetical to the concept of whole language instruction to use multiple choice tests to assess reading growth, since to do so would require the specification of sub-skills within the overall level of language skill development. Whole language assessment is based on "kid watching" (Goodman, K., 1989). This type of evaluation is a continuous ongoing, integral process in which teachers are involved in evaluating all aspects of the curriculum. It is based on a "double agenda" wherein students are learning through teacher-student interaction and there is continuous reflection on and questioning of the instructional process by both teacher and student. Teachers monitor student performance and ask such questions as "who is getting things done? How are students concepts and hypotheses changing? Who seems confused? How did things go in our discussion group? Did I organize the writing area so those who wanted to write could do so? 11 and so on (Goodman, Y., 1989.) The key techniques used by teachers for evaluating student growth and their own efficacy are "observing, interacting and analyzing."Teachers using the whole language approach have been on their own to devise and experiment with appropriate assessment tools and procedures during the course of instruction. Miscue analysis, portfolios, informal and formal observation techniques and other non-standardized approaches have been the standby assessment tools of the whole language teacher. While having face validity, these techniques are criticized for their subjectivity, lack of reliability and failure to yield aggregatable data. The Northwest Evaluation Association (NWEA) is one organization that has made a commitment to addressing some of the issues associated with whole' language assessment. The NWEA, has, over the past year, held a series of working retreats, involving educational

practitioners, evaluators and researchers in finding solutions to the problems of aggregating portfolio data. The results of this work to date, look very promising.'

Whole language assessment, therefore is a special case of "authentic assessment". It is an area of intense need, for which few effective tools have been well developed. Some research and development has taken place in a few of the state operated large scale assessment programs such as those in Michigan, Illinois and California, but relatively few tools are available for use at district or school level. Direct writing assessment is the formal measurement technique most often associated with whole language on a large scale, although it is not used exclusively for whole language assessment.The Need for Information on Authentic Assessment Practices

Standardized test bashing has become so fashionable, that it is becoming difficult to find anyone in education who is willing to defend standardized tests, regardless of the testing purpose. Such testing still has a valuable niche in providing a cost effective means of assessing the knowledge or skill level of1arge groups of students where the learning outcomes are well represented by the items on the test. It also has value in tracking individual growth, provided that the test and curriculum is well aligned and that there are enough items per significant learning outcome to ensure reasonable reliability. Perhaps it is time to slow down and take stock of where the authentic assessment movement is leading and assess the extent to which public education will be likely to benefit from this rapidly growing trend. A starting point for ascertaining where we are headed might be to first establish just where we are presently. This is why, in the early Spring of 1990, a broadly based survey was conducted of the state of the art in authentic assessment.

Purpose of the Study

The survey had two major objectives. The first was to gather information that would enable us to know the extent to which more authentic assessment techniques, including whole language assessments, were being implemented and by what organizations. Our second objective was to elicit materials from organizations engaged in authentic assessment efforts, that could be catalogued into a compendium of authentic assessment tools and techniques. We felt that such a compendium could become a resource base for educators interested in developing or studying authentic assessment. A simple questionnaire, requiring minimal effort and respondent time was designed to procure this information.

The Need for a Conceptual FrameworkAs researchers we felt a need for a conceptual framework (Miles and Huberman, 1985) to guide us in classifying and interpreting the responses we would obtain from the survey and to provide a schemata for classifying the resource documents we would collect. Therefore we explored the authentic assessment literature in search of ideas for such a framework.Basic precepts underlying a conceptual framework:. Any attempt at classifying authentic assessment efforts must begin with the question of purpose. what is the purpose of the assessment in question?Several purposes for assessment have been identified and stand as conventions. One taxonomy of purpose can be adapted from Stiggins and Bridgeford (1985) based on research on classroom assessments. These researchers defined classroom testing in terms of five purposes based on the types of decisions that teachers make: diagnosing, grouping, grading, evaluating, and reporting. Distinctions are also made across types of tests, e.g. publishers', teacher made, performance, etc. This has been extended as follows: Diagnosing strengths and weaknesses of individual pupils, diagnosing group needs, grouping students for instruction, determining the achievement potential of students, assigning grades to students, evaluating the instructional unit to see if it worked, communicating academic expectations, and controlling and motivating students.other purpose factors identified by Stiggins (1987), affecting assessment, include identifying students for special services, communicative affective or behavioral expectations, and providing test taking experiences.

Yet another taxonomy of purpose is offered by Airasian (1984) in which two general levels of assessment are offered, 1.) externally mandated assessments, and 2.) teacher. used assessments. The former includes high school graduation tests, preschool screening procedures, and test to allocate resources, such as for a remedial reading program. These are typically mandated by an authority external to the school such as the state school board or the legislature. To this list one could add statewide assessment programs and tests required by state accountability programs. These assessments serve the purpose of providing policy makers, data for monitoring the progress of schools and districts on state mandated goals or curriculum. Airasian's second level, those assessments used by teachers in their classrooms, are the assessments used to guide their instruction, maintain order, and to assign grades in their classrooms. These two levels of assessment have distinctively different characteristics.Externally mandated assessments involve tests, regulations, objectives and passing scores set by the external authority. Their purpose is often external and may include a moral or managerial dimension. They rely heavily on a single test score and a single cutoff point for decision making.Classroom assessments are performed on a day to . day basis. They vary in form, e.g. standardized, non-standardized, teacher made (oral, observation, paper-pencil, essay or objective), performance checklists, rating forms, work samples, anecdotal records.As the work of Stiggins and Airasian emphasize, the purpose of assessment is a key determinant in deciding what type of assessment to employ in a particular situation. obviously, purposes vary across a broad spectrum and carry implications based on whether they are internally focused or externally imposed.Another dimension of assessment to be considered in developing a conceptual framework is the type of required. Much of the criticism directed toward conventional assessment techniques is based on the belief that multiple choice, selected response formats limit the cognitive level of a student's response to that of simple recall, or at best, lower forms of reasoning associated witheliminating the least plausible answers. Selected response formats allegedly don't allow the student to express -what he/she knows about the subject in his/her own way. Therefore a useful dimensionfor classifying authentic assessment efforts, should be the extent to which they call for constructed rather than selected responses. This is not to say that all tests that call for selected responsesare inherently "non-authentic", although some would argue to the contrary. It is true however, that authentic assessments are generally perceived to be based on constructed rather than selectedresponses. Therefore assessments purporting to be more authentic but not requiring constructed responses would have to be subject to special circumstances.A third property of authentic assessments could be considered to be the level at which they* were intended to be interpreted and reported (e.g. individual pupil, classroom, school, district, state, or nation.) Portfolios, for example, might be very useful at individual pupil level, useful at class level, perhaps- somewhat less useful at school . level and present ever increasing difficulties for interpretation as higher levels of aggregation are desired. Some notable exceptions have * been reported and experimentation is under way in several places such as NAEP, Vermont State Assessment, and California State Assessment programs to incorporate open-ended responses formats, portfolios and other non-conventional techniques in data to be aggregated at state levels and higher. The issues of reliability, validity and-cost effectiveness of portfolios are to be considered separately from their level of interpretation and reporting.No conceptual framework for classifying responses to the call for authentic assessment should be considered unless it takes into account the subject matter or content area being assessed. It is obvious that some subjects lend themselves more readily to more authentic approaches than others do. Direct writing assessment, for example, has become the standard for assessing student writing skills. These considerations led us to develop and discard, through an iterative process of trial and error several conceptual frameworks before arriving at one that had the practical utility we sought. The structure we arrived at is illustrated in, the chart shown in Table I.This framework contains five dimensions and could be represented graphically as a pentagon. Each dimension is somewhat independent of the others, although they frequently interact. In classifying the -sample tests and descriptive documents, only the first two levels of each dimension were used. This was done because of the

lack of uniformity in the amount of detailed information in the descriptions of the assessment tools provided by the various submitters.

The Need to Examine Stages of Development in Authentic Assessment

Hall and Loucks (1977) created a model for tracking an innovation through several stages as it is implemented. This model, the Concerns Based Adoption Model (CBAM) allows the researcher to document the progress of the innovation in terms of the concerns and behaviors of the people who are attempting to implement it. While the CBAM offers an excellent means of tracking the progress of an innovation through its levels of implementation and use, it requires intensive on-site interviews with those persons most affected by the innovation. For our purposes, we needed a way of characterizing the level of implementation without conducting the on-site research required to document the process. For this reason we conceived a hypothetical process of implementation based on a hypothesized cycle of development that flows through five preoperational stages before the assessment tool or technique reaches a fully operational stage. The process is based on a set of assumptions that result in a linear flow model of developmental stages, somewhat analogous to the CBAM.' The assumptions are as follows;

1.Prior to the adoption or implementation of authentic assessment, one of two conditions must exist: either a policydecision must be made, to begin to explore the potential of' authentic assessment, or some support base for this approach must exist within the organization.

2.once the policy or support base is in existence, research, either formal or informal may occur in an overt fashion.

3.The research will likely lead to a decision to either begin implementation or seek a new approach. If implementation is chosen then, training will need to be provided on how to implement the authentic assessment technique. The training could either be preceded or followed by a planning process for implementation.

4.Do development of the assessment technique may occur concurrent with the planning/teacher training stage or follow from it. Obviously this must precede implementation. Development, in this case could include the purchase or acquisition of a commercial product.

5.Implementation or pilot testing may begin once the development has occurred.6.The authentic assessment technique may become- operational only after the previous five stages have

occurred in some order or fashion..This set of assumptions. provides a heuristic model for characterizing the stage of development an organization has attained with respect to authentic assessment. We argue that the authentic assessment technique cannot become a routine or culturally assimilated practice until these stages have been addressed either directly or vicariously within the organization.

METHOD

A simple questionnaire, consisting of two questions, was mailed out to individuals representing 433 educational organizations and businesses in the United States, Canada and to the education ministries in several foreign countries. Recipients were identified from the mailing lists of the National Association of Test Directors (NATD), Directors of Evaluation (DRE), foreign ministries of education (FME) and the researchers' personal contacts in the measurement and evaluation professional community. The questionnaire, a copy of which is included in Appendix A of this report, asked the following questions:Are you or your organization doing anything to respond to the call for "more authentic educational measurement?"2. Are you or your organization doing anything to respond to the call for "more appropriate assessment of the whole language approach to language arts instruction?"

TABLE 1. CONCEPTUAL FRAMEWORK FOR CLASSIFYING -AUTHENTIC ASSESSMENT- EFFORTSDimension 1. Purpose of AssessmentA. Pupil needs analysis1.) Content needs - what to leach. (Where Is the pupil In the curriculum-skill hierarchy?)2.) Process needs - how to teach. (Where Is the pupil In terms of receptivity to the Instructional processes being used, e.g. aural-visual, amount and frequency of practice, etc.)3.) Cognitive functions, e.g., memory, reasoning, Inference, transfer, etc..4.) Psychomotor needs and level of functionalityB. Selection of pupils for special programs (gifted, remedial, etc.)C. Grouping for InstructionD. Certification of skills, knowledge, proficiency levels, mastery, etc.E. Curriculum and Instructional program evaluationF. Policy guidance (district, state, national, etc.)

Dimension II, Scale of Assessment -- Level of aggregation and reportingA. Individual pupilB. Classroom or Instructional unitC. Grade level within schoolD. SchoolE. DistrictF. StateG. NationalH. International

Dimension III Response TypeA. Constructed1.) Controlled2.) Open ended 3.) OtherB. Selected1.) Multi-choice2.) True-false 3.) Completion 4.) Matching 5.) Other

Dimension IV, Content Area A. Language Arts B. B. Mathematics C. C. Reading D. D. Science E. E. Social scienceF. Other

Dimension V. Academic Level A. Early childhood (pre k - 2) B. . Primary (gl-3)

C. Intermediate (g 4-6) D. Junior High/Middle (g6-9) E. High School (g9-12) F. Post high school

Respondents were asked to describe the nature of their response if they answered "yes" to either question, and to provide the name, address and phone number of the appropriate contact person in their organization for possible follow up. The initial response level was somewhat disappointing at approximately 20 percent, therefore a limited follow up was planned. At that point we decided that we would focus the follow-up on local school districts, that had not responded to the first mailing. September, 1990, 125 questionnaires were mailed to a randomly selected sub-sample of the school districts that had not previously responded.Response data were entered into a PC database so that they could be sorted by respondent group. Assessment materials received from respondents were reviewed and abstracts were developed, summarizing their key characteristics, including information on the submitting organization and a contact person within the organization. The abstracts and materials were then catalogued by the descriptors developed from the conceptual framework to create a library of assessment materials.RESULTSScope of study

The study targeted 433 educational organizations. The number o institutions targeted and response rate by category is shown in Table II. Of the total target group, 187 institutions were public school districts in either the U.S. or Canada, representing 43.2 percent of the target group. Of the remaining groups, 13 percent were colleges or universities. Educational research and development agencies(public and private) comprised 7.2 percent of the total respondent group, while state and Canadian provincial education agencies comprised 6.7 percent. Professional associations and certification groups also comprised 6.7 percent of the respondent pool and publishers made up 6.0 percent. A variety of "others" made up the rest. These included intermediate agencies and private schools, councils, commissions and educational forums, smaller private consulting firms, software firms and some other unique designations.Response RatefA total of 110 organizations responded positively to one or other of the two questions asked, for a 25.4 percent rate of return after follow-up. of those responding, 106 or 24 percent of the target population responded positively to question one: "Are you .or your organization doing anything to respond to the call for .'more authentic' educational measurement?" Forty one of the 106 or 39 percent, submitted materials for review. Seventy five organizations responded positively to question two: " Are you or anyone in your organization doing anything to respond to the call for 'more appropriate' assessment of whole language approach to language arts instruction?" This represented 17.3 percent of the target group and 68 percent of the total number of respondents to either question. Sixty nine organizations responded positively to both questions, representing 16 percent of the total target group. These data are displayed in Table II.

Table Il. Summary,of responses to Authentic Assessment survey by organizational grouping.Responses to Ouestion One "Authentic AsseUMLnt"TARGET GROUPU-rganization/institutions Target Percent Number Percent Percentreceiving questionnaire Group N of Total YES" Target YES"GroupPublic School Districts (U.S. & Canada) 187 43.19 67 35.83 63.21Colleges and Universities 56 12.93 9 16.07 8.49Educational R & 0 Organizations (public & private) 31 7.16 2 6.45 1.89

State Education Depts. & Canadian Provinces 29 6.70 15 51.72 14.15Professional Associations - Certification Groups 29 6.70 3 10.34 2.83Test Publishers 26 6.00 5 19.23 4.72Councils. Commissions & Forums 9 2.08 1 11.11 .94Intermediate Education Agencies & Private Schools 7 1.62 2 28.57 1.89Other - 52 13.63 2 1.69 1.89Total 433 100.00 106 N.A. 100.01,* Total exceeds 100.00 due to rounding.Questign one.- "Authentic Assessment"Composition of Respondent Group: As shown in Table 11 67 or 63 percent of the total number of positive responses to question one were from public school districts. This figure represented 36 percent of the school districts targeted. Seven of 12 Canadian local districts responded positively for a rate of 58 percent. State departments of education and Canadian provincial agencies reported higher levels of activity (52 percent) than did any other group identified in the study. Local districts, both am6rican and Canadian were second (36 percent.) Out of five Canadian provincial agencies reporting, four were positive for an 80 percent rate. Three of 29 professional associations responded positively for a rate of 10 percent. Five of 25 publishers who responded were positive for a 20 percent rate. only 9 of the 56 respondents representing colleges and universities were positive, for a rate of 16 percent.

Stage of Development: Positive respondents were at many different stages in the development of their authentic assessments, ranging from collecting information to refining operational systems. These data are displayed in Table III. Responses were approximately equally distributed among research (20 percent), development (22 percent), and operational systems (22 percent). Other stages included implementation/pilot (10 percent), developing a policy/support base (2 percent) and not specified (7 percent).Table III. Stage of Development for Respondents Answering Yes on Question One -Authentic Assessment. (Highest level reported)Stage of Development Number PercentDevelop Policy/Support Base 2 .02Research 21 ..20Planning/Training 19 .18Development 23 .22Implementation/Pilot 11 .10Operational 23 .22Not Determined 7 .07Total 106 100Types of Assessments: Surprisingly, 32 percent of the total number of responses, including multiple statements from the same organization, did not specify the type of assessment they were working with. Direct writing was the most frequently reported type of assessment at 24 percent of the total number of responses recorded. Classroom paper and pencil tests shared the least frequently reported assessment type with norm referenced at 1 percent each. It is interesting to note that norm-referenced assessment was reported as an attempt at more authentic assessment by some respondents. These data are revealed in Table IV.Grade Levels: As indicated in Table V, fifty five of the lo6 positive respondents on question one, or 52 percent did not specify the grade levels at which their authentic assessment efforts were targeted. Twenty seven respondents or 25 percent identified both elementary and secondary grades as their focus. Elementary grades were the focus of 10 percent of the positive responders, while secondary-grades were focused on by 8 percent. Only 2 percent were planning to conduct assessments at pre-k or kindergarten levels.

Table IV. Types of Authentic Assessment ReportedAssessment Type Number PercentStandardized(non-specific) 4 .03

Criterion.Referenced 5 .04Norm Referenced 1 .01Performance Based 20 .15Direct Writing 33 .24Portfolio 24 .18Other 4 .03Not specified 44 .33Paper & Pencil(non-spec) 1 .01Total 135 101

Content Area: Not surprisingly, the most frequently reported area of "authentic assessment" was "language arts", including direct writing assessment. Forty nine of 171'total responses to question one cited language as the content area in which they were working. The second most frequently cited content area was mathematics with 30 responses. These data are shown in Table VI.Ouestion Two - "Whole Language"Forty seven of the 187 public school districts targeted responded positively to question two dealing with whole language assessment. As shown in Table VII, this represents 25 percent of the districts targeted and 62.67 percent of the 75 positive responses received. This was by far the highest percentage of positive responses for any group on question two. State departments and canadian provincial ministries of education were next with a 12 percent response. Colleges and universities followed with an 8 percent positive response.

PORTFOLIO ASSESSMENT: ISSUES RELATED TO AGGREGATION*Allan Olson Northwest Evaluation Association

Purpose of This White Paper

The NWEA White Paper on Aggregating Portfolio Data is intended to serve as a working document that summarizes key issues or concerns related to aggregating assessment information from portfolios. The paper represents the collective knowledge, readings, experiences, and intuition of many individuals directly involved in portfolios for instructional and/or assessment purposes, as well as individuals who are critically considering portfolios for similar purposes. This White Paper attempts to unify the large scale assessment issues which are only loosely connected or not even directly addressed in the current portfolio literature. The paper is intended to promote further discussion and exploration of the questions related to aggregating portfolio data, which in turn will result in the modification of this document.The initial target audience for this White Paper was the group of individuals who will be participated in the August 1990 NWEA Working Retreat on Aggregating Portfolio Data. Those participants included assessment/evaluation specialists, curriculum and instruction specialists, and classroom teachers. Participants represented grades K-12 and a number of subject areas plus special programs. The August 1990 NWEA Working Retreat on Aggregating Portfolio Data was a more focused follow-up session that was recommended by one of the subgroups at the more general December 1989 NWEA Working Retreat on Portfolio Assessment.Sections ILV, and VI of this White Paper served as the basis for some of the scheduled group discussions during the August 1990 Retreat. The goal of the retreat was to exit with a revised version of the paper that would even better reflect the collective knowledge, expertise, and questions about using portfolios for large scale assessment.

The secondary target audience for the White Paper is participates attending the upcoming NWEA Third Annual Writing Assessment and Portfolio Institute. Although the institute will involve more participants than the working retreat, the basic composition of the group will be very similar

Based upon input at the August Working Retreat, the original portfolio working definition was revised. For purposes of this White Paper, the following detailed working definition of portfolios will be used

Definition of Portfolios

Portfolio is an appropriate tool to measure and reflect student performance. The portfolio process is an opportunity for students to assemble a purposeful collection of their work, in preparation or completed form, which illustrates their efforts, progress, or achievements.A critical component of the portfolio process, and the portfolio itself, should be students, participation in the selection of the portfolio's content. The student should be involved in this process of selecting the pieces to be included. Criteria for selecting student work for the portfolio, as well as criteria for judging merit of the student work must exist.Furthermore, evidence of student self-reflections (i.e., metacognitions) about the included content should be present in the portfolio; otherwise it is a folder, scrapbook or showcase, but not a portfolio.Multiple purposes, uses, or levels of data aggregation must not be in conflict. Otherwise the portfolio (process) only complicates the performance description. Portfolios may have different purposes for different performance areas. It is important, therefore, that clear communication and agreement about the purpose(s) of the portfolio, its intended uses, and the levels (e.g. classroom, school, district, etc.) of data collection be established among all users. The portfolio is to be separate from the student's cumulative folder but may include test results and educator feedback.Finally, the components of the portfolio (process) will depend on the grade level, course, skill area, scope, and/or purpose of its use. For example, the portfolio may be "representative" or "bests" of a student, depending upon the purpose.

Revisions of the initial working definition of portfolios were based upon an abbreviated portfolio definition drafted at the August Working Retreat. The proposed abbreviated definition appears in Appendix A. The purpose would determine which definition would be most useful in a given situation.

Initial Development of this White Paper

On April 5, 1990, NMTEA convened a small working group to brainstorm issues and concerns related to the use of portfolios for large scale assessment. The ten participants represented: (1) small, medium, and large districts in Oregon and Washington and the State Departments of Education from both states, (2) curriculum and assessment specialists, (3) portfolio advocates and (4) individuals who were undecided about the value of portfolios for instructional and/or assessment. purposes.

The long list of questions and thoughts generated about aggregating portfolio data from the April 5. session was analyzed to identify issue clusters. In addition a computer search of the literature on portfolio assessment was completed for a second time. Issue clusters from these sources determined the major sections of this White Paper. Three participants in the brainstorming session assumed responsibility for drafting the initial White Paper.

Future Modifications/Expansions of this White Paper

The first modification to the NWEA White Paper on Aggregating Portfolio Data occurred as a result of reaction to the paper and additional input from the August 1990 NWEA Working Retreat on Aggregating Portfolio Data. The next anticipated revision will occur in late 1990 or early 1991 following two other scheduled N"WEA activities: (1) the October 1990 NWEA Third Annual Writing and Portfolio Assessment Institute and (2) the Winter 1990 NWEA Second Annual Working Retreat on Portfolio Assessment. Thereafter this White Paper will continue to evolve as more is learned about large scale assessment using portfolios. The document must remain current enough on issues related to aggregating portfolio data to provide adequate support to potential users.

Beliefs Related to This White Paper

To better interpret the major sections of this paper which focus on portfolio aggregation issues, it is necessary to understand the values and assumptions upon which this it is based. These beliefs influenced not only which questions were included but also impacted the related responses.

Beliefs

This White Paper is based upon the following set of beliefs: Instruction and assessment are best when integrated and should be driven by the same desired student

outcomes. Portfolios have a unique definition and can exist with information from any curriculum area or be multi

- disciplinary. Portfolio assessment is a performance-based evaluation as well as an unique process to obtain student

information and self-evaluation data. The format of portfolio data can be oral/written, behavioral or visual but critical elements of the

definition and student data must be present to make it a portfolio. Portfolios can also include a variety of assessment information, like student work or drafts; student questionnaires; teacher checklists; test results, such as norm or criterion-referenced; and samples of conference notes.

Student involvement in the selection and evaluation process is essential to the concept of the portfolio process.

Nationally held evaluation standards apply to portfolio assessment like any other assessment strategy. Input from classroom practitioners that use portfolios should be used to help shape designs for portfolio

data aggregation. This should help to establish a dear purpose for the assessment of portfolios via aggregation of portfolio data.

The purpose of aggregating portfolio data should give clear direction for the improvement of future learning outcomes.

One, some or all of a portfolio's components may be aggregated. Aggregation activities should intrude as little as possible on instruction and provide students, teachers,

parents, administrators and other decision-makers with appropriate, helpful information in a timely manner.

Multiple, or triangulated, measures of student performance are preferable to single measures. If an assessment/evaluation activity can simultaneously meet more, than one common purpose, it is more desirable

Aggregation of portfolio data beyond the individual level should support, = hinder, the use of portfolios at the classroom level

Portfolio assessment can be part of an overall assessment/ evaluation program or complement a currently existing assessment/evaluation program.

Issue Clusters Related to Aggregating Portfolio Data

Analysis of the issues, concerns, questions, and thoughts recorded during the April 5 Brainstorming Session revealed six clusters. Those clusters are:

1) Impact of "Newness" of Portfolios on Aggregating Portfolio Data2) Levels of Aggregation of Portfolio Data 3) Potential Conflicts for Portfolios Serving Both Purposes of Instruction/Indivi dual Assessment and

Large Scale Assessment 4) Potential Benefits of Portfolios Serving Both Purposes of Instruction/Indivi dual Assessment and Large

Scale Assessment 5) Using Appropriate Methodology to Aggregate Portfolio Data 6) Other Issues Related to Aggregating Portfolio Data

A separate section of this White Paper is devoted to each of the six issue clusters. For each cluster, key questions and related responses are provided to help clarify issues within the cluster related to the use of portfolios for large scale assessment. As more becomes known about aggregating portfolio data, the list of questions will expand and responses to existing questions may need to be modified.IISSUE CLUSTER #1: Impact of "Newness" of Portfolios on Aggregating Portfolio DataQ#1. Is the use of portfolios (outside the arts) so new that it is premature to explore if or how portfolio data can be aggregated?No. Portfolios for instructional and individual assessment purposes are presently receiving much attention in the educational literature. While portfolios have existed for ' a number of years in some instructional areas, like the arts, they are currently being discussed and/or tried in a wider variety of subject areas and special programs from kindergarten through the college level. Having received an increasingly broader baseof interest and support in the past few years from classroom teachers and curriculum specialists, portfolios have emerged as one of the new educational trends of the late 1980's and early 1990's. Clearly, portfolios are not so new that little or nothing is known about them.Portfolios for large scale assessment purposes (beyond the individual student or classroom level), however, have only recently began to be questioned and explored in depth. Some measurement and evaluation specialists contend that the instructional value and fidelity of portfolios should be conclusively proven before investing time and energy in determining if or how portfolio data can be aggregated. Others sharply disagree with that position, acknowledging the emerging roles of portfolios in instructional/individual assessment programs and wanting to be able to agg-regate portfolio data where appropriate in a timely manner.The same individuals believe that the use of portfolios presents an opportunity to integrate assessment and instruction and to possibly reduce the overall amount of "addon., assessment activities which vie for precious classroom instructional time. From their viewpoint, delaying the aggregation of portfolio data could further jeopardize realizing those two major potential benefits. This latter position is held by most of the curriculum and evaluation specialists, the classroom teachers, and administrators who have contributed to this'paper, directly or indirectly, despite their different opinions about the degree to which they think portfolio data can be successfully aggregated. Their diversity has added to the richness of this paper.

Q#2. Are any portfolio projects well enough implemented as instructional models that sites exist for trying out potential aggregation methods/system ?Unsure. Although some portfolio projects are fully implemented within instructional programs, they may or.may not be suitable sites for field-testing portfolio aggregation schemas. As outlined later in Section XII, Using Appropriate Methodology to Aggregate Portfolio Data, aggregation requires certain levels of standardization within and across portfolios. Most existing, fully implemented portfolio projects that the authors have reviewed are not standardized enough across major portfolio components to lend themselves to aggregation beyond the individual or classroom level. The implication is that the desire to aggregate must be

acknowledged early in the portfolio project design phase so that aggregation is possible following portfolio implementation.There are a number of portfolio projects currently being designed and implemented throughout the United States with the intent to aggregate. Those projects will provide the best sites for trying potential portfolio aggregation methods or systems.

Q#3. Do portfolio projects; exist where aggregation of portfolio data beyond the individual level has occurred?Unsure. To date none of the portfolio projects reviewed in the United States have revealed, evidence of aggregating portfolio data above the student or classroom level. This is not to say, however, that portfolio aggregation has not occurred. Projects involved in such efforts have simply n't yet been well identified.

Q#4. Do portfolio projects exist were aggregation of portfolio data beyond the individual level is currently in progress for the first time?Yes. There are a limited number of portfolio projects in which aggregation of portfolio data above the student or classroom level is currently in progress. The projects span K- 12 and a variety of subject areas and special programs. Their initial aggregation activities are scheduled for release in the Spring and Summer of 1990. The collective experience of the individuals coordinating these portfolio aggregation efforts will provide much needed insight and feedback about proposed aggregation methods/systems.

IX. ISSUE CLUSTER #2. Levels of Aggregation of Portfolio Data

Q#1. Are the levels at which portfolio data could be aggregated theoretically parallel to the levels at which other more traditional student achievement data are currently aggregated?Yes. Theoretically portfolio data could be aggregated at the student, classroom, school, and district levels provided there are agreements about the inclusion of specific portfolio components and standardization about the conditions under which those portfolio components were produced by students. Aggregation at the state and national levels should also be possible provided that similar assumptions are met.

Q#2. Does the desired level and purpose for portfolio data aggregation affect the evaluation design?Yes. Just as the level and purpose for aggregating traditional assessment data affects an evaluation design, so would be the case for portfolio data. For example, if the purpose of an. evaluation was to collect information for district-level instructional program improvement, not individual student assessment, then sampling of students (their portfolios) would be appropriate. The same general guidelines that shape evaluation designs involving traditional assessment data should apply if the information source is portfolio data.

Q#3. Is there a conceptual continuum of alternatives for aggregating portfolio data?Yes. Accepting that portfolios consist of distinct components, there are numerous alternatives for aggregating portfolio data provided that each component to be aggregated has been adequately standardized. The continuum ranges from aggregating only one selected portfolio component to separately aggregating all portfolio components, as illustrated below.

Single Portfolio Multiple (But Partial) Portfolio All PortfolioComponent Components Components

It is important to acknowledge the implications of a decision to separately aggregate across all portfolio components. Because each component to be aggregated must be standardized, the decision to aggregate across all components would therefore

standardize the entire portfolio which would in turn eliminate the option to individualize parts of the portfolio. That restriction may or may not be a concern depending upon the instructional/individual assessment purpose of the portfolio.In some cases it might be desirable to aggregate across components within a portfolio to derive one overall number or descriptor for a portfolio. This alternative is very purpose dependent and must be sensitively done. It would be appropriate in cases where there was a need to synthesize more information and where the richness lost as a result of such aggregation would not present a problem. The more diverse the components are, the more difficult this type of aggregation would probably be.Another alternative that has been suggested for aggregating portfolio data has been to compile a single "composite" portfolio to be representative of the portfolios being aggregated. This alternative, for example, could result in a single classroom portfolio or school portfolio or district portfolio drawn from portions of student portfolios. Constructing such an aggregate composite portfolio would require careful considerations to insure representativeness so that generalizations would be valid.

ISSUE CLUSTER #3: Potential Conflicts for Portfolios Serving Both Purposes of Instruction/individual Assessment and Large Scale Assessment

Q#1- Is there a concern of current and intended users of portfolios that large scale assessment needs will jeopardize the instructional value of portfolios?Yes. Portfolios were created as part of the instructional process where the dialogue between the evaluator and student is intimate and prized. Teachers and students make instructional decisions as they conference and compile the portfolio. There is much concern among teachers, curriculum specialists, and others that. dialogue will be damaged as emphasis is placed on including a standardized set of components for evaluation purposes rather than using the portfolio to showcase a student's unique efforts. Large scale assessment needs could undercut the use of the portfolio for "self-study" as part of formative evaluation.

Q#2- Will the aggregation of portfolio data force standardization of portfolios winch directly conflicts with the desire for portfolios to be individualized?Possibly. It is the challenge and dilemma of portfolio assessment to devise a way to standardize what began as an instrument to demonstrate uniquely personal skills. However, some examples of existing portfolios include reading logs and standardized test results. Components such as these can be aggregated without sacrificing individuality. The difficulty comes when a piece of actual work, such as a piece of writing or art is included. When specifications are created for the inclusion of these types of materials there is the possibility that the specifications will force an unnatural standardization of products. Work in some of the Oregon school districts may shed light here. In collecting materials for writing portfolios a few districts have specified that the portfolio must include pieces Written to certain prompts or pieces that demonstrate specific writing tasks.

Q#3. Does the purpose of the portfolio determine what the portfolio components must be?Yes. The components-need to be selected to demonstrate the student's skills relative to the purpose of the portfolio. Many would argue that the value of the portfolio comes from students selecting the contents of the portfolio. The purpose for a portfolio may vary. A student would select different items to show growth, representative work, or best efforts. Clearly items in a portfolio used only by an instructor and the student do not need to be as amenable to aggregation at the school or district level.

Q#4. Will aggregation requirements threaten student ownership of the portfolio?Possibly. For purely individual or instructional use of the portfolio the student may need to have the freedom to add or remove portfolio components. Here further aggregation may be a problem. When student ownership is threatened bN standardization of what is included or not, a key element of the portfolio process is threatened. It would be suspect to aggregate portfolios if the student did not feel ownership in it.

The actual physical ownership of the portfolio, or parts of it, can also become a problem. Traditionally, the school district or federally funded programs have reserved the right to retain students' standardized tests, writing competency exam papers, etc., for records.Such student products have formed the basis of much assessment. Because there has been little attachment by the parents or the students to traditional assessment items, this practice has seldom been challenged. This may change with portfolio assessment, particularly at the elementary -level whore parents may desire to save their child's work. Students may also want to keep their work or be reluctant to trust that it will be assessed and returned intact. Another problem is the moving of portfolios across the grades as the child is promoted. Does the portfolio move intact, in part or start anew? How is this standardized for aggregation?

Q#5. Will large scale assessment ultimately require multiple p6rtfolios, e.g., one for instructional individual assessment and separate ones for large scale assessment? Will this depend upon the aggregation level?Possibly. Keeping two portfolios would probably be unmanageable. However, it is possible that a subset of a student's portfolio could be used for a specific purpose but this would open the issue of sampling bias.

ISSUE CLUSTER #4: Potential Benefits of Portfolios Serving BothPurposes of Instruction/Individual Assessment and Large Scale Assessment.

Q# 1. Do portfolios support the integration of assessment and instruction?Possibly. In general, portfolios have the potential to integrate assessment and instruction. However, the extent to which portfolios support the integration of assessment and instruction is determined by (1) the specific portfolio components, (2) the conditions under which the components are generated by or obtained from students, and (3) how portfolio components are used. In brief, portfolios do not of themselves insure a higher level of integration of assessment and instruction. The conscious design and careful implementation of portfolios whose components meet certain criteria can result in higher levels of integration. If portfolio assessment activities parallel or more closely model instructional activities and there are clear expectations plus demonstrated practice to use portfolio assessment information to make instructional decisions, then integration is achieved.

Q#2. Since what is assessed is valued, will the use of portfolios for assessment communicate a broader range of student performances which are valued?Probably. If the portfolio components to be assessed include student performances which have traditionally not been assessed, then a broader range of student performances will be valued. For example, if the metacognitive component of portfolios were to be rated or summarized in some appropriate manner, then the metacognitive instructional activities would become more publicly valued. On the other hand, if the metacognitive component exists in the portfolios, but is not rated or summarized in some manner, then the use of portfolios would not result in an increased valuing of the metacognitive instructional activities.. The selection of which portfolio components are to be assessed and aggregated will determine whether the particular portfolios communicate that a broader range of student performance is valued-than is currently reflected by more traditional assessment measures.

Q#3- Can the use of portfolios for multiple assessment purposes eliminate redundant or "add on assessment/evaluation activities?Yes. Careful design of portfolio assessment components should enable these components to serve multiple assessment functions and, therefore, reduce the amount of redundant or "add on" assessment/evaluation activities. A portfolio component that can be used to supplement other individual assessment information, as well as be aggregated at the school and district level, can potentially eliminate the need to design and implement separate school and district level assessment activities to obtain the same information.- The needs

for standardization of the portfolio components to enable aggregation must, however, not hinder use of the component for instruction al/i n divi dual assessment purposes.

Q#4. Does the use of portfolios for assessment purposes increase the overall percentage of performance or authentic measures in a comprehensive evaluation program?Probably. Portfolios vary greatly based upon their unique components. The representation of performance and authentic measures in current comprehensive evaluation programs also varies tremendously. Since most portfolios will include performance and/or authentic measures, the use of portfolios would generally represent an increase in the overall percentage of those measures in a comprehensive evaluation program where performance and/or authentic measures have either not been used or have been used only to a limited degree.But the notion of "increase" is dependent upon not only what currently exists but also the extent to which the portfolio components meet the criteria for being performance and/or authentic measures. Implementation of portfolios, of itself, does not insure an increase in those measures. An analysis of current assessment activities and proposed portfolio components is necessary to determine whether portfolios represent an increased, decreased, or similar level of performance and/or authentic measures. Because most current evaluation programs include relatively few performance and/or authentic measures, the use of portfolios would generally represent an increase in those measures.

Q#5- Will the use of portfolios for large scale assessment purposes force the of previously ignored student performances?Possibly. Just because portfolios are aggregated for large scale assessment does not automatically force the assessment of student performances previously ignored. First the assumption is that student performances are selected as portfolio components which have not been previously assessed. Second, the assumption is that portfolio components selected for aggregation represent student performances beyond that which has been traditionally aggregated. The decisions about (1) what the portfolio components will be and then (2) which components will be aggregated will determine whether portfolio will extend the range of student performances being assessed. Simply implementing portfolios will not automatically accomplish that end.

ISSUE CLUSTER #5-: Using Appropriate Methodology to Aggregate Portfolio Data.

Q#1- Do/should standards exist for aggregating portfolio data?Yes. National standards have been established for conducting evaluations. However, state, regional or national standards are not currently available for the sole purpose of portfolios. The Standards for Evaluations gf Education Programs- Projects and Materials (Joint Committee on Standards for Educational Evaluation, 1980) have established guidelines to help develop and implement program improvement. The thirty standards cover four major areas of interest to evaluation workers. These areas are:(1) Utility of the evaluation to identify, select, interpret, report and timelyimpact the issues being studied.(2) Feasibility of completing an evaluation given the practicable problems,costs (money or human resources), and political factors that must be addressed.(3) Propriety of conducting an evaluation which is sensitive to the welfare ofthe persons under study as well as to the open, and honest disclosure of the findings and limitations resulting from the evaluation.(4) Accuracy of the methodology to reveal information that is: relevant to thecontext being studied; both valid and reliable; systematically collected; and defensible in its efforts to draw objective findings, be they qualitative or quantitative.A summary of these four areas and their thirty standards for educational evaluation is shown in Appendix B.

Standards which seem to capture the critical issues of aggregating portfolio data are the following,--

1) Aggregated portfolio assessment appears to-serve the practical needs of the teacher and the student by drawing data and conclusions from actual samples of student performance. However, the procedures used to judge the data are often too complex, too expert dependent, or too broad in scope to ensure timely and clear conclusions.

2) The cost in terms of human resource time to judge or aggregate portfolio data often overshadows the practical importance of the data.

3) The most noteworthy, however, are the issues of technically sound methods of unvealing and conveying the richness of information found in portfolios. While aggregating components of a portfolio, e.g., a writing sample, may be accomplished with valid and reliable procedures, such procedures if applied to the portfolio as a whole can be too complex and thereby loose the meaning that each portfolio was intended to have.

4)The answer to this question of standards is that generic standards do exist-for evaluation and can be applied to aggr, gating components of portfolios or the portfolio as a whole. If the latter is the central assessment task, then it carries with it the double-edged nature of having rich information but often too much complexity for quantification.

Q#2. Can portfolios be standardized to allow appropriate aggregation of data?Yes. Portfolios can be developed with systematic procedures. Components of the portfolio can be specifically sampled under common procedures so as to guarantee justifiable conclusions from the analyses of the data. The problem is not necessarily in the standardizing of what goes into the portfolio or how each sample is judged, but rather how to determine which components should be included and how individual components should be weighted.One answer to standardizing aggregation of portfolio data is to view it like grading. Course grades are singular representations of known behaviors, efforts and performances examined in usually a systematic manner over a period in time. Students are apprised of the grading rubrics and the teacher uses the rubrics to judge the effectiveness of the student to master aspects of the class. As a result, grades can reflect goal attainment; change in skills; or the demonstration of student processes, like motivation, interest or effort.Aggregating portfolios can be like assigning grades. Standardizing the evaluation process across staff, albeit complex, can be set with minimal criteria.For example, a portfolio process might be standardized by answering the following:1) What is to be included in the portfolio? 2) What time frame is to be used to collect components of the portfolio? 3) Are the portfolio components best products, random samples, early drafts, revisions, or results of other

standardized assessments. 4) Who evaluates the portfolio components? 5) Is there a coding process to show what is in the portfolio? 6) Do portfolio components have different weights or levels of importance?Once set, established portfolio standards allow further aggregation to occur for some -or all of the components. To illustrate, if grades are to be evaluated to answer a question about the distribution of grades for boys or girls in a course area, then the assignments, method of evaluation, reliability of the teachers' rating of performance, time frame and other factors must be consistent with all students. If these variables aren't standard then the answer about boy-girl distribution of grades may -reflect non gender causes for the grade distribution.For portfolios the same is true. Judging the effectiveness of an educational question by using all or most components of a portfolio forces all portfolios to be built in a similar if not identical manner. Once built the portfolio data must be aggregated in the same way using the same components and, as suggested, collected in the same manner.

Q#3. Can aggregation of portfolio data occur if portfolio contents, assignments, ratings, etc., have not been standardized?

Probably not. it depends on what the purpose of aggregation is and what evaluation question is being asked.To answer a specific portfolio question like..."Have students' writing performance skills for self selected, best samples improved since last year?"... the portfolio components must have been collected and assessed in the same way, at least at two different times. However, if the question is like..."Are there differences among portfolios at grade 3?"... this may only require aggregation of data that are descriptive. Listing, tallying or illustrating the status of portfolios may or may not require standardized procedures. Standardization will be required if external objectives or norms, like competency requirements for graduation, are applied to the portfolio. Standardization may not be required when only a snapshot of what is commonly included in a portfolio is sought.In short, the evaluation question determines the need for standardized procedures.

Q#4. Can aggregation of portfolio data occur if the rating of portfolio contents are made by judges with varying levels of expertness?Yes. Judges must establish a reliable set of data gathering instruments and/or procedures. More important, however, the rating process itself should be evaluated for level of implementation and reliability. Not until the latter are clearly shown as acceptable is the information from portfolios via experts' ratings meaningful, regardless of expertness.Degree of expertness should only effect the extent of distinguishing parts of the evaluation criteria. More experienced judges can use more finely developed criteria.Therefore when the pool of judges is heterogeneous on the depth of the rating criteria, keep the criteria simple and less specialized in separating student skills.

Q#5. Can aggregated portfolio data be valid and reliable?Yes. This is mostly true for components of a portfolio and maybe true for assessing the portfolio as a whole.There is always the pitfall of accepting the portfolio process as valid (having face validity) because of its appearance as a more authentic reflection of what a student can do. Validity in terms of evaluation is, however, much more.Validity of measurement means that sound conclusions can be drawn from the products and processes used to produce data. Validity is not just an inherent characteristic of the evaluation process or the instrument (-ation) used to evaluate. It is rather the questions being asked, the context of the data, the characteristics of the data gathered and the interpretation one can conclude from only the evidence provided and not observer bias.Portfolios provide a rich vehicle to guarantee the validity of describing student skills. This is possible because of its many samples of student products. Multiple measures are always better then one. Thus portfolio data, if aggregated across many samples, provides greater validity to concluding something about student achievement or process than generalizing from one sample, e.g., a test score or a late assignment, or one's stereotyped perspective.An authentic student product or performance is a much better illustrator of the valued skill than behavior shown through tangential educational estimates of skills, e.g. paper and pencil tests. Students actually reading aloud repeatedly over varied context is more likely to be valid than evidence of reading from a paper and pencil test. In the same way reliability is improved with repeated samples found in a portfolio.The question, however, of valid and reliable data for aggregated portfolio information is more difficult than the issue of sampling a single student's skill.For example, measurement specialists know that one should never average a set of average scores. This is because the second level of calculating an average does not have all the information used in averaging the first sets of scores, like the number in the denominator of each average. Thus a miscalculated higher level average has too much error in its conclusions it suggests.

The same is possible for aggregating portfolio data across portfolios. When all the factors used to generate each portfolio are not known then aggregating the separate portfolios and drawing the conclusions ryfight include too much possible error.To eliminate this problem, validity and reliability issues of the aggregated level must be solved by establishing standard procedures at the portfolio level.Valid parts produce valid wholes. Reliable assessment of the smallest part ensures reliable interpretation of results at the multi-part level.Q#6 Can the whole of the portfolio be greater than the sum of its parts?It seems so. Synergy of the pieces may overshadow the pieces of the portfolio. That is why the pieces must be as valid a reflection of the student and the question being asked as possible. More conclusions might be drawn from the whole portfolio than the pieces can suggest if they were looked at in isolation.Thus one may hypothesize that aggregating portfolio data may tend to exacerbate either the best or the worst of the sum of the pieces in the portfolio.

Q#7. Does adequate methodology currently exist to aggregate portfolio data?Possibly. As established are portfolios in the arts and business community, few models and practices have been successfully built in these realms to aggregate such data. In education the portfolio assessment process has the added challenge of there being more students, more varied student backgrounds and interests, and a broader context in which a portfolio can be designed. There are also more and varied skill levels of the judges (teachers) who rate the portfolio contents.Another reason for inadequate portfolio assessment methodology in education is that the complexity of aggregating portfolio data probably requires recording and analyzing tools previously unavailable to teachers and not part of their pre-service education. Computers, integrated data bases, assessment literacy and, graphically profiled representations of student data are just evolving -in, education and may hold future answers to this dilemma.There is also the reason that multi-variant statistical techniques are often needed to adequately combine the varied data of a portfolio. Such techniques have not been readily used by educational personnel. Higher level statistics, such as multiple analysis of variance and discriminate function analysis are difficult to explain and use. Drawing conclusions seems much easier and has more face-validity by simply showing a student a model of what is expected and showcasing what a student produced in comparison.In spite of these limitations several areas of science- need to be explored as possible approaches to analyzing portfolio data, parts or the whole. These include the following: Topology or that part of mathematics that investigates the properties of the whole that are unaltered when

the parts are examined on the one-to-one basis. Economics and its use of indicators to represent the complex changing and correlational relationships of

dynamic activities of economic status quo or movement, e.g., the Composite Index of Leading Economic Indicators.

Typology and Cluster Analysis as used in the describing of social environments and personality traits. Profiling, as a tool in quality schools and psychometric testing, provides measures of similarity to known or

derived standards.The most meaningful use of these and possibly other approaches is that they can yield three important comparisons for portfolio users.(1) - The comparison of a student's portfolio with the portfolios of the large, varied group of students who produced portfolios under similar conditions.(2) The comparison of the student's portfolio with the typical portfolio of the: a) group at large,b) group having a similar trait or set of traits andc) group believed to be significantly different from the student because of. the degree of the skill(s) possessed or not possessed.

(3) The comparison of the student's portfolio against a neutral point which reflects a skill change from a negative to a positive direction or the reverse.Any of the above comparisons could be done across time, thus producing evidence of change. Similar performance comparisons of student behaviors are made almost every two minutes by teachers in the regular classroom (Stiggins, 1990). Why should the components of a portfolio be any more complex than teacher judgements. The approaches listed above could be clues to analyzing these judgements as reflected in student products found in the portfolio.

Q#8. Does current methodology better support aggregating portfolios in some Instructional areas better than others?Yes. Methodology exists to assess portfolio components in the areas of writing, physical education, art and music. However, this is not the case for math, social studies, science, reading and other areas not traditionally assessed using portfolios. Regardless of area, a major methodology is lacking in aggregating the whole portfolio rather than the components.Some special program areas currently being assessed and defined by student characteristics, e.g., ability, do have portfolio-type methodology in place. Typically gifted and special education programs, in part because of the diversity of their participants, have evaluation models that include assessment of unique goals or outcomes for each student via portfolio type procedures.Will new methods need to be developed to support the aggregation of portfolio data?Possibly Establishing standards of portfolio definition, component selection, performance evaluation criteria and weighting a selection of portfolio components for aggregation will set conditions suitable for using current statistical practices. Such classroom level standards will also help portfolio process assessment to meet the aforementioned standards of good evaluation practices.When diversity of purpose for the portfolio process is the issue then new methods will be needed. Currently only single subject evaluation designs are effective but not efficient in measuring growth or change in individual portfolios.One relatively new methodology, that draws from single subject research but attempts to aggregate findings across students is a procedure called Goal Attainment Scaling (PEP, 1976). This procedure uses unique goals/outcomes for each student. By establishing criteria for each expected outcome the scaling weights each goal, ascertains level of actual vs expected outcome and calculates a scale score. The scale score can be aggregated across students to produce an estimate of how the group met individual goals or attained expected changes. Goal Attainment Scaling has possibilities for aggregating portfolios of both similar or dissimilar context and purpose.

Q#10. Does aggregation of portfolio data allow growth, progress, or improvement to be shown as well as no change or regression?.Yes. The Goal Attainment Scaling process mentioned above illustrates a manner of aggregating unique portfolio-like data which could show no progress. This procedure, however, assumes all components of importance are examined in the portfolio. If only the single "best" of a portfolio is sampled for aggregation and analysis, then improvement or lack of it may not be evaluated.

Q#11. Are appropriate measurement methods currently being used to aggregate portfolio data in existing portfolio projects?Doubtful. The authors have done repeated computer searches of the educational research to date (5/1/90) on large-scale portfolio assessment. Very little is available. No articles have been found on an analysis of the effectiveness of various techniques. Some work is being attempted by publishers in the area of writing portfolios but not with the intent to assess the use of the portfolio as a whole. Individual district projects in which aggregation is occurring, but have not been published, may or may not be applying appropriate methodology.

Q# 12. Would aggregated portfolio data become normative over a longer time?Unknown. It is likely that progress of a complex, multifaceted form changes in probably similar ways as does simple forms of performance data. Traditional analyses of data that are standardized in their collection could be normative across ages, grades or even gender and content area.Portfolios only collect student work which in turn provides a data source. It should not be assumed that such data would evolve any differently than non-portfolio data.

XIII. ISSUE CLUSTER #6: Other Issues Related To Aggregating Portfolio Data

Q# 1. Is aggregating portfolio data cost effective?Unsure. To date there is no data available to help answer this question. There are a number of factors which must be considered to determine cost effectiveness. Since portfolios vary so greatly in their components, cost effectiveness will probably have to be judged on a case by case basis anyway. And while portfolio aggregation may reduce costs because it might eliminate redundant or "add on" assessment activities, many portfolio components will be more time intensive and, therefore, more expensive to score or summarize than traditional measures currently being used. It will probably be some time before there is enough experience and cost information available from portfolio projects to begin answering the cost effectiveness question.

Q#2. Does technology exist which would help support portfolio implementation and aggregation?In some cases. There is software to archive images of art or the written word. This software can classify or codify any variable that a user determines through a set criteria. Software and hardware to analyze writing samples is and has been available for some time. The depth of the factors analyzed is limited only by the size of the computer and the creativity of the user. Scanners, computers, and software are, probably, too expensive at the present time to make them feasible for mass portfolio assessment.

Q#3. Will staff development activities and programs be needed to help support the appropriate use and interpretation of portfolios for instructional and purposes?Yes. Portfolio assessment is a new technique. Teachers and administrators will have unlikely learned about the technique in preservice or previous inservice. Inservice will be necessary to exp1ain the rationale for and the mechanics of using the portfolio for both instruction and assessment purposes.

BibliographyJoint Committee on Standards for Educational Evaluation, Standards for oEducational 'Programs, Proiects and Materials. 'McGraw-Hill, New York, New York.(1981).

Joint Committee on Testing Practices, Code, of Fair Testing Practices in Education, American Psychological Association, 1200 12th Street NW, Washington, DC, 20036,(1988).Program Evaluation Resource Center, Applications of Goal Attainment Scaling, 501 Park Ave. S., Minneapolis, Mn 55415 (1976).Stiggins, Richard J., Classroom Assessment Training Program " Northwest Regional Educational Laboratory, 101 SW Main, Suite 500, Portland, Oregon 97204 (1990).

Appendix A

Portfolio Definition Abbreviated

A. portfolio is a purposeful collection of student work that exhibits to the student (and/or others) the student's efforts, progress or achievement in (a) given area(s). This collection must include:Student participation in selection of portfolio content*the criteria for selection;*the criteria for judging merit; and*evidence of student self-reflection

Appendix BSUMMARY OF THE EDUCATIONAL EVALUATION STANDARDS

A Utility StandardsThe Utility Standards are intended to ensure that an evaluation will serve the practical information needs of given audiences. These standards are:A1 Audience IdentificationAudiences involved in or affected by the evaluation should be identified, so that their needs can be addressed.A2 Evaluator CredibilityThe persons conducting the evaluation should be both trustworthy and competent to perform the evaluation, so that their findings achieve maximum credibility and acceptance.A3 Information Scope and SelectionInformation collected should be of such scope and selected in such ways as to address pertinent questions about the object of the evaluation and be responsive to the needs and interests of specified audiences.A4 Valuational InterpretationThe perspectives, procedures and rationale used to interpret the findings should be carefully described, so that the bases for value judgments are clear.A5 Report ClarityThe evaluation report should describe the object being evaluated and its context, and the purposes, procedures, and findings of the evaluation, so that the audiences will readilvunderstand what was done, why it was done, what information was obtained, wha~t conclusions were drawn, and what recommendations were made.A6 Report DisseminationEvaluation findings should be disseminated to clients and other right-to-know audiences, so that they can assess and use the findings.A7 Report TimelinessRelease of reports should be timely, so that audiences can best use the reported information.A8 Evaluation ImpactEvaluations should be planned and conducted in ways that encourage follow-through by members of the audiences.

FEASIBILITY STANDARDSB Feasibility StandardsThe Feasibility Standards are intended to ensure that an evaluation will be realistic, prudent, diplomatic, and frugal; they are:B1 Practical ProceduresThe evaluation procedures should be practical, so that disruption is kept to a minimum, and that needed information can be obtained.E2, Political ViabilityThe evaluation should be planned and conducted with anticipation of the different positions of various interest groups, so that their cooperation may be obtained, and so that possible attempts by any of these groups to curtail evaluation operations or to bias or misapply the results can be averted or counteracted.

B3 Cost EffectivenessThe evaluation should produce information of sufficient value to justify the resources expended.

PROPRICETY STANDARDSC Propriety StandardsThe Propriety Standards are intended to ensure that an evaluation will be conducted legally, ethically, and with due regard for the welfare of those involved in the evaluation, as well as those affected by its results. These standards are:C1 Formal ObligationObligations of the formal parties to an evaluation (what is to be done, how, by whom, when) should be agreed to in writing, so that these parties are obligated to adhere to all conditions of the agreement or formally to renegotiate it.C2 Conflict of InterestConflict of interest, frequently unavoidable, should be dealt with openly and honesty, so that it does not compromise the evaluation processes and results.C3 Full and Frank DisclosureOral and written evaluation reports should be open, direct, and honest in' the disclosure of pertinent findings, including the limitations of the evaluation.C4 Public's Right to KnowThe formal parties to an evaluation should respect and assure the public's right to know, within the limits of other related principles and statutes, such as those dealing with public safety and the right to privacy.C5 Rights of Human SubjectsEvaluations should be designed and conducted, so that the rights and welfare of the human subjects are respected and protected.C6 - Human InteractionsEvaluators should respect human dignity and worth in their interactions with other persons associated with an evaluation.C7 Balanced ReportingThe evaluation should be complete and fair in its presentation of strengths and weaknesses of the object under investigation, so that strengths can be built upon and problem areas addressed.C8 Fiscal ResponsibilityThe evaluator's allocation and expenditure of resources should reflect sound accountability procedures and otherwise be prudent and ethically responsible.

ACCURACY STANDARDSD Accuracy StandardsThe Accuracy Standards are intended to ensure that an evaluation will reveal and convey technically adequate information about the features of the object being studied that determine its worth or merit. These standards are:D1 Object IdentificationThe object of the evaluation (program, project, material) should be sufficiently examined, so that the form(s) of the object being considered in the evaluation can be clearly identified.ID12 Context AnalysisThe context in which the program, project, or material exists should be examined in enough detail, so that its likely influences on the object can be identified.D3 ' Described Purposes and ProceduresThe purposes and procedures of the evaluation should be monitored and described in enough detail, so that they can be identified and assessed.D4 Defensible Information Sources

The sources of information should be described in enough detail, so that the adequacy o the information can be assessed.D5 Valid. MeasurementThe information- gathering instruments and procedures should be chosen or developed and then implemented in ways that will assure that the interpretation arrived at is valid for the given use.D6 Reliable MeasurementThe information gathering instruments and procedures should be chose or developed and then implemented in ways that will assure that the information obtained i's sufficiently reliable for the intended use.D7 Systematic Data ControlThe data collected, processed, and reported in an evaluation should be reviewed and corrected, so that the results of the evaluation will not be flawed.

LARGE SCALE ASSESSMENT: FOCUSING ON WHAT STUDENTS REALLY KNOW AND CAN DOFrank G. Horvath Alberta Department of Education

INTRODUCTION: GOOD ASSESSMENT SUPPORTS TEACHING AND LEARNINGThe tide of this symposium urges us to examine what is authentic about our assessment programs and how they can be made more authentic. For my purposes, authentic assessment is defined as assessment that, to the extent possible, provides students with an opportunity to show what they really know and can do.In Alberta, we have been concerned about authentic assessment since 1980 when our provincial Student Evaluation Program was first announced in our legislature. We have been absolutely committed to being fair to students, but in all our assessment programs we also wanted to know what exactly students have learned and how well, in relation to expectations. 'Mat is why there is a close tie between our assessment programs and the province's intended curriculum.The recent emergence of the term authentic in assessment literature has been very useful in our work because it has helped us enrich our concept of validity, placing particular emphasis on situational variables. The concept of authenticity has caused us to examine once again how our assessment programs actually reflect learning in a real world environment, and how our programs contribute to good teaching and learning.We have three major assessment Programs in Alberta, with very different purposes. While we want assessment in each of the programs to be as authentic as possible, each is constrained to a degree by its purpose. At this time, the Diploma Examinations Program is the most constrained in its authenticity because its purpose is to certify. individual students' achievement in specific courses at the end of high school. Since 50 percent of a student's final mark in a diploma examination course comes from the provincial examination, the questions must be tied to course objectives which are not in themselves necessarily related to real world experiences. Further, test reliability is critical to ensure fairness and equitable treatment of students. Therefore, we exercise considerable control over the design, development, administration, and marking of these examinations.The second program, the Achievement Testing Program, is intended to monitor student learning at grade three, six,, and nine throughout the province. The tests are based on the intended curriculum generally, but the lived curriculum is important too in their design. Also, the focus is on group data. Thus the constraints are lesser; we are more free to construct the tests to fit the children's real world experiences in the core areas of language arts, mathematics, science, and social studies. Fidelity between the intended curriculum and our assessment remains an essential feature, nonetheless. We want to describe as fully as possible what students know and can do, not merely to report a single mean score.Third are the diagnostic evaluation programs. Here we are the most free. The entire focus of the programs is on the students' day to day experiences and learning. The purpose is entirely to support teachers in their efforts to plan instruction according to the needs of individual students. Even in these programs, however, authenticity is constrained to some degree since we are concerned about the reliability of the criteria we develop for observing and analyzing student performance.We are aware, too, that teachers and students are not the only ones to be served by our assessment programs.. Parents and the public, as well as school administrators, have the right to know how well students are learning what they are expected to learn. Policy-makers need reliable information about the impact of their decisions on the overall system. We aggregate students' results on our diploma examinations and achievement tests and provide school, district and provincial level reports. 'This inionnation is given to the superintendents for use within their districts. Thus while we are ensuring that the authenticity of our assessments enriches our information base, we take care to protect our publics' confidence in the information.Concerns about "soft data" need to be addressed, therefore, possibly through stringent controls on administration of authentic assessments. This paper, however, offers more on the content of authentic assessment than' on the administrative processes that may be necessary to reassure skeptics.

I would like to discuss the ways in which our assessment programs encourage sound teaching and learning processes. Three key principles guide our work:1. Assessment is based on clear expectations for student performance.2. Assessment recognizes the central role of language in learning.3. Assessment provides models for good teaching.CLEAR EXPECTA11ONS FOR STUDENT PERFORMANCEIn Alberta, provincial learning expectations form the basis for both instruction and evaluation. What we expect of students determines what we do at various points in their education both to focus their learning and to assess how close they are to our expectations. In so far as expectations are concerned, "we" includes not just educators, but also parents, policy-makers and other members of the public.Our expectations are made explicit in the Alberta Program of Studies, a legal document which sets out what students must learn in each subject of study and at each level of learning. It provides teachers and school administrators throughout the province with the same I starting point, for the intended curriculum.The learner expectations outlined in the Program of Studies are the basis on which we prepare both our test blueprints and our performance standards for diploma examinations and achievement tests. The blueprint for an examination or test ensures that the assessment design has fidelity with what students are expected to have learned, and the performance standards ensure that the assessment is consistent with how well students are expected to have learned it. In Alberta, both blueprints and performance standards are tied to the intended curriculum. (See Appendix A for a sample of performance standards and Appendix B for a sample blueprint.)

Our performance standards are widely shared among teachers well in advance of assessment, as are sample questions and criteria for scoring of answers. We encourage teachers to share these with student-,. The descriptive statements of desired performance indicate what underlies our acceptable standard and our standard of excellence. Teachers gain further clarification of the performance standards through involvement in centralized test development and marking activities. For example, teachers from all parts of the province meet in a central location to establish criteria for writing. They apply these criteria to students' papers under provincial direction.The extensive involvement of teachers in our assessment programs helps us to make the necessary connection between instruction and assessment. Through their interaction with each other beyond their individual classrooms, teachers have an opportunity to develop a shared notion of learner expectations, and standards. This may be why teachers overall support our Diploma Examinations Program and consider the examinations valuable. This was reflected in a recent Alberta Teachers' Association survey on the impact of diploma examinations on the teaching-learning process (de Luna, 1991). Responses to the teacher questionnaire indicated that teachers see the examinations as a help in their teaching and as a means of self-evaluation.Our assessment programs are more authentic because we link provincial learner expectations and performance standards to the design and marking of our diploma examinations and achievement tests. When we report aggregate results, we report in relation to the acceptable standard and the standard of excellence. Our Examiners' Reports (See Appendix C') are written to reinforce the connection between expectations and results. We want teachers in particular to see consistency in provincial statements about what students should learn and how well.All aspects of our Student Evaluation Program are open to public scrutiny. Our diploma examinations and achievement tests are released after administration, for example. In this way, we are held accountable for the alignment between curricula and evaluation and for the quality of our instruments and assessment processes.

THE ROLE OF LANGUAGE IN LEARNING AND ASSESSMFINTWhile most assessment focuses on outcomes (i.e. what and how well students have learned) authentic assessment also reflects how students have learned. This dimension adds to the validity of assessment information; ultimately it enhances teaming by supporting and demonstrating for teachers aspects of sound teaching practices.

In Alberta, we have focused on language as central to learning. The importance of language in learning is not new to educators. Phrases such as It writing in the content areas", "writing across the curriculum", "language for learning" are commonplace and reflect this awareness of the critical role of language.Our understandings in this area have been greatly influenced by Vygotsky (1962), who theorized that speech becomes internalized and ultimately merges with thought Vygotsky emphasized the contrast between what a child does independently and with assistance. His notion of the "zone of proximal development" revealed the importance of interaction - and thus language - in a child's learning.Since the 1960s, the link between language and learning has been examined extensively and has begun to have a noticeable influence on curricula and teaching. - Britton. has no - doubt been a major facilitator in these developments. Building on the ideas of Vygotsky, as well as others, Britton (1970) theorized about the critical role in learning of personal expressive language - language of conversations, meant for expressing thoughts and feelings with no particular instrumental goal. Britton identified the pervasive contribution of this informal language in the development of deeper and deeper understandings and the development of language itself.Britton and many others who followed have helped educators move from merely an awareness that language is important in learning (a somewhat intuitive position) to a much more clearly articulated sense of its role. Class discussion, student projects, learning logs and self-evaluation are no longer new. Their use is encouraged in Alberta not only by the educational literature but also by the teacher resource manuals which are developed by the provincial government and made available to teachers for a nominal fee. The prevalence of these new approaches in classrooms reflects the greater consciousness that exists in the education community about the need for students to work from their own experiences - to use language to organize their experiences and give them meaning; "we ... go back over events and interpret them, make sense of them in a way that we were unable to while they were taking place" (Britton, 1970, p. 19).This consciousness and these new instructional approaches go hand in hand with developments in student assessment. In Alberta, we are emphasizing writing in our tests and examinations. All our school-leaving diploma examinations and many of our- grades 3, 6 and 9 achievement tests include extended written-response questions. The 1990-91 examinations, for example, contain the following percent emphasis on written-response:Mathematics and ChemistryBiology, Physics, and Social StudiesEnglish- 20 percent- 30 percent- 50 percentThe English examination most closely reflects the features of real writing situations. (See Appendix D.) - A passage from prose or poetry serves as a prompt for two pieces of writing - a personal response followed by an essay of literary analysis. the personal writing allows for exploration and inner dialogue and facilitates the analysis required for the essay. Thus, even in the exam setting, the students can use many of the writing processes they ordinarily use when writing n dependently.The inclusion of written-response in mathematics and the sciences is especially important because of increasing emphasis on thinking processes in our curriculum, as elsewhere. (See Appendix E.) We are moving as fast as we can to asking questions that have no single right answer - questions that encourage students to begin their responses with "... Well, that depends ..."

Besides written response, other types of student-constructed answers are required by some of our test questions. (See Appendix F.) For example, a math question for elementary students, related to making change, provides a picture of a variety of coins, requiring students to indicate a selection of coins they would expect to receive as change, given the details of purchase stated in the question. In another question, students can record data by shading in a bar to complete a graph. In the high school mathematics and chemistry examinations, students are required to write in their numerical responses to problems rather than just select the correct answer from four

alternatives. (See Appendix G.) The technology allowing responses such as these to be machine-scanned has had a somewhat revolutionary influence on the design of our paper/pencil tests; it has increased considerably the extent to which questions can represent classroom instruction and real-life situations.

ASSESSMENT PROVIDES MODELS FOR GOOD TEACHNGAmong the principles informing our development of assessment tasks is the idea that assessment should focus teaching on what is important in learning. In language arts, for example, reading passages are selected from published works * to reflect curriculum objectives. To the extent possible, complete literary selections are used and the questions focus on the author's message. Also considered are factors such as the likely interest of a passage to the students, the richness of the language the author uses, the portrayal of gender roles and other features which make the passage a valuable piece of reading.Multiple-choice questions in our tests are designed to assess students' ability to think through a problem or situation too. Most of the multiple-choice questions require students to go well beyond simple recall. We know from research on reading processes that multiple-choice questions are "authentic" reading activities (Farr, Pritchard, & Smitten, 1990). These questions require students to use language to analyze information and to pull together the parts based on what they already know. They must use their knowledge and experience in making judgments, just as- they do in real problems both inside classrooms and outside.

In social studies, for example, questions are presented in relation to source data. (See Appendix H.) The source data is collected from newspapers, magazines, or historical documents on social issues or key social studies concepts. The "family" of inquiry questions lead the student through an analysis of the source data and the application of learning from the classroom. Whenever appropriate, we construct questions in family groups, so as to assess a particular set of skills within a certain context. Similarly in teaching, grouping of learnings around a specific theme or topic is a common practice.In mathematics, we are field testing the use of a small kit of materials* that could be attached to each grade 3 achievement test. The kit includes items such as paper clips, string, ruler, and counters that the student would use in solving a problem or a set of problems. This adds an element of authenticity to the test and encourages the use of manipulatives in mathematics teaching.At present we permit students unrestricted use of calculators in most parts of the grade 6 mathematics test (excluding the section on basic facts), and in all parts of the 9 and 12 examinations. Data sheets and formulas are provided for students as well.To assist teachers in clarifying for students course expectations and performance standards, we publish sample questions and scoring criteria, as well as annotated samples of student work from previous examinations. In social studies, for example, we encourage teachers to, share with students how they will be marked on their examination essays and the specific criteria for scoring. (See Appendix 1.) These materials as well as the tests serve as a resource and support of good teaching.In addition, we have developed assessment materials specifically designed to encourage diagnostic teaching. The purpose of these materials is different than that of tests and exams. They are intended primarily to allow a teacher to gain information about an individual student's learning. Here the "how" of learning is particularly in- focus. Of special interest are our diagnostic programs and our piloting of performance-based assessment activities.

Diagnostic ProgramsTwo of our diagnostic programs are for teachers of elementary school students, one for reading and one for mathematics (Alberta Education, 1986; 1991). Both programs reflect the concepts and learning objectives presented in the Program of Studies for these areas. They provide strategies for identifying individual students' strengths and weaknesses in the cognitive processes central to reading and mathematics. Diagnostic criteria are valid for Alberta students and are reliable. Finally, the programs provide instructional strategies that help students build on their strengths while addressing learning processes needing attention.

The Diagnostic Reading Program provides teachers with strategies and materials to find out how well the student is using cognitive processes to construct meaning from a passage when he/she reads. The processes that are assessed are as follows: attending, associating, analyzing, predicting, inferring, synthesizing and monitoring. Alternative instructional strategies, "whole language" in nature, are provided to address areas of strength and weakness in students' reading.Extensive inservice was provided on the use of the Diagnostic Reading Program, upon its completion in 1986. The program has helped teachers with whole language methodologies and has helped them to teach diagnostically. (See Appendix J.) In a follow-up study on implementation, teachers remarked on the usefulness of the program not only for classroom purposes but also for reporting to parents - in more specific ways than they had previously. Students' performance in reading improved markedly in classrooms where the Diagnostic Reading Program is used (Alberta Education, 1989).The Diagnostic Mathematics Program is just now available. This program is intended to help teachers unlock the mystery of mathematics for students. (See Appendix K.) The focus is on the "how" and "why" of mathematics rather than just the product or the "what". Assessment activities address understandings in the concrete, pictorial, and symbolic modes, and follow-up instruction makes heavy use of manipulatives. Teachers are being inserviced in the use of the program at present.

A third diagnostic evaluation program is nearing completion (Alberta Education, 1990a). It is designed for teachers of students in junior high school and high school (grades 7-10). The core of this program is how students use language to learn. We provide criteria to help teachers look at language use in science, social studies, and language arts. These criteria are based on. a model of learning and communication which identifies six language processes basic to all subjects. (See Appendix L.)The model reflects the movement from the familiar - from what is close to oneself - to new, more generalized understandings. Personal, informal language is seen as an essential means of making this connection and achieving insights. The processes of exploring, narrating, imagining, and empathizing reflect this, as well as the importance of feelings in the development of understanding.All of the processes are seen to interact, as an individual uses language to develop an increasingly refined understanding of a concept or experience. However, the process of monitoring is seen as pervasive in learning and thus a particularly important partner to each of the other five processes.We have developed checklists that highlight several key behaviors associated with each of the processes. (See Appendix M.) More detailed descriptions of these behaviors are provided in a set of descriptive scales which identify performance criteria for each process. (See Appendix N.) The criteria are categorized as strong, satisfactory, limited and weak.Both the checklists and the descriptive scales have been revised and refined through field testing. They provide valid and reliable criteria for observing student language and behavior in a classroom and for analyzing student work or tapes of student interaction. By using the criteria, teachers can gain information that is very useful in planning instruction so as to capitalize on students ' strengths.Teachers who have been involved in the pilot phase of this program have expressed surprise at some of the insights they have gained about individual students' learning. They have also expressed pleasure about the fact that the processes highlighted in the program model seem compatible with those found in subject models of human learning - for example, the scientific method and problem-solving processes used to organize knowledge in the mathematics and sciences.

Performance-based Assessment ActivitiesIn the interests of extending the range of student performance being assessed by our Achievement Testing Program, we are piloting several types of performance-based assessment in addition to written-response activities. Last spring, we carried out a pilot project on the use of portfolios to systematically gather information on student achievement. Although portfolio assessment is classroom based, we wanted to examine

the possibility that it can also be used at the provincial level to help us report on what students know and can do by the end of grades 3, 6, and 9.Our pilot involved four classrooms - 120 students, both rural and urban. Specific assessment activities in math, science, social studies, and language arts were included. Teachers carried out the activities during May and June, fitting them into classroom' work. Twenty-one grade 3 teachers representing a cross section of Alberta teachers read the portfolios, using holistic scoring guides as a basis for assessing the work Both the teacher markers and the classroom teachers, recorded their observations about the student work.Teacher comments indicated that they liked the idea of using portfolios and learned a great deal about their students. They appreciated the chance to intentionally observe their students' process skills and to discuss what students were able to do. Many said they would do more of this kind of assessment in their classes.In language arts, where students had been asked to pick a "best piece", teacher markers were disappointed with the choices. They were surprised that the writing was not better than that done on the provincial achievement tests. This result may have implications for how teachers use portfolios in class work. The result may also indicate a need for attention to students' self-evaluation or to the types of writing tasks and procedures that are used in the assessment.We learned from this pilot. It showed us the importance of clarifying expectations about the assessment tasks at the outset. For example, we see the importance of asking for all kinds of writing. We need to be sure the teachers understand the assessment task uniformly. As well, we got valuable feedback on scoring math work. For example, we discovered that giving marks for process is problematic, since some students can see an answer right away and can lose marks if they do not plod their way through a number of steps that are expected - i.e. if they do not use the "process" we expect.This year we are extending the pilot project. Grade 3 teachers from twelve school districts in the province will collect selected items in a portfolio on each of their students. The assessment information will be collected in the course of regular instructional activities related to language arts, social studies, mathematics, and science. "Best pieces" of work will be selected by the student and teacher, and the focus will be on four interdisciplinary areas of learning: participation skills, oral communication skills, problem-solving skills, and process skills in science. The portfolios will also contain students' self-assessments, including written reflections on themselves as learners.To help teachers correct the assessment information in a consistent way, we are providing checklists for observing oral communication, and holistic scoring guides for evaluating problem-solving activities. Teachers will keep anecdotal records of students' participation skills, as well.In addition to this portfolio project at the grade 3 level, we are working with grade 6 and grade 9 teachers who will field- test specific assessment activities in science and mathematics. Also, grade 9 teachers are developing observation criteria for assessing group participation skills in social studies work. Although these are "add-on" activities for our Achievement Testing Program this year, teachers' responses have been very positive. They see the project as a very useful means of gaining information about their students' strengths and weaknesses in a broad range of learnings.

CONCLUSIONThe concept of authentic assessment has been very useful to us. It has served as an organizing concept for introspection and renewal. As a result, it has revealed more clearly than ever why certain aspects of our assessment programs are effective, and it has led to some exciting refinements. It has helped us reaffirm our key principles: Assessment is based on clear expectations for student performance. Assessment recognizes the central role of language in learning. Assessment provides models for good teaching.What does die future hold for us? We know that our assessment programs will continue to improve. We want to focus more than ever in the future on what is essential to students' ' achieving all they are capable of, both inside and outside school. We are 'interested in describing achievement in terms of profiles that show the levels

of learning attained by students. We want to find ways of improving the reliability of our "softer" assessment strategies too. Authentic assessments, in the form of performance-based activities, are not likely to replace achievement tests, but we believe that such alternative approaches will be an effective way of adding to the information we get from well-designed tests.

APPENDIX A THROUGH N ARE AVAILABLE IN THE ORIGINAL HARD COPY ONLY AT THIS TIME.

134REFERENCESAlberta Education (1986). Diagnostic reading program. Edmonton, Alberta: Alberta Education, Student Evaluation Branch.Alberta Education (1989). Diagnostic reading program survey. Available from Alberta Education, Student Evaluation Branch, Edmonton, Alberta.Alberta Education (1990a). Diagnostic learning and communication processes program. Handbook 1: Integrating evaluation and instruction. Available from Alberta Education, Student Evaluation Branch, Edmonton, Alberta.Alberta Education (1990b). Mathematics and sciences bulletin. Diploma examinations program: School year 1990-91. Edmonton, Alberta: Alberta Education, Student Evaluation Branch.Alberta Education (1991). Diagnostic mathematics program. Edmonton', Alberta: Alberta Education, Student Evaluation Branch.Britton, J. (1970). Language and learning. Harmondsworth, England: Penguin.de Luna, P. (199 1, February 12). Teachers wailt to keep diploma examinations. ATA News, p.3.Farr, R. Pritchard, R., & Smitten, B. (1990). A description of what happens when an examinee takes a multiple-choice reading comprehension test. Journal of Educational Measurement, 27, pp. 209-226.9Vygotsky, L.S. (1962). Thought and language (E. Hanfmann and G. Vakar, Trans.). Cambridge, MS: 'Me MIT. Press.P

AN INCLUSIVE APPROACH TO ALTERNATIVE ASSESSMENT

Lew Pike Fairfax County Public Schools

IntroductionPerhaps the most pointed question for evaluating a school system's assessment program is, "What is its effect on instruction?" As we all know, growing support for "alternative assessment" is leading to major changes in assessment at national, state and local levels. This movement toward a broader, more innovative range of assessment alternatives should serve to improve instruction. But there is a risk that the introduction of alternative forms of assessment on a broad scale will create as many problems as it solves.Fortunately, current proposals for reform are more constructive than the anti-testing stance of earlier propositions, such as NEA's call for a moratorium on all standardized testing. Rather than simply opposing standardized testing, most new proposals (see Archbald and Newmann, 1988; Mitchell, 1989; and Wiggins, 1989) include suggestions of what to use instead, and describe the proposed changes in the framework of developing a stronger link between testing and instruction.That is the good news. The bad news is that most discussions of assessment alternatives feature obstinate advocacy of opposing views. This approach fosters all-or-nothing decisions, or no decision at all. There is little exploration of a middle ground, in which the objective is to improve assessment by combining the best attributes of both traditional and alternative approaches. The position taken in this paper is that the improvement of instruction through assessment reform will be much better served if an inclusive, pragmatic approach is taken, rather than one seeking to determine which of two almost diametrically opposed perspectives is "correct."The purpose of this paper is to examine some of the issues involved in choosing between (or selecting from) "traditional" and "alternative" forms of assessment. A discussion of these issues will be followed by a brief description of how we have applied some of the concepts to assessing a revised mathematics curriculum in Fairfax County, Virginia.Issues in Alternative AssessmentThe following issues have emerged from consideration of assessment changes suggested by advocates of alternative assessment, and some concerns about these suggestions expressed by measurement specialists. Each issue will be discussed within the framework of designing assessment to support the improvement of instruction.1. Criterion-referenced versus norm-referenced tests2. Multiple-choice versus alternative item formats 3. System accountability versus classroom instructional use 4. Teacher involvement in assessment 5. Test biasFor each issue, an alternative-assessment proposal is presented, followed by the response of traditionally oriented measurement specialists and by a recommended course of action or decision rule for FCPS (or other school systems) based on a synthesis of the two perspectives.

Criterion-Referenced versus Norm-Referenced TestsAlternative-Assessment ProposalReplace norm-referenced tests with criterion-referenced- ones. The rationale is that norm-referenced test content does not match the curriculum. In a symposium on "What kind of instruction should measurement be driving?" Grant Wiggins (1990) noted that "Test content influences instructional content, as teachers seek to insure the best possible scores."Traditional-Measurement ResponseCriterion-referenced tests should enhance the validity of scores indicating student achievement of a school system's curriculum objectives. However, they do not allow comparisons to external reference groups such as state and national samples.

Recommendations and Discussion Use criterion-referenced tests to measure student achievement of a school system's curriculum objectives. Use norm-referenced tests, if the intent is also to compare a school system's student scores to external

reference groups.Criterion-referenced tests should be used, wherever possible, to assess student achievement of a school system's instructional program. For a given content area and grade level, the most relevant criterion for evaluating the instructional program is the corresponding set of curriculum objectives. To support the instructional program effectively, assessment must be tied closely to these objectives.It does not necessarily follow that norm-referenced tests should be eliminated. Given a test well-matched to a system's curriculum, a separate norm referenced test is also needed if, assessment is to include comparisons to external (state or national) norms.Multiple-Choice versus Alternative Item FormatsAlternative-Assessment ProposalReplace multiple-choice items with completion items or performance tasks. The rationale is that multiple-choice items cannot adequately assess higher-order learning. In the symposium noted above, Wiggins (ECS-1990) pointed out that test format also has an effect. he asserted that traditional test items requiring only recall, recognition, or simple algorithms leads to instruction that is "increasingly atomistic and composed of unambiguous questions."

Traditional-measurement ResponseTests composed of completion items and performance tasks are subjective, difficult to develop, time-consuming to administer, costly to score, oft ' en low in reliability, and provide a narrow sampling of the content domain. Further, multiple-choice items are not necessarily limited to the description given by Wiggins.Recommendation and Discussiono For each objective being measured, select the item format that affords the best information within the limits of feasibility.Given a well-defined curriculum, decisions about item-format can be made separately for each objective. Some objectives can be measured adequately by multiple-choice questions, and others cannot. For assessment of math instruction in Fairfax, we have found the following guideline to be useful: When multiple choice questions can be developed that curriculum and test specialists agree are valid for a given objective, then that is the preferred format. This is primarily because of their much greater cost-effectiveness, especially in the expenditure of classroom time. But when the validity of multiple-choice items for a given objective is questionable, completion items or other performance tasks must be used.

System Accountability versus Classroom Use of AssessmentAlternative-Assessment ProposalReplace summative assessment (for system accountability) with formative assessment (for classroom instruction). The rationale is not altogether clear. It seems based in some part on the recognition that summative assessment is often not adequate for serving formative assessment needs and, for at least some advocates, on a belief that accountability assessment is inherently antithetical to good instruction.

Traditional-Measurement ResponseFormative assessment has limited value for assessing accountability.Recommendation and DiscussionUse both summative and formative assessment in the assessment system.Neither type of assessment can replace the other. Each is important in its own right, and must be considered on its own merits. A fundamental question in designing an assessment system is: "How can the competing goals of assessment for classroom instruction and assessment for accountability be dealt with, when the primary goal is to support the improvement of instruction?"

The first part of the solution is to take a dual-track approach to assessment. The instructional track is directed primarily to teachers I needs for ongoing instructional decisions. The accountability track is concerned primarily with system-level accountability needs. Thus, for formative assessment, control should be primarily in teachers' hands, and the division role should be to provide training and support. For accountability, control should be primarily at the division level, with cooperation and assistance from teachers and school administrators. Having made this distinction, priorities can be assigned in a school system according to the assessment needs of each content area.The second part is to recognize that work directed primarily to one level can be extended to address some needs at other levels. Tests for the FCPS mathematics curriculum, for example, are designed primarily for accountability, and provide total scores for that purpose. However, strand (subscore) information is also reported, which shows areas of relative strengths and weaknesses. This information has direct value for improving classroom instruction. Similarly, it may be possible, but remains to be demonstrated, that portfolio assessment intended primarily for classroom use can be designed to allow aggregation of data across students and classes. Then the scores may have credibility for answering accountability questions.

Teacher Involvement in Assessment

Alternative-Assessment Proposal

Rely on teacher-based assessment, rather than on externally imposed assessment. The stated rational is that externally imposed assessment affords teachers little sense of "ownership."Traditional-Measurement Res22nseTeachers are poorly trained for either formative or summative evaluation.Recommendation and Discussiono Provide training to allow teachers to take an increasingly responsible role in assessment.There is general agreement that-teacher's should play an increasing role in assessment in the 1990s. Teachers' changing roles in assessment were described in several reports at the ECS-1990 conference. Richard Stiggins predicted that teacher- initiated classroom assessment will be the main source of information for decisions affecting instruction. Dale Carlson, of the California Assessment Program (CAP), forecast a shift to teachers' involvement, ranging from the development of exercises to scoring open-ended items. At the same time, major concerns were expressed about teachers I willingness and preparation for expanded responsibilities in assessment. William Brown, of the North Carolina Department of Education, described the introduction in his state of "Instructionally BasedEvaluation," in which teachers apply observation scales daily in reading and mathematics. He noted that teachers do not like it, and do not incorporate it in their instruction.

At the same conference, Stiggins discussed survey results showing that teachers receive very little instruction about assessment in most teacher preparation programs, and that the instruction provided rarely addresses practical classroom needs. As one solution to the problem, Stiggins provides assessment workshops for teachers, both directly and through a series of four videotapes from Northwest Regional Educational Laboratory. Another way of strengthening teacher motivation and training for assessment was reported by Tej Panday, of CAP. Teachers who scored open-ended items, recently introduced as a major part of CAP testing, received inservicing on related curriculum and assessment problems and issues. Most readily accepted this form of participation in, and instruction for, assessment.

Test Bias

Alternative-Assessment Proposal

Replace standardized testing with alternative -assessment measures. The rationale is that standardized test scores are biased against minorities.

Traditional-Measurement ResponseScore differences by ethnic group are often greater for alternative assessment measures than for traditional standardized tests.

Recommendation and DiscussionContinue to use standardized test scores as indicators of student achievement, when item formats are appropriate to the objectives measured.Critics of standardized tests often cite ethnic-group score differences as prima facie evidence that the tests are biased. Some advocates of alternative assessment suggest the use of alternative assessment as a solution to the problem. However, as such alternatives are being implemented, there is increasing evidence that the opposite may be true. H. D. Hoover recently predicted that fairness will become the biggest issue regarding alternative assessment-(ECS-1990). He described studies showing that ethnic differences increase when a change is made from multiple-choice items to alternative measures. Even changes such as allowing greater time in essay-writing may have this effect. A National Assessment of Educational Progress (NAEP) report on its 1988 writing exam noted that increasing writing time from 15 to 30 minutes did raise scores, but by a wider margin for whites than for blacks or Hispanics.Lower test scores do not necessarily mean that the net effect of introducing alternative assessment will be a negative one for minorities. Where such changes result in better measurement of achievement, particularly when there are also subscores pinpointing relative strengths and weaknesses, the effect can be improved instruction. Because it is better targeted, improved instruction can .then, in turn, increase minority achievement. In that event, a short-termproblem of lowered minority test scores could lead to highly significant long-term gains.

Summary of Issues in Alternative Assessment

Decisions about involvement in assessment reform are likely to vary by content area. The assessment program for each discipline should be reviewed to see how well it matches a system's corresponding curriculum objectives. This is particularly important for Mathematics and the Language Arts, which have undergone major curriculum changes, and for Science and Social Studies, which are beginning to experience similar changes.While the alternative assessment issues may be stated simply, they are in fact very complicated. This complexity can be substantially reduced by taking an approach that is inclusive and pragmatic, rather than -exclusive and ideological. Too often, the discussion about alternative assessment is presented from an advocacy stance. This easily leads to either-or decisions without considering how a combination of traditional and alternative assessment may best serve a system's assessment needs. While this is beginning to change, a pragmatic, inclusive approach -- i.e. , one that considers the advantages and limitations of both traditional and alternative modes of assessment -- is still too often "The path less taken."

Our general recommendation, across the five issues, is to proceed vigorously with the exploration of alternative modes of assessment, but approach large-scale implementation with caution. Adopt an inclusive and pragmatic approach, taking optimal advantage of the best of both multiple-choice and alternative formats to address the several problems.Applying an Integrated Approach to Assessment Alternatives in Math-tics

In Fairfax County, we have taken an inclusive approach to alternative assessment over the past five years. To date, this has been a journey over seldom- traveled terrain. As an offspring of early settlers of Oregon Territory, I'm not sure our current journey would qualify us as trailblazers, but if I were Scott Peck 1 might well describe our route as "The Road Less Graveled."What we have been working on is assessing a new Program of Studies (POS)in Elementary Mathematics for Fairfax. Both the revised POS and its assessmentare based on guidelines given in the 1989 NCME monograph, , Curriculum andEvaluation Standards for School Mathematics. The revised curriculum departs fromour earlier one in two important respects. First, it is much broader, rangingwell beyond the traditional emphasis on computation skills, even in the earliestgrades. Only two of the six strands (see Table 1), "Number Concepts" and"Operations," are focused on computation. The now POS further stipulates that11 ... the instruction of all students must include the concepts and skills of eachof the program strands * 11 The second major change is that math concepts areintroduced through the use of manipulative materials, providing an opportunityto construct mathematics concepts through concrete experience, rather thanbeginning at the outset on abstract symbols.Standardized tests such as the Iowa Test of Basic Skills (ITBS) are much more closely aligned to the old POS than to the new one. The effect of using these tests has been to impede implementation of the new POS, exemplifying all too well Grant Wiggins' (1990) observation, that "Test content influences instructional content, as teachers seek to insure the best possible scores."The charge given to the Office of Testing and Evaluation (OTE) was very simple. Develop tests for the revised POS that will complement, rather than interfere with, the FCPS curriculum changes introduced to improve mathematics instruction.The movement for assessment reform comes at an opportune time for Fairfax County Public Schools and other school systems, because curriculum reforms have been introduced for Mathematics, the Language Arts, and Science. Within each discipline, changes in assessment can complement curriculum changes to effect significant improvement in instruction.The school system level is central to effecting assessment reform. It is at this level that major curriculum decisions are made and implemented, and where responsibility lies for ensuring that assessment serves to improve instruction. ,The call for assessment reform that parallels instructional reform is compelling; so too are the warnings of measurement specialists who counsel that such changes should be implemented with due concern about possible unanticipated side effects.

Issue-Based Test-Specifications

To provide assessment that would strongly support the revised Mathematics POS, we gave priority to the following issue-based test specifications.

Test content must be curriculum-referencedIn contrast to 11 curriculum- free" testing associated with norm-referenced testing, test content was tied directly to the POS objectives. Every item was written for a specific POS objective.Item formats must match each objective.

Most standardized testing uses only multiple-choice items, both for efficiency and to avoid the subjectivity inherent in scoring completion items and performance tasks . However, many objectives in the new POS require the demonstration of math concepts by such actions as manipulating physical materials and writing story problems. Test items were developed in the appropriate alternative format for each of these objectives.

To ensure validity, multiple-choice items were used only when we could reach agreement with mathematics specialists that such items could adequately measure achievement of the target objective. For example, multiple-choice items could be used for the objective, "Know that a polygon with three sides is a triangle," but could not be used for "Draw a triangle." For grades one and two, it was necessary to introduce a third response -category, "teacher observation," for objectives such as "Skip-count forward," and "Use your blocks to demonstrate addition with regrouping."

To enhance objectivity and feasibility, we reached agreement with math specialists that, whenever validity constraints had been met, multiple-choice questions would have the highest priority, completion items second, and teacher observation third.

Assessment must include higher-order learningBoth the expansion of the curriculum to include such higher-order skills as estimation and probability, and the requirement that skills must be demonstrated by, for example, writing a story to illustrate the equation 3 x 12 = 36, require instruction in higher-order learning. Test items or tasks were developed, with appropriate formats, to assess the achievement of higher order learning.

Achievement of each strand must be assessedA general objective of the new POS is that ".the instruction of all students must include the concepts and skills of each of the program strands." To assess this objective, both strand scores and total scores were obtained.

Item and test difficulty levels must be curriculum-basedUnlike the earlier POS-math tests, which were constructed to have a specified difficulty level of 80 percent correct, the difficulty levels of the new tests were not predetermined. Instead, test items were developed to reflect, as accurately as possible, the difficulty inherent in each objective. Then, we "let the chips fall where they may." Only in this way could the tests directly reflect how well the new curriculum is being learned.

Validity and reliability must be demonstratedTo increase validity, the new POS assessment uses criterion - referenced measures and item formats appropriate to each objective. Much of the potential gain in validity could be attenuated by unreliability in subjective scoring or completion items or teacher-observation tasks. To avoid this problem, items and scoring procedures were carefully designed and pretested. Satisfactory raterreliabilities were required for an item to be selected for use in operational tests.

Feasibility and cost-effectiveness must be ensuredFrom the outset, we were determined to avoid contributing to "too much testing." We set a limit of two hours of class time, to assess a full year of mathematics instruction. We've come very close. We exceeded the limit only in grades one and two, and then by only about 20 minutes.

DISCUSSIONBarbara Preisseissen, Research for Better Schools

It seems that Hansen's theme that the topic of "authentic assessment" is a developmental endeavor underlies much of the discussion in all the papers of this Symposium. We are learning what we mean as we do it, performance is key to our discussion. Purposeful performance is one of the major goals of the authentic assessment approach. In a sense, the topic falls into the much larger issue of shifting our priorities, if not our paradigm, for testing in American education. The many sessions being held this year at AERA that have identified alternative assessment themes show

that a dynamic investigation is underway as we question some of the basic assumptions of our trade. It is understandable that "camps- will appear with particular points Of view in mind, as we pursue this innovative area. The papers in this session held many common views, in fact, yet there are several points that deserve further elucidation or deliberation.

The significance of the much older "constructivist" approach from the early studies of Piaget and his associates can be seen in this new assessment strategy. The emphasis on student learning, and assessment to serve that end, is related to this historic influence on current testing practices. With both a concern for student variation and emphasis on the goal of student achievement, we have advanced the notion that we need to be able to show what students know and are able to do -- hence the emphasis on performance again. It is important to note that this emphasis is not just a recent, fly-by-night orientation.The research reviewed reveals the central role of both language in learning and the importance of feedback to the learner to process his/her own thoughts during assessment. This relates testing to some of the research now current in instruction and-assessment, such as the importance of Vygotsky's theory of social development and learning, and the significance of ways to develop metacognitive experiences. Self-reflection, such as Olson suggests, is very much a part of portfolio development and the learning associated with such metacognitive activity. It is interesting to note the success of the portfolio strategy in the arts and in writing development.The larger significance of mediation -- direct intervention to help the learner -- thus characterizes a n ew kind of assessment. Some call it "dynamic," others call these tests more thoughtful and reflective. Thus, short answer exams are frowned upon because they pose too extreme a choice pattern, "either-or" positions that do not help students see the more "gray" areas of thinking and problem solving in situ. The relationships between these mediated understandings and an emphasis on knowledge that is really integrated into a student's understanding by use are the focus of several of the presenters' comments. Knowledge, they suggest, is not to be decomposed or decontextualized. This has obvious implications for the curriculum of a particular content area and the ways it is instructed.

Horvath's paper makes me wonder about why the Canadians seem to have the content dimension so well under control. It seems their curriculum guidelines are more generally owned and they are dealing with known, cognitive operations in their testing practices. Some forms of authentic assessment may fit certain contents better -- I wonder why. There are things to be learned about performance from music, art, writing, even physical education. But that will not answer all our questions regarding science, social studies, and reading. Pike's comments on the mathematics curriculum also make me wonder about questions of standards, accountability, validity, and reliability in creating these new measures.

One area of general difficulty seems to be calling for clarification from all the participants. What difference will authentic assessments bring to the particular problems of students at-risk of academic failure already in our schools? Will new testing practices make their lives even more difficult? And, unfortunately, no one examined the needs of urban school staffs to become informed, indeed involved, in authentic assessment endeavors. It may well be their experience that determines if this new wave of testing has real success in inner city schools.The importance of a staff development and teacher participation aspect to real change in testing was mentioned by many of the presenters. Several papers show that it is teacher "give and take" over time, even years of development, that actually molded a successful new testing program. How many such successful experiments have there been? Who is collecting data on what made them successful or not?And finally, the question of how to relate the assessment issues to th much larger "restructuring" topic came to mind. What does authentic assessment suggest as new roles or new relationships within a school, or across a district, or even throughout a state7 How might concerns for new technology be related to these concerns7 Are there new assumptions, different expectations, implications for staff development, allocation of resources? We may be opening Pandora's box -- at least our reflection on the issues that emerge can help us find opportunity for what follows. The Symposium in the long run left us with much food for further thought.

Symposium II MEASUREMENT ISSUES IN PERFORMANCE ASSESSMENT

The headlong rush for performance assessments is only occasionally met with reasoned analysis from the perspective of psychometric theory. This symposium, organized by Peter Volmut [Multnomah (OR) ESD], examines some of these measurement issues. Judy Arter [Northwest Regional Laboratory addresses issues related to validity. John Framer [Educational Testing Service] cautions against acting as if the inclusion of performance assessments must mean the exclusion of selected response methodologies. Michael Trevisan [*41tnamah (OR) ESD] supports the use of generalizability theory as a robust system for obtaining reliability indicators for performance assessments. Discussants Gil Sax [University of Washington] and Rich Stiggins [Northwest Regional laboratory add their thought-provoking reactions to the papers.

PERFORMANCE ASSESSMENT: WHAT'S OUT THERE AND HOW GOOD IS IT REALLY?Judy Arter Northwest Regional Education Laboratory

IntroductionThe Test Center at NWREL, established as part of our OERI laboratory funding, is a lending library of assessment Instruments and a source of technical assistance to educators in the Northwest. In support Of our lending function, we make systematic collection efforts in several chosen topical areas each year. These result either In a "Consumer Guide* - a description and review of the assessment tools available in the area - or an annotated bibliography.Test Center staff have made systematic collection efforts In the areas of assessment instruments for measuring higher-order thinking skills, school and classroom climate, self-concept, student motivation to learn, writing, speaking and listening, leadership, early childhood education, screening students into TAG programs, and alcohol/drug use surveys. Over the last two years we have also been gathering information on alternative assessment devices. Currently, we have in our collection over 75 Was in the area of using student portfolios for assessment, 25 tides relating to assessment alternatives in reading, and 15 tides about assessment alternatives In math.Annotated bibliographies of these alternative assessment devices are continually being updated and are available upon request from the author. Other Consumer Guides and annotated bibliographies are available from NWREL and ERIC.All the instruments and articles in the Test Center, including the large collection of assessment alternatives, are available for Inspection to educators In the Northwest on three week loan, through the mail. This is an inspection service only; once a person has decided on an assessment tool, he or she is instructed to contact the author or publisher. This has been a free service, supported in the past by OERI, and soon to be supported in other ways. Last year we circulated over 2000 tides to over 500 individuals.,,In addition to the lending library and Consumer Guide functions, Test Center staff provide technical assistance In assessment to an additional 200 callers a year. This assistance ranges from help with converting scores on norm-referenced tests, to helping people find sources of Item banks, to more recent keen Interest in assessment alternatives.Based on the review of a large number of assessment instruments, the systematic effort to track down assessment instruments in a number of topical areas, and discussions with a large number of end users of assessment tools, I would like to offer the following observations about alternative assessment devices in general and performance assessments in particular.Bibliographies and listings of NWREL Test Center assessment allernative materials canbe obtained by contacting the author.

Misconceptions About Performance and Other Alternative Assessments.

The message is not getting out clearly enough to test users that just doing "alternative assessment" does not automatically imply doing "good" assessment. Good assessment requires that we have a clear conception of the target we are trying to measure, we have a clear purpose for the assessment, we have chosen the assessment technique that best matches the target and the purpose, and that we have minimized factors that could lead to a misinterpretation of results (Stiggins, 1990).Common misconceptions in the field are that (1) doing performance (and other alternative) assessments will automatically result in better assessment; (2) anything qualifies as an alternative assessment; (3) all structured format tests are bad and only alternative assessment devices should be used; and (4) using alternative assessments will automatically solve all of our assessment problems. Here are some examples that demonstrate the existence of these misconceptions:I was asked to comment on a Guidebook that was being developed to accompany a large-scale video conference on assessment to support restructuring. The original draft contained comments like: "A subgroup of performance-based assessments are called exhibits or exhibitions. Exhibitions are authentic and engaging 'tests' of students' intellectual ability, where students have opportunities to "show off" what they know, and the control they have over a topic. Students must approximate an expert's ability to make informed judgments and to use knowledge effectively ... ;' and "Multidimensional assessments enable second language users and students with special needs to look at more naturalistic sources of Information. They increase special need students' incentive to loam, to take risks, and to overcome their own weaknesses..."The feeling in the original draft was that doing performance assessments would automatically ensure that all the wonderful things listed in the above statements would come true. There were no cautionary notes, and there was no seeming realization that these things would only come true N the assessments are done well. (in all fairness, the sponsoring organization was also uncomfortable with the first draft and sent It out to a number of reviewers. The final draft is somewhat different.)Anything seems to quality as an alternative assessment. For example, NcREL (1990) Includes many examples of now assessment strategies. A number are of this type: "At the end of a unit students write a paper for another class of students (younger, older, or the same age) explaining the concept. Example: Sixth graders write a book for fourth graders explaining the cycle of a star.- (p. 16)Why is this assessment? To quality as assessment, criteria or a method for evaluating th . e final product are needed. How does the teacher know it the students did an adequate job of writing this book N there are no criteria? How do the students know how effective they were and what might be done differently next time? How can the product be critiqued?This example also illustrates the misconception that "alternative* assessment automatically implies better assessment, and that alternatives will solve all our assessment problems. But, how do we know that this task really elicits what the student knows and can do? How does the ability to write affect the student's ability to show understanding?3. A lot of the portfolio literature also seems to reflect the misconception that "alternative*assessment is automatically better. Many papers that describe portfolios systems do not include criteria for assessing either the individual entries in the portfolio or the portfolio as a whole. Additionally, although many portfolio systems require student self-reflection on their own work, there are few examples of criteria to evaluate these metacognitions.

There are, of course, some notable exceptions, such as Vermont (1989 & 1990), Mumme (1990), and Juneau (1989). There are also many such rubrics for assessing writing samples. However, those using writing portfolios in Instruction seem loathe to use them -as if the process of evaluating a student performance diminishes its worth. (The problem might be that teachers think of evaluation as reducing a complex student performance to a simple number, when actually, having criteria means having an agreed upon and systematic basis for knowing what to value in a performance.)The Reality of Performance and Other "Alternative" Assessment Approaches

Performance and other alternative assessments certainly have a place in our assessment tool kit. They clearly have the potential to assess many types of things that are difficult to measure in fixed response tests. The Issue is not so much whether to use them, but to help users to realize that they have to be good consumers of published tools, and knowledgeable developers of local and classroom assessments.In actuality, if not done well and interpreted properly, performance and other alternative assessment devices can mislead as much, if not more, than the results of traditional (i.e., fixed choice) tests. As has been pointed out elsewhere (Rothman, 1990; Valencia, 1989). performance assessments are based on a small number of tasks (arid therefore may not be a representative sample of what a student can do), and can be subject to the individual biases of those rating the performance. Additionally, the criteria used to assess performance may not reflect the most relevant or useful dimensions of a task, the tasks that a student is asked to do can make one wonder what it is that is "authentic" about performance assessment, and there may be things in the performance assessment that makes a student unable to really demonstrate what they know or can do (Arter, 1989). Users may not understand these limitations and may, as a result, both misinterpret the results of, and design poor performance assessments.For example, In the Oregon writing assessment five different modes of writing are being assessed: personal narrative, descriptive, Imaginative, persuasive, and expository. Prompts that invite these types of writing are randomly distributed in classrooms so that all modes are addressed in each classroom. However, any given student writes orgy one essay. A major effort is underway to Inform users of the assessment results that one cannot make inferences about Individual students' ability to write based on this one sample. Although this makes sense to people when It Is pointed out, they seem to be almost- universally surprised and disappointed that this performance assessment has such a limitation. There is certainly the potential for overgeneralizing the results.Another example of a performance assessment that could mislead users is portions of The English Language Skills Profile (Hutchinson and Pollitt, 1987). One part of this assessment device is a structured discussion. The students are given an emergency scenario and are given 15 minut9s to decide in a group what they are going to do. The discussion is tape-recorded and the students analyze the tape to assess the contributions of individuals to the discussion. The students categorize individual comments using a scheme that includes such things as managing the discussion, introducing new ideas, clarifying or summarizing ideas, seeking clarification, etc.The questions that arise with respect to this activity Is the extent to which we elicit *real* student abilities. In other words, is this an authentic (valid) assessment? Does the task really reflect something we have to do in daily life? Would students be motivated in the same way to perform on this task as they would during a real situation in their lives? Are the behaviors elicited from students representative of their ability to discuss? What about discussions In larger or smaller groups? Or, discussions with adults instead of peers?

Additionally, the discussion task requires a certain amount of reading on the part of the student. This is an example of how extraneous performance requirements might affect student performance on the dimension of interest Do we have any information about how the ability to read or role play might affect performance in the discussion?

A third example of how designing or using alternative assessment devices without thinking through the implications of their use comes from the area of oral communication. Most of the assessment devices on the market purport to measure ability to communicate. However in actuality, the measures systematically leave out a large number of the communication contexts that would be necessary to include if we would truly like to be able to infer, in general, how well a student communicates (Arter, 1989). For example, consider speaking assessments. Most speaking assessments focus on rating a speech that a student gives. Is this really a good measure of how well a student communicates orally in general? What about interactive communication in which speakers and listeners take turns? Or, communication with different types of groups (peers, teachers, parents, younger children, etc.) requiring different levels of formality? All communication occurs In a context.

If we don't systematically sample from the contexts in our assessment (and instruction), we don't really get a true picture of performance.

A fourth example of how users need to be careful with respect to alternative assessments comes from the area of portfolios. I am currently working with a school district to develop a composite health portfolio to demonstrate how much students are learning, and the degree to which health instruction is integrated with other subjects. (A composite portfolio Is one which contains more than one student's work.) The teachers wanted to gather real work samples to show what students have learned, and wanted to gather examples of instructional units to show how teachers teach health. After discussing the types of displays that could be collected, the committee began gathering. After sharing what was gathered during the first round, it became abundantly clear that what we had was "the best of the best'; we could not answer two fundamental questions: Do all students learn this much? and Do all teachers do this? The question of the adequacy with which the content of the portfolio adequate represents what It is we want to show, is of central importance (Valencia, 1989).A final example is the Informal Writing Inventory (1986), which "provides structure for evaluating writing samples" to determine the "presence, degree, and, to a limited extent, the cause of writing disability." Compositions, elicited by means of 14 picture cards, are scored by comparing the number of technical errors (spelling, grammar, capitalization, incomplete sentences, etc.) to the number of errors that disrupt communication. Is this really the best measure of writing ability? This assessment does have criteria to judge performances, but does it have the right criteria?I chose this final example because it Is so extreme. But what about more subtle examples of criteria that might be inadequate? Like holistic scores on writing assessments? What about the relevance and quality of criteria that arise from different theoretical models? Or those that are developed by individuals that might not have an expert grasp of a subject area and direct experience with students?

Implications

We need to provide more assistance to users to ensure that performance and other alternative assessments are used well and developed properly. This is as important for using the results of large-scale assessment as it is for classroom use of a published instrument, or even for daily informal classroom assessment.Most alternative assessment approaches have their greatest potential use in the classroom as a integral part of instruction. If teachers do not understand how they can be mislead by poorly conceived tasks and fuzzy criteria, and how extraneous performance requirements can affect student performance, then their daily ability to make judgments about student needs and progress will be inadequate.Additionally, there is the danger that if we allow users to rush into use of alternative assessments without thinking- through their assessment needs, how alternatives fit into these needs, and what potential problems they might encounter, they could very likely be confused and disappointed when the alternative assessment does not fulfill their expectations of *fixing" all assessment problems. We want to avoid having people rush. headlong into alternatives only to have them later rejected because they don't work.

Performance and other alternative assessment approaches are too useful a part of our assessment arsenal to allow this to happen. We need to be cautious about how we integrate them into large-scale assessment. We especially need to give proper guidance (and additional undergraduate training) to teachers and school administrators concerning what good assessment is. and how and when various types of assessment approaches are best to use. And, we need to educate the public about alternative ways of knowing, so that they can be good consumers of assessment Information.

References

Arter, J.A. (1989). Assessing communication competence in speaking and listening, a consumer's guide. Northwest Regional Educational Laboratory, 101 S.W. Main, Suite 5W, Portland, OR 97204.Giordano, G. (1986). Informal Writing Inventory. Scholastic Testing Service, Inc: Bensenville, ILHutchinson, C., and Pollitt, A. (1987). The English Language Skills Profile, User's Guide - Assessing communicative competence in the English classroom. Macmillan Education: London.Juneau School District, Juneau Integrated Language Arts Portfolio For Grade 1 (1989). Juneau School District, 10014 Crazy Horse Drive, Juneau, AK 99MI.Mumme, J. (1990). Portfolio assessment in mathematics. California Mathematics Project, University of California, Department of Mathematics, Santa Barbara, CA 93106.North Central Regional Educational Laboratory, Restructuring to promote learning in America's schools, a guidebook, Volume 4: Multidimensional assessment: Strategies for Schools (1990). North Central Regional Educational Laboratory, 295 Emroy Avenue, Elmhurst, IL 60126.Rothman, R. (1990). New tests based on performance raise questions. Education Week, 10 (2), September 12, 1990.Stiggins, R. CJassroom assessment video workshop series (1990). Northwest Regional Educational Laboratory, 101 S.W. Main, Suite 500, Portland, OR 97204.Valencia. S. (1989). Assessing reading and writing: Building a more complete picture. University of Washington, Seattle, WA 98195.Vermont State Department of Education (11%19). Vermont portfolio assessment project, Montpelier, VT 05602.Vermont State Department of Education (1990). Vermont mathematics portfolio, Montpelier, VT 05602.AI

CRITERIA FOR DESCRIBING, SELECTING AND REVIEWING 185ASSESSMENT TOOLS IN SPEAKING AND LISTENING -

SUMMARY

Criterion 1: Content We will describe:1. The purposes/uses the author planned for the instrument.2. General Information about the instrument such as the grade levels intended for use, number of levels, forms

and items, test length, and administration requirements (training, equipment, etc.).3. The task presented to the student, including the purpose. setting and audience for the communication. as

well as the specific content presented to students and the skills the assessment is trying to cover. With respect to skills. we will indicate both the extent to which the assessment toot emphasizes linguistic versus communication competence and the specific skills covered.

4. The responses by which the student demonstrates his or her level of skill.5. Who scores the responses or performances and the criteria by which they are scared.

The rating in this area will depend on how well materials accompanying the instrument provide the information necessary for users to match the instrument to their needs. Excellent. The developer includes information on purposes, the population recommended for use, and

limitations of the instrument for the use suggested: describes how the instrument could be used with atypical populations: defines measurement terms and uses language appropriate for the user, lists specialized skills needed to administer the instrument-, describes the test development process. provides information on reliability and validity: and provides samples of questions, directions, answer shoots, manuals and score reports (Joint Committee On Testing Practices. 1988).

Good Much of the information above is provided. Fair. Some of the information above is provided. Poor . Little of the information above is provided.

Criterion 2: ReliabilityWe will use the following criteria for judging the general adequacy of the reliability of instruments: Excellent Reliability of total test score .95 or above; reliabilities of subtest scores .90 or above. Good: Reliability of total test score .85-.94 reliabilities of subtest scores .80 and above Fair: Reliability of total test score .75-.84, reliabilities of subtest scores .65 and above. Poor: Reliability of total test score .74 or below: reliabilities of some subtest scores below .64. Unknown: No information is provided.

Criterion 3: ValidityIn the reviews of instruments, we describe the types of validity considerations and studies carried out by the author(s). This includes discussions of content, criterion and construct validity. Because they relate most directly to speaking and listening, we will pay particular attention to the validity issues discussed in the previous chapter extent of sampling from contexts. artificial v. naturalistic tasks, assessing skills in isolation or in concert. tasks that require extraneous skills, sources of bias, degree of realism in the task and response. extraneous skills required for responding, correspondence between the task and scoring criteria. rater effect and ecological validity.For purposes of this Guide. ratings in the area of validity will be: Excellent. There are many lines of evidence presented that the instrument measures what is claimed and can

be used for the purposes proposed. Good. Several lines of evidence are presented and these provide convincing evidence. Fair. At least one study was completed and this provides convincing evidence.

Poor. Evidence that is provided is not convincing. Unknown. No evidence is provided.

Criterion 4: Help With interpretation and UseRatings in the area are:

Excellent There are norms that are based on a large, representative sample of an appropriate reference group of students or them are other useful standards for comparison (&g., performance of various groups or judgments of mastery); there is help in how to use ft results in instruction: there Is a discussion of the possible uses and misuses of results: there are good score reports and they serve the intended use.

Good There are appropriate norms and/or other standards of comparison. There is discussion in at least one other area mentioned above.

Fair There is good assistance in at least one of the areas mentioned above. Poor The assistance that is provided is judged seriously lacking. Unknown No information is provided.

PERFORMANCE TESTING AND STANDARDIZED TESTING: WHAT ARE THEIR PROPER PIACES?John Framer Educational Testing Service

There has been a good deal of speculation about-the likely impact of performance assessment on both testing and classroom practice. I share the view that there will be a significant influence. Our experiences with direct measurement of writing support this prediction. Whether the impact is positive or negative, though, will depend on the quality of the assessments and on how the results are used. It is likely that pressures will develop to teach the specific performance tasks just as there are pressures to teach the specific items included on multiple choice tests. We can think of some positive benefits of such trends, e.g., it would be a very good outcome if virtually all students could set up properly and write a simple business letter. Overall, though, such pressures seem sure to have more negative than positive impacts.

As part of my preparation for this session I came across this historical example of how advance knowledge of the content of tests might influence a teachers behavior. This was written in 1926.

It is proposed that teachers participate in the construction of the tests. The teachers gain a great deal from this work, particularly in becoming better acquainted with the subject matter in the course of study... The objection has been raised that teachers may be dishonest and take advantage of their knowledge of test content to make a good showing. If such should happen to be the case, it should not be difficult to detect the dishonesty. The experience in about twenty school systems in New York State did not result in a single instance of a teacher's being accused of dishonesty in this respect. If the superintendent has succeeded in developing a sufficient degree of professional spirit, the possibility of such dishonesty will be negligible (Orleans & Sealy, p.61).

RELIABILITYI will turn now to the issue of reliability, a critical aspect of any measurement worthy of the name.There are two important aspects of reliability for any kind of measurement: Reliability of actual scoring process Reliability in sense of comparability of scores

Scoring - The reliability of actual scoring process

When you have good quality control, reliability of scoring of multiple-choice tests is extremely high. Thus reliability is seldom a factor when selecting among tests from commercial publishers, becauseresults tend to be uniformly good. On the other hand, reliability of scoring in performance assessment is an important issue. It's often not high at all. It is essential to have clear criteria and careful training, and, even then, people vary considerably in their ability to score reliably. Also, tasks vary enormously in the extent to which they facilitate reliable scoring.

Score Reliability - Reliability in the sense of comparability of scores. When you look at reliabilities of various total test scores reported for publisher's tests - such as ITBS, Stanford, CAT - you see numbers that help you evaluate how likely it is that a student would have earned about the same score on another form of the same test. When tests are highly reliable, as is typically the case withstandardized tests, it tends not to make much difference whether you take Form A or Form B. This is because standardized tests tend to be composed of pretested questions whose characteristics are therefore known.

Performance tests pose very substantial reliability problems Typically composed of relatively few questions Comparability of tasks from form to form is hard to obtain, even with pretesting For most interesting applications.it's difficult to obtain high agreement among judges or raters.

VALIDITY

Some proponents of performance assessment argue that the "validity" of such measures is obviously superior to that of standardized tests.

Sometimes this is a case of using the word "valid" as a personal synonym for "good" or for "something I like," without any particular reference to the meanings that trained test'makers and psychometricians typically ascribe to validity.

In other instances, advocates of performance assessment are arguing that such approaches provide more appropriate and meaningful information to evaluators and decision makers. This argument is indeed a claim of greater validity. To what extent is this claim justified?

A problem with some of the uninformed discussion of the merits of performance assessment is the substantial weight being given to the appearance of the testing situation.

If we want to know whether students understand how to measure something and we can see that they are actually creating and using a measuring device, then some would argue that this actual performance task is giving us better (i.e., more valid) assessment than any set of structured, machine-scannable questions. Is this necessarily true? While many people would say "yes," the answer is very likely to be "no."

Just as a paper-and-pencil test labeled "science" (or "math applications") may prove to be mostly a-reading test to those with low reading skills, a science performance test may actually mostly measure student confidence or motor coordination or the ability to figure out exactly what would please the teacher.

When we set exercises for students, we are providing an opportunity for summary information to be recorded. The measurement occurs when the teacher or other observer places student responses on some type of scale, categorizes the responses, or decides whether they meet or do not meet astandard.

In this setting, we can easily find that we are differentiating among students on some basis other that what we think we are measuring.

In each assessment setting, we need to ask ourselves what our primary information need is. Often we want a status assessment now so we can track progress over time. In many, if not most assessment situations, some combination of very clearly test-like exercises and performance tasks islikely to lead to the most accurate classifications placements on meaningful scales inferences about likely misunderstandings

In most instances, though, it is an inference from the observed sample that is critical.

When my two younger children tried out for Little League baseball, the test wasTwo swings of the bat Two balls hit to you: one ground ball, one fly ball Two throws to first base One run around the bases

That's what my kids did. What the coaches did was assign an overall summary rating of "promise as a Little Leaguer." In this setting, as in most assessment settings, we want to go beyond what we actually observe. We want to make inferences to a larger domain of behavior. This has enormous implications because, as compared to standardized testing, performance assessment typically involves: a smaller sample of tasks tasks that are more memorable than multiple choice questions tasks that are probably very logical teaching exercises

At a minimum, we need to evaluate how performance on the particular tasks we use compares with performance on other tasks that are viewed as equivalent. (This is an example of gathering evidence that a test is meeting its intended purpose.)

This particular aspect of establishing test validity has a very long history. Here is an example that goes back to 1922:

Suppose that we give a group of pupils a test in arithmetical problems, and then, without arousing the suspicion of the pupils, arrange the situation so that these same pupils will meet these same arithmetical problems in their play life on the street, and suppose that the test and the observations upon the pupils' success with the play problems are reliable measures of each of these abilities and suppose, finally, that the correlation between the test and the observations is of only average closeness, does this condemn the test as not being a measure of real ability? Assuming that proper experimental precautions have been taken, this correlation certainly tells us that the test problems are a rough but not an accurate measure of play problems. But before we condemn the test we ought to correlate the pupils' score on play problems with their scores on those same problems when shopping for their mothers or some other practical situation. It is not known, but it is very possible that the correlation between different real-life situations is no closer than between the test and any one of these situations. In sum, it is even probable that there is no such thing as real ability, in the sense that we are discussing it, but that there are instead, many abilities differing somewhat one from another. It is hopeless to expect to find a test which will closely correlate with each of these life situations, wrapped about, as each is, with its own individuality or specificness (McCall, p.209).

PRACTICALITY

Let's turn now to practicality, an issue that will be cited by anyone who has actually managed an assessment program containing performance tasks.

Costs of performance assessment in dollars or staff time and other practical and logistical issues are frequently cited as major barriers to broadened use of performance assessment.

The cost and practicality of all types of assessment have always been key factors in the design and development of assessment programs. In recent years there has been a significant shift in sentiment and behavior.

My own experience in helping plan testing programs has often proceeded as follows: People typically start with very expansive ideas, "Let's not be limited by costs." People then begin choking on costs - literally, never give performance assessment cost to people while they

are eating. But people now seem willing to consider assessment costs several times greater than their previous budgets

(possibly 5 to 10 times) more.Range of cost of performance tests is enormous - from free to many hundreds of dollars per student, depending on the source and type of assessment and on what costs are actually counted, e.g. teacher time.

Proponents of performance assessment are saying, what is it worth to have: Teachers willing to pay attention to results of assessment Better information on which to base decisions about programsThe cost issue appears to be much more one of establishing a cost/benefit ratio than of simply reckoning up absolute costs.

SUMMARY RECOMMENDATIONS

I would like now to present a set of summary recommendations.1) Consider Impact on Practice When you choose test content and item types, consider the likely impact of

your selections on how teachers and students will prepare for the test.2) Realistic Beginnings - Look for ways to add performance assessment components to existing programs but

within the confines of attainable budgets.3) Writing - If you do not have direct measurement of writing, this is a good place to star t. Do not assume,

however, that it will be easy or inexpensive to do well. Also use a combination of different types of measurement approaches:

4) Target your use of methods - Use performance assessment components where they are most needed to measure what is valued. Performance assessment is needed less for reading and some aspects of social studies than for writing and some aspects of scien ces. It is also needed less for mathematics concepts than for mathematics applications.

5) Consider Sampling Use - If you have the option of using performance assessment on a sampling basis, perhaps as a supplement to your every-pupil program, pursue this possibility. You can get many of the benefits of acceptance without all the attendant costs. For some purposes you may be able to test all students and score all responses locally and only some centrally.

6) Use as Supplement - Do not consider performance assessment as the basic method for an accountability assessment program but as a supplement to standardized measures.

7) Learn from others - Very good materials exist on setting up a writing assessment program, for example. Experienced veterans are available who can help you deal with all aspects of design and implementation. A large supply of very good prompts has already been developed. Less is available in other domains but a number of thoughtful people are trying to provide guidance, e.g., Clare Burstall, Rick Stiggins, Grant Wiggins, staff from Connecticut Department of Education, and others.

8) Work with a Veteran - Try to plan and carry out your program in collaboration with someone who has practical experience in bringing to life an assessment program with significant performance components.

9) Extensive Pilots - Include in your program planning considerable opportunity to pilot materials and methods before you have to make your first operational reports.

10) Weigh Positives and Negatives - Don't be discouraged by the negative aspects of performance assessment but don't ignore them either. In each proposed application, evaluate the costs and gains and make a reasonable decision. This should virtually always be some combination of measurement methods.

CLOSING COMMENTS

To me one of the most striking features of the current interest in performance assessment is intense enthusiasm of a number of its advocates. In my own professional experience, perhaps the closest prior example of such missionary spirit was the criterion-referenced testing movement. Ihave seen some of this same spirit in people working on computer-based testing.

I thought it might be instructive to look at some of the relatively early books on standardized testing in the U.S. to try to pick up some of the flavor of the beginning of that movement. Here are a couple of examples.

Without objective information of the kind which is obtainable only from standardized tests, the guidance of... a student can rest upon little more than guess work. One is tempted to put it more strongly and to say that "educational guidance without educational testing is professional quackery," as much so as in the case of the physician who refuses to employ the approved laboratory techniques in the diagnosis and treatment of diseases (Ruch & Stoddard, p.xviii).

In other days there were ordeals by fire, by water, by battle, and by examinations in academic subjects. The tests by fire, water, and battle were subject to accidental conditions and even to manipulation, so that the will of heaven was not always accurately divined. To a higher degree the ends of justice and of education were furthered by examinations in academic subjects. But for students, too commonly, examinations remained ordeals. To take an examination was a highly chanceful proceeding. Success or failure might turn on the answer to a single question. There were freaks of memory; there was variability in the interpretation of questions and of answers. Much besides scholarship was incidentally tested. A good but nervous student might fail, while a more phlegmatic student of duller wit might pass. It is the purpose of new, carefully constructed, objective tests to do more than give-marks that merely distinguish the elect. It is their purpose to reveal the instructional needs of the teacher as well as the educational needs of students; to lighten the burden of the ,teacher, and to give teacher and students assurance that a valid and just test and not an ordeal has been applied. How they do it is the message of this book (Orleans & Sealy, back of inside title page).

The enthusiasm for standardized tests reflected in these statements was very widely shared and helped lead to the development of: -- measurement profession national testing programs commercially available tests that are household words, e.g., Iowa's, Stanfords, CTBS, SAT, ACT, DAT

Although there are critics who wish the situation were otherwise, it is hard to imagine our society without standardized tests.

We now have very strong enthusiasm for a quite different orientation toward testing. Can the energy that the idea of performance assessment has generated be used to create positive and lasting change? That is our challenge. Can we do as much as those who laid the foundation for standardized testing in the 1920's and 1930's or will we be ground up by the challenges? Will we collaborate productively or merely spend out time pointing out the shortcomings in the "other sides" methods or products? I believe I know whose responsibility it is to

help find the proper places for performance testing and standardized testing., I think you know also. Is there a mirror handy?

REFERENCES

McCall, W.A. (1922). How to Measure in Education. New York, NY: Macmillan.Orleans, J.S. & Sealy, G.A. (1928). Objective Tests. Yonker-on-Hudson, NY: World Book.Ruch, G.M. & Stoddard, G.D. (1927). Tests and Measurements in Higb School Instruction. In Editor's Introduction, p.xviii by Lewis M. Terman (Ed.), Yonkers-on-Hudson, NY: World Book.

RELIABILITY OF PERFORMANCE ASSESSMENT:LET'S MAKE SURE WE ACCOUNT FOR THE ERRORS

Michael Trevisan Multnomah County ESD

Performance assessments are becoming part of the achievement data routinely collected by school districts and state departments throughout the country. Because of the nature of the scoring, this type of assessment provides unique situations for the assessment or evaluation specialist charged with providing evidence for the dependability of the measures. Reliability and measurement error must be defined somewhat differently than is typical for machine-scored standardized achievement tests. A major source of error in performance assessments for example, is differences due to rater judgment; therefore, a reliability coefficient must often take this error into account. This paper explores some of the issues regarding the estimation of reliability for performance assessments and provides appropriate methodology.Concern for the reliability of data from performance assessments and the dearth of existing reliability information has been voiced (Rothman, 1990; Suen, 1991). Caution, therefore, has been recommended before wholesale acceptance of this type of assessment is given. The psychometric quality of large-scale performance assessments conducted at the district or

We are pleased to Present this - The National...

Documents

Transcript of We are pleased to Present this - The National...