DOCOMENT RESUME - ERIC · v. DOCOMENT RESUME ED 186 457 TN BOO 146. AUTHOR. walker, Clinton R: And...

v.

DOCOMENT RESUME

ED 186 457 TN BOO 146

AUTHOR walker, Clinton R: And OthersTITLE CSE Criterion-Referenced Test Handbook.INSTITUTION California Univ., Los Angeles. Center for the Study

of Evaluation.SPONS AGENCY National Inst. of Education (DHEW) , Washington,

D.C.PUB DATE 79CONTRACT 400-76-0029NOTE 266p.

EDRS PRICEDESCRIPTORS

IDENTIFIERS

MF01/PC11 Plus Postage.Achievement Tes+s: Annotated Bibliographies;Cognitive Oblectives: *Criterion Referenced Tests;Elementary Secondary Education: *Evaluation Cr:!teria;Resource Materials: *Test Reviews; *Test SelectionTest Bibliogra;hies

ABSTRACTThe bulk of +his document consists cf reviews of over

60 criterion-referenced tests, most of which are used to testelementary or secondary-level achievement in the basic skills. Foreach test review, the following information is given: description oftest, price, field test data, acTministration, scoring, and othercomments. The tests are rated according to three categories ofcriteria: (1) conceptual validity--domain descriptions, agreement,and representativeness; (2) field test validitysensitivity, itemuniformity, divergent validity, lack of bias, and consistency ofscores: and (3) appropriateness and usability--clarity ofinstructions, item review, visible characteristics, ease ofresponding, informativeness, curriculum cross-referencing,flexibility, alternate form availability, administration, scoring,recordkeeping, decision rules, and comparative or normative data.Guidelines on aspects of test selection are given: locating tests,comparing tests' technical and practical features, and comparingtests for their curricular relevance. Appendices list resources fordeveloping or purchasing criterion-referenced tests, sources of othertest reviews, definitions of terms, and available tests which werenot reviewed. A subject index to the reviewed tests, a directory ofpublishers, and a sample of an exemplary domain description are alsoincluded. (GDC)

*******************************************************************4!**** Reproductions supplied by EDRS are the best that can be made *

* from the original document. ************************************************************************

T ^41

U S DEPARTMENT OF HEALTH,EDUCATION L WELFARENATIONAL INSTITUTE OF

EDUCATION

THIS DOCUMEN I HAS BEEN REPRO-DUCED EXACTLY AS RECEIVED FROMTHE PERSON OR ORGANIZATION ORIGIN- I

AT ING IT POINTS OF VIEW OR OPINIONSSTATED DO NOT NECESSARILY REPRE- tr

SENT OFFICIAL NATIONAL INSTITUTE OFEDUCAT'ON POSITION OR POLICY

3

CSE

"PERMISSION TO REPRODUCE THISMATERIAL HAS BEEN GRANTED BY

Yoce/7,TO THE EDUCATIONAL RESOURCESINFORMATION CENTER (ERIC)."

CRITERION-REFERENCEDTEST HANDBOOK

MMNIFIN11-

6

CENTER FOR THE STUDY OF EVALUATION

UNIVERSTTY OF CALIFORNIA LOS ANGELES

MONIS

CSE CRITERION-REFERENCEDTEST HANDBOOK

CSE Test Evaluation Project

Project Staff:

Clinton B. Walker, DirectorMargeret DotsethRussell HunterKaren Ogg SmithLynnette KampeGuy StricklandShonagh NeafseyChristine GarveyMichael BastoneElizabeth WeinbergerKathi YohnLaura Spooner Smith

Center for the Study of EvaluationUniversity of California, Los Angeles

1979

,-;

The Center for the Study of Evaluation is an educational researchand development center established in 1966 by the U.S. Departmentof Health, Education and Welfare. This book and much of CSE workis supported by a contract with the National Institute of Educa-tion.

The mission of the Center is to conduct programmatic inquiry inthe nature of testing and evaluation in public education.

4

TABLE OF CONTENTS

Acknowledgments

Foreword vii

Chapter 1. Introduction 1

Chapter 2. Basic Concepts and Issues 3

Chapter 3. Introduction to the Test Reviews 11

Chapter 4. CSE Criterion-Referenced Test Reviews 29

Chapter 5. How To Select Tests: Locating Tests andComparing Their Technical and Practical Features . 155

Chapter 6. How To Select Tests: Comparing Tests forTheir Relevance to a Given Curriculum 173

Appendix A. Resources for Developing CRTs Locallyand for Purchasing Ready-to-Order CRTs 195

Appendix B. Sources of Other Test Reviews 203

Appendix C. Glossary 205

Appendix D. Supplement to Chapter 3: Example of aDomain Description Which Would Receive a Level ARating 217

Appendix E. Available Tests That Were Screened Outof the Pool of Measures Reviewed in This Volume . . . 221

Index A. Names of Reviewed Tests 225

Index B. Tests by Subject Matter 229

Index C. Publishers' Names ancr-Addresses 239

References 243

ACKNOWLEDGMENTS

The authors are happy to acknowledge the many contributors to thisvolume. A draft of the "Introduction to the Test Reviews"(Chapter 3) was reviewed during its development by Jason Millman,Professor of Education at Cornell University; Jack C. Merwin,Dean of the College of Education at the University of Minnesota;Albert H. Rouse, Jr., Department of Research and Development,Cincinnati School District; and members of the professional staffof CTB/McGraw-Hill.

Albert H. Rouse, Jr. also suggested some of the basic ideas whichwere used to develop the procedure for finding the test with thegreatest relevance to a given curriculum (Chapter 6). A draft ofthat procedure was reviewed by Doris Morton, Master Teacher atHawaiian Avenue Elementary School in Los Angeles; and comments onit were received as well from James Cox, Evaluation Consultantwith the Office of the Los Angeles Superintendent of Schools.Lynn Lyons Morris of the Senior Research Staff at CSE made exten-sive improvements in Chapter 6.

Earlier versions of the text were reviewed by Jason Millman;Carolyn Denham, Associate Professor of Education at CaliforniaState University, Long Beach; Jeffrey S. Davies, Coordinator ofResearch, Evaluation, and Testing, Ventura (CA) Unified SchoolDistrict; and Joan Herman and Rand Wilcox of the Senior ResearchStaff here at CSE. Robert Stake, Professor of Education at theUniversity of Illinois, made critical comments on parts of thetext while he was a Visiting Scholar at CSE. Howard Sullivan,Professor of Education at Arizona State University, also gave usmany useful comments on the text.

Comments on the text were also received from James Block, Pro-fessor of Education at the University of California, SantaBarbara; Thomas Haladyna, Assoclate Research Professor in theOregon State System of Higher Education; Joan Bollenbacher,Director of Testing Services, Cincinnati Public Schools; andThomas J. Riley, Director of Research and Evaluation, Fresno (CA)County Department of Education.

6v

We are deeply grateful for the improvement which each of thesereviewers has brought to this volume. The reader should notethat the reviewers did not agree with everything herein and that,in places, we did not take their good advice.

Our appreciation goes also to other CSE staff members for theirpatience, insight, and support. Laura Spooner-Smith and JamesBurry helped with organization and editing, and Marlene Henersondid extensive final editing. Much of the subject index was doneby Diane Ornstein and Laura Spooner-Smith. Correspondence withreviewers and publishers, as well as drafts of the volume, wereably typed by Phyllis Burroughs, Donna Cuvelier, Irene Chow, andAllison Hendrick. Donna Cuvelier has our special thanks fordoing the layout, formatting, and typing of the final manuscript.

This project was supported by the National Institute of Education(NIE) under Contract No. 400-76-0029. However, the opinions andfindings expressed here do not necessarily reflect the positionor policy of NIE, and no official endorsement by NIE should beinferred.

FOREWORD

CSE Criterion-Referenced Test Handbook is the sixth in a series oftest evaluation books prepared by the Center for the Study ofEvaluation (CSE). CSE is a federally funded research and develop-ment center associated with the Graduate School of Education atthe University of California, Los Angeles (UCLA). In 1970, CSEpublished the first book of test reviews, CSE Elementary SchoolTest Evaluations. In that volume and subsequent volumes, stan-dardized, norm-referenced tests designed for use in schools werereviewed and rated. The present volume is the first in the seriesto deal with criterion-referenced tess.

In deciding which tests to review for this volume, CSE staff pro-ceeded in two stages. First, we conducted a wide ranging searchfor likely tests; we then screeued the resulting pool of mea-sures. We examined the catalogs of hundreds of test publishers,bibliographies of tests (listed separately under References), andtest lists compiled during previous CSE projects. A retrospec-tive and ongoing search of the Educational Resources InformationCenter (ERIC) system was conducted using the following subjectheadings: criterion-referenced, mastery, objectives-based,domain-referenced, content-standard, and universe defined tests.The retrospective search covered the past ten years of Currentralex to Journals in Education and Research in Educaticn and thepast five years of Psychological Abstracts.

All leads to possible CRTs were pursued by letters of inquiryand, for non-respondents, follow-up letters. The letters ofinquiry requested information on tests which fit any of the des-criptors used in the ERIC search, which were designed for any ofgrades K-12, and which were available to test users apart frominstructional materials. We then ordered sample materials foreach test and later in the course of the project checked withpublishers to ensure that we luid on hand the most current andcomplete information to support each test. Sevenry-seven commer-cial publishers and ninety-two non-commercial test developers(mostly school systems) were contacted as possible sources ofCRTs. The thoroughness of the search was cross-checked and con-firmed by the responses of a national sample of 421 school dis-trict staff members who replied to a survey question on CRT use

in their districts.

The variety of available measures required that rules be devel-oped for screening tests for inclusion in this volume. Screening

1

vii n

rules were developed both on theoretical grounds and in responseto the idiosyncracies of the available measures. The firstscreen was availability: only tests that are readily availableto general test users were included. About 80 locally developedtests were excluded when this rule was applied. During thecourse of the project, some tests were dropped from the listbecause they were removed from the market. Developmental orexperimental versions of tests which were available in onlysingle copy were not included, since such tests sometimes do notgo into production or, when they are produced, often appear in aform quite different from the developmental version. Alsoexcluded were three tests of a publisher which required the pro-spective buyer to visit the sales location.

The next screen resulted from our working definition of the con-cept of CRT. Since a technically strict definition would haveexcluded all of the available measures, a less stringent defini-tion was used. The need to acquaint test users with the currentset of approximation to CRTs dictated using the following fourpart definition:

The measure was originally designed to indicate an absoluterather than a normative level of learning.

The measure was built around explicit objectives.

The test items are keyed to these objectives.

Scores are provided for each objective.

The first part of the definition excluded tests originally devel-oped as norm-referenced tests to which objectives were lateradded. Also excluded were tests of typical performance such asattitude tests. Measures which did not met this or otherscreens discussed below are listed in Appendix E.

Among the tests that were readily dvailable, only those that werenot embedded in a special curriculum were reviewed. This ruleexcluded tests which are sold onb> with curricular materials orwhich, although sold separately, are keyed to the content andorganization of one curriculum. This rule was adopted since suchtests are acquired mainly as a result of a decision about-teaching materials. Our system for evaluating CRTs, describedin Chapter 3, may still be applied locally to such tests as apart of the process of choosing among curricular series.

9viii

Another class of readily available tests that were not reviewedwere the customized or made-to-order CRTs which a few publishersoffer. Test users with sufficient funds could probably hire anytest publisher or consulting firm to create CRTs for a specificcurriculum. A listing of publishers who offer this service rou-tinely is given in Appendix A.

Some possible CRTs were excluded on other grounds. Tests thatwould have to be duplicated by a photocopying method were screenedout. Of those, the materials that are uncopyrighted are listedin Appendix A as Resources for Developing CRTs Locally. Behaviorchecklists were excluded (e.g., Can the child tie his shoes? Canthe child skip?). Measures of behaviors that are usually theresult of maturation or general experience were also excluded.Finally, tests with only one item per objective were excluded onthe ground that they were not serious attempts at criterion-referenced measurement. Two exceptions to this rule were madeowing to the likely attention these tests will receive as a resultof extensive publisher promotion.

While the acquisition and screening of tests were taking place,project staff developed a set of standards for evaluating CRTs.Although some possible test features are not relevant in all testsor for all test uses, the need for test users to be able to

compare tests dictated the development of one evaluative schemefor use across tests. An initial pool of 70 test features wasdeveloped on the basis of a review of the professional literatureand test publishers' promotional materials. This large numberwas reduced by several methods. First, some judgments were com-bined, for example, test-retest and alternate form reliability.Next, features which were more relevant to NRTs than to CRTs wereeliminated. Finally, test features which could only be evaluatedwith respect to a local testing situation (e.g., estimated timefor test administration) were treated as descriptive rather thanevaluative information.

A draft of the evaluative system was reviewed externally byauthorities in CRM and then tried out on a sample of tests. The

final version of the system, given in Chapter 3, reflects thesereviews as well as the input from a national survey conducted byCSE of 530 test users on school district staffs. The system wasalso sent for comment to test developers whose products werebeing screened for this volume. Only two of the test publishersreplied.

Each test was evaluated independently by two members of the proj-

ect staff. These staff members were beyond the M.A. level in

ix

education and had extensive experience on previous test evalua-tion projects, in test use, and in evaluation. All evaluationswere reviewed by a third judge who adjudicated any differencesbetween the original evaluations. The project director thenchecked and edited all the test evaluations. This process re-sulted in a rate of agreement in evaluative judgments of 88.5%.

For all of the tests that survived screening, the complete reviewswere sent in March, 1978, to the test developers for comment. Iasome cases, the test developers provided information that per-suaded us to change some aspect of our review. In other cases,we were not persuaded by the publishers' feedback, but we reportit with the test review. In all, sixteen publishers replied andtwelve did not.

In the course of searching for CRTs, we unearthed a variety ofresources which are potentially useful to the readers of thisvolume. These resources are described in the appendices.

Before final editihg, the text was reviewed by school and districtlevel educators, university faculty in education, and evaluatorsat CSE who are noted in the Acknowledgments.

CHAPTER 1Introduction

This chapter provides a summary of the book's contents andmakes suggestions for using the various parts of the bookaccording to the reader's specific needs.

Testing influences our lives in manyways. When we were children, ourgrades in school, course of study,access to higher education, and evenself-image were determined in partby our performance on tests. As

adults we relive many of thoseexperiences through the children inour lives. With our taxes, we payfor the education of children; andwhen the test results of educationalprograms are made public, we areconsumers of the scores.

A growing awareness of the impact oftesting has caused educators andresearchers to look more criticallyat existing measurement tools. In

particular, standardized norm-referenced tests--their socialfairness, sensitivity to students'learning, and relevance to instruc-tional decision making--have comeunder attack, leading in the extremecase to calls for a moratorium ontesting in schools.

Some critics of these tests, intheir search for more constructiveremedies, have turned to the tech-nology of programmed instruction.A major component of programmedinstruction is frequent testing ofsmall units of study. This approachto testing is seen to hold promise

for meeting some of the major objec-tions to the conventional methods ofmeasurement. Since the test itemsin the programmed materials use theconcepts and content of instruction,they have diagnostic usefulness.Their sensitivity to learning of thematerials seemingly reduces theirsensitivity to students' socialbackgrounds. They are thus seen asless biased, more "culture fair."

Criterion-referenced testing1 (CRT)is partly an outgrowth of this tech-nology. As educators have come torecognize that testing, evaluation,or indeed all of educational manage-ment should better support the con-tinuinc renewal of instruction, theappeal of instructionally relevanttests has grown. Major test pub-lishers have developed and marketedcriterion-referenced tests, andnearly half of the school districtsin the United States now reportusing such measures.2 The CSECriterion-Referenced Test Handbookwas undertaken in response to thesedevelopments in educational measure-ment.

1A glossary can be found in Appen-dix C.

2Dotseth, et al., 1978.

The Contents and Uses of This Book

The Criterion-Referenced Test Hand-book iq a collection of resourcesfor educators who develop testingprograms, use tests, or need merelyto stay informed about advances ineducational measurement. The workfor ihis book was driven by twobeliefs:

That testing should support in-struction as directly as possible,and

That source materials on testingshould be casy for test users toapply.

This volume is meant to function asan introduction to criterion-referenced testing and as a guidefor selecting tests. Readers whowant an introduction to the basicconcepts of CRT can start withChapter 2, which contrasts CRT withthe standardized, norm-referencedapproach, and proceed to Chapter 3,which introduces the test reviews.A framework for evaluating criterion-referenced tests is given here whichdescribes the importance of 21 testcharacteristics. Basic sources arelisted in the References for thosewho would read further on the sub-ject of criterion-referenced testing.A Glossary designed to explain basicevaluation and measurement conceptsin a non-technical manner is pro-vided in Appendix C.

To survey the nature and quality ofavailable CRTs, readers may refer tothe evaluative and descriptivereviews that make up Chapter 4.Test selection can also begin here.Identification of likely tests startsby referring to these reviews whichare indexed at the back of the bookby test name (Index A), test content(Index B), and publisher's name(Index C). Index C also includespublishers' addresses for ordering

2

the current year's test catalogswhile Appendix B lists sources ofother test reviews.

Secondary sources, such as testreviews and publishers' catalogs,are not sufficient, however, to tellwhich of several seemingly appropri-ate tests is best for a particularpupil population, curriculum, andtesting need. To make such a choiceeffectively, test buyers need tostudy the different tests directly.Chapters 5 and 6 give step-by-stepprocedures for comparing tests firsthand. In addition to giving anoverview of the process of testselection, Chapter 5 guides thereader in comparing tests' practicaland technical merits for the giventesting situation. The single mostimportant feature of tests, theirspecific relevance to the localcurriculum, is finally evaluated bymethods which are detailed inChapter 6.

The guidelines in these last twochapters have much broader applica-tion than just to the tests reviewedin Chapter 4. They can be appliedto any achievement tests, CRT or NRT(norm-referenced test), reviewed ornot reviewed.

If no suitable tests emerge from thesteps in Chapter 5 and 6, or if thereader begins with the intent todevelop criterion-referenced testslocally, the references to itembanks and test development guides inAppendix A will be helpful.

3

CHAPTER 2Basic Concepts and Issues

This chapter introduces criterion-referenced testing by com-paring it with standardized, norm-referenced testing. The

points of contrast are the form and meaning of test scores,the methods used in developing the test, and the optimal testuses. The importance of curricular relevance in testing isstressed. The chapter concludes with a discussion of issuesin criterion-referenced testing.

The Form and Meaning of Test Scores

Criterion-referenced testing (CRT)is informally contrasted with norm--eferenced testing (NRT) in these:ems: criterion-referenced tests(CRTs) are said to show what aperson knows or can do, while norm-referenced tests (NRTs) show wherea person ranks in a group of testtakers. CRTs indicate how com-pletely the student has learned askill or body of information, whileNRTs show where the student standsin comparison with other students--that is, compared to a norm group.

This informal contrast captures anessential difference between NRT andCRT, namely, how the test scores areinterpreted. Scores on CRTs arebased on a scale of 0 to 100% cor-rect and are indicators of the testtaker's thoroughness or completenessof learning (or knowledge, skill, orcompetency) in the domain beingtested. Thus, CRT scores are sup-posed to be directly meaningful interms of the degree of learningwhich the individual test taker

3:-

possesses. Scores of other testtakers do not enter into thecriterion-referenced meaning of testresults.

NRTs, on the other hand, yield rawscores which are converted to percen-tiles, grade equivalents, stanines,or other numbers referring towhere a score stands among thescores of other test takers. Norm-

referenced meaning tells how wellone student did in comparison with anorm group; criterion-referencedmeaning tells how well a student didcompared with how well it is possi-ble to do.

The difference in meaning betweenthe two types of test scores is likethe difference between a runner'stime for finishing a race--acriterion-referenced test--and thatrunner's place (first, second, etc.)--a norm-referenced test. The two

types of results are meaningful, butthey give different information. In

some cases, the two types of resultswill give a different impression ofthe same performance. For example,

in a field like physics or gymnas-tics, where most people have limitedknowledge or skill, one might scorein the very high percentiles on atest of the general population whilebeing far from learned or skillful.Conversely, in a skill where manypeople are well trained, such asdriving a car, a very proficienttest performance might earn a norm-referenced score in only the middlepercentiles of the general drivingpublic.

Current writings on testing aboundwith problems of terminology. To

begin with, dictionaries do notrecognize the word reference as averb. Authors use the terms criter-ion, criterion-referenced, domain-referenced, norm-referenced, andstandardized differently. Worsestill, these authors rarely makeclear whether they mean to reflectcommon usage or to improve it.

The term criterion, for example, isoften used to mean cut score orlowest acceptable score. In this

context, a CRT is a test with such acutting score, where results aretreated in pass-fail terms. Else-where the term criterion is definedas the specific skill or knowledgebeing measured and is used inter-changeably with the term domain.The domain/criterion can be viewedas the larger, perhaps unlimited,set of potential test items fromwhich the actual test items are

drawn. In this context, a CRT is atest that gives domain estimates;that is, a CRT estimates the propor-tion of the domain which the testtaker knows or can do. In this

book, the terms criterion andcriterion-referenced have the lattermeaning.

Some authors use the term standard-ized tests to refer to any ready-made (or off-the-shelf) publishedtests. Others use the term to mean

4

norm-referenced tests (NRTs). Theterm standardized test will be usedin this book to refer to NRTs.

Methods Used in Developing the Tests

In principle, there can be bothnorm- and criterion-referencedmeaning for scores on a single setof test questions. A number oftests reviewed in this book giveboth norms and absolute scores.There is reas:m to believe, however,that a test which is most effectivefor rank-ordering students is lesseffective as a direct index of theirproficiency, and vice versa. The

methods used in test developmentdetermine whether a set of questionswill function better to locatepupils on a scale of learning or ona scale of other test takers. The

two main differences in test devel-opment are these: how fully thebehaviors to be tested are des-cribed, and how items are chosen tobe on the test.

First, NRTs are generally designedto measure such broad educationalgoals as "reading comprehension" or"word attack skills." Theoreti-

cally, the behaviors to be tested ona CRT are described in much greater

detail. The specifications for aCRT are, in theory, detailed enoughto describe,the content and formatof all possible items on the test,thus describing the scale oflearning which the test measures.A test is effective in indicatingthe degree of learning (i.e., infunctioning as a CRT) only to theextent that it provides a clear des-cription of what is to be learned.In current practice, few of thetests which are sold as CRTs reportusing such detailed test specifica-tions. As a group, the existingCRTs achieve a clearer description

5

of the behaviors to be tested thanNRTs typically do in that they breakdown the broad educational goalsinto more specific skills. Forexample, on a CRT, reading compre-hension may be divided into literaland interpretive comprehension; theninterpretive comprehension may befurther divided into separate testsdealing with cause and effect, para-phrasing, real vs. make believe,fact vs. opinion, relevance ofstatements, stated vs. unstatedassumptions, analogy, predictions,and so on.

A second difference in test develop-ment is the way in which items arescreened for inclusion on the test.Since a CRT is supposed to revealthe thoroughness of learning, itperforms that function best when thetest items are a representativesample of the material to belearned. Test items are thusselected for a CRT on the basis ofwhether they are nongruent with thetest's specifications, that is, thedetailed description of the test'scontent and format.

NRTs, on the other hand, are in-tended mainly to discriminate orrank individuals; and items which dothis best are selected for inclu-sion. A test gives the most consis-tent ranking when it produces a widerange of scores. Test items whichare very easy do not help to differ-entiate test takers because everyonetends to get them right. Similarly,very hard items do not discriminateamong test takers, for everyonetends to miss them. In order toproduce the greatest and most con-sistent differences among people'stest scores, one selects items foran NRT so that about half of thetest takers get each one right. If

there are test questions on materialwhich is widely taught and widelylearned, pupils are likely to dowell on those items, and the items

5

are likely to be rejected for use onan NRT because they do not discrimi-nate among test takers.

In other words, a test which isbuilt to give the most consistentranking of students (an NRT) willlikely not give credit for thoseaspects of teaching and learningthat are generally successful. Atest that is built to give a repre-sentative measure of how much waslearned (a CRT) will give rankingsof students that are less consis-tent, but it should show the resultsof instruction more readily. Thusthe same set of test items may yieldboth criterion-referenced or norm-referenced meaning, but it will doone of those functions better thanthe other, depending on how theitems are chosen.1

One other differenne between CRTsand NRTs should noted: a CRTwill typically give mote subscoresthan an NRT of equal length.Strictly speaking, every CRT objec-tive for which a separate score isprovided is a separate CRT. A testbooklet, then, which covers severalCRT objectives is really severalshort tests.

Use of Tests for Pur-loses of In-struction and Program Evaluation

The differences between CRTs andNRTs in meaning of scores and intest development imply differentoptimal uses for each. The factthat CRTs are built around specificinstructional objectives makes themespecially useful to support in-struction. A teacher, school, ordistrict can test the objectives ina CRT battery which are relevant to

1Hambleton, et al., 1978.

the local program and avoid testingirrelevant skills. Instructionalplanning for groups of students canthen be based on the patterns ofspecific skills and needs indicatedin the test scores. Individual stu-dents' strengths and needs can alsobe diagnosed at the level of teach-able skills so that individualizedassignments can be made. Similarly,CRTs may be used during the schoolyear to see how well students areprogressing in the skills of thelocal curriculum, so that studentsmay be advanced or helped as needed.In short, the results of CRTs can bedirectly related to teaching andlearning activities and are thus aresource for planning and managinginstruction.

The potential for relevance in CRTshas an important effect. Students'

scores on CRTs are more likely toreflect the positive achievementswhich do take place in class thanare the scores of a broad surveytest which is designed to relateloosely to many varied curricula.This quality of "sensitivity toinstruction" is especially timely inan age of educational accountabil-ity, since it is important to showas much as possible of the reallearning which occurs. In fact,giving teachers and pupils creditfor their accomplishments can beseen as a heretofore very underratedpurpose of testing.

NRTs, which are designed to differ-entiate individuals, are most effec-tive for selecting a limited numberof very high (or very low) scoringindividuals out of a larger pool ofavailable students. They are alsocapable of giving the most reliablecomparison of scores with a nationalnorm. The individual test user willhave to decide on the relativeimportance of these uses of tests--instructional support, givingcredit, selection, and comparison

with the nation. Guidelines forchoosing tests to meet specificneeds are given in more detail inChapter 5.

The use of tests, whether NRT orCRT, in evaluation needs to beplaced in context. To many educa-tors, evaluation equals testing.2In practice, good evaluation in-volves a wide variety of managementand research techniques aimed atstudying the effort, impact, andefficiency of pi igrams at the stagesof their preparation, start-up, andongoing conduct.3 A major purposeof evaluation is to provide decisionmakers with information they need tomake social programs work well.

In this context, testing is only onepart of educational evaluation.Testing can be used at the start ofa program as one of several methodsfor determining curricular needs.During a program, testing can beused as one of several methods formonitoring students' progress inlearning so as to help maintain pro-gram strengths and modify weaknesses.After a program has been in opera-tion for a reasonable length oftime, testing can be used as one ofseveral methods for determining thelonger range achievement of the stu-dents.

Even with these multiple uses oftesting for evaluation, three reser-vations should be noted. First, itis clear that much inappropriatetesting has been done in the name ofeva1uation.4 Both the ease of gath-ering test data and the demand foran accounting of program funds haveencouraged an exaggerated reliance

2Lyon, et al., 1979.

3Tripodi, et al., 1978.

4Baker, 1978.

on test scores. Second, neitherhigh nor low test scores in them-selves are sufficient evidence ofprogram effectiveness. EffectiNe-ness can be judged only in thecontext of a program's goals andimplementation. It would be wrong,for example, to say that an instruc-tional practice or curriculum wasineffective on the basis of low testscores unless it was shown also thatthe practice or curriculum was ade-quately put into operation.

The third reservation has to do withthe meaning of test results: testscores are not as pure and meaning-ful as they seem. For a given setof test results, their apparentmeaning depends on how they arereported. This point is true bothfor CRTs5 and NRTs.6

Keeping a perspective on the placeof testing.in program evaluation,one can more sensibly approach theissue of CRTs vs..NRTs. For credi-bility in the eyes of whoevercommissions an evaluation, a testoften needs to have been well vali-dated. For instructional useful-ness, a test needs to have closerelevance to the program curriculum.At present, standardized tests aregenerally better validated by fieldtrials than CRTs. But since NRTsare meant to survey a variety ofprograms, their curricular relevancevaries. Also, the methods of selec-ting questions for NRTs make theitems unrepresentative of manycurricula. The test user may thusbe in the position of choosing thelesser of evils in deciding betweentests with strong field test dataand tests with curricular r'levance.The following section argues thatcurricular relevance should not besacrificed when choosing tests.

5Barta, et al., 1976.

6Linn and Slinde, 1977.

The Importance of Curricular Rele-vance in Tests

Whether tests are being sought tosupport instruttion directly or tosupport program evaluation, thesingle most important feature toconsider is the degree to which theobjectives of a test match the testuser's curriculum. A test may havehigh reliability, good norms, andother technical virtues; but if theobjectives which it tests are not afair sample of what is being taught,then the test is not a valid measureof that curriculum. Diagnostictests, for example, give usableinformation only if the skills onthe test are the ones to be coveredby the local program. In programevaluation, it is hard to demon-strate the effects of a program bypupils' scores on a test whichincludes many skills that the pro-gram does not attempt to teach.Tests of skills not taught by thelocal program are at best measuresof transfer and at worst measures ofI.Q. or general cultural advantage.Low scores on such tests may revealmore about the inappropriateness ofthe measure than about students'real learning.

Several recent studies show thehazards of using a test that is notclosely related to the local curric-ulum. One study7 demonstrated thatthe content of certain standardizedtests is not very standard. Theauthors found that a sample of NRTsof reading achievement reflect thevocabulary of different basalreading series unequally. That is,a given NRT will give better scoresfor knowing the vocabulary of onereading series than for knowing thevocabulary of others. For the sevenreading series examined in thestudy, the grade level equivalent

7Jenkins and Pany, 1976.

7 c)

score that c3uld be earned byknowing the series' specific vocab-ulary frequently varied by more thanone whole grade depending solely onwhich test was used, a finding thatthe authors refer to as "curricularbias in tests."

A second study dealt with readingcomprehension.9 The authors com-pared the coverage of sixteen sepa-rate comprehension skills by threebasal reading series and by twowidely used norm-referenced tests.In one reading series the propor-tion of exercise on literal andinferential comprehension was 83%and 17%, respectively, but Por theother two series it was about 42%and 58%. Two types of comprehen-sion skills--cloze sentences andwords in ontext--were covered inone or more reading series, butwere not included in either test.The cloze sentence exercises repre-sented 24% of the comprehensionskills in one reading series, 51% inthe second series, and 28% in thethird. The words-in-context repre-sented 1% of the comprehensionexercises, 1%, and 36%, respec-tively. Thus the tests failed tocredit important parts of thesereading programs; and the oversightwas unequal across programs.

In a third study,9 the authors foundthat four widely used standardizedtests of fourth grade mathematicsdiffered markedly from one anotherin their modes of presenting infor-mation and in the nature of thenumerical materials used. For exam-

ple, the proportion of test itemsusing graphs, tables, or figuresvaried from 15% on one test to 43%on another. The proportion of itemsusing integers varied from 39% to66% across tests.

9Armbruster, et al., 1977.

9Floden, et al., 1979.

In these studies, rather specificskills or aspects of test contentwere compared. A fourth, more com-prehensive study" compared tests'coverage of broad objectives for theentire reading and math domains.For this analysis the reading domainwas divided into nine non-overlappingobjectives and the mathematical do-main into thirteen. Coverage of thereading objectives by eight popularNRT series and of the math objectivesby seven of the same series was re-ported for each grade from 1 to 12.The overall trend in the mass of datawas that tests differ consistentlyand widely in the extent to whichthey emphasize, or even include, therather general objectives in the twodomains.

For the purposes of this discussion,the relevant result of the above men-tioned studies is the extent to whichthe percentage of items per testthat are devoted to a given skillactually varies from test to test.The median range in these percent-ages was 42% for the three mostcommonly tested reading skills(namely, recognizing meanings ofwords, literal comprehension, andinterpretative comprehension).That is, the test that had thegreatest percentage of its itemsdevoted to any one of those skillstypically had 42% more of its itemsmeasuring that skill than did thetest with the smallest percentageof its items devoted to that skill.For the math domain the variationwas not as extreme, but still thepercentage of items within a testwhich measured a given objectivediffered by at least 10% from testto test in 68 out of a possible 156cases.

The four studies cited were based onan analysis of materials only, not

"Hoepfner, 1978.

1 9

of students' performance on tests.One further study11 on the effec-tiveness of traditional and innova-tive curricula looked at the effectsof test content bias on actual testscores. A secondary analysis ofmore than 20 published researchreports led the authors to the con-clusion that:

What these studies show, appar-ently, is no+ that the twocurricula are uniformly superiorto the old ones, though this maybe true, but rather that differ-ent curricula are associated withdifferent patterns of achieve-ment. Furthermore, these differ-ent patterns of achievement seemgenerally to follow patternsapparent in the curricula. Stu-dents using each curriculum dobetter than their fellow studentson tests which include items notcovered at all in the othercurriculum or given less emphasisthere. (p. 97)

The first four studies show that thecontent of standardized tests dif-fers and that such tests differ intheir correspondence with any givencurriculum. The conclusion thatsuch variation in test content couldbias the outcome of evaluations,irrespective of students' actualachievement, is confirmed by thelast study cited. Thus, if stu-dents' scores are affected not onlyby their actual achievement but alsoby the mere choice of test, it isessential for tests to be selectedso as to maximize their relevance tothe local curriculum.

Since curricula differ and since theobjectives of ready-made CRTs arenot all the same, curricular rele-vance may be as much a problem forCRTs as for NRTs. In contrast with

'Walker and Schaffarzik, 1974.

9

FIRTs, however, CRTs give a separatescore for each objective, thusmaking it easier to distinguishstudents' performance on program-relevant and program-irrelevantobjectives. In some cases, scoreson program-irrelevant test itemsmay even be used as a baseline orcontrol measure with which to com-pare students' achievement on skillsthat were actually taught.12

Issues in Criterion-ReferencedTesting.

In the area of CRT, there are manyissues on which there is not a con-sensus. A few of these issues areincluded here to point out placeswhere the test user may have to makesome hard choices. More impor-tantly, this selection of issues ismeant to ward off premature compla-cency about CRT. Just as there aremany basic disagreements about stan-dardized testing, much remains to bediscovered or decided aboutcriterion-referenced testing.

...Practical issues

Three of these issues are quitepractical. First, how shall minimumlevels of acceptable performance beset? Ultimately, the choice of acutting score is determined by thechoosers' values; hence it is arbi-trary. But the issue remains as tohow the arbitrary nature of thisprocess can be made more rationaland more politically acceptable.Some methods for setting cut scoresare described in the how-to-do-itvolume by Hambleton and Eignor.13

12Walker, 1978.

13Hambleton and Eignor, 1979.

Second, how can test scores bereported in a way that is both mean-ingful and efficient for a CRT thathas many separate objectives? For

the individual student, test resultsmay exist for 20 or 30 objectives in.each of several subjects. Likewise,in program evaluation a large numberof objectives and grades may bestudied. The problem in both casesinvolves combining data into ausable, summary form while stillconveying significant information.Barta, Ahn, and oastright discussseveral methods for dealing withthis problem.14

Finally, how shall teachers use testscores to make decisions about stu-dents? Will tests be a supplementto teachers' judgments about thestudents or a central tool fordecision making? On the one hand,a teacher knows far more about astudent than any test can measure.In such cases a test may reveal onlywhat the teacher already knows. Is

the test, then, a valuable confirma-tion of teacher judgment or a costlyredundancy? On the other hand, stu-dents may have unsuspected needs orstrengths that the results of a goodtest may bring to light. Also, atany given point in the course ofinstruction, a teacher may need toknow which students have reached apre-set mastery level. In thesecases, tests may have a major influ-ence on instructional decisions.Just how to combine test and non-test sources of information toinform the decision making processis a persistent, practical issuefor teachers and for people who wantteachers to use test scores.

...Theoretical issues

Since this volume is meant to bepractical rather than theoretical,

I4Barta, et al., 1976.

10

these issues will merely be men-tioned. The first is whether CRTscores have construct meaning oronly a work sample meaning. In the

former case, a CRT score is viewedas measuring an attribute or mentalprocess of the test taker. Thedifferent items need to measure thesame thing in this case. In thelatter case, different items on atest may measure different taskcomponents.

A second issue is whether criterion-referenced testing can be appliedmeaningfully only to achievement orwhether CRTs can effectively measurestudents' attitudes as well. Manywriters equate criterion-referencedtesting with mastery testing, whichexcludes measurement of attitudes.

A third issue deals with the impor-tance of field test data forvalidating CRTs. Some expertsargue that a CRT needs only to havea representative sample of itemsfrom a well defined domain of beha-vior in order to be valid. Othershold that field trials are neededfor CRTs in order to establish thetraditional types of validity. Forany CRT that purports to measurepsychological traits or processes,including attitudes, validation byfield test would obviously be essen-tial. In Chapter 3, this issuereceives further attention in thediscussion of test characteristicsthat should be evaluated whenselecting tests.

CHAPTER 3Introduction to the Test Reviews

This chapter introduces the form and content of the testreviews that comprise Chapter 4. First the descriptive com-ponent of the reviews is explained. Next the system for theevaluative component is outlined in the form of 21 questionsto ask when judging CRTs. Each of the 21 test features andits levels of quality are then explained in detail.

Each test review in Chapter 4 con-sists of two sections--a descriptionof the test and an evaluation of 21of its features. The assignment oftest characteristics to the descrip-tive or evaluative category is basedon the following rationale: Testfeatures which are likely to affectthe test's merit uniformly for mosttest users are assigned to the eval-uative category. Test featureswhich are likely to have very dif-ferent importance for different testusers--cost of the test or format oftest items, for example--areassigned to the descriptive cate-gory. The intended use of a testingsystem for such purposes as diagno-sis, progress monitoring, programevaluation, and the like, is alsodescribed rather than evaluated.Descriptive characteristics affecta test's suitability for the indi-vidual user, but such informationneeds to be evaluated by each useraccording to local needs andresources.

THE DESCRIPTIVE SECTIONOF THE TEST REVIEWS

The descriptive section of each re-view mentions the intended grades,number of levels, content, intendeduse, number of objectives, andnumber of items per objective. The

availability of alternate forms isreported here. For any test wherepupils do not respond on paper, thatfact is noted. When the publisheroffers supporting materials in addi-tion to the basic test booklet, suchas diagnostic and prescriptive aids,these materials are mentioned.

In the descriptive section, the wordlevels refers to levels of diffi-culty for which separate test formsare provided. Two testing systemsmay be designed for grades 2 through7, one having separate test bookletsfor three broad levels and the otherfor six narrower ones. Test content

is described in terms of broad sub-ject labels such as reading: word

atack, or math: geometry. Wherethe publishers have provided suchlabels, we have used theirs, modi-fying them only as needed forgeneral familiarity. The readermay locate tests by subject headingsin Index B.

Price information is reported inper-pupil terms for tests, answersheets, and any other major compo-nents, for the smallest quantity inwhich they are available. Note thatprices may decline as the size ofpurchase goes up and that priceschange fairly often. The date ofthe price information is given, butthe currency of the costs should bechecked before making a purchasechoice. Publishers readily providecurrent catalogs and ordering infor-mation. Addresses are given inIndex C for the publishers whosetests are reviewed in Chapter 4.

Field test data, if given by thepublisher, are described next. Thesize and composition of pupil popu-lations tested and type of datareported are noted. Details of testadministration, such as estimatedtesting time, special equipmentneeded, and the need for trainedadministrators are reported whererelevant.

Descriptive information on scoringis given in terms of costs and typesof scoring offered. Price informa-tion here is also quite changeable.A descriptive category calledComments is included for any addi-tional information which does notreadily fit in the other categories.

12

THE EVALUATIVE SECTIONOF THE TEST REVIEWS

Each test is evaluated according to21 dimensions or test features.These 21 features, summarized inquestion form in Table 1, pages 14-15, fall into three categories:

Measurement properties (featuresdetermining whether the test wasconstructed according to soundprinciples of educational measure-ment).

Appropriateness for the intendedexaminees (features determiningthe suitability of the test fu:the intended students).

Usability (features determiningthe ease with which the test canbe administered, scored, andinterpreted).

A fourth and critical category--relevance to the test user's curric-ulum--is not treated here, since thedetermination of such relevance canonly be done with a detailed des-cription of a specific curriculum inhand. Chapter 6 gives assistance inattending to this fourth area ofconcern.

In reviewing tests, one might com-pare them on a very large number offeatures. A recent national sampleof school district staff specialistsin curriculum, counseling, andtesting rated 20 different testcharacteristics to be very importantor crucial in picking tests.1 Evena set of 39 characteristics usedearlier by CSE2 is far from com-plete. A variety of systems forrating tests are used by the booksof test reviews listed in Appen-dix B. Also, a number of other


2Hoepfner, et al., 1976.

authors3 have developed guidelinesfor comparing tests systematically.Since many test features are treateddescriptively in this volume, theCSE system for evaluating CRTs looksat only the 21 test features pre-sented in Table 1.

Each question in Table 1 is accom-panied by a brief summary of thelevels of quality (or standards)which comprise the ratings. Eithertwo or three levels of merit on eachfeature were used, depending in parton how many different degrees ofquality were discernible. Levelswere also chosen to try to discrim-inate among tests, even though thispractice resulted in setting thecutting point for a maximum ratingat a low level of quality for a fewfeatures. The reader should notinfer that CSE advocates low stan-dards in tests, but rather shouldunderstand that the standards werechosen to try to differentiatetests.

Why should a test user even considerusing a test which does not consis-tently meet high standards oftechnical merit? Because one othercharacteristic--relevance to thelocal curriculum--is more important.Ideally a test buyer would be ableto choose from a pool of tests onethat is technically sound as well asclosely related to the objectives ofthe local program. When this is notpossible, curricular relevance isthe less expendable of those twoqualities. In this vein, Cronbach4has said that precision in testscores is useless if the skillsmeasured by the test are not rele-vant to the intended decisions.

Cronbach (1970: 186-192), Katz(1973), Popham (1978, Chapter 8),and Hambleton and Eignor (1978).

4Cronbach, 1970: 152.

011

Ratings of test features in Chap-ter 4 are expressed in terms ofletter grades. Letters are usedinstead of numbers to encourage testusers to weigh the featuresaccording to the users' own needsrather than to add the ratingsmechanically. Methods for weighingand combining such ratings for thepurpose of comparing tests are des-cribed in Chapter 5. The lettersA, B, and C are used, with A and Cbeing assigned for test featuresthat are divided into only two lev-els of merit.

In the remainder of Chapter 3, theimportance of each of the 21 testfeatures is explained, and levels ofmerit (standards) are described ingreater detail. Casual readers mayattend to Table 1 and skip this moretechnical and detailed explanationof the evaluative criteria. Readerswho are involved in selecting testswill profit from the detailed pre-sentation.

NOTE: The information in Table 1 isprovided on the inside back cover ofthis handbook for the convenience ofthe reader who wishes to refer to itwhile examining a test review.

TABLE 1Key to the Evaluative Sections of CSE Test Reviews*

MEASUREMENT PROPERTIES: CONCEPTUALVALIDITY

1. Domain Descriptions. How good(i.e., thorough and comprehen-sive) are the descriptions ofthe objectives or domains to betested?A. Very good (objectives are

thoroughly described)B. Adequate (objectives are

stated behaviorally but notin detail)

C. Poor (objectives are looselydescribed and subject tovarious interpretations)

. Agreement. How well do the testitems match their objectives?A. The match is confirmed by

sound evidenceC. Data are not provided or are

not persuasive

Representativeness. How ade-quately do the items sampletheir objectives?A. Items are representative of

domainsC. Item selection is either un-

representative or unreported

MEASUREMENT PROPERTIES: FIELD TESTVALIDITY

4. Sensitivity. Does conventionalinstruction lead to test-scoregains?A. Test scores reflect instruc-

tionC. Data are not provided or are

not persuasive

. Item Uniformity. How similarare the scores on the differentitems for an objective?A. Some evidence of item unifor-

mity is providedC. No data are provided

6. Divergent Validity. Are thescores for each objective rela-tively uninfluenced by otherskills?A. Independence of skills is

confirmedC. Data are not provided or are

not persuasive

7. Lack of Bias. Are test scoresunfairly affected by socialgroup factors?A. Persuasive evidence of lack

of bias is offered for atleast two groups (e.g., wom-en, specific ethnic groups)

C. Data are not provided or arenot persuasive

8. Consistency of Scores. Arescores on individual objectivesconsistent over time or overparallel test forms?A. Consistency of scores for

objectives is shown overparallel forms or repeatedtesting

C. Data are not provided

APPROPRIATENESS AND USABILITY

9. Clarity of Instructions. Howclear and complete are the in-structions to students?A. Instructions are clear, com-

plete, and include sampleitems

B. Either instructions or sam-ple items are lacking

C. Both are lacking

10. Item Review. Does the pub-lisher report that items wereeither logically reviewed orfield tested for quality?A. YesC. No

*This system for evaluating CRTs is explained in detail in the text. Fortest features where only two levels of quality are distinguished, theletters A and C are used to indicate the levels.

14 9

TABLE I (continued)

11. Visible Characteristics. Is

the layout and print easilyreadable?A. Print and layout are read-

able for more than 90% ofobjectives

C. At least 10% of objectiveshave problems in readability

12. Ease of Responding. Is theformat for recording answersappropriate for the intendedstudents?A. Responding is easy for more

than 90% of the objectivesC. Lack of clarity, crowding,

etc., make responding diffi-cult in at least 10% ofobjectives

13. Informativeness. Does the testbuyer have adequate informationabout the test before buyingit?

A. YesC. No

14. Curriculum Cross-Referencing.Are the test objectives indexedto at least two series of rele-vant teaching materials?A. YesC. No

15. Flexibility. Are many of theobjectives tested at more thanone level, and are singleobjectives easy to testseparately?A. Objectives are varied, carry

over across test levels, andare easy to test separately

B. One feature is missing fromvariety, carry over, orseparability

C. Two or three of the featuresare missing

16. Alternate Forms. Are parallelforms available for each test?A. YesC. No

17. Test Administration. Are thedirections to the examinerclear, complete, and easy touse?A. Directions are clear, com-

plete, and easy to useC. One or more of the above

features are missing

18. Scoring. Are both machinascoring and easy hand scoringavailable?A. YesB. Easy, objective hand scoring

is available, but no machinescoring

C. Hand scoring is not easy orobjective; or only machinescoring is offered

19. Record Keeping. Does the pub-lisher provide record formsthat are keyed to test objec-tives and are easy to use?A. YesC. They are not included or not

keyed to t,.st objectives

20. Decision Rules. Are well jus-tified, easy-to-use rules givenfor making instructional deci-sions on the basis of testresults?A. YesC. Decision rules either are

not given, not easy to use,or not justified

21. Comparative Data. Are scoresof a representative referencegroup of students given forcomparing with scores of pupilsin the test user's program?A. National norms, criterion

group data, or item diffi-culty values are provided

C. These are not provided orare not clearly representa-tive

15 9r,

MEASUREMENT PROPERTIES:CONCEPTUAL VALIDITY

A test score is not an end in it-self; it is a sign or indicator ofsomething more important. A scoremay give a prediction about thepupil's future performance, or itmay give an estimate of how the testtaker is likely to perform on alarger set of possible items fromwhich the test items are drawn. Inthe latter case, that pool of pos-sible test items is called the2omain (or criterion). A pupil'sscore on a CRT thus gives an esti-mate of how the pupil is lilcely toperform with respect to the popula-tion of all such items.

One essential str; in making thescores of a CRT meaningful is todescribe the criterion pool of itemsclearly. First, a clear descriptionenables teachers to teach the skillor attitude that is described. Byproviding a practical target forinstruction, the description makesthe score on such a test userful fordiagnosis and prescription. Second,a clear test description can help todemystify testing by telling con-sumers of test results just what wastested. The description thus givesmeaning to the score. In most ofthe tests covered in this book, thedescriptions of the criterion beha-viors take the form of instructionalobjectives. When "a test" isreferred to, it means a group ofitems that provides a separatescore. One CRT test form may thuscontain several tests.

Since the items of a CRT are sup-posed to test the skill or attitudeas set forth in the description ofthe criterion, the validity of a CRTdepends on the extent to which theitems actually fit the test descrip-tion. This type of validity isoften referred to as content valid-ity, but that term is too narrow.

Popham5 has suggested the phrasedescriptive validity so as to applynot only to CRTs in the cognitivedomain but also to those in thepsychomotor and affective domains,where process or action may be morerelevant than content. A descrip-tion of the criterion that clearlyspecifies what should and should notbe included is an essential link indetermining whether a CRT has thistype of validity.

1. Domain (or Criterion)Descriptions

Eackground,

For test buyers, thorough domaindescriptions have practical poten-tial. A test user could compare thecurricular relevance of two or moreCRT batteries by seeing how welltheir domain descriptions match thelocal program, rather than by havingto examine the test items directly.A full CRT description will consistof a set of instructions to the testwriter that prescribes the content,format, and mode of responding forall of the possible test items.Directions for making up multiplechoice options, for scoring free re-sponses, and for sampling items fromthe criterion item pool will also be

given. Much of this informationgoes beyond subject matter content.

It is obvious that detailed domaindescriptions are technical docu-ments, too lengthy and detailed intheir entirety to be efficienteithe- for planning instruction orfor reporting grades. But thedetailed descriptions can includebrief statements for teachers andparents in a form like behavioralobjectives.

5Popham, 1978.

16

/

Levels of Quality

Level A.6 Content, format, responsemode, and sampling rules are des-cribed thoroughly enough so that(a) different test writers shouldproduce equivalent tests by fol-lowing the description, or (b) forany test item or set of items, it isclear whether they fa:11 inside oroutside the intended domain. Thenames of three types of test des-cription that are most adequate areitem forms, amplified objectives,and domain specifications.

Level B. Content, format, and re-sponse mode are described, as in abehavioral objective. Rules forsampling items are not given, orthere is so much slack in the limitsof content, format, or response modethat differing tasks could still fitthe description. Tests based onsuch descriptions are objectives-based.

Level C. The test is described interms that give little indication ofthe content, format, and responsemode of the test items. Generalskill category labels, such asreading comprehension, word attackskills, or basic mathematical opera-tions, are at this vague level ofdescription. Many different typesof test items will fit a descriptionas general as this one. Since thesedescriptions give little indicationof what the criterion behaviors are,tests with such descriptions arescarcely criterion referenced.

6Appendix D has an example of adomain description which would re-ceive a level "A" rating.

17

2. Agreement of Items withtheir Test Descriptions

Background

The domain descriptions of feature#1 above are a test maker's inten-tions for constructing tests. It isstill necessary to show that theintentions were carried out. Fea-tures #2 and #3 deal with thisissue. Feature #2 asks whether thetest items are accurately describedby the test description. If theyare not, then the items test some-thing else and the test is invalid.Technical terns that are used torefer to the concept of agreementinclude item-objective congruence,content validity, and descriptivevalidity.

Levels of Quality

Level A. Sound evidence of agree-ment is offered and described inenough detail to evaluate. The testdeveloper gives a detailed accountof either how the items were gener-ated from the description of thecriterion behaviors or how qualifiedjudges confirmed the fit of theindividual items to the description.

Level C. No evidence of agreementis offered; or evidence is mentionedbut not described in enough detailto evaluate; or evidence is des-cribed in detail but is flawed.

3. Representativenessof the Items

Background

Rarely is a test score of interestfor its own sake. Test scores areused as observable indicators ofmore important things that are dif-ficult or impossible to observedirectly. For example, students'

scores on any achievement test areused to indicate their mastery of atotal set possible questions onthe subject matter. It is rarelypossible to test the total set.Likewise, a person's performance ona test of intelligence or person-ality is used as an indicator of howthe person will act in more naturalsituations. For a test score to bean accurate indicator, the testitems must not be chosen in a biasedmanner. In other words, the itemsmust be chosen in a way that allowsfor generalization from the testscore to the intended total set ofbehaviors. If the selection processis biased, unplanned, or unrepresen-tative, then the total set of beha-viors that the test score representscannot be determined.

Levels of Quality

Level A. The test developer reportsthat the items were selected eitherrandomly from the set of questionspossible under this objective or, ifthere are components in the domain,by stratified random sampling.

Level C. No account is given of howthe test questions were chosen fromthe set of questions possible underthis objective; or items were selec-ted in a biased or unrepresentativefashion. Items are not representa-tive if the item selection processsystematically excluded those thatfailed to discriminate high and lowscoring individuals in a group ofstudents who have a common instruc-tional background.

18

MEASUREMENT PROPERTIES:FIELD TEST VALIDITY

Authorities in the field of CRTagree that conceptual validity isnecessary for a good criterion-referenced test. They do nor agree,however, on the necessity for empir-ical (data-based) validation ofCRTs. CSE takes the position thatthe two types of validity are inter-dependent; both are necessary forconfirming that a test measures whatit claims to. Without validation byfield trials, a test that appears tobe conceptually sound may give mea-sures that are not consistent (testfeature #8), that do not reflect therelevant learning (#4), that are ofan unintended mixture of behaviors(#5), that are affected by skills orattitudes other than the intendedone (#6), and that are biased (#7).Without meeting the standards forConceptual Validity, on the otherhand, a test may be an unrepresenta-tive measure (#3) of the wrongcriterion (#2) or of no.identifiedcriterion at all (#1).

4. Sensitivity to Learning

Background

Students' scores on a test may ormay not reflect their actuallearning of the skills which thetest purports to measure. To theextent that the scores do, the testis said to be sensitive to learning.This feature for judging the meritsof tests is not universallyaccepted, in part because ft isusually called sensitivity to in-struction. The objection is that atest may not show any effects ofinstruction because the given in-struction did not have any effect.Thus, when a small sample of stu-dents in a field test does notappear on a posttest to have

benefited from instruction, thatresult is not necessarily the faultof the test.

The objection is well taken as faras it goes. However, consumers oftests need to know that the testdoes reflect positive effects of in-struction in a fair proportion ofclassrooms. If it does not, eitherthe test is insensitive or the testcontent is not teachable by currentmethods. In either case, such atest will not be useful.

Demonstration of sensitivity tolearning under one form of instruc-tion or with one type of pupil willnot guarantee sensitivity to allforms of instruction or for alltypes of pupils. The test developershould describe the type(s) of in-struction and pupil used in thefield tests so that test buyers candecide if the test is likely to besensitive in their own setting.

There are serious technical problemsin measuring change, and there isnot yet a consensus on how to provea test's sensitivity. This testfeature was evaluated here simply byasking: Does the test developeroffer any evidence of a test's sen-sitivity that is free from the wellestablished problems in measurement(e.g., unreliability, the effect ofexperiences outside the school)?Such data must be provided for eachseparately scored skill.

Levels of Quality

Level A. The test has been found toreflect learning in a representativesample of students following anordinary (in terms of time, inten-sity, and resources usually avail-able for the particular subject)course of instruction. The courseof instruction is clearly aimed atthe criterion behaviors. The wellestablished problems in measurementare not present in the study.

Level C. No information is given onthe sensitivity of the test to stu-dent gains; or evidence suggeststhat the test suffers from wellestablished problems in measure-

,ment'; or the gains cited are notstatistically dependable; or thesuccessful teaching method was notdescribed.

15. Item Uniformity

Background

This feature deals with whether atest (i.e., each separately scoredset of items) measures a uniform,coherent skill or attitude. If thetest does not, then it measures amixture of things. A CRT that is auniform measure is a better test,with the following exception. In

some cases, the definition of thecriterion behaviors identifies dif-ferent components or levels ofdifficulty. For example, a phonicstest might deal with consonants, thediffering categories of consonants(e.g., stops, liquids, nasals, fric-atives) being identified as compo-nents of that phonic skill. Such atest should show uniformity withineach category, but not necessarilywithin the whole test of severalcategories. When such a test mea-sures a mixture of things, the mixis planned. An accidental lack ofuniformity results when the itemsunintentionally call for differentskills or attitudes, It is a signthat the description of the cri-terion is defective, for the testdoes not measure what it purportsto measure.

Uniformity or coherence of a CRT isshown by measures of the extent towhich all items for a given skillfunction alike. The more a stu-dent's score on one test item is

7Campbell and Stanley, 1963.

1930

similar to his scores on the otheritems, the more uniformity the testhas. In classical test-scoretheory, factor analysis, inter-itemnrreations, and part-whole corre-ations give measu J of uniformity.

Levels of Quality

Level A. At this early stage in thedevelopment of CRTs, any numericalevidence of item uniformity will beaccepted if it is reported forgroups of items testing singleobjectives. The data must be basedon students' responses to the testitems.

Level C. No numerical evidence ofitem uniformity is given at thelevel of the individual objective,or only judgmental evidence isgiven.

6. Divergent Validity

Background

This feature deals with whether thescores on a test are relativelyuninfluenced by achievements orattitudes that the test is not sup-posed to be measuring.8 If thescores on the test are relativelyuninfluenced by other, unintendedfactors, then it is a test of some-thing distinct and has divergentvalidity. For example, the morethat scores on a test of readingcomprehension are influenced bygeneral knowledge, apart from theexaminees' understanding of thetest, the less the divergent valid-ity of the test. For a math test tohave divergent validity, its lan-guage must be simple enough so thatpupil errors are not reading errors.

Divergent validity, or separateness,can be confirmed by traditional

8Campbell and Fiske, 1959.

methods--factor analysis, correla-tion studies among measures ofseparate behaviors--or by experi-mental evidence that scores on atest respond to a relevant treatmentwhile scores on certain other testsdo not.

Levels of Quality

Level A. Evidence of divergentvalidity is given showing the CRT'sscores to be independent of scoreson tests of other supposedly unre-lated achievements or attitudes.

Level C. No evidence of divergenceis offered, or the evidence is notdetailed enough to judge. There isevidence of contamination (forexample, high correlations of CRTscores with other I.Q. scores orscores of verbal aptitude).

17. Lack of Bias

la0Amaal

This feature is concerned with howdifferent groups of students--forexample, different ethnic groups--perform on a test. It does not dealwith the surface content of testquestions. Bias has been commonenough in testing so that it isunwise to assume that it is absentfrom current tests. Hence a demon-stration of lack of bias is requiredto confirm a test's validity formajor social groups.

A test is biased for a given groupof students if it does not permitthem to demonstrate their skills orattitudes as completely as it per-mits other groups to do so. Such atest is invalid for that group. Thesubject of bias is surrounded withcontroversy, in part because socialinjustice for large numbers of stu-

dents can result from biased tests.

20 32

8. Consistency of Scores 1

Levels of Quality

Level A. Evidence of lack of biasis offered for at least two of thefollowing groups: women, Blacks,and students from Spanish speakingbackgrounds. Lack of sizable item-by-group interactions is one form ofevidence. A second is similarityacross groups of the other data forempirical validity (features #4-8).

Level C. No evidence of lack ofbias is offered, or evidence isoffered but not persuasive. A dif-ference in the average scores ofethnic or other groups by itselfwill not be considered evidence ofbias.

Background

A test is .ns!stent if the differ-ence in a student's scores on twooccasions is due to a real change inachievement or, for affective mea-sures, in attitude.. If a student'sscores change as a result of thevagueness of the instructions,variations in testing conditions,or other factors aside from reallearning, then the test's scores arenot consistent. Changes in scoresdue to irrelevant factors make thescores of any one occasion suspect.The more that a test's scoresreflect,real learning, and notirrelevant factors, the more con-sistent it is.

Consistency measures used with norm-referenced tests include estimatesof test-retest reliability andz't,-rnaP orm re?,iability. The

traditional reliability estimatesoften are not suitable for CRTs, andthus the use of the broader termconsistency. When CRTs are to beused in a pass/fail fashion,

consistency should be shown for thepass/fail judgments.

Consistency data are necessary toshow that a test's scores aredependable, but not many suchstudies have yet been done on CRTs.In principle, consistency may varyover a wide range; but current CRTsdiffer more on whether they reportconsistency data at all than on thevalues reported. At this point, thereporting of any such data is seenas a positive step in test develop-ment and a step toward truth inpackaging.

Levels of Quality

Level A. Data are reported on theconsistency of students' scores.Either consistency of individuals'scores over repeated testing orconsistency of individuals' scoreson different forms of the test willbe credited.

Level C. No consistency data aregiven.

3 `)21

10. Item Review!

APPROPRIATENESSAND USABILITY

The effects of features #9-12 wouldshow up in the validity and consis-tency data for a test. Becauselittle information is yet availableon the measurement properties ofCRTs, and because features #9-12 maycause problems in giving a test,they are treated separately here.

9. Clarity of Instructionsto Students

Background

The instructions to students mustdescribe all aspects of the task inlanguage that is suited to theintended age or grade levels.Sample items that are both typicaland clear should be given both forpractice and clarification.

LQ.vels of Quality

Level A. The instructions to theintended test takers are clear andcomplete, and a sample item isprovided.

Level B. Either the instructionsare not appropriate or the sampleitems are lacking.

Level C. The language of the in-structions is too advanced or other-wise inappropriate for the intendedgroup; or instructions are incom-plete or hard to follow; or a sampleitem is not given.

Background

Test items are appropriate if theyare understandable, have at leastone correct answer, give credit for

22

all correct answers, do not giveaway the correct answer, and areotherwise free from technical flaws.Two kinds of evidence are consideredhere--namely, test developers' re-ports that the items were logically(i.e., judgmentally) reviewed, orthat they were reviewed throughfield testing.

Levels of Quality

Level A. The test developer reportsthat item quality was reviewedindependently of item writing.

Level C. The test developer offersno evidence that item quality waschecked apart from the process oforiginal item generation.

11. Visible Characteristicsof Test Materials

Background

The visible characteristics of testmaterials should make it easy forstudents at the intended levels touse the materials. Tests wereexamined for the details of layout,organization, and clarity mentionedunder Level C below.

Levels of Quality

Level A. More than 90% of theobjectives are free of the flawslisted under Level C.

Level C. At least 10% of the objec-tives have one or more of theseflaws: print of pictures is un-clear, items are too close together,stems and responses are not clearlygrouped, sequence of items is easyto lose, there is little blank spacefor math work, the page is clut-tered, item numbers are not easy topick out, information needed toanswer a question is unnecessarilyspread out.

33

12. Ease of Responding]

Background

A test should be formatted so thatstudents' scores are not affected bydifficulties in recording theiranswers. Answer sheets or otherspaces for responding are judged fornot only the attributes mentioned infeature #11, but also for the amountof space provided for answers.

Levels of Quality

Level A. More than 90% of theobjectives are free of the responsematerial flaws described underLevel C.

Level C. Answer sheets or otherresponse sheets for at least 10% ofthe objectives have one or more ofthese flaws: print is unclear,items are too close together, itemnumbers are not easy to pick out,answer spaces are too small.

13. Informativeness of Materialsfor the Prospective Buyers

Background

Some publishers make it easier fora prospective buyer to decidewhether or not to purchase a testby providing complete, easy-to-useinformation on the materials. Thereare usually two stages in test pur-chase: ordering sample materialsand ordering the testing packageitself. Since the presence andquality of technical information ontest development is covered infeatures #2-8, it will not becounted here.

The issue here is whether the pro-spective buyer knows what thetesting package will consist ofbefore investing in it. This fea-ture is more important in weighing

23

the more costly CRT systems, wherethe prospective user will be lesswilling to buy the system withoutfirst having an opportunity toexamine it.

Levels of Quality

Level A. Either the whole system isavailable on approval, or thefollowing possibilities are avail-able to the prospective purchaseras part of the publisher's promo-tional effort: specimen sets ofsample pages and instruction can beobtained; test copies can be pur-chased in any quantity; a completelisting of the test's objectives isprovided before purchase; informa-tion on ordering of original andreplacement materials is clear andcomplete; replacement materials maybe ordered separately; informationon returning unused materials isclear; pertinent information isavailable without buying the testingpackage--the instructions to studentsand test users, the physical charac-teristics of the test, instructionsand materials for recording answers,the number of separate test formsavailable, decision standards andcomparative data, time required fortesting, the training needed to giveor interpret the test.

Level C. Promotional materials donot give enough information fordeciding whether to order specimensets in one or more of the respectslisted under Level A.

14. Curriculum Cross-Referencing_

Background

A testing package is easier to coor-dinat4 with local curriculum andinstruction if it includes an indexrelating its objectives to specificteaching materials. Such an index

3i

can be used to guide test selectionand to help teachers locate alterna-tive instructional materials. Foreither purpose the user will have toverify that the indexed instruc-tional materials adequately coverthe same skills as the test and thelocal curriculum.

Levels of Quality

Level A. Indexing of the test'sobjectives to two or more pub-lishers' teaching materials isprovided in detail (e.g., specificunits in specific texts).

Level C. No curriculum cross-referencing is provided.

15. Flexibility inChoosing Objectives

Background

A test or testing system is adapt-able to a range of local needs--forexample, individualization--if itcovers a variety of objectives andtests them on separate forms. Atesting system which combines afixed set of objectives does notgive the user as much control overtesting. Flexibility is also pro-vided by a system which gives testsof the same core objectives at morethan one test level. Such a systemdoes not have the same test items onforms that differ only in the levelmarking, but has tests of the sameskills with content and illustra-tions suited to the differentlevels.

Note that it is not fair to comparelarge-scale testing systems withsmaller ones that do not try tocover the same range of skills orgrades. If the user is looking forhighly specific test, this cri-

terion may not be relevant. Also,

the cost of such flexibility will be

24

an important consideration to thetest buyer.

Levels of Quality

Level A. All of the following fea-tures are present: variety in theobjectives, separate forms, or gradelevel flexibility of core objectives.

Level B. One of the features men-tioned under Level A is missing.

Level C. The test or system pro-vides a narrow range of objectivesand prints several of them togetheron the same test form. Core objec-tives are available in materialsappropriate to only one grade level..

16. Alternate Forms

Background

When a testing system has alternateforms, the user can give independentretests to the same students. If

retesting is done with the same formthat was used for the original test,students' scores are likely to beinfluenced not only by theirlearning of the subject matter butalso by specific memory of the firsttesting. This latter influenceinvalidates the retest scores. Withalternate forms, pre- and post-testing or repeated posttesting canbe done without this invalidatingcarryover effect.

Levels of Quality

Level A. Two or more forms withnon-overlapping sets of items areavailable for each test.

Level C. Only one form is availablefor each test.

2 r;

17. Test Administration

Background

A test is more practical if theinstructions to the examiner areclear, complete, and well organized.With good instructions, the testingis not only easier, but the testingconditions are also more uniform.

Levels of Quality

Level A. Instructions leave littleroom for misunderstanding by theexaminer and are complete and easyto use.

Level C. Instructions to the exam-iner are hard to find or follow.They are vague, ambiguous, not com-plete, not all in one place, or notlogically ordered. Or, the copy inthe manual is unclear.

Background

18. Scoring

A test is more practical if it canbe scored easily and objectively andif the test user is not limited toone method of scoring. Hand scoringis easy if scoring templates orother well organized keys are pro-vided.

Levels of Quality

Level A. Both machine and easy,objective hand scoring options areavailable.

Level B. Only hand scoring isavailable, but it is objective andeasy.

Level C. Hand scoring is difficult,or arbitrary, or requires specialtraining. Or, scoring requires theexpense of special machines on site

25

or the delay of sending students'responses out for scoring.

19. Record Keeping

Background

Good records of student performanceare an important part of classroommanagement and of meeting account-ability requirements. CRT systemshave the potential for making recordkeeping burdensome because theyoften have large numbers of objec-tives. A testing system is morepractical when it has forms forrecording students' test scores thatare easily keyed to the objectives,easy to maintain, and easy to inter-pret.

Levels of _Quality

Level A. Usable forms for recordkeeping are provided.

Level C. Either teachers mustcreate their own record forms, orthe testing system's forms are noteasily keyed to the objectives, easyto maintain, or easy to interpret.

ITC: Decision Rules

Background

Tests may be used to make decisionsabout students. Tests should beconstructed in a way that allowsdecisions to be made with confidenceand ease. The information for deci-sion making should be easy to find,easy to use, and well justified.Although the choice of cuttingscores for passing or mastery shouldbe left to the local test user, theTublisher should give an indicationof the consequences of choosingdifferent cutoffs.

Relative costs and gains will affectthe choice of a cutting score.Where a prerequisite skill is beingtested, it may be preferable to holdback a few students who have actu-ally mastered it in order to avoidpassing ones who have not. In othercases, holding students back may bemore costly than advancing studentsbefore they attain mastery.

One aspect of test design thataffects the decision rules has notbeen covered previously--namely,number of items per test or perobjective. For several reasons itis important to have more than justa few items per objective. First,there must be enough items so thatoccasional misreading of questionsby students will not result inunwarranted failures. Second, theremust be enough items so that chanceeffects, like guessing, do notresult in unwarranted passing.Ideally there will be enough testquestions so that three levels ofattainment can be identified: clearpass, clear fail, and an area ofuncertainty. Finally, a sufficientnumber of items on a test is aprotection against misjudging indi-vidual students scores in casestudents share an occasional answer.

Since statistics for CRTs are stilllargely under development, includingthe statistics for decision making,only two general levels of merit areapplied here.

Levels of Quality

Level A. Decision rules that areeasy to find and use are providedalong with a rationale for theiruse.

Level C. Decision rules are notprovided, or they are provided with-out justification, or they are hardto find or use.

21. Comparative Data

Iiackground

Authorities disagree on whether theintent of criterion-referencedtesting is undermined by providingcomparative (that is, norm-referenced) interpretations ofCRTs. But test scores are not easyto interpret, and the more informa-tion that can be provided aboutthem, the easier it is to understandand explain them to others. ThusCRTs that offer both absolute andrelative interpretations of scoresare seen as more practical than onesthat have only the former.

Test users should recall that NRTsare designed to provide stablerankings of students in that theyconsist of test items that spreadout the scores of test takers. Awell designed CRT (features #1-3) islikely to provide less stablerankings because items are sampledto be representative of the skillsor attitudes, and because the numberof items on a test (i.e., on asingle olRjective) is likely to besmaller.'

Note that comparative data need notbe percentile norms. Average per-cent correct could be given forvarious reference groups. Gradelevel equivalents are not acceRtableowing to their many problems.lu

Levels of Quality

Level A. Acceptable comparativedata are based on the responses ofat least several hundred students ina nationally representative sample.

9Hambleton, et al., 1978.

10APA, 1974; Tallmadge and Horst,1976; Linn and Slinde, 1977.

2637

Percentile norms, data for wellidentified reference groups, orsummaries of performance of studentsin the target grades are suitable.These data are easy to find andinterpret in the user's manuals.

Level C. Comparative data areeither not provided, or consist onlyof grade level equivalents, or arebased on the responses of a small orunrepresentative sample of students.Or these data are hard to find andinterpret.

CHAPTER 4CSE Criterion-Referenced Test Reviews

A key summarizing the rating system used in the following testevaluations can be found on the inside back cover of thishandbook for the convenience of the reader who wishes to referto it while examining a test review.

39

29

1

ANALYSIS OF SKILLS (ASK) - Scholastic Testing Service, 1975Language Arts

DESCRIPTION

The ASK-Lanivage Arts is a six-level battery of tests designed for diagnosisand prescription in grades 2-8. It covers a total of 73 objectives in thefollowing broad areas: capitalization and punctuation, usage, sentenceknowledge, and elements of composition. There are 36 to 58 objectivesper level, each with three multiple choice items. Items within each objec-tive are "spiraled" so that the easier ones come earlier in the test form,the harder ones in the latter part of the test.

PRICES

Two systems for ordering materials and services are available, a conventionalone and a "lease-score" system. For materials that are to be retained, thereusable test booklets are 62C each in packages of 20; answer sheets, 17C insets of 50; clxawiners' manuals are 70; and a general manual is $2.00. Indi-vidual student record forms which are included in the basic scoring serviceare 12c each in sets of 20 when purchased separately.

Under the "lease-score" system, the purchaser returns all test materials tothe publisher and pays only the costs of data processing, reporting, andtransportation. Specimen sets are $2.00. Date of prices: 1978.

FIELD TEST DATA

The developer has unpublished data showing that each level of ASK-LanguageArts was field tested on a median of 145 pupils. The data are mostly in theform of numbers and percents of pupils picking each response choice. The

field test is not described.

ADMINISTRATION

The ASK tests are designed for group administration. Tests are untimed, butthe publisher recommends scheduling two sessions for any one level andestimates the total testing time as 1 minute per item, i.e., 108 to 174minutes for Levels 2-3 to 7-8.

SCORING

The basic leasing service costs $1.10 per pupil and includes three classrecord sheets, one set of individual student record forms and labels, oneinterpretative brochure per class, and a content outline "per level perclass." The basic scoring cost for those who purchase materials is 78C perpupil and includes everything except test booklets and answer sheets.

1 030

ANALYSIS OF SKILLS (ASK) - Scholastic Testing Service, 1975Language Arts

MEASUREMENT PROPERTIES

AOC 1. Description.

A 2. Agreement. No data.

A 3. Representativeness. Nodata.

A © 4. Sensitivity. No data.

A 0 5. Item Uniformity. Nodata.

A 0 6. Divergent Validity. Nodata.

A © 7. Bias. No data.

A 8. Consistency. No data.


CIB C 9.

A 10.

C 11.

C 12.

0 C 13.

A © 14.

A(E)C 15.

Instructions.

Item Review. It is notclear from the unpub-lished field test datathat they were used foritem revision.

Visibility.

Responding.

Informativeness.

Curriculum Cross-Referencing.

Flexibility. Coreobjectives are covered,each at several levels.The intentional"spiraling" of itemswithin objectives makessingle objectives hardto test separately.

A 16. Alternate Forms.

() C 17. Administration.

A BC 18. Scoring,. Hand scoringis difficult. Machinescoring is offered.

CI C 19. Record Keeping.

A 20. Decision Rules.

A 21. Comparative Data. Whenthe scoring service isused, local and nationalnorms for the contentareas and for the totalscores may be printed onthe class record sheetby punil. The composi-tion of the nationalnorm group is not des-cribed.

31 41

ANALYSIS OF SKILLS (ASK) - Scholastic Testing Service, 1974Mathematics

DESCRIPTION

ASK-Math is a seven-level battery of diagnostic tests for pupils in grades1-8 covering these categories of skills: computation, concepts and problemsolving, and applications. There are 44 to 58 objectives per level withthree multiple choice items per objective. Items within each objective are"spiraled" so that the easier ones come earlier in the test form, the harderones in the latter part of the test.

PRICES

Two systems for ordering materials and services are available, a conventionalpurchase system and a "lease-score" system. For materials that are to beretained, reusable test booklets are 62Q per pupil in packages of 20 andanswer sheets 17c in sets of 50. The examiner's manuals, one for Level 1-2and one for Levels 2-8, are 70, and a general manual is $2.00. Individualstudent record forms, which are included in th e. basic scoring service, are12Q each in sets of 20 when purchased separately.

Under the "lease-score" system, the purchaser returns all materials to thepublisher and pays only the costs of processing, reporting, and transporta-tion. Specimen sets are $2.00. Date of information: 1978.

FIELD TEST DATA

The developer has unpublished data showing that each level of this batterywas normed on a median of over 1100 pupils. The field test is not described.

ADMINISTRATION

The ASK-Math tests are made for group administration. The tests are untimed,but the publisher estimates the total testing time to be 180 minutes forLevel 1-2 and 130 minutes for each other level.

SCORING

The publisher does not recommend hand scoring. The lease-scoring service, at$1.43 for Level 1-2 arid $1.10 for the other levels, provides individual pupilfolders with score labels, an interpretive brochure, and objectives-based andnormative scores for individuals and for the group.

32 42

ANALYSIS OF SKILLS (ASK) -Mathematics

Scholastic Testing Service, 1974


AC)C

A

A ©

A ©A ()

A ©

A ©A ©

1. Description.

2. AgInen. No data.

3. Representativeness. Nodata.

4. Sensitivity. No data.

5. Item Uniformity. Nodata.

6. Divergent Validity. Nodata.

7. Bias. No data.

8. Consistency. No data.


(DB C 9. Instructions. Instruc-tions for level 1-2appear too complicated,but other levels do not.

A © 10. Item Review. It is notclear from the unpub-lished field test datathat they were used foritem revision.

A ED 11. Visibility. At level1-2, there is a problemof crowding and printsize. Other levelsappear satisfactory.

® C 12. Responding.

A ED 13. Informativeness. Prices

and contents are notclearly laid out in thecatalog.

A © 14. Curriculum Cross-Referencing.

A BC) 15. Flexibility. Althoughcore objectives arecovered at severallevels, about one-thirdof the items arerepeated across levels.The intentional"spiraling" or itemswithin objectives makessingle objectives hardto test separately.

A © 16. Alternate Forms.

c 17. Administration.

A Be 18. Scoring. The publisherdescribes hand scoringas difficult. Machinescoring is offered.

0 C 19. Record Keeping.

A ED 20. Decision Rules.

A ED 21. Comparative Data.Normative scores areoffered, but the com-position of the normgroup is not described.

ANALYSIS OF SKILLS (ASK) - Scholastic Testing Service, 1974Reading

DESCRIPTION

The ASK-Reading is a four-level battery of tests for pupils in grades 1-8which cover the following major skill areas: word analysis, comprehension,and study skills. There are 43 to 48 objectives per level with three multi-ple choice items per objective.

PRICES

Two systems for ordering materials and services are available, a conventionalpurchase system and a "lease-score" system. For materials to be retained,reusable test booklets are 62c per pupil in packages of 20 and answer sheets17c in sets of 50. Examiner's manuals are 70C and a general manual forreading and math is $2.00. Individual pupil record forms, which are includedin the basic scoring service, are 12C each in sets of 20 when purchasedseparately.

Under the "lease-score" system, the purchaser returns all test materials tothe publisher and pays only the costs of processing, reporting, and transpor-tation. Specimen sets are $2.00. Date of information: 1978.

FIELD TEST DATA

The developer has unpublished data showing that each level of ASK-Readingwas field tested on a median of about 200 pupils. The data are mostly inthe form of number and percent of pupils picking each response choice. Thefield test is not described.

ADMINISTRATION

The ASK-Reading tests are made for group administration. The tests areuntimed, but ne publisher estimates the total testing time to be between2 and 21/2 hours per level.

SCORING

The publisher does not recommend hand scoring, but answer keys are providedin the manuals of directions for the lowest level of test, Level 1-2. Thebasic scoring service is $1.43 per pupil for Level 1-2 and $1.10 for theother levels.

34

44

ANALYSIS OF SKILLS (ASK) - Scholastic Testing Service, 1974Reading


A0C 1. Description.


A () 3. Representativeness. Nodata.

A () 4. Sensitivity. No data.

A () 5. Item Uniformity. No

data.

A () 6. Divergent Validity. Nodata.

A 7. Bias. No data.

A C) 8. Consistency. No data.


(DB C 9. Instructions. Instruc-tions for Level 1-2appear too complicated,but the other levels donot.

A () 10. Item Review. It is not

clear from the unpub-lished field test datathat they were used foritem revision.

0 C 11. Visibility.

0 C 12. Responding.

(A) C 13. Informativeness.

A () 14. Curriculum Cross-Referencing.

A(B)C 15. Flexibility. Overlap ofobjectives across levelsis provided, but allitems for a level are onone test form.


0 C 17. Administration.

A 8© 18. Scoring. Hand scoringis discouraged. Machinescoring is available.


A 20. Decision Rules. Deci-sion rules are given,but with little support.

A © 21. Comparative Data. Whenthe scoring service isused, local and nationalnorms are provided.Composition of thenational norm group isnot described.

35 45

BASIC ARITHMETIC SKILL EVALUATION Imperial International Learning(BASE) AND BASE II Corporation, 1972

DIRECTIONS

BASE and BASE II are two diagnostic and prescriptive systems for math ingrades 1-6 and 7-8 respectively. BASE has six levels of 16 to 23 objectiveseach and covers the following skill areas: numeration and operations withwhole numbers, fractions, money, measurement, geometry, story problems,decimals, and percents. BASE II measures objectives in operations with inte-gers, fractions, decimals and percents, and story problems. For bothbatteries, there are three multiple choice items per objective. Referenceguides to prescriptive materials are a part of the system. The cards forposttesting individual pupils are a type of alternate form.

PRICES

The BASE system for grades 1-6 sells for $229 and includes for eachgrade level a cassette tape of instructions, a reference guide, consumabletests for 30 pupils, 30 student profile sheets, and a set of posttest cards.The price per grade level is $39.50. With BASE II, which cost $54.00separately, the complete system is $269.00. The cost for replacing testsand profile cards for 30 pupils is $19.50 for each primary grade and $21.50for BASE II. Date of information: 1978.

FIELD TEST DATA

Field testing of BASE II is mentioned but not described.

ADMINISTRATION

The BASE system is designed for group administration or self-administration,both with the aid of tape recorded instructions. Each level of the testsfor grades 1-6 is estimated to take 1-11/2 hours. Three class periods aresuggested for giving BASE II.,

SCORING

The carbonized answer sheets are self-scoring.

36 46

BASIC ARITHMETIC SKILL EVALUATION Imperial International Learning(BASE) AND BASE II Corporation, 1972


A(E)C 1. Description. No data.


A 3. Representativeness. No

data.

A 4. Sensitivity. No data.

A 5. Item Uniformity. No

data.

A 6. Divergent Validity. No

data.

A 0 7. Bias. No data.



(DB C 9. Instructions.

A () 10. Item Review.

C 11. Visibility.

0 12. Responding.

A C 13. Informativeness. Therating here will dependon the contents of thespecimen set: Does itinclude a copy of theobjectives?

0 C 14. Curriculum Cross-Referencing..

AGC 15. Flexibility. There isgood carry over ofobjectives from levelto level, but separatetesting of objectivesis not easy.

A 16. Alternate Forms. Theposttest cards do notmake group posttestingpractical, although theymay be considered analternate form.

37


AODC 18, 1.:Loriu.i. The print onthe self-scoring dupli-cates is often veryfaint.

C 19. Record Keeping.

A 0 20. Decision Rules.

A © 21. asearative Data.

4 7

BASIC WORD VOCABULARY TEST Dreier Educational Systemsby Dr. Harold J. Dupuy

DESCRIPTION

The Basic Word Vocabulary Test is a 123-item multiple choice test designedfor test takers from 4th grade through Ph.D. level. Item stems are rootwords or simple phrases with a root word underlined. These words wereselected from a 1% sample of words that are common to four major unabridgeddictionaries. Eliminated from the sample were foreign, slang, archaic, andtechnical words. Response choices are single words or short phrases. Itemsare arranged in order of increasing difficulty.

PRICES

A package of 40 test booklets, the examiner's manual, and scoring key sellsfor $4.95. The specimen set, at $2.95, contains an examiner's manual andsample test for this and each of four other tests by Dreier. Date of infor-mation: 1978.

FIELD TEST DATA

A developmental pretest was done on 148 people ranging in age from 11 to 61.After revision, the final form was administered to 3,100 students in grades1 through 12 in the public schools of Fairfax, Virginia. The examiner'smanual gives percentiles and grade level equivalents for grades 3 through 12,and percentiles for college and graduate students. These latter scores werederived by extrapolation, since the norming population went only throughthe 12th grade. IQ-like scores based on vocabulary alone, called the Vocab-ulary Development Quotient, are also given. Raw scores provide estimates ofpupils' mastery of the 12,300 word "basic vocabulary." The detailed techni-cal manual, DHEW Publication No. (HRA)74-1334, is reprinted in ERIC asED 094 373.

ADMINISTRATION

The BWVT can be used as a group test. Pupils read the test words to them-selves and stop where indicated on the test form (e.g., 4th graders after68 words). Estimated testing time is 20 minutes or less.

SCORING

Tests are scored by hand with an overlay or by machine. The user is invitedto write Dreier for information on machine scoring.

3848

BASIC WORD VOCABULARY TESTby Dr. Harold J. Dupuy

Dreier Educational Systems


OB C 1. Description. The criteri-on pool of words is des-cribed, and rules aregiven for generating dis-tractors and correctanswers.*

A q) 2. Agreement. Although the

technical manual does notreport that item-domainagreement was verifiedindependently, the carefuldomain description makesit likely that agreement

3. Rsep:::::t:itgihveness.

4. Sensitivity.

5. Item Uniformity. Correla-tions of scores on 40-itemsubsets of the total testranged from .95 to .97.

A q) 6. Divergent Validity. Thepublisher's suggestion toget an IQ-like score indi-cates that the test ismore an IQ test than anachievement test.

A q) 7. Bias. The technicalmanual says that testslike this shouZd revealthe effects of culturaldeprivation so that the

A

*The BWVT includes many very rarewords, over 30% of the stem words notappearing in the Thorndike-Lorge wordcount. Although a criterion-refer-enced interpretation of this test ispossible, the criterion pool of wordsis not a useful one for general edu-cation. A number of the words willbe much less familiar in some regionsof the U.S. than in others.

need for remediation canbe identified.

A 8. Consistency.


0B C 9. Instructions.

0 C 10. Item Review. Items wererevised after fieldtesting.

C 11. Visibility.

0 12. Responding.

c 13. Informativeness.

A 14. Curriculum Cross-Referencing.

A B C 15. Flexibility. For thisone-objective test witha graded vocabulary,flexibility is not rele-vant.

A () 16. Alternate Forms. Thetechnical manual givesthree parallel formswhich are subsamples ofthe complete test.

C 17. Administration.

OB C 18. Scoring. A transparentoverlay or machinescoring may be used.


A q) 20. Decision Rules. The

instructional implica-tions of a score on thistest are not clear.

A 21. Comparative Data. A

well-defined but localsample of about 275 stu-dents per grade providedthe norms.

39 49

BEGINNING ASSESSMENT TEST FOR READING J. B. Lippincott Company, 1975

DESCRIPTION

The tests in this battery measure the development of skills which are relatedto early reading instruction. The skills covered include vocabulary, visualand auditory discrimination, classification, rhyming, sequencing, riddles,letter recognition, sound-letter correspondences, picture-word and picture-sentence matching, spelling, sentence completion, oral production and compre-hension, and color naming. A 41-item placement test measures 12 objectiveswith 2 to 6 items each. The comprehensive test measures 19 objectives with6 to 26 items each. Responses are spoken, written, and selected from multi-ple choices.

PRICES

When purchased in the boxed set for 35 pupils, the cost is $21.78. This set

includes consumable practice tests, placement tests, and comprehensive tests,manuals for each of these tests, and record forms. Date of information:

1978.

FIELD TEST DATA

Field testing is mentioned, but results are not reported.

ADMINISTRATION

These tests are designed for administration by a teacher to groups of 8 to 15children, except for two subtests that require oral responses. Estimatedtesting time is 30-40 minutes for the placement test and 60 minutes for thecomprehensive test.

SCORING

Hand scoring is done with replicas of the pupils' answer pages.

40 50

BEGINNING ASSESSMENT TEST FOR READING J. B. Lippincott Company, 1975


A(B)C 1. Description.


A 0 3. Representativeness. No

data.

A IQ 4. Sensitivity. No data.

A 0 5. Item Uniformity. No

data.

A 0 6. Divergent Validity. No

data.

A 0 7. Bias. No data, butattention was given toavoiding stereotypes initem development.

A 0 8. Consistency. No data.


OB C 9. Instructions.

(D C 10. Item Review. Items werescreened on the basis ofa small field test.

0 C 11. Visibility.

® C 12. Responding.

A 0 13. Informativeness. No

specimen set.

A 0 14. Curriculum cross-referencing.

A® C 15. Flexibility. Notentirely relevant, sincethe test is designed forone level. Items for

each objective areprinted on differentpages.

A 0 16. Alternate Forms.


A(E)C 18. Scoring.

m


A 20. Decision Rules. Rules

are provided, but with-out support. Interpre-tative guidance forprescription or place-ment is not given.

A 0 21. Comparative Data.

41

CARVER-DARBY CHUNKED READING TEST Revrac Publications, 1970

DESCRIPTION

The Carver-Darby is designed to measure reading rate and retention at thehigh school or adult level. The reader is given a one-page practice testwith 20 multiple choice practice questions and then five similar scored pas-sages and sets of questions. Each test item consists of a section of thetext. The reader is asked to mark the one phrase or sentence where themeaning of the original passage has been altered. Three scores are given forthe total pool of 100 questions: rate (number of answers given), efficiency(number of correct answers), and accuracy (efficiency divided by rate, times100). The test has one level, for which two alternate forms are sold.

PRICES

Reusable test booklets cost 50o each in sets of 30. Answer sheets are 6O eachby thehundred, and scoring templates 50o each. Individual pupil reports in

sets of 100 are 8G each. The manual, with technical data and directions foradministration, costs $4.00. A specimen set is offered at $6.00. Date ofinformation: 1977.

FIELD TEST DATA

After a developmental field test on 60 college students, validation andreliability studies were carried out with 61 and 41 college students, respec-tively.

ADMINISTRATION

The test is administered by an examiner to groups under timed conditions.Administration time is 25 minutes.

SCORING

A hand-scoring stencil is available.

COMMENTS

The manual includes a detailed discussion of the review of the Carver-Darbyin Buros Seventh Mental Measurements Yearbook. Unlike the other testsreviewed here, the Carver-Darby is not built around instructional objectives.The design of the task is described in enough detail for the test to meritconsideration as a criterion-referenced test, but our evaluative frameworkdoes not clearly fit the design of this unique test. The author intends tolet the test go out of print when existing supplies are sold out.

42 52

CARVER-DARBY CHUNKED READING TEST Revrac Publications, 1970


ACC 1. Description. The con-struction and rationaleof the test are welldescribed. The authorsstate candidly thatwriting items for it isan art.

A q) 2. Agreement. No data.

A 3. Representativeness.Portions of text wereselected at random todevelop into test items,but it is not clear whatpopulation of informationor skill the correctanswers represent.


0 C 5. Item Uniformity. Alter-nate form reliability isin the .7 to .8 range,all items presumablytesting the same thing.

6. Divergent Validity.Factor analysis shows adistinction between therate and accuracyscores.


0 C 8. Consistency. Alternate

form reliability in therange of .65 to .81 isreported for the threesubscores.



A 10. Item Review.

CI C 11. Visibility.

0 C 12. Responding.

C 13. Informativeness.

A (,) 14. Curriculum Cross-Referencing.

A B C 15. Flexibility. Not rele-vant.

C 16. Alternate Forms.


A(g)C 18. Scoria. Hand scoringwith a template.

C 19. Record Keeping. Indi-vidual pupil reportsheets are available.

A q) 20. Decision Rules. Deci-sion rules are given forcategorizing readersinto six types, but asolid argument for thesetypes is not made.

A C) 21. Comparative Data. Data

on the performance of143 college students aregiven as a standard ofcomparison.

43 5 3

COOPER-McGUIRE DIAGNOSTIC WORD Croft Educational Services, Inc.,ANALYSIS TEST 1972

DESCRIPTION

The Cooper-McGuire battery consists of diagnostic tests for primary to inter-mediate grades that measure the following categories of skills: phoneticanalysis, structural analysis, and readiness. There are 32 objectives withan average of 15 items each. Item formats include multiple choice, oralresponse, and fill-ins. The test of each objective is printed on separatespirit masters for local duplication and scoring. Alternate forms of thisbattery are available. An optional curriculum index is offered.

PRICES

The book of spirit masters for one form of the tests costs $26.00. Prices

per test per pupil will vary with the number of objectives tested and numberof copies made from each spirit master. The administrator's manual, whichcontains scoring keys, costs $8.00. The test manual, with objectives andrationale, is $2.00. Class record charts are $2.00 each in sets of 20, andindividual pupil record cards are 12O each in packs of 50. Cassettes for

administering the tests are $29.00 per set. Transparent overlays for scoringare $89.00 per set. The price for the curriculum index is $49.00. Date of

information: 1978.

ADMINISTRATION

Except for six individually administered objectives, the tests are designedfor group administration by a teacher or for self-testing by cassette tape.

SCORING

Hand scoring is done with filled in pupil pages or with optional overlays.

44 54

COOPER-McGUIRE DIAGNOSTIC WORD Croft Educational Services, Inc.,ANALYSIS TEST 1972






A 5. Item Uniformity. Nodata.


data.

A 7. Bias. No data.



A(5)C 9. Instructions. A numberof the tests lack sampleitems.

A 10. Item Review.

A 11. Visibility. Several

tests are too crowdedfor lower primary chil-dren.

C 12. Ite=idin.

A 13. Informativeness. Speci-

men sets with full setsof objectives are notoffered.

0 C 14. Curriculum Cross-Referencing.

(DB C 15. Flexibility. The test

for each objective isprinted on separatespirit master.

c) C 16. Alternate Forms.

cD C 17. Administration.

A(DC 18. Scoring. Overlays areoptionally available atextra expense. Reducedpupils' answer pages arein the administrator'smanual. Machine scoringwould not be appropriate.


A 0 20. Decision Rules. Given,

but without support.

A (() 21. Comparative Data.

45 53

CRITERION-REFERENCED TESTS OF BASIC Multi-Media Associates, 1975READING AND COMPUTATIONAL SKILLS

DESCRIPTION

The EPIC battery consists of eight levels of tests of core skills in readingand math for grades K-6. Each level measures 25 objectives with 4 multiplechoice items per objective, At each level, 15 of the objectives are inreading, 10 in math. The reading skills tested range from identifyingletters and sequencing story pictures at the lowest level to study skillsand interpretive comprehension at the highest. Math skills range fromcounting objects up through ratios and proportions.

PRICES

The reusable notebook with 25 answer sheets for individually testing pupilsin grades K-2 costs $8.75 per grade. A test package for one level of EPICat grades 3-6 costs $8.75 and includes reusable tests for 25 students,25 machine and hand scorable answer sheets, an examiner's manual, an answerkey, and an envelope for ordering machine scoring. In such packages the unitprice per test is 35c per pupil. Specimen sets are $8.00 per level for K-2and $2.00 for each upper level. Date of information: 1978.

FIELD TEST DATA

Each level was field tested on 17 to 47 pupils at that level in Tucson,Arizona. Some items were revised after the field test. For each item, diffi-culty levels are reported. Test-retest consistency is reported in terms ofpercent of response consistency at the item level and correlations at thelevel of the objective and the total test.

ADMINISTRATION

The lower three levels are designed for individual testing, the upper levelsfor group testing by the teacher. Estimated testing time per level is 30minutes for reading and 40 minutes for math.

SCORING

Hand scoring may be done by key or template, or the answer sheets may bemachine scored. The basic scoring service costs 80C per answer sheet andincludes individual scores, group summary scores, school summaries, and dis-trict summaries. Learner needs assessment reports and classroom item analysisreports are also available.

COMMENTS

Publisher also offers a customized test development service.

5646

CRITERION-REFERENCED TESTS OF BASICREADING AND COMPUTATIONAL SKILLS

Multi-Media Associates, 1975


A B0 1. Description. Althoughthe objectives are numer-ous and fairly narrow,they are vaguely stated.

A qD 2. Agreement.

A 13 3. Representativeness.

A 4. Sensitivity.

A qD 5. Item Uniformity.

A qp 6. Divergent Validity.

A el 7. Bias.

C 8. Consistency. In smallsamples of pupils (17-47)the test-retest reliabil-ities at the level of theobjective ranged from amedian in the .40s atlevel 7 to a median inthe .70s at levels 3and 4.


A B® 9. Instructions. Instruc-tions for the lower leveltests are complex.Sample items not pro-vided for each objective.

0 C 10. Item Review. Items wererevised on the basis ofthe field test.

C 11. Visibility.

C 12. Responding.


A © 14. Curriculum Cross-Referencillg

A(E)C 15. Flexibility. There isconsiderable carry overof objectives from levelto level, but all objec-tives for a level aretested in one booklet.

A © 16. Alternate Forms.

A 17. Administration. Vague.

OB C 18. Scoring. Templates andmachine scoring optionsare available.


A 20. Decision Rules. Localoption is offered, but arationale for passingscores is not provided.

A qp 21. Comparative Data. Diffi-culty levels are providedfor all items, but onvery small, local sam-ples.

47 57

CRITERION TEST OF BASIC SKILLS Academic Therapy Publications, 1976

DESCRIPTION

This battery contains individual tests of 18 reading objectives and 26 mathobjectives for diagnosing the basic skills of pupils in grades K-8. Readingobjectives deal with letters (recognition, sounding, and writing), phonics,and sight words, there being an average of 13 items per objective. The mathobjectives deal with numbers and numerals, the four basic operations, money,time, supplying the missing symbol, fractions, decimals, and percents, therebeing an average of over six items per objective. Item formats for readingand math are oral and fill-in. The manual has 70 pages of suggested teachingactivities and materials.

PRICES

The complete test package, which includes stimulus cards, 25 answer sheetsfor each of reading and math, the administrator's manual, and a pad of mathproblems sells for $17.00. Replacement answer sheets are 14 each in setsof 25. Date of information: 1977.

FIELD TEST DATA

Field testing is mentioned, but not described.

ADMINISTRATION

These tests are designed for administration to individuals by a teacher.Estimated testing time is 10-15 minutes for each of the six sections inreading and eleven sections in math.

SCORING

Scoring is done on the spot by circling correct responses and writing incor-rect responses on the individual pupil record.

COMMENTS

A word list from the local text series is used for the sight word objective.Publisher feels that features 6 and 14 are not appropriate for evaluatingthis test.

48 58

CRITERION TEST OF BASIC SKILLS Academic Therapy Publications, 1976


A(E)C 1. Description.

A C) 2. Agreement. No data.

A ()




A © 6. Divergent Validity. Nodata.

A C) 7. Bias. No data.



A(E)C 9. Instructions. Sampleitems are lacking.

0 C 10. Item Review. Revisionon the basis of fieldtest is reported.

0 C 11. Visibility.

0 C 12. Itesp_Taidin.

0 C 13. Informativeness.

A 14. Curriculum Cross-Referencin4. There is

an extensive activitiesguide, but no indexingof test materials.

(E)B C 15. Flexibility.

A (.) 16. Alternate Forms.


A(E)C 18. Scoring.


A 20. Decision Rules. Threelevels are identified,but without support.

A 21. Comparative Data.

49 59

DESIGN FOR MATH SKILL DEVELOPMENT NCS Educational Systems,by 0. A. Kamp, et al. 1975

DESCRIPTION

The Design for Math Skill Development is a seven-level system for instruc-tional management in elementary math that is built around the following tencontent strands: numeration and place value, addition and subtraction,multiplication and division, word problems involving the basic operations,fractions, geometry, measurement, money, time, and graphing. The number ofobjectives per level ranges from 14 at the first to 30 at the highest, objec-tives averaging eight multiple choice items each. Two alternate forms areavailable.

PRICES

Test booklets average 37C per pupil in packages of 35 for levels A-D (con-sumables) and $1.71 for levels E-G (reusable). Placement tests average 29Cin packets of 35. Spirit masters for printing answer sheets are $3.00. Alsoavailable are the Teacher's Planning Guide for $4.25, Administrator Manual at$1.25 for each level, and a Teacher's Resource File for $21.00. Date ofinformation: 1978.

SCORING

Scoring is by hand using keys in the examiner's manual.

ADMINISTRATION

The Design for Math Skill Development test is administered to groups or indi-viduals. Testing time for an entire level will take from 60 minutes at thelowest level to 195 minutes at the upper end.

50

DESIGN FOR MATH SKILL DEVELOPMENT NCS Educational Systems,by D. A. Kamp, et al. 1975


A0C 1. Description.

A qp 2. Agreement. No data.

A qD 3. Representativeness. Nodata.

A sg 4. Sensitivity. No data.

A qD 5. Item Uniformity. No

data.

A (D 6. Divergent Validity. No

data.

A qD 7. Bias. No data.

A g 8. Consistency. No data.


0 B C 9. Instructions.

A © 10. Item Review.

A © 11. Visibility. Tests atthe lowest levels arecrowded.

A 0 12. Responding. At thelowest levels, responsespaces are small.



OB C 15. Flexibility. Separatetest forms for eachobjective, with goodcarry over of objectivesacross levels.

0 C 16. Alternate Forms.


A(B)C 18. Scoring. Scoring byanswer key.


A © 20. Decision Rules.

A C) 21. Comparative Data.

51

6/

DIAGNOSIS: AN INSTRUCTIONAL AID - Science Research Associates,MATHEMATICS 1972-73

DESCRIPTION

SRA's Diagnosis-Mathematics is a two-level battery of diagnostic tests forgrades 1-6 that measure objectives in the following skill areas: computation,sets and numeration, operations, problem solving, measurement, and geometry.At each level there is a survey test and a series of diagnostic probe tests.The survey for Level A has 95 items testing 24 skill categories, while the24 corresponding probe tests average 15 items each and have an average ofabout two items per objective. At Level B, the survey has 157 items testing32 skill categories, and the 32 corresponding probe tests average 15 itemseach and 2 items per objective. All items are multiple choice. Alternateforms of the survey tests are optionally available. Two diagnostic labs, onefor grades 1-4 and the other for grades 3-6, are available separately. Theseinclude diagnostic tests and prescriptive guides to basal texts and supplemen-tary materials.

PRICES

A complete kit for a level lists at $80.00-$87.50 (school price--$60.00-$65.50). The kit for each level contains 30 copies of the survey test and ofall the probes, the teacher's handbook, a guide to texts and materials,scoring overlays (Level A) or keys (Level B), etc. All test materials areconsumable except the Level B surveys. Alternate forms of the surveys areavailable in sets of 30 for 28 per pupil list. Specimen sets for each levelare $9.60 list ($7.20 school price). Date of information: 1977.

ADMINISTRATION

Tests are designed for group administration by an examiner or, in come cases,individual administration by the pupil.

SCORING

Level A is hand scored with overlay keys and Level B with strip keys.

COMMENTS

The teacher's handbook for Level B cross indexes the test items on severalwidely used norm-referenced tests to the SRA-Diagnosis probes and to sectionsof the SRA guide to texts and materials. A revised edition is expected to beon the market by 1980.

52 6 9

DIAGNOSIS: AN INSTRUCTIONAL AID - Science Research Associates,MATHEMATICS 1972-73


A® C 1. Description. Objectivesare printed on the backsof the test forms.



A 0 4. Sensitivity. No data.



data.

A 7. Bias. No data.-A 8. Consistency. No data.


A® C 9. Instructions. Sampleitems are not provided inthe probes.

A 10. Item Review.

C) C 11. Visibility.

C 12. Responding.

A () 13. Informativeness. Theobjectives are not com-pletely listed in thespecimen materials.

C 14. Curriculum Cross-Referencing.

A® C 15. Flexibility.

® 16. Alternate Forms. Alter-nate forms of the surveyare sold. Probes comein only one form.


ACDC 18. Scoring. Hand scoring is

easy.

0 C 19. Record Keeping..

A G-) 20. Decision Rules.


DIAGNOSIS: AN INSTRUCTIONAL AID - Science Research Associates, 1974READING

DESCRIPTION

SRA's Diagnosis-Reading is a two-level battery of diagnostic tests forgrades 1-6 that measure objectives in the following skill areas: phoneticanalysis, structural analysis, comprehension, vocabulary, study skills, anduse of sources. Each level (A=grades 1-4, B=grades 3-6) has a survey test ofover 60 items and a series of over 30 diagnostic probes, each with an averageof 20 items. On the probes, the minimum and usual number of items per objec-tive is 2, there being 306 objectives at Level A and 224 at Level B. Item

formats include multiple choice, matching, fill-in, and ordering. Alternateforms of the survey tests are optionally available. The classroom kit

includes a guide to texts and other instructional materials. Two diagnosticlabs, one for grades l-4 and the other for grades 3-6, are available sepa-rately. These include diagnostic tests and prescriptive guides to basaltexts and supplementary materials.

PRICES

A complete kit for Level A with 25 copies of the survey and of each of theprobe tests, a guide to texts and materials, the teacher's handbook, cassettesfor the phonetic tests, etc., lists for $159.50 (school price $119.50). The

Level B kit lists for $116.75 (school price--$87.50). The alternate form of

the survey for each level lists at 58 per pupil in sets of 25. Specimen

sets for each level list at $9.60 ($7.20 school price). Date of information:

1977.

ADMINISTRATION

The SRA Diagnosis-Reading tests are made for a variety of modes of adminis-tration. The surveys are administered to groups by a teacher; Level Aphonetics probes are given by cassette tape; and many of the other probesare self-administered under teacher supervision.

SCORING

Survey tests are scored by hand with a key, while the probes are self-scoring.

54

64

DIAGNOSIS: AN INSTRUCTIONAL AID -READING

Science Research Associates, 1974


ACIC 1. Description. Objectivesare located on the insideof the self-scoringanswer sheets.

A 0 2. Agreement. No data.

A 0 3. Representativeness. Nodata.


A 5. Item Uniformity. No data.

A el 6. Divergent Validity. Nodata.




0 B C 9. Instructions. The surveytests do not have sampleitems, but the probes do.

10. Item Review.

11. Visibility.

12. Responding..

13. Informativeness. Theobjectives are not com-pletely listed in thespecimen materials.


Flexibility. Each probetests from 2 to 26 objec-tives.

A0 C 15.

16. Alternate Forms. Alter-nate forms of the surveyare available for pre-and posttesting. Probes

come in one form.

CD C 17. Administration.

A0C 18, Scoring. The surveysare scored with reducedpupil pages; probes areself-scoring.


A qD 20. Decision Rules.

A (ID 21. Comparative Data.

5565

DIAGNOSTIC MATHEMATICS INVENTORY CTB/McGraw-Hill, 1975by John K. Gessel

DESCRIPTION

A revision of the earlier Prescriptive Mathematics Inventory, the DMI is aseven-level diagnostic testing system for grades 1.5 through 7.5 plus. The

following 11 categories of skills are covered: pre-operational concepts,counting, matching, addition of single digits, addition of integers with morethan 1 digit, subtraction of integers, missing addends and factors, sequencesand inequalities, measurement, plane figures, and inverse and place value.The DMI has from 37 to 179 multiple choice items per level, with each itemtesting a separate objective. The number of choices per test item rangesfrom 5 to 10. For math skills at this broad level of description--"measurement," "subtraction of whole numbers with regrouping," "segments,lines, rays"--the number of skills per level varies from 11 to 39 and thenumber of items per skill from 2 to 8.

To support classroom instruction, the following optional materials are avail-able: interim tests for monitoring pupils' progress during the year, learningactivities guides, guides indexing the DMI to math text series, and guides tonon-text teaching materials.

PRICES

Test books come in packages of 35 with an Examiner's Manual. At the lowerthree levels, hand scorable test books are 38C-49c per pupil, and machinescorable ones are 67C-80C. At the upper four levels, the reusable test booksare 54c-61c per pupil. Machine scored answer sheets are 13c-16c per pupil in

sets of 50; hand scorable ones are 20C-40C in sets of 25. Consumable practiceexercises for leveling pupils before giving the diagnostic tests are offered

for llc per pupil in sets of 35. The Teacher's Guide, serving all levels, is

$3.25. Examination kits are $5.50 per level, $16.00 for an all-level kit.Date of information: 1979.

FIELD TEST DATA

A technical report was in preparation on the DMI while this volume was in

progress. It has point-biserials for individual items, KR-20s, test-retestreliabilities, and item-difficulty data. Only small pieces of these data

were available for our review.

ADMINISTRATION

The DMI is a group test designed to be given by an examiner. Estimated

testing time varies from two sessions of 40 minutes each at the lowest levelto six sessions of 45 minutes each at the upper levels.

SCORING

Tests are machine scored at a cost of 70c to 97C per pupil. The basic service

includes all of the following: respbnses to individual items, individuals'

total scores, summaries by item of group responses, summaries by total scoresof the group. For an additional cost of 15c and 25C per pupil respectively,

group and individual diagnostic reports are available. Estimated norms are

optionally available. Publisher says that the approximate time for returning

scores to the user is 15 days from receipt.

56 66

m

DIAGNOSTIC MATHEMATICS INVENTORYby John K. Gessel

CTB/McGraw-Hill, 1975


A B® 1. Description. Objectivesfor single items defeatthe purpose of objectives--to describe skillsrather than test items.At higher levels of itemgrouping, the "objectives"are vague.

A () 2. Agreement. No data.

A () 3. Representativeness. Nodata.

A q) 4. Sensitivity. No data.

A () 5. Item Uniformity. No dataon broader "category"objectives. Not appli-cable to one-item objec-tives.

A 6. Divergent Validity. Nodata.

A qD 7. Bias. No evidence.

A qD 8. Consistency. No data.See notes under FieldTest Data on the facingpage.


B C 9. Instructions.

10. Item Review.

11. Visibility.

C 12. Responding.



AODC 15. Flexibilit/.. Core objec-tives are tested atseveral levels, buttesting of individualcategory objectives isnot practical withmachine scoring. Option-al interim tests providemore flexibility.

A () 16. Alternate Forms.


(DB C 18. Scoring.


A 20. Decision Rules. Rules

are implied but not wellsupported.

C 21. Comparative Data. Seenotes under Field TestData and Scoring on thefacing page.

57 67

DOREN DIAGNOSTIC READING TEST OF WORD American Guidance Service, 1973RECOGNITION SKILLS by Margaret Doren (Original copyright: 1956)

DESCRIPTION

The Doren is a group diagnostic test for children in the primary grades whichcovers the following skills: letter and word recognition, beginning andending sounds, consonants and vowels, word roots, blending, rhyming, spelling,sight words, and guessing words in context. Items are both multiple choiceand written. There are 33 objectives averaging 12 items per objective, for atotal of 395 items. Items are both multiple choice and written.

PRICES

Consumable test booklets are 27ç each in sets of 25 booklets. An overlay keyis offered for $5.90, and the manual (dated 1973) is $2.35. The specimen set,at $3.00, includes a test booklet, manual, and class record sheet. Date ofinformation: 1977.

FIELD TEST DATA

Total test scores are reported for a sample of approximately 40 pupils ateach of levels 1-4. The recency of these data is not reported.

ADMINISTRATION

The test is administered by an examiner in a group setting. The catalogestimates total testing time to be one to three hours depending on class sizeand reading level.

SCORING

Scoring is done either with an optional template or a key in the manual.

COMMENTS

The test was first published in 1956, before the days of objectives-basedtesting, but it was not apparently developed as a norm-referenced test.

58 68

DOREN DIAGNOSTIC READING TEST OF WORD Allerican Guidance Service, 1973

RECOGNITION SKILLS by Margaret Doren (Original copyright: 1956)

4EASUREMENT PROPERTIES

AOC 1. Description.

A () 2. Agreement.

A () 3. Representativeness.

A () 4. Sensitivity. No data.

A © 5. Item Uniformity.

A 0 6. Divergent Validity.





A 0 10. Item Review.

11. Visibility.

C 12. Res_pontling,.


A 14. Curriculum Cross-Referencing. The manualhas a seven page sectionon remedial activi:ies.

A BO 15. Flexibility.

A C) 16. Alternate Forms.

C) C 17. Administration.

A(TDC 18. Scoring.


A (D 20. Decision Rules. Easy-to-use decision rules aregiven, but they needjustification.

A C) 21. .Comparative Data. Rangesof scores are given bygrade level for grades1-4, but the field testpopulation is limited to"four midwest suburbanschool districts."

EARLY CHILDHOOD ASSESSMENT Cooperative Educational Service,by Robert Wendt and Robert Schramm undated

DESCRIPTION

The ECA is a battery of performance tasks to be used for locating children of3 to 6 years along a developmental curriculum sequence. It has six levels,ranging from sensory-motor activities which are maturationally determinedthrough integration, to symbolic activities of reading and math. There

are 73 separately scored objectives with the number of tasks for each rangingfrom one to twelve. The median number of items per objective for the 23reading and 17 math objectives is 4 and 3 respectively. The manual statesthat the assessment is not designed to be diagnostic or categorical, butrather to serve as an aid for locating the child's level. A prescriptiveguide to activities for learning centers in early childhood education isavailable separately.

PRICES

Consumable scoring booklets are 50Q each, available in any numbers. Theadministrator's manual is $2.25 and the prescriptive guide is $4.75. Dateof information: 1977.

FIELD TEST DATA

Field testing is mentioned, but not described.

ADMINISTRATION

The ECA is administered or supervised by a person trained in individualtesting. Although it is an individual test, procedures are described fortesting larger numbers of children at the same time at a series of separatetesting stations. The following equipment is optional: an audiometer,Telebinocular, Titmus Stereo apparatus, Good-lite Screening Instrument. Esti-mated testing time is 45-50 minutes per pupil.

SCORING

Many of the objectives are scored by observer's judgment. Guidelines forscoring are often subjective.

COMMENTS

The ECA is an ESEA Title III Project. A new edition is scheduled to be pub-lished before the release of this handbook.

60 70

EARLY CHILDHOOD ASSESSMENT Cooperative Educational Service,by Robert Wendt and Robert Schramm undated


A(TDC 1. Description.


A C) 3. Representativeness. No

data.

A C) 4. Sensitivity. No data.

A C) 5. Item Uniformity. No

data.

A C) 6. Divergent Validity. No

data.

A C) 7. Bias. No data.



AIDC 9. Instructions. Sample

tasks are not consis-tently given.

A 10. Item Review.

® C 11. Visibility.

0 C 12. Responding,. Generallyan action by the child.

A 13. Informativeness. Pro-

gram descriptors and anoverview booklet areavailable on request atno cost.

A (D 14. Curriculum Cross-Referencing.

A(DC 15. Flexibility. Parts ofthe test may be given toany one child.


A 17. Administration. Instruc-

tions vary in clarity.Examiner has to switchbetween manual anddetailed response book-let.

A B0 18. Scoring. Scoring will

vary with the observer'ssubjective standards ofcorrectness.

C 19. Record Keeping. Pupil

and class record sheetsare provided.

A 0 20. Decision Rules. Givenbut not with any support.

A 21. Comparati7e Data.

61 71

EVERYDAY SKILLS TESTS: READING, CTB/McGraw-Hill, 1975TEST A; MATHEMATICS, TEST A

DESCRIPTION

The Everyday Skills Tests consist of a battery of two objectives-based tests(Tests A) in the reading and math skills that are useful for adults in theirdaily lives and two norm-referenced tests (Tests B) in computation andreference/graphic materials. There are 3 multiple choice items for eachof 15 reading objectives and 9 math objectives in the A tests. Reading objec-tives deal with materials like labels, ingredients, want ads, tax forms, andthe like. Math objectives deal with matters like cost comparisons, rates ofinterest, and time calculations.

PRICES

Reusable test booklets are 32c each for reading and 26C each for math, both insets of 35. The examiner's manual is included. Booklets contain both theobjectives-based A part and norm-referenced B part of each domain. Answersheets are 9C each in sets of 50 and scoring stencils are $2.75 apiece. Aspecimen set is offered at $5.00. Date of information: 1978.

FIELD TEST DATA

The A Tests were field tested in a sample of schools in Florida. Difficultiesare reported for each item for 6th, 8th, and 10th grade pupils in the sample.The median percent of correct responses to an item at the 10th grade level is88 in reading and 67 in math.

ADMINISTRATION

These tests are designed for group administration by a teacher. Estimatedtesting time is at least 30 minutes for reading and 24 minutes for math,although the tests are untimed. The norm-referenced parts of the battery,which are timed, take another 30-40 minutes each.

SCORING

Scoring is done by hand key, optional stencil, or machine. The scoring ser-vice provides a class record and individual reports for 50c and 75c per pupil,respectively.

COMMENTS

Items for part B of each test come from Form R of the Comprehensive Test ofBasic Skills.

62

EVERYDAY SKILLS TEST: READING,TEST A; MATHEMATICS, TEST A


A(S)C 1. Description.


A q) 3. Representativeness. Nodata.


A 5. Item Uniformity. Threeitems per objective wereselected from a set offive partly on the basisof inter-item correla-tions, but the correla-tions are not given.


data.

A 7. Bias. No data.




C 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding.

0 CA

A B® 15. Flexibility.



OB C 18. Scoring.



13. Informativeness.

14. Curriculum Cross-Referencing_.

CTB/McGraw-Hill, 1975

A ® 21. Comparative Data. Itemdifficulties are reportedbut on a sample of only200-435 pupils fromFlorida.

6373

FOUNTAIN VALLEY TEACHER SUPPORT Richard L. Zweig Associates, 1972SYSTEM IN MATHEMATICS

DESCRIPTION

The Fountain Valley math tests are part of a nine-level diagnostic/prescrip-tive system which covers objectives for grades K-8 in the following areas:numbers and operations, geometry, measurement, applications, statistics andprobability, sets, functions and graphs, logical thinking, and problemsolving. The number of objectives per level ranges from 36 at the lowest to135 at grade 6. Each test form contains the teats of several objectives, sothere are 11 to 31 separate forms per level. The number of multiple choiceitems per objective ranges from two to twelve, with the average being atleast three at all levels. Directions for all tests are given by cassettetape. An optional "teaching alternatives supplement" at each test levelcross references the Fountain Valley objectives to the text and non-textinstructional materials of 40 publishers.

PRICES

The total system for all grades (about 500 students) sells for about $2,750.An inservice (training) module is offered for $75.00. Modules containing amanual, a teaching alternative supplement, and tape cassettes sell for from$83.50 to $203 depending on the level. Hand-scored test forms are 3Q perpupil in sets of 50, while the self-scoring forms are about 11Q each. Answerkeys sell for from $11 to $31 per level depending on the number of tests forthe level. Rather than have the system described fully in a catalog orspecimen set, the publisher explains the system mostly through its salerepresentatives. Date of information: 1977.

ADMINISTRATION

The Fountain Valley math tests are administered to groups of pupils for themost part by cassette tape. Administration of the tests of numbers andoperations by teachers is supported by a separate manual. The estimated

testing time per test form ranges from six to twenty minutes.

SCORING

Overlays are provided for scoring the answer sheets.

COMMENTS

The keying of items to objectives is contained only in the scoring materials,not with the objectives in the manuals.

64

FOUNTAIN VALLEY TEACHER SUPPORTSYSTEM IN MATHEMATICS

Richard L. Zweig Associates, 1972

MEASUREMENT PROPERTIES A 0 20. Decision Rules. Rulesare provided without

A® C 1. Description. support.

A CD 2. Agreement. No data. A 0 21. Comparative Data.


data.


A CD 5. Item Uniformity. Nodata.


A CD 7. Bias. No data.



A® C 9. Instructions. Sampleitems are not provided.The instructions atlevels K and 1 may betoo complex.

A 10. Item Review.

0 C 11. Visibility.

A qp 12. Responding. Answersheets for the lowesttwo levels are crowded.

A 13. Informativeness.


AOC 15. Flexibility. The testforms cover an averageof three or four objec-tives each.



A® C 18. Scorin&.


65 75

FOUNTAIN VALLEY TEACHER SUPPORT Richard L. Zweig Associates, 1975SYSTEM IN READING

DESCRIPTION

The Fountain Valley reading tests are part of a six-level system for themanagement of reading instruction in grades K through 6 covering the followingskill areas: phonic analysis, structural analysis, vocabulary development,comprehension, and study skills. The number of objectives varies from 125 atlevel K-1 to 33 at level 4. Each test form contains the items for several(i.e., 3 to 6, on the average) objectives. There are from two to twelvemultiple choice items per objective with the average being about three itemsat all levels. A "teaching alternatives supplement" cross references thetests' objectives to the text and non-text instructional materials of over70 publishers.

PRICES

The total system for all grades (about 500 students) sells for about $2,125.An inservice (training) module is offered for $75.00. Modules containing themanual, the teaching alternatives supplement, and tape cassettes vary from$100 to $51 per level. Hand-scored test forms are 3.5C per pupil in sets of50, while the self-scoring forms are about 12C each. Answer keys sell forfrom $9 to $19 per level depending on the number of tests for the level.Rather than have the system described fully in a catalog or specimen set, thepublisher explains the system mostly through its sale representatives. Dateof information: 1977.

FIELD TEST DATA

Over 10,000 students in grades 106 took part in the field test in the FountainValley, California, School District. Results of the field test are notreported.

ADMINISTRATION

These are group tests which can be administered by cassette tape or orally by

a teacher.

SCORING

Two scoring options are offered: hand scoring by template or self-scoringwith special answer sheets. Estimated testing time per test form is from 5.5

to 20 minutes.

COMMENTS

The keying of items to objectives is contained only in the scoring materials,not with the objectives in the manuals.

66

FOUNTAIN VALLEY TEACHER SUPPORTSYSTEM IN READING

Richard L. Zweig Associates, 1975


A0C 1. Description.

A 0 2. Agreement.

A c) 3. Representativeness.Items were chosenaccording to how wellthey discriminated highand low scorers.

A 0 4. Sensitivity. The datathat are given in amimeographed technicalreport indicate changesin scores on standard-ized tests followingintroduction of thesystem.



data.


A qD 8. Consistency. No data


® B C 9. Instructions.


® C 11. Visibility.

O C 12. Responding. The answersheets for the lower twolevels are somewhatcrowded.

A 0 ,13. Informativeness. Speci-men sets are not offered.It is hard to figure outwhat the basic packageis.

O C 14. Curriculum Cross-Referencing.

ACOC 15. Flexibility. Test formsare one page each andtest 3 to 6 objectives.

A qp 16. Alternate Forms.


A(E)C 18. Scoring,. Two methods ofeasy local scoring areoffered: template and

self-scoring answersheet.


A q.) 20. Decision Rules. Pro-vided, but withoutsupport.

A q.) 21. Comparative Data.

677-i '

1

I I

GROUP PHONICS ANALYSIS TEST Dreier Educational Systems

DESCRIPTION

The Group Phonics Analysis Test is a 75-item diagnostic test of basic phonicsskills for pupils in grades 1-3. The 11 stated objectives range fromrecognizing printed letters and numbers to dividing words into syllables.There are from 3 to 19 multiple choice items per objective, the mode being 3.

PRICES

The one-page consumable test forms are 170 each in packs of 40, which includethe examiner's manual. The specimen set, at $2.95, contains an examiner'smanual and sample test as well as samples of four other tests by Dreier.Date of information: 1978.

FIELD TEST DATA

Norms and reliabilities are based on a field test of 104 pupils in grades 1-3.

ADMINISTRATION

These untimed tests are designed for group administration.

SCORING

The answer sheet contains a pressure-sensitive self-scoring second page.

68 78

GROUP PHONICS ANALYSIS TEST Dreier Educational Systems


A B0 1. Description.



data.


5. Item Uniformity. TheKR-21 reliability for asample of 104 pupils ingrades 1-3 is .88.

CD C 6. Divergent Validity.Scores on this test havea low correlation withscores on a test ofreading comprehension(r=.32).

A

A

7. Bias. No data.



A0C 9. Instructions. Only twosample items are given.

A 10. Item Review. No report.

A qD 11. Visibility. See 1/12.

A 12. Responding. The self-scoring test form is toocrowded for easyanswering by primarylevel children.


A 14. Curriculum Cross-Referencing,.

A Be 15. Flexibility.



A(B)C 18. Scor.r.i. The test formhas a pressure-sensitiveself-scoring duplicatebacking.

Record Keeping.

Decision Rules.

Comparative Data. Inter-quartile bands are givenaround the average scoresfor 1st through 6thgrades. The normingpopulation is small andthe method for inferringnorms for grades 4-6 notdescribed.

A © 19.

A © 20.

A ® 21.

69

INDIVIDUAL PUPIL MONITORING SYSTEM - Houghton Mifflin, 1973-1974Mathematics

DESCRIPTION

The IPMS-Mathematics is an eight-level system for continuously monitoringpupils' mastery of math objectives. The levels, corresponding roughly togrades 1-8, are each divided into three "assessment modules" aiming at one-third of a year of instruction. Tests for each objective are printed onseparate pages. The number of objectives ranges from 48 at Level 1 to 64 atLevel 8, the lower three levels having 5 multiple choice items per objectiveand the upper levels having 10. Two forms of the tests are available. In

addition to the basic testing materials, resources for relating tests toinstruction are optionally available, including one booklet at each levelindexed to major math text series and guides to other learning materials andactivities.

PRICES

Test booklets and individual pupil progress records come together in sets of35 @ 43ç per pupil, per module, per form for Level 1 and @ 57G for the otherlevels. Self-scoring answer sheets for Levels 3-8 sell for 13Q each in setsof 100, and test booklets for those levels are reusable. The crayon for theself-scoring system is sold by the dozen at about 36Q each. A Teacher's Kitcontaining a booklet of objectives, a set of classroom record forms, ateacher's guide, and a booklet indexing objectives to texts and teachingmaterials is available @$10.80 for Level 1 and $5.40 for the other levels.Date of information: 1978.

FIELD TEST DATA

The publisher reports field testing each levul of each form on a nationalsample of about 350 pupils for the purpose of leveling and selecting testitems from a larger initial pool of items.

ADMINISTRATION

Directions for group administration are provided, but at the upper levelspupils may be taking different tests at the same time. Tests are untimed,

and no time estimates are provided. Pupils may be tested on as little as oneobjective at a sitting.

SCORING

Self-scoring by means of a latent-image system can be used, or scoring can be

done by template.

COMMENTS

Several adults who tested the latent-image answer sheet and crayon found thatheavy hand pressure was needed to make the hidden answer appear.

70

80

INDIVIDUAL PUPIL MONITORING SYSTEM -Mathematics

Houghton Mifflin, 1973-1974


AOC 1. Description.

A CD 2. Agreement. A review ismentioned and thereviewers named, butthe method is not des-cribed.

A CD 3. Representativeness.An item analysis ismentioned but notdescribed.

A cp 4. Sensitivity. No data.


A CD 6. Divergent Validity. No

data.

A cD 7. Bias. No data.

A cD 8. Consistency. No data.


ACC 9. Instructions. Instruc-tions for higher levelsare clear, but for lowerlevel math concept itemsthe vocabulary may besomewhat hard.

10. Item Review. Item selec-tion was based in parton field test data.

11. Visibility.

12. Respondim.

13. Informativeness. Aninformative specimen setis offered, but its con-tents should be describedin the catalog.


()B C 15. Flexibility.

C 16. Alternate Forms


ACIC 18. Scoring. Two hand-scoring options areavailable.




INDIVIDUAL PUPIL MONITORING SYSTEM - Houghton Mifflin, 1974Reading

DESCRIPTION

The IPMS-Reading is a six-level system for managing instruction in reading.Designed for grades 1-6, it has from 43 to 63 objectives per level, each withfive multiple choice test items. Tests for each objective are printed onseparate pages. Two alternate forms of the tests are available. At everylevel there is one test booklet for each of these Crree groups of skills:word attack, vocabulary/comprehension, and discrimination/study skills. Inaddition to the basic testing materials, resources for relating tests toinstruction are available, including indexes relating subtests to basalreading series (optional) and record keeping systems (included).

PRICES

Test booklets are 57C to 60C each, including an individual pupil record, insets of 35. Booklets for the lower two levels are consumable. Self-scoringanswer sheets for levels 3-6 are 13C each in sets of 100. Crayons for theself-scoring syste171 are 36C each in sets of a dozen. Hand-scored answersheets are about 5C in setF, of 500. Teacher Kits @ $4.11 to $4.26 containthe following, which are also available separately: booklet of IPMS-ReadingObjectives, Teacher's Guides, and Teacher's Management Record Booklet. Foreach of eight basal reading series, a separate cross-reference booklet issold @ $1.98 to $3.30 per copy. The examination kit is $4.26. Date ofinformation: 1978.

FIELD TEST DATA

A developmental field test is mentioned, but not described in any detail.

ADMINISTRATION

IPMS-Reading is administered to groups by an examiner. Time to administerthese unspeeded tests will vary with the number of objectives tested at onesitting.

SCORING

Scoring can be done either by referring to answer keys at the back of theteacher's guide or by counting the correct items on the latent-image answersheet.

COMMENTS

Several adults who tested the latent-image answer sheet and crayon found thatheavy hand pressure was needed to make the hidden answer appear.

72

82

INDIVIDUA PUPIL MONITORING SYSTEM - Houghton Mifflin, 1974Reading



A 2. Agreement. A review ismentioned, but not des-cribed.


information.


A 0 5, Item Uniformity. Nodata.

A qp 6. Divergenz: Validity. No

data.


A 8. Consistency: No data.


A(E)C 9. Instructions. Sampleitems are included, butinstructions for thelower levels appear tooadvanced in places.

A 10. Item Review. No data.

A 0 11. Visibility. At the

lower levels items arecrowded and responsespaces are too small.

CD C 12. Responding.

CD C 13. Informativeness. An

informative examinationkit is' offered, but itscontents should be des-cribed in the catalog.


C 15. Flexibility.



AOC 18. Scoring. Two methods ofhand scoring are offered.




73

83

'"1"'

INDIVIDUALIZED CRITERION-REFERENCED Educational Progress,TESTING - Math 1973

DESCRIPTION

The ICRT-Math is an eight-level battery of tests for math in grades 1-8. Ateach level, there are 4 or 5 separate test forms of 16 multiple choice items(8 objectives) each. Objectives deal with sets, numeration systems, the fourbasic operations, geometry, functions and graphs, applications, and measure-ment. Alternate forms of the battery are available. Indexing of testobjectives to two curriculum series by the publisher is given in the manual.Other prescriptive resources are optionally available.

PRICES

Test booklets are sold in complete sets for a level. In packages of 10pupils, the price per pupil per test booklet runs 25 to 32. Machine scor-able test forms are available for level 1; otherwise test forms are reusablein conjunction with answer sheets. Answer cards (plus the machine scoringservice) are $1.25 per pupil for an order of at least 100 pupils. Theteacher's/administrator's manual is $4.50. Date of information: 1976.

FIELD TEST DATA

The two ICRT components, math and reading, were field tested in six districtsin Orange County, California. Data from the field tests are not reported.

ADMINISTRATION

The ICRTs-Math are made for group administration by a teacher or self-administration in the upper grades.

SCORING

Templates for hand scoring are available, and machine scoring is offered for85t to $1.25 per pupil. The basic scoring service, which requires a minimumorder for 100 pupils, includes prescriptive reports for individuals, aninstructional grouping report for the class, a building summary, and a dis-trict summary. Estimated turnaround from receipt of materials is seven days.

74 84

INDIVIDUALIZED CRITERION-REFERENCED Educational Progress,TESTING Math 1973


AOC 1. Description.


A C) 3. Representativeness. No

data.


A C) 5. Item Uniformity. No data.


data.


A () 8. Consistency. No data.


A0C 9. Instructions. Most ofthe test forms lacksample items.


0 C 11. Visibility.

A 12. Responding. Spaces formarking answers onmachine scorable cardsare small and crowded.

A CD 13. Informativeness. Nospecimen set.

() C 14. Curriculum Cross-Referencing. In themanual, test objectivesare indexed to two ofthe publishers' seriesof materials. Withmachine scoring, threeother publishers' mate-rials will be indexed inthe reports.

A(B)C 15. Flexibility. There ismuch carry over ofobjectives from levelto level, but each testform hac items for eightobjectives.



GB C 18. Scoring. Hand scoringby template or machinescoring are available.


A (7) 20. Decision Rules. Pro-vided, but with littlejustification forindividual objectives.With only two items perobjective, secure deci-sions about mastery ofobjectives are notpossible.


75 85

INDIVIDUALIZED CRITERION-REFERENCED Educational Progress,TESTING Reading 1973

DESCRIPTION

The ICRT-Readini, tests make up an eight-level battery for pupils in gradesK-8. The skills tested include word attack, literal comprehension, andinterpretative comprehension. For levels 1-8, each test booklet has 16multiple choice items, two questions for each of 8 objectives. The numberof test booklets per level ranges from nine at level 1 to four at levels4, 5, 6, and 7-8. At the K level, there are eight booklets of five itemseach. Indexing of test objectives to two curriculum series of the publisheris given in the manual. Other prescriptive resources are optionally avail-able. Alternate forms of this battery are available.

PRICES

The package of 10 copies of all test booklets for one form of a level sellsfor $12.50, which gives a unit price of 32Q or less per booklet. For levels1-8, booklets are reusable. Answer sheets are $1.25 each for an order ofat least 100, which includes the cost of machine scoring. Consumable bookletsfor levels 1 and 2 are also offered. ln conjunction with the scoring mate-rials, the unit price for these consumables is $2.85 per pupil for the entirelevel. A template and a 50-page answer sheet pad for local hand scoring cost$1.50 each per level for all levels. The tests for levels 1-8 are alsopackaged in a kit of 144 large cards for individual testing, one objectiveper side. This kit, called Benchmarks, is $38.50. Date of information: 1976.

FIELD TEST DATA

The two ICRT components, math and reading, were field tested in six districtsin Orange County, California. Data from the field test are not reported.

ADMINISTRATION

Test booklets are made for group administration by a teacher. The tests aspackaged in Benchmarks are for individual testing.

SCORING

Templates for hand scoring and machine scoring services are available. For aminimum order of 100 answer sheets, the cost is $125.00. It includes pre-scriptive reports for individuals, an instructional grouping report for theclass, a building summary, and a district summary. Estimated turnaround timeis seven days from receipt of materials.

76 86

INDIVIDUALIZED CRITERION-REFERENCED Educational Progress,TESTING - Reading 1973


A® C 1. Description.

A CD 2. Agreement. No data.

A CD 3. Representativeness. Nodata.

A CD 4. Sensitivity. No data.

A IQ 5. Item Uniformity. No

data.

A qp 6. Divergent Validity. Nodata.

A CD 7. Bias. No data.

A CD 8. Consistency. No data.


AOC 9. Instructions. Sampleitems are lacking formost of the tests.

A CD 10. Item Review.

0 C 11. Visibility.

C 12. Responding.

A CD 13. Informativeness. No

specimen set is offered.


0 B C 15. Flexibility.


(D C 17. Administration.

OB C 18. Scoring,.


A CD 20. Decision Rules. Pro-vided, but with littlejustification for indi-vidual objectives. Withonly two items per objec-tive, secure decisionsabout mastery of objec-tives are not possible.

A CD 21. Comparative Data.

7787

INSTANT WORD RECOGNITION TEST Dreier Educational Systems, 1971

DESCRIPTION

This test measures pupils' sight recognition of a 600-word basic lst-4th gradevocabulary. It contains 48 multiple choice items that are arranged inincreasing difficulty. An alternate form of this test is available by usinga second orally presented word list with the same answer sheet.

PRICES

Self-scoring test forms are 17c each in sets of 30. The administrator'smanual, which contains the 600-word basic vocabulary as well as directionsfor both forms, is included with an order of tests. A specimen set containingsamples of this and several other tests by the publisher is available for$2.95. Date of information: 1978.

FIELD TEST DATA

On the basis of a field test of 153 first graders, a mean score of 11.1correct and a correlation of +.77 with a standardized test is reported.

ADMINISTRATION

Group administration by a teacher is intended.

SCORING

The pupils' answer sheets are self-scoring.

COMMENTS

No objective is stated as such, but the criterion pool of words is given inthe manual.

78 88

I

INSTANT WORD RECOGNITION TEST Dreier Educational Systems, 1973


A BO 1. Description.


A 0 3. Representativeness. Nodata.



A () 6. Divergent Validity. Nodata.





A () 10. Item Review.

11. Visibility. There aretoo many items per pagefor first graders.

C 12. Responding. Except for#11 above.

C 13. Informativeness. Thepromotional and samplematerials tell what thetest is like. But onlooking at the test andmanual, it is hard toknow what a test scoremeans.

A 14.

A B© 15.

C 16.

C 17.

AOC 18.


Flexibility.

Alternate Forms.

Administration.

Scoring.

A 0 19. Record Keeping. Onlyspaces for raw scoreson the answer sheet areprovided.

A QD 20. Decision Rules.

A 21. Comparative Data. Theone average score givenis based on a smallsample.

79 89

KEYMATH DIAGNOSTIC ARITHMETIC TESTby Austin J. Connolly, WilliamNachtman, & E. Milo Pritchett

American Guidance Service, 1971(Metric Supplement, 1976)

DESCRIPTION

KeyMath is a diagnostic battery intended for individually testing pupils ingrades K-6. At the level of specific objectives, there are 209 objectives,each with one test item. Objectives (items) are grouped into subtests asFollows: numeration, fractions, geometry and symbols, the four basic opera-tions, mental computation, numerical reasoning, word problems, missingelements, money measurement, and time. Subtests have 7 to 27 items that arearranged on a scale of progressive difficulty as determined by Rasch-Wrightitem analysis methods. Within subtests, items are grouped into "instruc-tional clusters" of an average of 2 to 3 items. A 31-item metric supplementis also offered.

PRICES

A complete Test Kit is $26.50. The price of each component item, if orderedseparately, is as follows: Reusable Easel-Kit at $21.50, examiner's manualat $2.85, and Diagnostic Records per package of 25 at $4.55. The MetricSupplement Manual and Test items sell for $4.25, and the response forms forit are $2.50 per package of 25. Date of information: 1978.

FIELD TEST DATA

Over 2000 pupils in a national sample were field tested, 1222 of them fornorming KeyMath. Grade equivalents and W-scale values for total scores and

for each individual item are given.

ADMINISTRATION

KeyMath is made for individual testing. Estimated testing time is 30 minutes

per pupil.

SCORING

Stimulus pictures have correct answers printed on the flip side for immediate

scoring and recording.

COMMENTS

Publisher says that the test may be used for remedial purposes above grade 6.Fall and spring percentile norms and normal curve equivalents for KeyMath wereexpected to be available by the time this volume is published.

80

rg°

90

KEYMATH DIAGNOSTIC ARITHMETIC TESTby Austin J. Connolly, WilliamNachtman, & E. Milo Pritchett

American Guidance Service, 1971(Metric Supplement, 1976)


A B@

A @

1.

2.

Description. There areobjectives for individ-ual items, but not forclusters of items.

Azreement. No data.

A @ 3. Representativeness.

A @ 4. Sensitivity.

A @ 5. Item Uniformity. Split-half reliabilities forthe 14 subtests rangefrom .23 to .90 withingrade with the median(of all grades) rangingfrom .64 to .84.

A @ 6. Divergent Validity.

A @ 7. Bias.

A @ 8. Consistency. The pub-lisher advises againstinterpreting its test-retest data as reliabil-ities owing to the longperiod which separatedthe two testings.


AOC 9. Instructions. Sample

items are not given.

(D C 10. Item Review.

(D C 11. Visibility.

C 12. Responding.

C 13. Informativeness. The

system is available onapproval.

A @ 14. Curriculum Cross-Referencing.

C 15. Flexibility.

A @ 16. Alternate Forms.

GD C 17. Administration.

AC1C 18. Scoring.. Machine scoringis not relevant here.Scoring is done on thespot with pre-printedanswers on backs of ques-tion cards.

A @ 19. Record Keeping. Agraphic profile recordis provided, but it iskeyed to subtests and toindividual items, not toinstructional clusters.

A 20. Decision Rules. Pro-vided, but withoutsupport. Decisions arefor subtests, not for"instructional clusters"of items.

A cD 21. Comparative Data. The1971 norms are given ingrade equivalents. Onlyfive school districtstook part in the calibra-tion study.

81

LANGUAGE AND THINKING PROGRAM: Follett Publishing Company, 1973MASTERY LEARNING CRITERION TESTS

DESCRIPTION

The language and thinking tests measure children's proficiency in selecting,pictures of familiar things in response to different categories of verbal in-structions. The publisher says that the tests may be used for mastery testingor regrouping. Separate test booklets are provided for each of these groupsof verbal concepts: classification, functions, directions/locations, colors/shapes/sizes, actions, and blends (i.e., combinations of two or more fea-tures). Designed for children from 3 to 7 years, the tests are almostentirely multiple choice. Each test booklet measures from 6 to 12 objectives,the number of items per objective ranging from two to eight.

PRICES

Consumable test booklets cost from 36 to 60G each, or $3.42 for the set of 7(six concept areas plus a practice booklet). Reusable examiner's manuals foreach test are from $1.14 to $1.83 each, or $9.96 for the set. Date of infor-mation: 1977.

FIELD TEST DATA

Data are not reported, but the commercially available edition of the testthat was reviewed by CSE was the field research edition.

ADMINISTRATION

These are group tests which are given by an examiner.

SCORING

Scoring is by hand from keys in the examiner's manuals.

COMMENTS

These tests were developed as part of the Language and Thinking Program ofCEMREL, Inc., but are sold separately.

82 92

LANGUAGE AND THINKING PROGRAM: Follett Publishing Company, 1973MASTERY LEARNING OF CRITERION TESTS


A B® 1. Description. Severaltest objectives reflecttwo or more instruc-tional objectives.


A @ 3. Representativeness. No

data.

A @ 4. Sensitivity. No data.


data.


data.




A B0 9. Instructions. The lan-guage of the instructionsis advanced for pupils ofthis age.

A @ 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding.


A a 14. Curriculum Cross-Referencing. Tests arekeyed to the specificlanguage and thinkinginstructional packagewith which they weredeveloped.

A® C 15. Flexibility. The dif-ferent concept areas maybe tested separately, butthere is only one levelfor each.



ACID 18. Scoring.


A @ 20. Decision Rules.

A cp 21. Comparative Data. Not

available for the FieldResearch Edition.

839 3

LANGUAGE ARTS: COMPOSITION, LIBRARY, Instructional Objectives Exchange,AND LITERARY SKILLS (K-6) 1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests measuring16 objectives on composition, 10 on library skills, and 6 on literary skills.There are from five to ten items per objective. Tests for each objective areprinted on spirit masters for local duplication and scoring. Two alternateforms of this collection are available.

PRICES

Each form of this test collection sells for $29.95, which includes the manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.Date of information: 1979.

FIELD TEST DATA

Publisher reports that each test was tried out on at least five students inan elementary school in Los Angeles. Data are not reported.

ADMINISTRATION

These are group tests.

SCORING

Answer keys are provided in the manual for hand scoring.

COMMENTS

Leveling of the tests in this collection according to content, format, etc.,is done locally by the test user. The publisher also offers a customized CRTservice.

8494

LANGUAGE ARTS: COMPOSITION, LIBRARY, Instructional Objectives Exchange,AND LITERARY SKILLS (K-6) 1973


0 B C 1. Description. Amplifiedobjectives: rules forsampling each domain arenot given though.

A CD 2. Agreement. A review isreported but not des-cribed.

A CD 3. Representativeness. No

data.

A CD 4. Sensitivity. No data.

A (g) 5. Item Uniformity. No

data.


data.





0 C 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding.


A @ 14. Curriculum Cross-Referencing..

GB C 15. Flexibility.



A BO 18. Scoring. The one-pagescoring guide containskeys for all 32 tests insmall print.

G C 19. Record Keeping.

A O 20. Decision Rules.

A O 21. Comparative Data.

85

LANGUAGE ARTS: MECHANICS AND USAGE Instructional Objectives Exchange,(K-6) 1973

DESCRIPTION

This battery is a collection of fill-in and multiple choice tests measuring10 objectives in mechanics (capitalization and punctuation) and 23 objectivesin usage (plurals, possessives, modifiers, verb agreement, irregular verbs,and commonly confused words). There is an average of more than eight itemsper objective. Tests for each objective are priLted on spirit masters forlocal duplication and scoring. Two alternate forms of this collection areavailable.

PRICES

Each form of this test collection sells for $29.95, which includes the manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are usedDate of information: 1979.

FIELD TEST DATA

Publisher reports that each test was tried out on at least five students inan elementary school in Los Angeles. Data are not reported.

ADMINISTRATION


SCORING


COMMENTS

Leveling of the tests in this collection according to content, format, etc.,is done locally by the test user. The publisher also offers a customizedCRT service.

86

A

LANGUAGE ARTS: MECHANICS AND USAGE Instructional Objectives Exchange,(K-6) 1973


(DB C 1. Description. Amplifiedobjectives: rules forsampling each domain arenot given though.

A cp 2. Agreement. A review isreported, but not des-cribed.

A cD 3. Representativeness. Nodata.

A cD 4. Sensitivity. No data.

A CD 5. Item Uniformity. No

data.

A cD 6. Divergent Validity. Nodata.

A 7. Bias. No data.




CD c 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding.



B C 15. Flexibility.


0 C 17. Administration

A BO 18. Scoring. Keys for all33 objectives areprinted on two pages insmall type.




LANGUAGE ARTS: WORD FORMS AND SYNTAX Instructional Objectives Exchange,(K-6) 1973

DESCRIPTION

This battery is a collection of selected response tests measuring 15 objec-tives dealing with word form and 27 objectives dealing with syntax. Thereare five toten items per objective. Tests for each objective are printed onspirit masters for local duplication and scoring. Two alternate forms ofthis collection are available.

PRICES


FIELD TEST DATA

Publisher reports that each test was tried out on at least five pupils in anelementary school in Los Angeles. Data are not reported.

ADMINISTRATION


SCORING

Answer keys arc provided in the manual for hand scoring.

COMMENTS

Leveling of these tests according to content, format, etc., is done locallyby the user. The publisher also offers a customized CRT service.

8898

LANGUAGE ARTS: WORD FORMS AND SYNTAX Instructional Objectives Exchange,

(K-6) 1973


(3)B C 1. Description. Amplifiedobjectives: rules forsampling each domain arenot given though.

A 2. Agreement. A review ismentioned but not des-cribed.


data.



data.


data.

A 7. Bias. No data.




C 10. Item Review.

11. VisibilitL.

12. Responding.

C) .0 13. Informativeness.

A el 14. Curriculum Cross-Referencing.

(DB C 15. Flexibility.

0 16. Alternate Forms.

0 17. Administration.

A BO 18. ScoriLai. Keys for all

42 objectives areprinted on three pagesin small type.




MASTERY: AN EVALUATION TOOL Science Research Associates,(MATHEMATICS) 1974-75

DESCRIPTION

Mastery (Math) is a nine-level battery of tests in math for grades K-8. There

are 15 to 40 objectives per level with three multiple choice items per objec-tive. The following skill areas are covered by the catalog (that is, ready-made) tests: for K-2--numbers and numerals, whole-number computation,measurement, sets, logical thinking, and geometry; for 3-8--whole numbers,fractional numbers, integers, rational and real numbers, geometry, measure-ment, sets, functions, graphing, statistics, probability, logic, and flowcharts. Two alternate forms are available.

PRICES

Test booklets are 55C to 79C each per level in sets of 25, the lower threelevels being consumable. Answer sheets are 13C each in sets of 100. Anexaminer's manual, which is included with an order of tests, is availableseparately for 70C. Catalogues of Mastery (Math) objectives are availableat $2.20 for the K-2 set and $3.55 for tl-e 3-8 set. Specimen set for K-2 is$5.00 and for 3-9 it is $5.25. Date of information: 1977.

FIELD TEST DATA

A technical report is available from SRA giving item difficulties, item/testcorrelations, and KR-20s for each test level. Data come from "a cross-sectionof SRA test users." Numbers of test takers average about 3000 per level forform X and 475 for form Y.

ADMINISTRATION

Mastery (Math) is a battery of group tests designed to be given by a teacher.Estimated testing time is three minutes per objective.

SCORING

Keys are provided for hand scoring and a machine scoring service is offered.For a price per pupil of 98C to $1.40 the user receives profiles for individ-ual pupils and for the total group.

COMMENTS

The publisher offers a customized CRT service as well as catalog (ready-made)tests.

90 109

MASTERY: AN EVALUATION TOOL(MATHEMATICS)

Science Research Associates,1974-75



A c) 2. Agreement. A review ofitems for their contentvalidity is mentioned,but not described.

A C) 3. Representativeness. Nodata.

A e 4. Sensitivity. No data.

A 0 5. Item Uniformity. Pointbiserial correlations ofitems with the total testscore have a median ofabout .4, and the KR-20sfor test levels have amedian of .95.

A cD 6. Divergent Validity. Nodata.

A 7. Bias. No data.

A e 8. Consistency. No data.


C 9. Instructions.

A e 10. Item Review. A review ismentioned, but not des-cribed in any detail.

C 11. Visibility.

C 12. Responding.

C 13. Informativeness. Testsare available on 30-dayapproval.

® c 14. Curriculum Cross-Referencing. Availableseparately.

A® C 15. Flexibility. Catalogtests cover similar objec-tives at several levels,but all objectives are inone booklet per level.

16. Alternate Forms.

17. Administration.

A BO 18. Scoring. Both machineand hand scoring areavailable, but handscoring does not appeareasy.

A 0 19. Record Keeping. If the

scoring service is pur-chased, detailed recordsare provided.

C 20. Decision Rules. For eachthree-item objective, theprobabilities ofattaining scores of 0 to3 by guessing are pro-vided.

A e 21. Comparative Data.Although the samples forthe item-difficulty datain the technical reportare large, the publisherdoes not claim that theyare necessarily represen-tative of the nation.

91 1 01

MASTERY: AN EVALUATION TOOL Science Research Associates, 1975(SOBAR READING)

DESCRIPTION

SOBAR (System for Objective Based Assessment of Reading) is a ten-levelbattery for testing the following reading skills in grades K-9: letter

recognition, phonic analysis, structural analysis, vocabulary, comprehension,and study skills. There are three multiple choice items per objective, thenumber of objectives ranging from 23 at level K to 35 at the upper levels.Two alternate forms are available.

PRICES

In sets of 25, test booklets range from 79C per pupil for the lower threelevels (consumable) to 55c for the upper levels (reusable). The examiner'smanual, which comes with an order of test booklets, may be bought separatelyfor 70C depending on the level. Answer sheets are 13C each in packages of100. Catalogs of SOBAR objectives cost approximately $2.95 each, there beinga K-2 and a 3-9 catalog. Specimen set for K-2 sells for $5.00 and for 3-9 itis $5.25. Date of information: 1977.

FIELD TEST DATA

A technical report is a-ailable from SRA giving difficulty statistics for eachitem, point-biserials for each item, and KR-20s for each test level. Numbers

of test takers averaged about 3200 per level for form L and 450 per level forform M.

ADMINISTRATION

SOBAR is a battery of group tests to be administered by the teacher. Esti-

mated testing time is three minutes per objective.

SCORING

Keys are provided for hand scoring, and a machine scoring service is offered.For a per pupil price of $.98-$1.40, the buyer receives profiles for individ-ual pupils and for the group.

COMMENTS

In addition to the catalog (ready-made) tests, the publisher offers a cus-tomized CRT service.

MASTERY: AN EVALUATION TOOL Science Research Associates, 1975(SOBAR READING)



A (0 2. Agreement. A review ofitems for their congru-ence with their objec-tives is mentioned butnot described.

A 6D 3. Representativeness. Nodata.

A 6D 4. Sensitivity. No data.

A 6D 5. Item Uniformity. Pointbiserial correlations ofitems with total testscores are reported, KR-20s for test levels havea median of .94.

A 6D 6. Divergent Validity. Nodata.

A 6D 7. Bias. A review of theitems for racial andsexual bias is mentionedbut not described.

A © 8. Consistency. No data.



A 6D 10. Item Review. A review ismentioned but not des-cribed in any detail.

C 11. Visibility.

6D C 12. Responding.

6) C 13. Informativeness.

C 14. Curriculum Cross-Referencing. Availableseparately.

AOC 15. Flexibility. Catalog

(ready made) tests arein one booklet per level.Objectives are coveredat several levels.

6) C 16. Alternate Forms.

6) C 17. Administration.

A BC) 18. Scoring. Both machineand hand scoring areavailable, but handscoring does not appeareasy.

A 6D 19. Record Keeping. If thescoring service is pur-chased, detailed recordsare provided.

C 20. Decision Rules. For eachthree-item objective, theprobabilities of a pupilgetting scores of 0-3 byguessing are provided.

A 6D 21. Comparative Data.Although the samples forthe item-difficulty datain the technical reportare large, the publisherdoes not claim that theyare necessarily represen-tative of the nation.

93 .103

MATH DIAGNOSTIC/PLACEMENT TESTS U-SAIL (Utah System Approach toIndividualized Learning Project),1975

DESCRIPTION

The U-SAIL Math Tests make up a six-level battery for pupils in grades 1-6 onthe following concepts: whole numbers, basic operations with integers, basicoperations with fractions and decimals, sets measurement, geometry, graphsand functions, ratio and proportion, and percent. There are 10 to 17 objec-tives per level with five multiple choice items per objective. These testsare part of a math curriculum which includes instructional materials andother resources for teachers.

PRICES

Consumable tests for the lower three levels range from 24 to 37 per pupilin sets of 35, and reusable tests for the upper levels range frov 22 to 24per pupil in the same quantity. The teacher's manual is 75. A complete setof all 35 copies of all the levels is $56.00. Date of information: 1978.

FIELD TEST DATA

U-SAIL provided CSE with some unpublished data on item difficulties andinter-item correlations within each objective for the lower four levels. The

number of pupils per item was 223 to 249. It is these data that are referredto below in the comments on standards 5 and 21. Test data were used forrevision of the materials.

ADMINISTRATION

U-SAIL tests are designed for group administration.

SCORING

Templates for hand scoring are provided with the test booklets.

COMMENTS

This test battery was developed by a consortium of school districts.

94

1 0 4

MATH DIAGNOSTIC/PLACEMENT TESTS U-SAIL (Utah System Approach toIndividualized Learning Project), 1975

MASUREMENT PROPERTIES

A B@ 1. Description. For usersof the U-SAIL math pro-gram, the ratings on testfeatures #1-3 would behigher, since the itemsare systematically sam-pled from the domainsthat make up the curric-ulum. For the generaltest buyer, the scope andsequence chart gives onlybrief descriptions of themath objectives.

2. Agreement.

3. Representativeness.

4. Sensitivity. The unpub-lished data of pupilgains are not clearlyfree,from well-knownproblems in measurement.

5. Item Uniformity,: Part-whole correlations perobjective are reportedfor the lower four levelsMost are in the .6 to .7range.


data.

A 7. Bias. Unpublished infpr-mation from the developerrefers to studies toensure lack of bias, butdetails are lacking.

A 8. Consistency. The unpub-lished data provided bythe developer were notcomplete enough to eval-uate.

A @A @A @



C 10. Item Review.

° 12. Riessipbondlitc

A 13. Informativeness.

A 14. Curriculum Cross-Referencing.. Althoughthe developer does notprovide a curriculumindex for these tests,it states that many pub-lishers of math programsdo index their text ,

series to the U-SAILobjectives.

Flexibility. Each objec-tive is covered at onlyone level, but the useof more than one levelof test with individualpupils is suggested.

Alternate Forms.

Administration.

Scoring. By hand tem-plate.

Record Keeping.

Decision Rules. Threelevels of attainment aredescribed, but notsupported.

Comparative Data. The

publisher has some com-parative data, but doesnot routinely providethem to test buyers.The pupils were from ageographically limitedarea.

A@c 15.

A @ 16.

C 17.

A@C 18.

0 C 19.

A q.) 20.

A @ 21.

95

1 05

MATHEMATICS: ELEMENTS, SYMBOLISM, Instructional Objectives Exchange,AND MEASUREMENT (7-9) 1974

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests dealingwith 43 objectives in the following skill areas: integers, rational numbers,real numbers, numeration, measurement, and sentences and logic. The itemsfor each objective are printed on separate spirit masters for local duplica-tion and scoring. Number of items per objective ranges from 5 to 10. Twoalternate forms of this collection are sold.

PRICES


FIELD TEST DATA

Preliminary field testing of these materials wa3 done in two schools in LosAngeles.

ADMINISTRATION

These are group tests which may also be self-administered by pupils.

SCORING


COMMENTS

The publisher also offers a customized CRT service.

96 106

MATHEMATICS: ELEMENTS, SYMBOLISM, Instructional Objectives Exchange,AND MEASUREMENT (7-9) 1974


OB C 1. Description. Amplifiedobjectives are given forall tests, but rules forsampling the domains arenot.

A 0 2. Agreement. Reviews ofagreement are reported,but not described.


data.



data.


data.

A IQ 7. Bias. No data.



GB C 9. Instructions.

C 10. Item Review.

C 11. Visibility.

C 12. Responding.


A 0 14. Curriculum Cross-Referencing.

OB C 15. Flexibility. The testfor each objective isprinted on a separatespirit master.

cDC 16. Alternate Forms.


A B© 18. Scoring. The print is

small and crowded on theanswer keys.




97

107

MATHEMATICS: GEOMETRY (K-6) Instructional Objectives Exchange,1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests dealingwith 36 geometry objectives. There are five items per objective, with thetests for each objective being printed on separate spirit masters for localduplication and scoring. Two alternate forms of this collection are avail-able.

PRICES


FIELD TEST DATA

Preliminary field testing of these materials was done in two schools in LosAngeles.

ADMINISTRATION


SCORING


COMMENTS

The publisher also offers a cu..4tomized CRT service.

MATHEMATICS: GEOMETRY (K-6) Instructional Objectives Exchange,

1973


(DB C 1. Description. Amplifiedobjectives are given forall tests, but rules forsampling the domain arenot.

A 2. Agreement. Reviews ofagreement are reported,but not described.


data.


A cp 5. Item Uniformity. No

data.


data.

A 7. Bias. No data.




(D C 10. Item Review.

C 11. Visibility.

C 12. Responding.

(D C 13. Informativeness.

A © 14. Curriculum Cross-Referencina.

A Be 15. Flexibility. The testfor each objective isprinted on a separatespirit master.



A Be 18. Scoring. The print issmall and crowded on theanswer keys.

C 19. Record Keeping..

A © 20. Decision Rules.


99 109

MATHEMATICS: GEOMETRY, OPERATIONS, Instructional Objectives Exchange,AND RELATIONS (7-9) 1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests covering48 objectives in the following skill areas: geometry, operations andproperties, statistics, ratios and proportions, and graphs. There are atleast five items per objective, the tests for each objective being printed onseparate spirit masters for local duplication and scoring. Two alternateforms of this collection are sold.

PRICES


FIELD TEST DATA


ADMINISTRATION

These tests may be administered to groups by an examiner, and may be self-administered by pupils.

SCORING


COMMENTS


MATHEMATICS: GEOMETRY, OPERATIONS,AND RELATIONS (7-9)

Instructional Objectives Exchange,1973


(DB C 1. Description. Amplifiedobjectives are given forall tests, but rules forsampling the domains arenot.


A C) 3. Representativeness. Nodata.



data.


A 7. Bias. No data.




C 10. Item Review.

0 C 11. Visibility.

C 12. Responding.



OB C 15. FlexibilitT. The test

for each objective isprinted on a separatespirit master.



A Be 18. Scorins. The print issmall and crowded on theanswer keys.

0 C 19. Record Keepia.



MATHEMATICS: MEASUREMENT (K-6) Instructional Objectives Exchange,1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests covering38 elementary level objectives in measurement. There are five items perobjective, the test for each objective being printed on separate spiritmasters for local duplication and scoring. Two alternate forms of thiscollection are available.

PRICES

Each form of this test collection sells for $29.95, which includes the manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.

Date of information: 1979.

FIELD TEST DATA


ADMINISTRATION


SCORING


COMMENTS


102 21 9

MATHEMATICS: MEASUREMENT (K-6) Instructional Objectives Exchange,1973



A C) 2. Agreement. Reviews ofagreement are reported,but not described.

A () 3. Representativeness. No

data.



data.

A () 6. Divergent Validity. No

data.






0 C 11. Visibility.

0 C 12. Responding.

() C 13. Informativeness.

A C) 14. Curriculum Cross-Referencing.

OB C 15. Flexibility. The testfor each objective isprinted on a separatespirit master.



A BC) 18. Scoring. The print is

small and crowded on theanswer keys.

C) C 19. Record Keeping.



103 11 3

MATHEMATICS: NUMERATION AND Instructional Objectives Exchange,RELATIONS (K-6) 1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests covering38 objectives in the following skill areas: numeration, ratios and propor-tions, graphs, statistics and probability, and logic. There are five to tenitems per objective. The items for each objective are printed on separatespirit masters for local duplication and scoring. Two alternate forms ofthp collection are sold.

PRICES

Each form of this test collection sells for $29.95, which includes the manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.Date of infotmation: 1979.

FIELD TEST DATA

Preliminary field tescing of the materials was done in two schools in LosAngeles.

ADMINISTRATION


SCORING


COMMENTS


104 111

MATHEMATICS: NUMERATION ANDRELATIONS (K-6)

MEASUREMENT

0 B C 1.

PROPERTIES

Description. Amplifiedobjectives are given forall tests, but rules forsampling the domains arenot.


A () 3. Representativeness. No

data.

A () 4. Sensitivit . No data.


data.


data.

A () 7. Mad. No data.

A () 8. Consistency. No data.

APPROPRIATFNESS AND USABILITY


C 10. Item Review.

C 11. Visibility.

0 C 12. Responding.



0 B C 15. Flexibility. The testfor each objective isprinted on a separatespirit master.

0 16. Alternate Forms.


A BO 18. Scoring. The print issmall and crowded on theanswer keys.





105 1 5

MATHEMATICS: OPERATIONS AND Instructional Objectives Exchange,PROPERTIES (K-6) 1974

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests dealingwith 40 objectives on the four basic operations--addition, subtraction,multiplication, and division--using integers, fractions, and decimals. Thereare five items per objective. The tests for each objective are printed onseparate spirit masters for local duplication and scoring. Two alternate

o.forms of this collection are available.

PRICES

Each form of this test collection sells for $29.95, which includes themanual and record forms. The price per pupil will vary with the number ofcopies made from each spirit master and the number of objectives that areused. Date of information: 1979.

FIELD TEST DATA

Preliminary field testing of the materials was done in two Los Angelesschools. After publication, performance data on 200 to 600 pupils per objec-tive were gathered.

ADMINISTRATION


SCORING

Answer keys are provided in the manual for hand scoring. Comparative dataare given in the form of cumulative percentages of pupils at each of two tofour grades attaining each possible score for each objective. Pupils weretested in the 'L'all, so the publisher reports data for each group as year-endresults for the previous grade.

COMMENTS


106 116

I

MATHEMATICS: OPERATIONS ANDPROPERTIES (K-6)


(DB C 1. Description. Amplifiedobjectives are given forall tests, but rules forsampling the domains arenot.

2. Agreement. Reviews ofagreement are reportedbut not described.

3. RepresenLativeness. Nodata.


5. Item Uniformity. No

data.


7. Bias. No data.


APPROPRIATFNESS AND USABILITY


C 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding..



C)B C 15. Flexibility. The testfor each objective isprinted on a separatespirit master.

() C 16. Alternate Forms.


A BC) 18. Scoring.. The print onthe answer keys is smalland crowded.



A q) 20. Decision Rules.

A q) 21. Comparative Data. Data

are provided, but thesamples are not largeand are all from urbansettings in SouthernCalifornia.,

107 11

MATHEMATICS: SETS AND NUMBERS (K-6) Instructional Objectives Exchange,1973

DESCRIPTION

This battery is a collection of multiple choice and fill-in tests dealingwith 35 objectives in the follwing skill areas: sets, whole numbers, andrational numbers. There are five items per objective. Tests for each objec-

tive are printed on separate spirit masters for local duplication andscoring. Two alternate forms of this collection are sold.

PRICES

Each form of this test collection sells for $29.95 which includes the manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.Date of information: 1979.

FIELD TEST DATA


ADMINISTRATION


SCORING


COMMENTS


108 1 1 8

MATHEMATICS: SETS AND NUMBERS (K-6) Instructional Objectives Exchange,1973



A 0 2. Agreement. Reviews ofagreement are reportedbut not described.


data.


A 0 5. Item Uniformiey. Nodata.

A 0 6. Divergent Validity_. No

data.

A 7. Bias. No data.





C 11. Visibility.

C 12. Responding.



(DB C 15. Flexibilitz. The testfor each objective isprinted on separatespirit masters.

® C 16. Alternate Forms.

® C 17. Administration.

A BO 18. Scoring. The print issmall and crowded on theanswer keys.

® C 19. Record Keeping.



109 119

MCGUIRE-BUMPUS DIAGNOSTIC Croft Educational Services, 1971-72COMPREHENSION TEST

DESCRIPTION

The McGuire-Bumpus tests are a two-level battery for primary and inter-mediate pupils which measure the following types of reading comprehensionskills: literal, interpretive, analytic, and critical. The number of objec-tives for each of these skill types is respectively 4, 3, 3, and 2 at eachlevel, each objective having 12 multiple choice items. Tests are printed onspirit masters for local duplication and scoring. Alternate forms are avail-able. An optional curriculum index is offered.

PRICES

The book of spirit masters for one form of the tests costs $26.00. Prices pertest per pupil will vary with the number of objectives tested and number ofcopies made from each spirit master. The administrator's manual, which con-tains scoring keys, costs $8.00. Scoring overlays may be ordered at $89.00for one test form. Class record charts are $2.00 each in sets of 20, andindividual pupil records are 12 each in sets of 50. Cassettes for adminis-tering the tests are $29.00 per set. The curriculum index sells for $49.00.Date of information: 1978.

ADMINISTRATION

These tests are made for group administration by a teacher or for self-administration by cassette recorder.

SCORING

Hand scoring is done with answer keys in the manual or with optional overlays.

COMMENTS

The test battery by itself lacks explanatory and interpretive information.

110 leo

MCGUIRE-BUMPUS DIAGNOSTIC Croft Educational Services, 1971-72COMPREHENSION TEST


ACIDC 1. Description.

A el 2. Agreement. No data.

A cp 3. Reprasentativeness. No

data.


A C) 5. Item Uniformitz. No

data.


A c) 7. Bias. No data.




A 10. Item Review.

C 11. VisibLlity.

C 12. Responding.

® C 13. Informativeness.


AOC 15. Flexibility. Each objec-tive is tested at twolevels, but individualobjectives are notseparately testable.

G c 16. Alternate Forms.

c 17. Administration.

AOC 18. Scoring. Overlays areavailable. The keys inthe manual are not soeasy to use.


A C) 20. Decision Rules. Rules

are given without support.


1111

NEW MEXICO CAREER EDUCATION TEST Monitor, 1973SERIES by C. C. Healy & S. P. Klein

DESCRIPTION

The New Mexico Career Education Test Series is a battery of tests dealingwith career related attitudes, knowledge, and activities for pupils ingrades 9-12. The four cognitive tests deal with these subjects: careerplanning, knowledge of occupations, job application procedures, and careerdevelopment. Each of these tests has 20 to 25 multiple choice items dividedamong two or three sub-objectives. Two forms of the career planning testare offered.

PRICES

Reusable booklets for each of the tests are 24 per pupil in sets of 35.Answer sheets are 6(1 each in like sets. The examiner's manual for the seriesis $2.50 and separate answer keys are $1.00 per test. A specimen set is$3.75 for each test and $17.50 for the series. Date of information: 1978.

FIELD TEST DATA

Each of the tests was given to a sample of at least 500 ninth graders and 1200twelfth graders in New Mexico. Item difficulties, point biserials, and normsare given for all tests.

ADMINISTRATION

These tests are designed to be given to groups. They are timed, taking 20minutes each.

SCORING

Tests are scored by hand with templates.

COMMENTS

Eight of the items on the career development test measure an affective objec-tive.

112

9196..4,

m

NEW MEXICO CAREER EDUCATION TESTSERIES by C. C. Healy & S. P. Klein

Monitor, 1973


A B® 1. Description.

A IQ 2. Agreement.


A CO 4. Sensitivity. Small butstatistically reliabledifferences in thescores of 9th and 12thgraders are reported.Whether these differencesare due to instructioncannot be determined fromthe data.

A 10) 5. Item Uniformity. Inter-nal consistency measuresrange from .51 to .87 forthe separate tests butthe data are for totaltests, not for theseparate objectives. Anaverage of five itemsper test have correla-tions with the total testscore of less than .3.

A @A @

c

6. Divergent Validity.

7. Bias.

8. Consistency.


C)B C 9. Instructions.


A 11. Visibility. Print sizein the test items issmall.

C 12. Responding.



AOC 15. Flexibility. The serieshas four separately soldcomponents, each with2-3 sub-objectives.

A 16. Alternate Forms. Onlyone of the tests hastwo forms.


ACIC 18. Scoring. By template,

A 19. Record'Keeping.


A 21. Comparative Data.Norming samples rangefrom 500 to 2500 pupils,all from New Mexico.

NEW MEXICO CONCEPTS OF ECOLOGY TEST Monitor, 1973

DESCRIPTION

The Concepts of Ecology Tests are a two-level battery of survey tests inecology for grades 6-12. Each level has 20 items and deals with 5 to 7"knowledge areas." There are 2 to 6 multiple choice items per knowledgearea.

PRICES

Reusable test booklets are 24c per pupil in sets of 35 and answer sheets are6c each in like sets. The examiner's manual is $1.50 and answer keys are$1.00 per level. A specimen set is available at $3.00 per level. Date ofinformation: 1978.

FIELD TEST DATA

The lower level was field tested on 1,040 sixth grade students, the upperlevel on 2,389 12th graders, both groups in New Mexico. Difficulties andother statistics are reported for each item, as are internal consistenciesand norms for the whole test.

ADMINISTRATION

These tests are designed for group administration. They are timed, taking20 minutes each.

SCORING

Scoring is done by hand with a template.

114

124

NEW MEXICO CONCEPTS OF ECOLOGY TEST Monitor, 1973

MEASUREMENT PROPT1RTIES

A B @ 1.

A ® 2.

A @ 3.

A ® 4.

A q) 5.

Description.

Agreement. No data.

Representativeness.

Sensitivity. An averagesuperiority of about two



A0 C 18. Scoring. By hand withtemplates.

A q) 19. Record Keeping.

A q) 20. Decision Rules.

items correct for 12th A q) 21. Comparative Data. Thegraders over 9th graders norm samples are fromis reported, but that one state, New Mexico.gain is not clearlyattributable to instruc-tion.

Item Uniformity. Inter-nal consistencies of.67 and .74 are reportedfor the total test, butconsistencies by "knowl-edge area" are not given.Four to five items pertest have correlationswith the total test scoreof less than .3.

A @ 6. Divergent Validity.

A @ 7. Bias. No data.

A @ 8. Consistency. No data.



A @ 10.C 11.

0 C 12.

0 C 13.

A 0 14.

A B C 15.

Item Review.

Visibility.

Responding.

Informativeness.


Flexibility. Not clearlyrelevant. Four of theknowledge areas aretested on both levels.

115 125

NEW MEXICO CONSUMER MATHEMATICS TEST& CONSUMER RIGHTS AND RESPONSIBILITIESTEST

Monitor, 1973

DESCRIPTION

There are two New Mexico Consumer Tests, the Consumer Mathematics Test andthe Consumer Rights and Responsibilities Test. Designed for pupils ingrades 9-12, both contain 20 items. Clusters of generally three items dealwith more specific topics such as insurance or unit prices.

PRICES

Reusable booklets for each test are 24C per pupil in sets of 35, and answersheets are 6C in like sets. An examiner's manual for each test is $1.50, andthe two answer keys are $1.00 each. Specimen sets for each test are $3.00.Date of information: 1978.

FIELD TEST DATA

Each test was field tested on over 800 ninth graders and 2400 twelfth gradersin New Mexico. Difficulties and other statistics are reported for each item,as are norms and internal consistencies for the total test.

ADMINISTRATION

These are designed for group administration. Testing time is 20 minutes foreach.

SCORING

Templates are available for hand scoring.

NEW MEXICO CONSUMER MATHEMATICS TEST& CONSUMER RIGHTS AND RESPONSIBILITIESTEST

Monitor, 1973


A B() 1. Description.

A 2. Agreement.

A 0 3. Representativeness.

A 4. SensItivity. An averagesuperiority of about 2.5items correct for 12thgraders over 9th gradersis reported, but thatgain is not clearlyattributable to instruc-tion.

A 5. Item Uniformity. Inter-nal consistencies of .62to .75 for the totaltests are reported, butconsistencies withincontent clusters ofitems are not given.Several items on eachtest (e.g., three forConsumer Rights andResponsibilities atgrade 12, eight for Con-sumer Math at grade 9)have correlations withthe total test score ofless than .3.

A 6. Divergent Validity.

A 0 7. Bias.

A q) 8. Consistency.



A q) 10. Item Review.

0 C 11. Visibility.

0 C 12. Responding.


A ap 14. Curriculum Cross-Referencing.

A B() 15. Flexibility..


17. Administration.

AOC 18. Scoring. Templates forscoring are available.

A 0 19. Record Keeping.


A 0 21. Comparative Data. Normsfor pupils in NewMexico are given.

PRE-READING ASSESSMENT KIT CTB/McGraw-Hill Ryerson Limited,1972

DESCRIPTION

Pre-Reading Assessment Kit is designed as a "rough screening device for t -

classroom teacher" to use with children in kindergarten and first grade. Itstests measure skills in the following four areas: listening, symbol percep-tion, experience vocabulary, and comprehension. The kit has tests at threelevels of difficulty, the number of objectives ranging from three at thedifficult level to eight at the easy one. Items are multiple choice,averaging ten per objective.

PRICES

A classroom set of consumable test forms for 32 pupils costs $1.67 per pupiland includes record forms and a manual. The manual is $2.40 separately. A

specimen set is offered for $3.60. Date of information: 1977-78.

FIELD TEST DATA

Difficulty leveling was based on a pretest of 2864 first graders. It is

likely that these pupils were Canadian.

ADMINISTRATION

These tests are made for group administration. Estimated time for each ofthe 18 subtests is 10 minutes.

SCORING

The manual contains keys for hand scoring.

COMMENTS

The manual suggests that tests like these are biased against children fromlimited English speaking or culturally disadvantaged background.

118 128

PRE-READING ASSESSMENT KIT CTB/McGraw-Hill Ryerson Limited,1972




A CD 3. Representativeness. No

data.

A cD 4. Sensitivity. No data.

A cD 5. Item Uniformity. Nodata.


A e 7. Bias. No data.



C 9. Instructions.

A CD 10. Item Review.

C 11. Visibility.

0 C 12. Responding.


A 14. Curriculum Cross-Referencing. Resourcematerials are identifiedfor some portions of thetest, but the informationis not detailed.

OB C 15. Flexibility. Each sub-test is on a separateform.

A cD 16. Alternate Forms.


18. Scoring.


A 0) 20. Decision Rules. Possibleinterpretations of scoresare discussed and sugges-tions are given forcutting scores. Supportfor the decisions is notgiven.


1191 29

PRESCRIPTIVE READING INVENTORY CTB/McGrawHill, 1972

DESCRIPTION

The PRI is a sixlevel system for testing the following areas of readingskill: recognition of sound and symbol, phonic analysis, structural analysis,translation (meanings of words and phrases), literal comprehension, interpretive comprehension, and critical comprehension. Levels 1 and 2, for K to 1.0and K.5 to 2.0 have 10 objectives each. The upper four levels, aimed atgrades 1.5 through 6.5, have 34 to 42 objectives per level with an averageof 3 to 4 multiple choice items per objective. In addition to the bookletfor testing each level, smaller interim tests are optionally available formonitoring progress during the school year. The Interpretive Handbook(included) has guidelines for integrating the PRI into instruction and suggestions for classroom activities for each objective. Guides indexing thePRI to basal reading series are optionally available.

PRICES

Test booklets in sets of 35 sell for various prices depending on whether theyare reusable (for the upper two levels, 39c to 44c each), hand scorable (570,or machine scorable (71). Answer sheets are 10C each in packs of 50. Keysfor hand scoring are 16C per pupil in sets of 35. One per pupil is needed.Included in the specimen set ($5.50 for each level, $11.00 for all levels)are test booklets, answer sheets, plus the following materials, with theirseparate prices in parentheses: examiner's manual ($2.50 per level), and anInterpretive Handbook ($3.25). A Technical Report is available for eachlevel at $3.25. Date of information: 1979.

FIELD TEST DATA

A national tryout was conducted on an ethnically mixed national sample of18,000 students. In the Technical Report, Feveral analyses of these data arepresented, including a comparison of difficulties for "standard" and Blacksamples of pupils. Reliability, validity, and sensitivity to instruction dataare given. Data are also given for the study equating the PRI and the CAT-70.

ADMINISTRATION

The PRI is a group test. Time for testing an entire -level is about threehours. The publisher recommends administering the lower two levels bycassettes.

SCORING

The basic scoring service, which costs 70C per pupil for answer sheets or 97Cper pupil for scoring booklets, reports individual scores and group summaryscores by objective. Estimates of normative scores are optionally available.Estimated reporting time is 15 days from receipt by the publisher. Hand

scoring keys are provided in the Interpretive Handbook for all levels.

120

1 3 0

PRESCRIPTIVE READING INVENTORY CTB/McGraw-Hill, 1972


A(DC 1. Description.

A 0 2. Agreement. Item sensi-

tivity data provide avery rough indication ofdegree of agreement.


A @ 4. Sensitivity. Averageitem sensitivities of.20 - .38 per level arereported for the tryoutversion of the PRI usingthe index of Marx andNoll. Data are notreported at the level ofthe item or objective.

0 C 5. Item Uniformity. KR-20reliability coefficientsrange from .63 to .88.

A 6. Divergent Validity.Reported factor analysesdo not support the sepa-rateness of the testedskills in a consistentfashion across testlevels.

A @0

7. Bias.

8. Consistency. For the

tests of 34 objectives,a type of alternate formreliability is reported,namely correlation of thescores for an objectivewith scores on a longercriterion test of thesame objective. Sevento ten objectives fromeach level were sampled.Data are reported foreach of 2-3 grades foreach test level.



GD C 10. Item Review. Item analy-sis and revision weredone after tryout.

0 C 11. Visibility.

C 12. Responding.



A0C 15. Flexibility. There is agood carryover of objec-tives across levels, butsingle objectives arenot necessarily easy totest separately.Optional interim testsgive more flexibility.



B C 18. Scoring.

GD C 19. Record Keeping.

A q) 20. Decision Rules. Threelevels of attainment areidentified, but withlittle justification.

C 21. Com arative Data.Pupils' performance onthe PRI may be used toestimate their perfor-mance on the CaliforniaAchievement Test innormative terms, whenthe publisher's scoringservice is used.

121 1 31

READING: COMPREHENSION SKILLS Instructional Objectives Exchange,(K-6) 1973

DESCRIPTION

This battery is a collection of multiple choice tests measuring 40 objectivesin reading comprehension. The objectives deal with the following groups ofskills: main idea (10 objectives), conclusions (10), sequence (7), contextclues (9), punctuation (3), syntactical structures (4), affixes (2). The fiveto ten items per objective are printed on spirit masters for local duplicationand scoring. Two alternate forms of this collection are sold.

PRICES

Each form of this collection sells for $29.95 which includes a manual andrecord forms. The price per pupil will vary with the number of copies madefrom each spirit master and the number of objectives that are used. Dateof information: 1979.

FIELD TEST DATA

A formative field test is mentioned but not described. After publication,performance data were gathered on 81 to 737 pupils per objective (average:over 500).

ADMINISTRATION


SCORING


COMMENTS

Leveling of tests in this collection according to content, format, fieldtest data, etc., is done locally by the test user. The publisher also

offers a customized CRT service.

122

1 32

READING: COMPREHENSION SKILLS Instructional Objectives Exchange,(K-6) 1973


C)B C 1. Description. Amplifiedobjectives, but withoutrules for sampling thedomains.


A © 3. Representativeness. No

data.

© 4. Sensitivity. No data.


data.


data.

A c) 7. Bias. No data.

A (D 8. Consistency. No data.


9. Instructions.

10. Item Review.

11. Visibility.

C 12. Responding.


© 14. Curriculum Cross-Referencing.


GD C 16. Alternate Forms.


A B© 18. Scoring. The keys forall 40 objectives areprinted on one page insmall type.



A BC) 21. Comparative Data. Com-

parative data are givenin the form of cumulativepercentages of pupilsattaining each possiblescore for each objectiveat each of severalseparate grades. Thesample is all fromSouthern California.

123133

READING: WORD ATTACK SKILLS (K-6) Instructional Objectives Exchange,

1973

DESCRIPTION

This battery is a collection of multiple choice and oral response testsmeasuring 38 objectives in word attack. There are five to ten items per

objective (mostly ten), items for each objective being printed on a separatespirit master for local duplication and scoring. Two alternate forms of this

test are sold.

PRICES

Each form of this collection sells for $29.95. This price includes a manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.Date of information: 1979.

FIELD TEST DATA

A small developmental field test is reported but not described. After publi-

cation, performance data were gathered on 81 to 713 pupils per objective(average: over 300).

ADMINISTRATION


SCORING


COMMENTS

Leveling of the tests in this collection according to content, format, etr.,is done locally by the test user. The publisher also offers a customized

CRT service.

124

1 34

READING: WORD ATTACK SKILLS (K-6) Instructional Objectives Exchange,1973


OB C 1. Description. Amplifiedobjectives, but withoutrules for sampling thedomain.

A

A

A

A

A

A 0A ®

2. Agreement. No data.



5. Item Uniformity. Nodata.


7. Bias. No data.




c 10. Item Review.

c 11. Visibility.

0 C 12. Responding.



()B C 15. Flexibility.



A B® 18. Scoring. The keys forall 38 objectives areprinted on one page insmall type.

C 19. Record KeepinK.


A q) 21. Comparative Data. Com-parative data are givenin the form of cumulativepercentages of pupilsattaining each possiblescore for each objectivefor three separategrades (on the average).The sample is all fromSouthern California.

125135

REAL: READING/EVERYDAY ACTIVITIES Cal Press, Inc., 1972IN LIFE

DESCRIPTION

REAL is a test of basic literacy skills for readers age 10 and above. It con-sists of 45 fill-in items, 5 each dealing with nine categories of commonprinted materials. For example, the category of "sets of directions" istested by five items relating to a recipe for pizza which is given.

PRICES

Consumable test booklets are $1.00 each for orders of up to 100 copies.Cassette tapes for individual testing are $6.00 each. The Administrator'sManual, with technical information, is $6.50. A specimen set is availablefor $8.00. Date of information: 1977.

FIELD TES7 DATA

After a developmental field test on 300 persons, mostly junior and seniorhigh school students in inner city schools, REAL was revised and then normedon 434 disadvantaged Job Corps students of ages 18-21. Percentile norms,total test reliability (KR-20 = .93), point biserials for individual items,and item difficulties are given.

ADMINISTRATION

The REAL is administered to groups or individuals with the aid of cassettetapes and earphones.

SCORING

Scoring is done by hand using model answers in the manual.

126 1 36

REAL: READING/EVERYDAY ACTIVITIES Cal Press, Inc., 1972IN LIFE



A 0 2. Agreement. Contentvalidation procedures arealluded to but not des-cribed.

A 3. Rek. isentativeness. An

effort to ensure repre-sentativeness is alludedto but not describe0.

A C) 4. Sensitivity.

A 0 5. Item Uniformity. Theinternal consistency dataare not at the level ofthe objective.

A 6. Divergent Validity.

A 7. Bias.

A 8. Consistency.




CD C 11. Visibility.

0 C 12. Responding.

CDC 13. Informativeness.

A 14. Curriculum Cross-Referencina.

A BC) 15. Flexibility.

A C) 16. Alternate Forms.


A BC) 18. Scoring.



A 21. Comparative Data. The

norm sample is small.

12 3 7

SIPAY WORD ANALYSIS TESTS Educators Publishing Service, Inc.,by Edward R. Sipay 1974

DESCRIPTION

The Sipay Word Analysis Tests consist of a 17-test diagnostic batterymeasuring word-analysis skills in these three broad areas: visual analysis,phonic analysis, and visual blending. The tests range in breadth from"visual analysis" with three subtests and a total of 99 items, to "vowelsounds of y" with 9 items. There are at least three items for each specificskill (e.g., contractions with not), the items all calling for oral responses.The first test is a 57-item diagnostic survey.

PRICES

This battery is sold for $73.00 in a kit which includes a manual, a "mini-manual" for each of the 17 tests, 12 answer sheets for each test, and a setof 756 stimulus cards. Answer sheets are available separately in sets of 12for 15C to 60c depending on test length. Specimcn sets are $2.50. Date of

information: 1977.

ADMINISTRATION

The Sipay tests are made for administration to individuals by a teacher.

SCORING

The pupil's oral responses are scored by teacher judgment at the time ofresponding.

COMMENTS

The stimuli, when they are words or syllables, are chosen to be uncommon sothat they are unlikely to be in children's sight vocabulary. The developerdisagrees with our rating of feature #18 and says that many users do notfind the directions for the examiner (feature #17) complicated.

128 1 38

SIPAY WORD ANALYSIS TESTSby Edward R. Sipay

Educators Publishing Service, Inc.,1974


AOC 1. Description.


3. Representativeness.Principles for selectingstimuli are described indetail. A number of thedomains are tested infull, not merely sampled.

A IQ 4. Sensitivity. No data.

A () 5. Item Uniformity. Nodata.


A e) 7. Bias. Although there areno field test data, spe-cific instructions aregiven to avoid scoringdialect responses asincorrect.



ACC 9. Instructions. Sampleitems are not given forabout half of the tests.

A 10. Item Review.

C) C 11. Visibility.

C 12. Responding.

C) C 13. Informativeness.


OB C 15. Flexibility.


A qD 17. Administration. The

directions for adminis-tering, scoring, andinterpreting resultsare complicated.

A B® 18. Scoring. The recordingand scoring of responsesis often complex andsubjective.


A 20. Decision Rules. Cutoffpoints are given but

without support.


129 39

SKILLS MONITORING SYSTEM: READING Harcourt Brace Jovanovich/The Psychological Corporation, 1975

DESCRIPTION

The SMS: Reading is a four-level instructional management system for readingwhich measures pupils' skills in word identification at a Grade 3 level(including visual perception, phonics, morphemic elements) and comprehensionat 3rd, 4th, and 5th grade levels (including word meaning in context, literalmeaning, interpretation, critical reading). Each level includes both "loca-tor" or diagnostic tests of from 27 to 36 objectives, with two multiple choiceitems per objective, and "ski1I-minis" for the same number of objectives with8 to 12 items per objective. Practice ski1I-minis are also available.

PRICES

At each level, a package of 35 skill locators with scoring key, class record,and teacher handbook is 55 to 69 per pupil for the machine scored form.Keys for hand scoring are $1.35 per level. Self-scoring skill-minis are 17per pupil in sets of 16. A classroom set of materials is also sold. Specimen

sets are $2.75 per level. Date of information: 1978.

FIELD TEST DATA

Publisher reports that the SMS: Reading was field tested on roughly 6000pupils in 215 classrooms at grades 3, 4, and 5 in selected school systems.

ADMINISTRATION

These are designed as group tests.

SCORING

Machine scoring of the locator tests costs 75 per pupil. The locators andskill-minis may be scored by hand from a key, or the skill-minis may beordered in a self-scoring form.

COMMENTS

An optional Teacher's Resource Notebook was in preparation in 1977. This

will contain guidelines for instruction and an index of curricular resources.

14/)130

SKILLS MONITORING SYSTEM: READING Harcourt Brace Jovanovich/The Psychological Corporation, 1975



0 C 2. Agreement. Judges sortedtest items into homogene-ous groups, wrote objec-tives for each group,then compared theirobjectives with theoriginal ones. The levelof detail in those objec-tives and the method ofcomparing objectives arenot described.

A @A @

C

3. Representativeness.


5. Item Uniformity. MedianKR-20s and ranges ofKR-20s are reported foreach test length in eachlevel. Medians aremostly .73 - .83.

6. Divergent Valid. Theevidence is not strong:low correlations mostly(<.4) among pairs ofitems measuring differentobjectives on the locatortests.

A 7. Bias. No data, but areview for bias is men-tioned.

(6)8. Consistency. A type of

alternate form reliabil-ity is reported in verygeneral terms: mediantetrachoric correlationsfor each level betweenthe mastery judgment onthe locator for each ob-jective and the corres-ponding judgment on theskill-mini. Values rangefrom .67 to .73.


ACIC

C) c

9. Instructions. Sampleitems are generally notgiven.

10. Item Review. Item selec-tion and revision werebased on field test data.

11. Visibility.

12. Responding. Also,latent image format ofminis gives instantfeedback.

® C 13. Informativeness.

A © 14. Curriculum Cross-Referencing. In prepara-

tion.

B C

A ©6) c

B C

® CA 0

15. Flexibility.


17. Administration.

18. Scoring.

19. Recoreein.

20. Decision Rules. Decisionrules are given butwithout support.

A C) 21. Com arative Data.

131

SOCIAL STUDIES: AMERICAN GOVERNMENT Instructional Objectives Exchange,(10-12) 1973

DESCRIPTION

This battery is a collection of multiple choice tests measuring 32 objectivesin American government. An average of three to four of the objectives dealwith each of the following topics: our colonial heritage, the Constitution,citizens' rights, politics, the Congress, the Executive, the Federal Judici-ary, and state and local government. Each test item is printed on spiritmasters for local duplication and sccring. Two alternate forms of thiscollection are available.

PRICES

Each form of this collection sells for $29.95. This price includes a manualand record forms. The price per pupil will vary with the number of copiesmade from each spirit master and the number of objectives that are used.Date of information: 1979.

FIELD TEST DATA

Field testing in one high school is mentioned but not described.

ADMINISTRATION

These are group tests for administration by a teacher or by oneself.

SCORING


COMMENTS


132 112

SOCIAL STUDIES: AMERICAN GOVERNMENT Instructional Objectives Exchange,(10-12) 1973


C 1. Description. Amplifiedobjectives, but withoutrules for sampling thedomain.

A 0 2. Asreement. No data.




data.


A 7. Bias. No data.



A B 9. Instructions. The lan-guage of instructionsand stems may be diffi-cult for the average highschool student.


CD C 11. Visibility.

C 12. Responding.

® c 13. Informativeness.


C)B C 15. Flexibility.

cD C 16. Alternate Forms.


A B® 18. Scoring. Answers to all32 tests are printed onone sheet in small type.

cD C 19. Record Keeping.



133 143

SRA SURVIVAL SKILLS IN READING Science Research Associates, 1976AND MATH

DESCRIPTION

The SRA Survival Skills Test is a 120-item test of practical problems inreading and math for pupils at grade 6 and above. For each of the 20 objec-tives in reading and 20 in math, there are 3 multiple choice items.

PRICES

Reusable test booklets are 730 each in sets of 25 (550 to schools) and answersheets are 130 each by the 100. The administrator's manual is 700. A review

set is offered for $1.30. Date of information: 1977.

FIELD TEST DATA

A technical report is available from SRA giving item difficulties and item/test correlations. Data are reported for a median of 560 pupils per gradefor grades 7-12.

ADMINISTRATION

This test may be administered to groups.

SCORING

Machine scoring is offered at a cost of 980 per pupil, which includes the costof answer sheets.

SRA SURVIVAL SKILLS IN READING Science Research Associates, 1976AND MATH


1. Description.

A cp 2. Agreement. No data.



A c.) 5. Item Uniformity. Pointbiserial correlations foritems have a median near.45 for reading and .5for math as given in thetechnical report. Theseare correlations of itemscores with total testscores, not with scoresfor each item's objec-tives.

A qD 6. Divergent Validity. Nodata.

A 7. Bias. No data.

A cD 8. Consistency. No data.


C 9. Instructions.

© 10. Item Review.

cD C 11. Visibility.

C 12. Responding.

cDC 13. Informativeness.


A B.@ 15. Flexibility.



C)B C 18. Scoring.

A 19. Record Keeping. There isno form for recordingscores by hand.

20. Decision Rules. For eachthree-item objective, theprobabilities of a pupilgetting scores of 0-3 byguessing are given.

A 0- 21. Comparative Data. Itemdifficulties are givenin a technical report,but the sample of pupilsthat was tested is notdescribed.

STANFORD DIAGNOSTIC MATHEMATICS TEST Harcourt Brace Jovanovich/by Leslie S. Beatty, et al. The Psychological Corporation, 1976

DESCRIPTION

The Stanford Diagnostic Mathematics Test is a four-level battery testingskills that are usually taught in grades 1 through 8. Each level consists ofthree tests, one each dealing with number system and numeration, computation,and applications (problem solving, applications, tables, and graphs). Ateach level there are 11 or 13 objectives, there being an average of 8 to 10multiple choice items per objective. Alternate forms are available.

PRICES

Hand scorable test booklets are 43Q per pupil in sets of 35, these beingreusable at the upper three levels. Keys for scoring test booklets are $3.85per level and for scoring answer sheets $1.40 per level. Machine scorableand hand scorable answer sheets are about 11Q each in sets of 35. Practicetests for each level are also offered optionally. Administrators' manualsare $2.75 per level. A standard package containing materials for testing35 students is sold. Specimen sets are available at $3.30 per level. Dateof information: 1978.

FIELD TEST DATA

The Stanford Diagnostic Mathematics Test was field tested on a national sampleof 23,000 students and normed on a stratified sample of 36,000 pupils ingrades 2-12. Percentile ranks, stanines, and grade equivalent scores aregiven as well as item difficulties for pupils at several separate grade levelsper test level.

SCORING

Tests may be machine scored or scored by hand with a template. The basicscoring service runs 80Q to 85Q per pupil for machine scoring. The publisherestimates a turnaround time for test results of 10 working days from receipt.

STANFORD DIAGNOSTIC MATHEMATICS TEST Harcourt Brace Jovanovich/by Leslie S. Beatty, et al. The Psychological Corporation, 1976


A® C 1.

A q) 2.

A 3.

A 4.

A 5.

Description.

Agreement. A review ismentioned but not des-cribed.

Representativeness. No

data.

Sensitivity.

Item Uniformity. Inter-nal consistencies arereported for whole sub-tests (30 items) but notfor separate objectives.

A q.) 6. Divergent Validity, Highsubtest intercorrelationssuggest that aptitude andachievement are not wellseparated.

A 7. Bias. No data, butediting to eliminate biasin the development of thetests is reported.

A 8. Consistency. Alternateform reliabilities forclusters of items repre-senting two to threeobjectives are reportedfor pupils at two sepa-rate grades per testlevel. Median tetra-choric coefficient isabove .8.



C 10. Item Review.

C 11. Visibility.

6D C 12. Responding.

c 13. Informativeness.


A® C 15. Flexibility. There isample carryover ofobjectives from levelto level, but objectivesfor one level are all inone booklet.


17. Administration.

GB C 18. Scoring. Templates

and machine scoringoptions are available.

6D C 19. Record Keeping,

C 20. Decision Rules. Passingscores were set afterconsidering severalfactors (e.g., whether askill is a basis forlater skills), but theprocess of setting thesescores is described invery general terms.

6DC 21. Comparative Data. See

note on Field Test Dataon facing page.

1137

STANIORD DIAGNOSTIC READING TESTby B. Karlsen, R. Madden, & E. F.Gardner

Harcourt Brace Jovanovich/The Psychological Corporation, 1976

DESCRIPTION

The Stanford Diagnostic Reading Test is a four-level battery of testsdesigned to span grades 1.5 to 12.0. The following skill areas are covered:auditory discrimination, phonetic analysis, structural analysis, auditoryvocabulary, word meaning, word parts, word reading, comprehension, rate, fastreading, and scanning/skimming. There are 17-25 objectives at each levelwith generally 6-8 multiple choice items per objective (range: 4 to 42items). Alternate test forms are available. Publisher states that the SDRTplaces more emphasis on low achievers than is customary by including morethan the usual proportion of easy questions. Guidelines for using theresults for instructional and administrative purposes are given in theteacher's manual. A handbook referencing the tested skills to a variety ofreading series is offered.

PRICES

Consumable test booklets are 43C each in sets of 35 for the lower two levels.Reusable booklets for the third and fourth levels are 43C and 48C in sets of35. Answer sheets vary from 14 (hand scorable) to 28C (machine scored) insets of 35. Scoring keys range between $3.00 and $3.60 per level, while eachlevel of the manual for giving and interpreting the SDRT is $2.75. A specimenset is available at $3.30 for each level. Date of information: 1978.

FIELD TEST DATA

This revision of the SDRT was field tested on 24,000 pupils in grades 2-9 in1974 and normed on a stratified national sample of 30,000 students in 1975.Percentile ranks, stanines, and grade equivalent scores are given as well asitem difficulties for pupils at several different grade levels per test level.

ADMINISTRATION

The SDRT is a group-administered test battery. The estimated testing time foran entire level runs from 100 to 145 minutes.

SCORING

Scoring by hand template, key, or machine is available. The publisher's

machine service costs from 85c to 90c per pupil. Publisher estimates aturnaround time for test results of 10 working days from receipt.

STANFORD DIAGNOSTIC READING TESTby B. Karlsen, R. Madden, & E. F.Gardner

Harcourt Brace Jovanovich/

The Psychological Corporation, 1976


AlOC 1. Description.

© 2. Agreement. No data.

A 3. Representativeness.

A 4. Sensitivity.

C 5. Item Uniformity.

A q-D 6. Divergent Validity. Thesubtests show high inter-correlations, whichsuggests they all measurethe same thing.

A 7. Bias. Data are notgiven, but editing forbias during test devel-opment is reported.

CD C 8. Consistency. Alternateform reliability isgenerally above +.8.


B C

® c9. Instructions.

10. Item Review. Items wereselected on the basisof field test data.

C 11. Visibility.

C 12. Responding.


C 14. Curriculum Cross-Referencing..

ACIC 15. Flexibility. There is agood overlap of objec-tives across levels butitems for many objectivesare intermixed, notgrouped separately.



OB C 18. Scoring.


A 20. Decision Rules. "Prog-ress indicator cutoffscores" are provided,but they are justifiedin only general terms.The publisher encourageslocal discretion insetting cutoffs. Use of

normative scores forgrouping is also ex-plained.

0 C 21. Comparative Data. Based

on the national normingsample, percentiles,stanines, grade equiva-lents, and scaled scoresare given.

139

SURVEY OF READING SKILLS Dallas Independent School District,1973-74

DESCRIPTION

The Survey of Reading Skills is an eight-level battery of tests measuringobjectives in the following categories: pre-reading skills, structural anal-ysis, word meaning, and comprehension. A test booklet and examiner's manualare provided for each of levels K-6. Level S, for secondary students needingremedial instruction, has four test booklets. The number of objectives perlevel ranges from 40 at S to 15 for 6th, the average number of items perobjective ranging from 4 to 7.

PRICES

The price for the Survey of Reading Skills is the current printing and post-age costs. The test booklets are consumable. Date of information: 1977.

FIELD TEST DATA

The system has been field tested, but results are not provided with the test

ADMINISTRATION

The Survey of Reading Skills is designed for group administration, except fora second form of the K-level test.

SCORING

The tests are hand scored from keys in each examiner's manual.

COMMENTS

The difficulties of the levels are indicated by reference to specific textsin basal reading series. For example, Level II is aimed at the reading levelof Secrets and Rewards; Level V at Images. The objectives themselves arecommonly taught, not peculiar to this district.

140

P111""' I Fr

SURVEY OF READING SKILLS Dallas Independent School District,1973-7\4


AGC 1. Description.

A 2. Agreement.


A 0 4. Sensitiviti.

A 5. Item Uniformiti.

A (D 6. Divergent Validity.

A 7. Bias.

A © 8. Consistency.



A 6D 10. Item Review.

A 11. Visibility. Graphics areoften unclear.

C 12. Responding.

A © 13. Informativeness.


ACC 15. Flexibility. There isample carryover of objec-tives across levels, butthey are all tested inone booklet at each level.



A BC) 18. Scoring. Hand scoringinvolves a complex chart/counting system.

cD C 19. Record Keeping.

A 0 20. Decision Rules. Rules

are provided withoutsupport.

A qp 21. Comparative Data.

141

TESTS OF ACHIEVEMENT IN BASIC SKILLS Educational and Industrial Testing- MATH Service, 1970-74

DESCRIPTION

TABS-Math is a seven-level battery of survey tests for pupils in grades K-12.There is one item per objective on all of the tests, objectives being groupedinto the following clusters: arithmetic skills, geometry-measurement-application, and modern conccpts. The number of items varies from 18 atLevel K to 69 at the level for grades 4-6. The number of clusters per levelis 3 or 2. Item formats are fill-in for levels K, 1, and 2, and multiplechoice for the upper four levels. Alternate forms of this battery areavailable.

PRICES

Consumable test booklets for the lower four levels are 25C each in sets of 30.For the upper three levels, reusable booklets are 21C each in sets of 35 andanswer sheets are 8c each in like sets. For each level the administrator'smanual and answer key are each $1.50. A specimen set is offered at $2.25 perlevel. Date of information: 1977.

FIELD TEST DATA

The three test levels for grades 4-6, 7-9, and 10-12 were given preliminarytryouts and then were normed on national samples of 4500, 17,000, and 3500pupils respectively. Means and standard deviations are reported for totaltest scores for four ability groups and three grade levels for each of thosetest levels. In addition, entry level item difficulties are given for allitems at three grade levels.

ADMINISTRATION

The TABS are designed for group administration by a teacher.

SCORING

Hand scoring of the lower three levels is done with reduced pupil pages.Template and machine scoring are both offered for the upper three levels.The basic scoring service which costs 35c per pupil includes item and totalscores for individuals and for classes.

142 152

TESTS OF ACHIEVEMENT IN BASIC SKILLS Educational and Industrial Testing- MATH Service, 1970-74


A B© 1. Description. Objectivesfor single test itemsdefeat the purpose ofobjectives, to describeskills and not singlequestions. The higherlevel clusters of itemsare described by extremelyvague labels.

A 2. Agreement.


A 4. Sensitivity. Reportedgains from grade to gradeare not clearly theresult of relevant in-struction.

A 5. Item Uniformity. Datareported are not forobjectives or skillclusters.

A qD 6. Divergent Validity..

A 7. Bias.

A () 8. Consistency. At thelevel of the total testscore, alternate formreliabilities arereported for two levels.


A Bc) 9.

10.

C 11.

C 12.

Instructions. Sampleitems are not provided,and the instructionsfor the lower levels areoften unclear.

Item Review. Qualitycontrol reported for theupper three levels only.

Visibility.

Responding.


A 14. Curriculum Cross-Referencing. TABS isindexed to a curricularseries of the publisher.

AC) C 15. Flexibility. Goodcarryover of objectivesacross levels, but allare tested on one form.


6) C 17. Administration.

(DB C 18. Scoring. Except at thelower three levels wherethe hand scoring mate-rials are reduced pupilpages.

C) C 19. Record Keeping.


0 21. Comparative Data. Forthe upper three levels,there are detailed com-parative data; for thelower four levels, none.

1431 53

TESTS OF ACHIEVEMENT IN BASIC SKILLS Educational and Industrial Testing- READING AND LANGUAGE Service, 1975

DESCRIPTION

The TABS is a three-level battery for assessing pre-reading and readingskills in pupils in grades K-2. There are 38 to 52 objectives per levelwhich deal with the following categories of skill: word analysis, languagedevelopment, comprehension, and study skills. A few affective objectivesare included as well. For each objective there are from 1 to 24 items, theaverage being close to 3. Item formats include multiple choice, matching,and fill-in. A diagnostic and instructional program is available optionally.Two parallel forms of TABS are sold.

PRICES

Consumable test booklets with answer sheets are available for one test format one level at 26 per pupil in a set of 30. The manual and answer key for

a level are $1.50 together. For any one level the specimen set, test bookletplus manual, is $4.50. Classroom sets of the teaching and testing materialsare available on approval. Date of information: 1977.

ADMINISTRATION

TABS are designed for group administration.

SCORING

Answer keys for hand scoring are available.

144 1 54

TESTS OF ACHIEVEMENT IN BASIC SKILLS Educational and Industrial Testing- READING AND LANGUAGE Service, 1975


A(E)C 1. Description. Althoughwritten in the form ofbehavioral objectives,many of the objectivesare vague.

A 2. Agreement.

A 0) 3. Representativeness.

A qD 4. Sensitivity.

A c.) 5. Item Uniformity.

A 0) 6. Divergent Validit/.

A 0) 7. Bias.

A qD 8. Consistency.


A(B)C 9. Instructions. Sampleitems are not provided.

A 0) 10. Item Review.

11.

C 12. Responding.

A () 13. Informativeness. Con-

tents of the specimenset are not listed inthe catalog. It is notclear which manuals areavailable.

A 0) 14. Curriculum Cross-Referencing. Keyed tothe publisher's owninstructional program.

A B® 15. Flexibility.


(D C 17. Administration.

A BO) 18. Scoring. Answer keysare not consistentlyeasy to use. Some sub-jective judgments areinvolved in scoring.

(p C 19. Record Keeping.

A C) 20. Decision Rules. Rules

are provided withoutsupport. Some decisionsare based on one item.


145

1 55

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: COMPREHENSIONby Wayne Otto, Karlyn Kamm, et al.

DESCRIPTION

NCS Educational Systems,1977

The Wisconsin Design is a seven-level battery of measures for diagnosing thestatus and monitoring the progress in reading comprehension of pupils ingrades K through 6. The number of objectives per level ranges from 3 to 8,with at least 12 items per objective. Thirty-three of the objectives in thebattery have multiple choice iteml: six ask for written responses; one asksfor oral responses. Fifteen different types of literal and interpretive com-prehension are tested in all. Alternate forms are available. Optionalsupporting materials include a teacher's planning guide and teacher's resourcefile. This battery is one part of a six-part instructional management system;the word attack and study skills tests are also reviewed in this volume.

PRICES

Consumable test booklets for the lower grades are 59C to 80C per pupil andreusable booklets for the upper levels are $1.71, both types coming in setsof 35 along with an administrator's manual. The tests for the lower levelsare also available on spirit masters at $16.00 to $27.00 per level. Spiritmasters for printing answer sheets are $3.00 each. Specimen sets are $6.00.The teacher's planning guide is $4.25 and the teacher's resource file is$41.50. Date of information: 1978.

FIELD TEST DATA

Each multiple choice objective was field tested on about 150 pupils fairlyevenly drawn from schools labeled low average, average, or high average inreading comprehension. Constructed response items were field tested on8 to 24 pupils.

ADMINISTRATION

These tests are made to be given in groups by a teacher. Although the tests

are not timed, the estimated time for testing a single skill is about 10minutes.

SCORING

Keys are provided for hand scoring of multiple choice items. Models of

correct responses are given for the constructed response items.

COMMENTS

Data for test features #1, 2, 4, 5, and 8 were provided by the publisherafter the original test review was completed. The ratings here for thosefeatures were made by one person (CBW). The technical reports cited areavailable from the University of Wisconsin R&D Center for Cognitive Learning.

146

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: COMPREHENSIONby Wayne Otto, Karlyn Kamm, et al.



OB C 1. Description. Given inWorking Paper 11213, apreview of the finaltechnical manual.

A 2. Agreement. A review foragreement is mentioned inWorking Paper #213, butnot described.


A 4. Sensitivity. Gains arereported in a paper byKarlyn Kamm, but spurioussources of increase arenot clearly controlled.

A 5. Item Uniformity. Pub-lisher expected to havedata available by thetime this volume is pub-lished.


data.

A 7. Bias. No data.

A 8. Consistency. Publisherexpected to have dataavailable by the timethis volume is published.



10. Item Review.

C 11. Visibility.

12. Responding.

GD C 13. Informativeness.

14. Curriculum Cross-Referencing.


6DC 16. Alternate Forms.

() C 17. Administration.

AOC 18. Scoring. Hand scoringonly.

6D C 19. Record Keeping.

A 20. Decision Rules. Threelevels of attainment aredistinguished, but theirrationale is not givenin the test package.Publisher says that adissertation by Demosdeals with this issue.


147

15

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: STUDY SKILLSby Wayne Otto, Karlyn Kamm, et al.


DESCRIPTION

The Study Skills component of the Wisconsin Design is a seven-level batteryof tests for pupils in grades K through 6. The major content strands dealwith pictures and maps, graphs and tables, and reference materials. Thereare from 2 to 14 objectives per level, each with at least ten multiple choiceitems per objective. Alternate forms are available for most of the tests inthis battery. This battery is one part of a six part instructional managementsystem; the comprehension and word attack tests are reviewed in this volume.Optional supporting materials include a teacher's planning guide and teacher'sresource file.

PRICES

Consumable test booklets are from 28Q to 80Q per pupil for the lower fourlevels and reusable booklets are $1.71 for the upper levels, both types comingin sets of 35 with an administrator's manual. Tests for the lower levels arealso available on spirit masters at $6.00 to $28.00 per set, depending on thenumber of separate tests that make up the level. Machine scorable answersheets for the upper levels are printed locally from spirt masters which cost$3.00; the teacher's planning guide is $4.25; and the resource file plussupplement is $61.00. Date of information: 1978.

FIELD TEST,DATA

After pilot-testing the precommercial edition in 22 schools and revising it,publisher field tested this edition in three schools of average achievementlevel in Georgia. Over 1000 pupils provided data, 455 taking alternate formsof a subset of objectives, arid 605 taking adjacent levels of the battery. Avariety of data are given including, for each objective, average correct,frequency distributions, and internal consistencies. Alternate form relia-bilities and inter-level correlations are reported in several ways. Dataappear in Working Papers #190, 11391, and #422 which are available from theUniversity of Wisconsin R&D Center for Cognitive Learning.

ADMINISTRATION

The Wisconsin Design Study Skills tests are made for group administration.Working paper 11190 gives 14 miuutes as the approximate average time foradministering these untimed tests.

SCORING

Scoring is by hand key.

COMMENTS

Data for test features #1, 4, 5, 6, 8, and 21 were provided by the publisherafter the original test review was completed. The judgments reported herefor those features were made by one person (CBW).

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: STUDY SKILLSb Wa ne Otto Karl n Kamm, et a



C 1. Description. Given inWorking Paper #190.


A 'CD 3. Representativeness. Nodata.

A 4. Sensitivity. TechnicalReports #341 and #422show gains in scores andlevels, but spurioussources of increase arenot clearly controlled.

0 C 5. Item Uniformity. Medianinternal consistency(Hoyt 0 per objectiveper level is close to.74 for form P.

0 6. Divergent Validity.Intercorrelations ofscores on pairs of objec-tives within a level aregenerally below .5.Intercorrelations ofmastery decisions forall pairs of tests withinlevels and between adja-cent levels are alsogiven.

7. Bias. No data.

8. Consistency. Alternateform consistencies aregiven for only 24 objec-tives from the upper fivetest levels. These arein two forms: consis-tency of mastery decisionsand of number correct.The median of the alter-.nate form raw scorecorrelations is r=.51 forthese objectives.


@ B C

C

9. Instructions.

10. Item Review.

11. Visibility.

12. Responding.


14. Curriculum Cross-Referencing.

15. Flexibility.

16. Alternate Forms. Avail-able for most of thetests.


AOC 18. Scoring. Hand scoringonly.

C 19. Record Keeping. Classrecord sheets may needto be made locally, butindividual records areprovided.

A 20. Decision Rules. A

mastery percentage isgiven, but not supported.

A 0 21. Comparative Data.Although a variety ofdata are given, WorkingPaper #190 says thatthey are not intendedfor use as norms. Thesample of pupils isgeographically limited.

1491 59

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: WORD ATTACKby Wayne Otto, Karlyn Kamm, et al.


DESCRIPTION

The Wisconsin Design Tests of word attack are a four-level battery fordiagnosing the status and monitoring the progress of pupils in grades K-6.The objectives deal mainly with readiness, phonics, sight reading, and struc-tural analysis. There are from six to sixteen objectives per level, eachhaving at least fifteen multiple choice items. Alternate forms are available.Optional supporting materials include a teacher's planning guide and ateacher's resource file. This battery is one part of a six part instructionalmanagement system, the comprehension and study skills tests are also reviewedin this volume.

PRICES

Consumable test booklets are 29 to 59 per pupil in either a self-scoring orhand-scoring format. Tests are also available on spirit masters at $10.50 to$18.00 per level. The examiner's manual which is included with each set ofconsumable test booklets, is available separately for $1.60 per level;teacher's planning guide is $4.25; and the resource file plus supplement is$61.00. Date of information: 1978.

FIELD TEST DATA

After pilot testing this battery in 23 schools and revising it, publisherfield tested this version in three schools in New York. A median of 152pupils per test level provided data for both test forms, and a total of113 pupils were tested on two adjacent levels of the battery. A variety ofdata are given including, for each objective, average correct, frequencydistributions, and internal consistencies. Alternate form reliabilitiesand inter-level correlations are reported in several ways. Data appear inWorking Paper #190 which is available from the University of Wisconsin R&D'Center for Cognitive Learning.

ADMINISTRATION

Thirty nine of the 45 objectives in this battery are designed for grouptesting by a teacher. Although the tests are untimed, the estimated timefor testing a single skill averages 12 minutes (Working Paper #190).

SCORING

Scoring keys are provided.

COMMENTS

Data for test features #1, 5, 6, 8, and 21 were provided by the publisherafter the original test review was completed. The judgments reported herefor those features were made by one person (CBW).

150

WISCONSIN DESIGN FOR READING SKILLDEVELOPMENT: WORD ATTACKby Wayne Otto, Karlyn Kamm, et al.



B C

A ®A

A ®0 C

1. Description. Given inWorking Paper 1/190.




5. Item Uniformity. Inter-nal consistencies forindividual objectiveshave a median of about.77 (Hoyt r).

6. Divergent Validity,.

Intercorrelations of rawscores for pairs ofobjectives within eachlevel are mostly below.5. Intercorrelations ofmastery decisions for allpairs of tests withinlevels and between adja-cent levels are alsogiven.

A 7. Bias. No data.

A 8. Consistency. Mediancorrelation between rawscores on alternate formsof single objectives is.64. Data are also givenfor consistency of masterydecisions across bothforms of all objectives.


9. Instructions.

10. Item Review.

11. Visibility.

12. Responding.


C 14. Curriculum Cross-

C 15. Flexibility. Tests areavailable on spiritmasters for separateduplication if desired.


17. Administration.

18. Scoring. Hand scoringonly.


A (s) 20. Decision Rules.

A e 21. Comparative Data.Although a variety ofdata are given, WorkingPaper 11190 says thatthey are not intendedfor use as norms. Thesample of pupils isgeographically limited.

s) c

() c

A® C

151 161

WOODCOCK READING MASTERY TESTS American Guidance Services, Inc.,by Richard W. Woodcock 1973

DESCRIPTION

The Woodcock Test consists of 400 oral response items for measuring thefollowing reading skills in grades K-12: letter identification (45 items),word identification (150), word attack (50), word comprehension (70 analogyitems), and text comprehension (85 modified cloze items). In each skillarea, items are arranged in ascending difficulty as determined by Rasch-Wright item analysis methods. Pupils work the test from their own basallevel to their own ceiling. Alternate forms of the Woodcock are available.

PRICES

A complete set of materials for either form of the test costs $22.00. It

includes the easel kit with all of the test items, the manual, and 25 formsfor scoring and interpreting responses. Date of information: 1978.

FIELD TEST DATA

The final pool of 800 items (400 per form) was selected from an initial poolof over 2400 as a result of developmental testing. The final tests werenormed on a fairly representative national sample of over 5000 pupils.

ADMINISTRATION

The Woodcock is an individual test which can be administered by a classroomteacher in an estimated time of 20 to 30 minutes.

SCORING

Individual responses are scored and recorded on the spot as the studentspeaks them. Correct answers are visible to the examiner on the backs ofeasel kit stimulus cards.

COMMENTS

Fall and spring percentile norms and normal curve equivalents for the Woodcocktests were expected to be available by the time this volume is published.

WOODCOCK READING MASTERY TESTS American Guidance Services, Inc.,by Richard W. Woodcock 1973


ACC 1. Description. Althoughnot stated in the usualform of behavioral objec-tives, the domains aredescribed fairly well inthe manual.


3. Representativeness.Items were selected onstatistical grounds.


5. Item Uniformity. Splithalf reliabilities for103 pupils on the fivesubtests vary from .79to .99 at grade level"2.9." On the fourtests of word- or text-level skills, they rangefrom .83 to .98 at the"7.9" grade level for102 pupils.

A 6. Divergent Validity.Tables 10-14 in the man-ual report correlationsbetween subtests and ofsubtests with the totalfor other tests in thebattery. They are ratherdependent at the lowergrade levels, over halfthe correlations being> .7. At the uppergrade levels, relativeindependence is shown.

A 67) 7. Bias. No evidence.

8. Consistency. Reliabil-ities for retesting withthe alternate form are.84 or better at thesubtest-level in 7 outof 10 cases reported.


C)B C 9. Instructions.

C) C 10. Item Review.

C 11. Visibility.

0 C 12. Responding.

0 C 13. Informativeness. Speci-men sets are not offered,but the materials may bereturned for refundwithin 30 days if theyare in unused condition.

A @ 14. Curriculum Cross-Referencing.

C 15. Flexibility.


17. Administration.

ACC 18. Scoring. Scoring of itemresponses is generallyeasy and objective, butmay require some judg-ments of meaning.Converting the rawscores to derived scoresrequires some practice.


0 C 20. Decision Rules. Thedecision rules are likeconfidence intervals andpredictions of success inusing material at specif-ic levels of difficulty.

C 21. Comparative Data.

153

CHAPTER 5How To Select Tests: Locating Tests

and Comparing Their Technical and Practical Features

This is the first of two chapters on selecting a test so thatit will be suited to the needs of a particular program. Thischapter describes procedures for locating and screening teststo arrive at a number that is workable for evaluating indetail. Methods for evaluating tests' technical and practi-cal features, and comparing them according to these featu7-esare ther given. A major concern in test select,ion--fl:nliugthe one which best matches a specific curriculum--is coveredin detail in Chapter 6.

Ideally a test user would be able toidefitify the single best test for agiven need (for example, diagnosisof word attack skills of thirdgraders in the inner city) by con-sulting a reference book of testevaluations. A number of factorsmake this method unfeasible. Forone, ongoing developments in testingcause a reference work to grow obso-lete starting at the time when theresearch for the book stops.Second, not all features of a testare equally important to all testusers, and a single test seldomexcels in all features. Thus, it isnecessary for individual users toweigh the various features accordingto their own needs and then to makeoverall comparisons. Finally, thesingle most important aspect of atest--its relevance to the testuser's curriculum--can only bejudged locally, by the people mostfamiliar with that curriculum.

Before selecting a test, local pro-gram staff should decide whethertesting is, in fact, the most effec-tive response to their needs forinformation. This decision willch. end on such issues as these:

What type of information isneeded?

Who will receive the information?

What other methods are there toobtain the information?

What dollar costs are acceptableif the staff decides to proceedwith a testing program?

What costs in pupil and staff timeare acceptable?

How useful (i.e., timely and rele-vant) will the test scores be forthe classroom teacher?

155

If testing turns out to be the pre-ferred action, and if a specifictest is not mandated by externalauthority, then test selection canproceed.

Having decided to test, most schoolsor districts seek to purchase ready-made tests--a logical first step.If suitable tests are not available,however, two other options can beconsidered. First, tests or testingsystems may be created locally. Theconsiderable cost in staff time forsucha project may be substantiallyreduced through the use of suchresources as skills continuums,objectives collections, and itembanks (see Appendix A). The bene-fits of maintaining local controlover testing may offset the costs ofthis option. Because the develop-ment of a test battery is a long-range project, this option should befollowed only after careful consid-c_ration of the alternatives.

A second option is to hire a testdeveloper to create a testing sys-tem. A number of publishers willcustom-make CRTs. Appendix A listssome of these publishers. The teststhey produce should be subject tothe same evaluation procedures; that

would be applied to ready-made testsunder consideration.

The procedures described in thischapter are meant to help you assessthe merits of available NRTs, CRTs,or a mixture of the two, in yoursearch for a test. Although onecould use these procedures to exam-ine a single test or test package,they are most useful for comparingtwo or more tests. It is not possi-ble to say how much overall qualitya single test must have in order tobe "good enough," nor is it possibleto determine that the match of asingle test to a given curriculum is"close enough." One can only decidewhich of several tests is better.

Many potential test buyers will nothave the personnel to follow all ofthe procedures in this chapter andthe next one. We have included themso that test users can make deci-sions consciously rcther than byoversight. Where test selection iscarried out by a committee, as it isin a majority of school districts inthe United States,1 it will beeasier to evaluate tests thoroughlybefore choosing one.

Test selection involves a number oftechnical decisions, so it is essen-tiaZ2 that some of the peopleinvolved in the process have aknowledge of the principles of bothcriterion-referenced and norm-referenced testing. To maximize theinstructional relevance of testingand to minimize the possible aliena-tion resulting from it, it is alsoimportant to involve teachers andcurriculum specialists--those mostfamiliar with the students and thecurriculum--at every step of testselection.

Finally, a word should be said aboutthe importance of local fieldtesting in test selection. Thoughnot always possible, it is extremelyhelpful to try out a test in yourown schools before deciding to adoptit. Teachers' and pupils' reactionsto a test are very significant indi-cators not only of its appropriate-ness for your setting, but also ofits quality and usability. Localtest tryouts may serve either toscreen out less desirable measuresor to choose one out of a pool offinalists in the selection process.


2APA, 1974.

156 1 95

HOW TO SELECT A TEST

IDENTIFY TESTS WHICHSEEM APPROPRIATE AND

DO AN INITIAL SCREENING

Before starting to search for tests,you should be clear about your pur-poses in testing. For some purposes,certain characteristics of a testwill be more important than others.A good understanding of what kind ofinformation you want from the testwill help you identify the testcharacteristics which are mostimportant for your purposes.

Any purpose for testing is best des-cribed in terms of a type of deci-sion which the test results aremeant to influence. For example, acommon purpose is to select a lim-ited number of individuals from alarge pool of available students, asin selecting for admission to aspecial program. Another purpose isto guide the planning of instructionby measuring students' current pro-ficiency on a given set of skills.Still another is to make decisionsabout individual students by mea-suring how well they have masteredthe objectives of a program.

Once the purpose for testing is madeclear, you can develop a pool ofavailable tests by means of a sys-tematic search process. A goodstarting point for the search is theset of test reviews in Chapter 4 andin the reference works listed inAppendix B. Information in any re-views may need to be updated byreferring to test publishers'current catalogs which are readilyobtainable by mail.

At this point in the test selectionprocess, you are working from des-oriptions of tests. As you lookthrough these materials to exclude

tests which do not respond to yourspecific needs, you are doing aninitial sifting to arrive at amanageable number of tests forcloser consideration.

To help you with this initialsifting, the following paragraphsmention several test uses and theirimplications for test selection.

Testing for diagnosis and prescrip-tion of the individual student

In order to be most usable fordiagnosing individuals' strengthsand needs, and for assigninglessons, a test must have thesequalities:

Test items are keyed to clear andteachable objectives.

There are several items per objec-tive.

Hand scoring is practical forquick use of results.

A score is given for each objec-tive.

If scoring is by machine, thereturn of results to teachers israpid, and score reports are easyto interpret.

Tests with only two or three itemsper objective will save testingtime, but their consistency in iden-tifying individual students'strengths and needs on particularobjectives is lower than that oftests with more items per skill.Diagnostic tests may be packaged toallow testing only a small number ofobjectives at once, but usually theysurvey a large number of objectivesin one test booklet.

It is up to the test buyer to decidewhat level of subject matter detailis needed in the test scores tosupport diagnosis and prescription.

157 1

MI

Some educators believe that scoreson fairly broad content area.) suchas vocabulary, word attack, andcriticaZ thinking are useful. Mostclassroom teachers feel that scoresfor objectives are needed at thelevel of a lesson or small number oflessons.

Testing to verify or monitor ongoingstudent progress

The traditional tool for monitoringstudents' learning is the teacher-made test. On the basis of the test'scores, students are moved forwardto the next lesson or are given morepractice on the current one. A num-ber of test publishers have producedbatteries of many short tests whichare meant for the same purpose. Tobe well suited for this purpose, atest battery must have these quali-ties:

Test items are keyed to clear andteachable objectives.

The test is packaged to allowtesting a small number of objec-tives at one sitting, preferablyone objective.

There is an adequate number ofitems per objective.

Hand scoring is practical forquick use of results.

A score is given for each objec-tive.

These tests differ from diagnosticones by covering a very small numberof objectives in each test form topermit flexible, individualizedtesting of specific lessons as theyare taught. Verification of student.progress also requires a very reli-able score on each skill so as to besure of each student's degree oflearning; a reliable score, in turn,requires fairly large numbers ofitems per objective.

158

Many instructional programs inreading and math have progress-monitoring test batteries asoptional components. These bat-teries need to be evaluated beforepurchase just as carefully as anyother tests.

Testing for program planning orneeds assessment

When testing is done to identify th.2..

strengths and needs of a givencurriculum, it can be thought of asdiagnostic testing at the programlevel. Such tests should:

Survey the appropriate range ofcontent and skills.

Give scores that allow planningdecisions to be made.

Breadth of coverage is relevanthere, not reliability at the levelof the individual student. Thus thenumber of items per objective thatindividuals answer need not belarge. Presumably, scores on testsfor diagnosing individuals could beaggregated and used for this plan-ning function, allowing the test toserve two purposes at once.

Testing for program evaluation oraccountability

When testing is conducted to meetexternal requirements, thoserequirements may state which charac-teristics the test should have oreven which test to use. Anyrequired characteristics, such asthe presence of national norms or ofother field test data, can be usedas screens in test selection. A

growing number of CRTs provide normsalong with the absolute scores.Testing for the purpose of programevaluation usually calls for the useof measures which survey a broadrange of content and skills. If thechoice of test is left to local dis-cretion, then the test should also

167

give scores that will supportinstructional decision making, atleast at the program level if not atthe classroom level. If instruc-tional relevance is a lost cause,then tests for accountability orprogram evaluation can be chosen soas to minimize testing time.

Testing for other purposes

A few other obvious or surface fea-tures can be used for eliminatingtests at this preliminary stage whenyou are working from the test des-criptions--for example, availabilityof alternate test forms (for pre-and posttesting purposes). Therewill not be enough information insecondary sources to inform manyother test selection decisions,although it may seem that there is.Take, for example, the need toselect students for a special pro-gram. If an NRT is to be used, theappropriateness of the norm group iscrucial. But information on normgroups is not available in many testreviews nor in most publishers'catalogs. In most cases, the testpackage itself will have to be exam-ined directly in order to make-judgments about other critical test

features.

EXAMINE SPECIMEN SETS

Once some promising tests have beenfound in the secondary sources,specimen sets of these tests shouldbe ordered. Further selection isthen done by referring to actualtest materials and manuals. At

least two broad standards should beapplied at this point.

Standard 1

First, the cultural appropriatenessof each test's items for your

student population should be judged.Some of the questions that will helpyou gauge the appropriateness of atest's items are these:

, Are the concepts familiar to yourstudents?

Is the dialect of the languagefamiliar to your students?

Is the test's content free fromsocial stereotypes?

Are the instructions to the stu-dent understandable?

This standard can be applied effec-tively by classroom teachers andcurriculum coordinators who have agood sense of what is culturallysuitable for the program's students.

Standard 2

The other standard is a rough mea-sure of a test's relevance to thelocal curriculum. Because theobjectives of most existing tests--CRTs and well as NRTs--are statedrather loosely, they may seem to fitany curriculum. In order to judgehow well the test materials coverthe skills of your specific program,you should examine the actual testitems.

For each of the tests under consid-eration, identify and count theitems which measure skills that areactually taught in your program atroughly the same level. Record thatnumber and then calculate the pro-portion of items on each test thatare relevant to your program. Com-pare the tests on these two figures--total number, and proportion oflocally relevant items. Eliminatethe tests which have markedly lowerfigures. This task can be effec-tively carried out only by personswho are very familiar with thecurriculum as it is actually taught.

1591 68

This initial method of comparingtests with the local curriculum,while useful, is not adequate forfinding the one test which is bestmatched to your program, for thefollowing reasons:

By matching test materials withthe curriculum "as you rememberit," you may overlook which andhow many objectives of your ownprogram each test fails to cover.In other words, since the focus ison test materials, skills in yourprogram that are missing in thetest battery will tend to be over-looked.

By making global judgments of therelevance of test items, you maynot attend to a number of otherfactors that affect the appropri-ateness of the test materials,such as difficulty of test items,appropriateness of item formats,and the relative importance of theskills which each test covers anddoes not cover.

This initial method for judgingtests' curricular relevance is onlya broad screening device. In Chap-ter 6, a more thorough method isgiven which takes into account theother factors that were just noted.

As mentioned earlier, steps up tothis point in test selection shouldquickly reduce the tests under con-sideration to a number that is prac-tical to evaluate in detail. If thenumber of tests remaining at thispoint is too large for availablestaff to study closely, then othertest features may be used asscreens, or the previous featuresmay be reapplied more stringently.On the other hand, if the remainingpool of tests provides no satisfyingchoices, then serious thought shouldbe given to developing testslocally, modifying existing tests,or not testing at all.

COMPARE TESTS ACCORDINGTO THEIR PRACTICAL

AND TECHNICAL MERITS

The method for comparing tests'merits that is outlined here in-volves selecting test features toevaluate, making judgments aboutthose features, converting the judg-ments into numbers, combining thenumbers for each test, and comparingtests in terms of the numericaltotals. These steps may at firstseem too detailed and mechanical.Three points should be noted in thisregard.

First, by assigning numbers at eachstage of judgment and carryingthem to the next stage, you ensurethat information from earlierjudgments is not lost. In otherwords, the component decisions allhave an effect on the final ratingsof each test. Second, the methodsare explicit. Therefore, they areteachable, repeatable, and easy toadapt. Finally, as you follow thesteps, you will find the proceduresare harder to read about in theabstract than they are to apply ina practical situation and that theybecome quite easy with a littlepractice.

As a practical matter, it is desir-able to have specific features oftests evaluated by staff members whohave the special training and exper-ience to evaluate them. Thus yourspecialists in testing could eval-uate tests' statistical qualitieswhile teachers and curriculum spe-cialists could judge the features,such as directions to the pupils andquality of prescriptive aids, whichrequire a knowledge of pupils andinstructional materials.

Table 2 summarizes the steps in com-paring tests feature by feature andserves as a checklist for carryingout these steps.

160 1 69

TABLE 2Checklist of Steps for Comparing the

Technical and Practical Merits of Tests

Step PaRe

1 Select test features to evaluate. 162

2 Rate the importance of the test features, and 162record the ratings on the Worksheet.*

3 Write the names of the tests to be compared at the 163

top of the Worksheet, and duplicate the form forthe test rater(s).

4 Find, in the sample materials for each test, the 168evidence for the first test feature.

5 Arrange the tests in descending order of merit on 168

the given feature. Record these rankings (best,second, third...) in the respective columns of theWorksheet next to the name of the feature.

6 For tests which are equally good on a feature, give 168them the average of the ranks they would haveearned if not equal. For tests which differ, butnot by much, use the given rules of thumb.

7 Repeat Steps 4-6 for all other test features to be 169

evaluated.

8 Summarize the tests' rankings by weighting them and 169

then recording them in the "Final Results Table" atthe upper right of the Worksheet.

9 Check that the total number of tallies per test in 170

the "Final Results Table" is equal.

10 Compare the tests' profiles in the "Final Results 170

Table." Eliminate tests that are markedly worse.Select the better ones for detailed analysis oftheir congruence with the local curriculum.

*Figure 1, pages 164-165.

161

I

STEP 1. Select test features toevaluate.

Two lists of test features for com-paring tests have been developed atthe Center for the Study of Evalua-tion--one for use with norm-referenced tests,3 and the otherfor use with CRTs. The latter--the one used for test evaluationsin this volume--is shown in Table 1,pages 14-15. Another list forevaluating CRTs was developed byHambleton and Eignor.4

Any ready-made list should be editedby local staff. Such editingrequires that the list be reviewedto determine if there are featuresyou wish to add or omit. Featuresthat should be omitted are:

Ones that do not make a testbetter or worse for meeting yourtesting needs. These are featureswhich are irrelevant or are ofnegligible importance. For

example, the two test features,curriculum cross-referencing andalternate forms, may be eliminatedfrom the judging process whenthere is to be one-time surveytesting for accountability pur-poses, with its broad normativescores and slow reporting ofresults.

Features that have already beenused in a pass/fail fashion tonarrow the pool of availabletests. These are called exclu-sionary features. In screeningtests to use for diagnosis, forexample, you will already haveexcluded tests which do not pro-vide scores for separate objec-tives.

3Hoepfner, et al., 1976.

4Hambleton and Eignor, 1978.

162

Some features may be used in both apass/fail fashion and a comparativeone. For example, tests with fewerthan some minimum acceptable numberof items per objective may beexcluded in the earlier screening;then, when tests are compared fea-ture by feature, tests with largernumbers of items per objective maybe rated higher than tests withsmaller numbers. In the same vein,tests which do not offer optionalcurricuZum indexes may be screenedout, and the remaining tests latercompared on the quality of theircurriculum indexes.

Figure 1, pages 164-165, is a work-sheet for recording and summarizingjudgments about individual testfeatures. In the first column ofthe worksheet, write the names ofthe features to be evaluated.Figure 2, pages 166-167, shows aworked example of the worksheetwith a small set of features chosenfrom Table 1, pages 14-15.

- 3

STEP 2. Rate the importance of thetest feature, and record the ratingson the worksheet.

A test's suitability for your needsdepends more heavily on some of itsfeatures than on others. Threedegrees of importance in featureshave already been recognized up tonow:

Exclusionar features--ones thatare so important that they areessential if a test is to meetyour needs. These are used in apass/fail fashion to excludeclearly unacceptable or irrelevanttests.

Irrelevant or unimportant features--ones that have just been elimi-nated from consideration becausethey do not make a test better orworse for your purposes.

'71

Comparative features--all of thoseaspects of a test which make itmore or less suitable. Theseinclude exclusionary features onwhich tests may still vary inquality even after they have metminimum levels of acceptabilityas mentioned under Step 1. Alsoincluded, of course, are variousother test features you havedeemed useful for judging thepractical and technical meritsof the tests under consideration.

Now judge the relative importance ofthese features and assign importanceratings, or weights, to them. Werecommend a three-level weightingsystem like the following:

3=most important2=average importance1=useful, but not so important

The later, overall rating of a testis influenced by the importanceweight of each feature. The purposeof having exclusionary features forscreening tests at first and thenimportance weights for adjusting theinfluence of features on the overallrating is this: It is necessary tokeep the less important featuresfrom adding up in the final analysisto overcompensate for the absence ofessential and more important ones.In other words, don't let the minortest features dominate the compari-son of tests. As noted above, afeature that is of minor importancefor one test may be essential for adifferent use.

The different audiences and users ofthe tests in your program shouldparticipate in making the importanceratings so that their needs andinterests will be taken into account.We recommend that teachers have amajor voice at this stage becausethey have a good sense of how testsmay or may not be useful for instruc-tional purposes, of how practical a

test is to administer, and of theeffects of testing on pupils' moti-vation and morale.

STEP 3. Write the names of thetests to be compared at the top ofthe worksheet, and duplicate theform for the test rater(s).

In the spaces at the top of theworksheet, enter the name, form, andlevel of each test to be evaluated.For ease in filling out the rest ofthe worksheet, write an abbreviationof each test's name in the columnlabeled "Abbreviated Name."

Make a photocopy of the form foreach person (or team of persons)who will be evaluating the tests,keeping the original copy blank incase more clean duplicates areneeded.

163

172

MONTH/YEAR

RATER(S)Steps 8-10:

ABBRE- FINAL RESULTS TABLE (Total of Not

Step 3: VIATED Weighted Rankings for Each Test) Acceptable

NAMES/FORMS/LEVELS OF TESTS BEING COMPARED NAMES 1st 1-2 2nd 2-3 3rd 3-4 4th 4-5 5th -Zero-

Step 1:

TEST FEATURES

Step 2:IMPORTANCEWEIGHTSOF FEATURES3=very imp.2=important1=useful

Steps 5-7:RANKINGS OF TESTS (Enter abbreviatednames; for ties, average the respec-tive ranks.)F--Acceptable

Best Second Third Fourth Fifth Zero NOTES

17 3

Figure 1. CSE Worksheet for Comparing Tests' Technical and Practical Features

1 '77ir

MONTH/YEAR March 1979

(e teept fea+vre. tkiA-

RATER(s) rarac, by evalvaior)

Step 3:NAMES/FORMS/LEVELS OF TESTS BEING COMPARED

Test A (primary level)

Test B

Test C

Step 1:TEST FEATURES

Step 2:IMPORTANCEWEIGHTSOF FEATURES3=very imp.2=important1=useful

ABBRE-VIATEDNAMES

Step) 8-10FINAL RESULTS TABLE (Total ofWei hted Rankin s for Each Test)

1st 1-2 2nd 2-3 3rd 3-4 4th 4-5

Rt71 //

A

// ../Nr/

Step) 5-7:RANKINGS OF TESTS (Enter abbreviatednames; for ties, average the respec-tive ranks.)

Acceptable

Best Second Third Fourth Fifth Zero

NotAcceptable

5th -Zero-

//

NOTES

-He /

411tref ///

*1. Domain descriptions

2. Agreement (of itemsand their domains)

176

3

.3

A

8

13

/1

(1,.

,.,(----

35f5 ciete,led-,o4jecfive make.7asi- li easier re t8ach ibward..

Judges sortcd. /teems Ar 7-.....i. 5,+hen wro-l-e a cloma;a cleteri1,#7.6rii-pr each set. Stray ifeb415 ve,..cl-cpppea-4_. Sets .licF-..i..),jyters' an

a 1 ,cLornain itles iprion5,r bp r7_,ma-r6r7e1.- were, ep

1

8. Consistency ofscores (reliability--should be rated by atesting person)

10. Item review

14. Curriculum cross-referencing

16. Alternate forms

20. Decision rules

3

,2

*The numbers here correspond with those in Tible 1, pages 14-15.

Relhib11114, 01 i-he 4gersion ismore os,e'Fal -1-han of40-1.9; ...seores.-re 51- CIota.

SI -I-he. ,5eorio9 erF Th5-1- ;der&ologpe-ii-ive_, Men il;

aecisiort rule5 woo Id 6e- *00

Ellare.,2. Worked Example of CSE Worksheet forComparing Tests' Technical and Practical Features

g

17 9

STEP 4. Find, in the sample mate-rials for each test, the evidencefor the first test feature.

The specimen sets for many testshave an examiner's manual, a techni-cal report, one complete test formfor each test level, a complete setof answer sheets (if they areseparate from the test forms), acomplete set of scoring keys, exam-ples of score reports, and anyrelevant stimulus cards, manipu-landa, etc. Not all specimen setsare organized the same, and theevidence for any given test featuremay be spread over several places.

The test rater should become famil-iar with the specimen sets, findingand noting the evidence for eachfeature which (s)he has the job ofevaluating. If there appears to beno evidence for a given feature,that fact will be noted in the nextstep.

Find the evidence for the first testfeature in all of the specimen sets.

[

STEP 5. Arrange the tests indescending order of merit on thegiven feature. Record theserankings (best, second, third...)in the respective columns of theworksheet next to the name of thefeature.

Study the various tests' evidencefor the given feature and decidewhich one (if any) is better thanthe others on that one dimension.Then decide whith test is secondbest, and so on. For any testswhich provide no evidence of meriton a feature, or else evidence ofinsufficient merit, rank them as

zeroes on that characteristic. Youwill have to decide locally howlittle merit a test can have on afeature and still be worth a ranking

above zero. For example, you maydecide that reliabilities below .60are as bad as having no reliabilitydata at all. Then you would rankall tests with no reliabilityfigures or with figures below .60as zeroes, and give the remainingtests positive rankings.

For this first feature, write thetests' abbreviated names in thecolumns for their respectiverankings. Make these entries onthe same line as the name of thefeature. Be sure to write theshort names of the zero-rated testsin the zero column because thisinformation is used later.

STEP 6. For tests which are equallygood on a feature, give them theaverage of the ranks they would haveearned if not equal. For testshich differ, but not by much, usethe iven rules of thumb.

Occasionally two or more tests willbe equally good on a given featureso that they are tied in ranking.For these cases, it is necessary tohave a standard method of recordingthe rankings. A method that iscommonly used with such ordinal(rank order) data is to assign eachof the tied tests the aVerage of theranks they would have occupied ifthey had not been tied.

An illustration of this appears onthe worked example in Figure 2. For

feature #16 (alternate forms),Tests A and B have received equalratings of 1.5 (1+2=342=1.5). On

the line of the worksheet for thatfeature, a circle has been drawnthat includes the spaces for thefirst and second places. The abbre-viated names of the two tied testshave been written in the circle ashas the rating of 1.5.

168 1 SO

In the same vein, if three testswere tied for third place, you wouldcircle the spaces for third, fourth,and fifth, write the tests' shortnames in the circle, and write inthe average of 3, 4, and 5, whichis 4.

In short, give each of the tiedtests the average of the ranks whichthey would have earned if not tied.

A related difficulty in rankingtests arises when they differ, butonly slightly, in their merits on agiven feature. Here you need todecide, "How much of a difference inquality makes a difference?" Onerule of thumb is that small differ-ences in merit should result indifferent rankings for test featuresthat are very important, but not forfeatures that are less important.A second rule of thumb is that smalldifferences in merit should resultin the same ranking for featuresthat are judged subjectively or onwhich different judges disagree agreat deal. For features on whichclear, objective determinations canbe made, there is justification forassigning different rankings onsmall differences.

You will still have to decidelocally how much of a difference inquality should be treated as aneffective difference, but the tworules of thumb will make those deci-sions easier.

STEP 7. Repeat Steps 4-6 for allother test features to be evaluated.

Compare the tests one feature at atime, and record their rankings fora feature before going on to eval-uate the next one. When problemsor questions arise, note them in theright column of the worksheet. They

can be resolved later by conferring

with other test raters or consul-tants. The "Notes" column can alsobe used to record reasons for agiven ranking.

Staff members with special expertiseshould be assigned specific featuresto evaluate, so one person will notbe rating all of the features. For

example, language specialists willevaluate the linguistic and culturalappropriateness of a test for abilingual program; testing special-ists will rate the statisticalfeatures, etc.

STEP 8. Summarize the tests'rankings by weighting them andthen recording them in the "FinalResults Table" at the upper rightof the worksheet.

Start with the rankings of the firstfeature. For the test that isranked Best, you will enter one,two, or three tallies in the firstcolumn of the "Final Results Table"according to whether the feature hasan importance rating of 1, 2, or 3.That is, the test which is rankedBest on a Very Important featurewill have three tallies entered inthe 1st place column of the table.Two tests that are tied for secondand third place on that feature(hence are both ranked 2.5) willeach have three tallies entered inthe column headed 2-3 of the "FinalResults Table." Any other frac-tional rankings will be transferredto the in-between columns of thesummary table. Another test whichhad no acceptable evidence for thatsame feature would have three tal-lies entered in the right handcolumn of the table. All tallieswill be written on the line of thetable opposite the respective tests'name.

169181

STEP 9. Check that the total numberof tallies per test in the "FinalResults Table" is equal.

Check your entries in the "FinalResults Table" by counting the num-ber of tallies for each test. Thetotal number of tallies should bethe same for each test, and shouldequal the sum of the importanceweights for the features which wereevaluated. If this is not so, re-doStep 8 on a sheet of scratch papercolumn by column, instead of featureby feature. Again verify your workby seeing if the number of talliesis equal and correct.

The outcome of this step is a tableof profiles for the tests showinghow many first places, in-betweenfirst and second places, secondplaces, etc., each test earned. It

is these overall profiles which willbe compared next as the index oftests' technical and practicalquality.

STEP 10. Compare the tests pro-files in the "Final Results Table."Eliminate tests that are markedlyorse. Select the better ones for

detailed analysis of their congru-ence with the local curriculum.

Now compare the tests. Better testshave a greater part of theirweighted ranks in the higher places,toward the left of the "FinalResults Table." Tests of relativelylower quality and merit have agreater balance of their rankingsin the Zero and other lower scores.Small differences between tests inthe balance of high and low ranksshould not be seen as significant,since the data do not come from pre-cise measurement. At this stage oftest selection, the purpose is toscreen out tests that have markedlylower quality on the features whichare relevant for your program.

If there is not an obvious breakbetween the higher ranking and lowerranking tests, you may select andscreen on the basis of your re-sources for carrying out the nextstep in test selection. That stepinvolves studying tests item by itemand judging the items' relevance toyour curriculum. Since this analy-sis is quite detailed, you will wantto carry it out on only a small setof tests. That consideration mightlead you to select, say, the threetop ranklng tests in the "FinalResults Table" for detailed curric-ular analysis. Retain the othertests in case the top three turnout to have too little relevance toyour program.

Refer now to the "Final ResultsTable" to decide whether any of thetests under consideration are mark-edly better or worse in their over-all rankings. Either the profile oftallies for each test may be com-pared, or the tallies may be con-verted to percentages if percentagesare easier to understand.

5To transform the tallies into per-centages, simply divide the totalnumber of possible tallies (found inStep 9) into the number of talliesin each cell or box of the table.Record the numbers. The resulting

figures are percentages of the totalpossible tallies which fall in eachbox. Adding across for each test,the percentages should sum to 100%(plus or minus rounding error).

SUMMARY

The methods in this chapter aremeant to help you find, screen, andevaluate tests to suit your specialsituation. The complex judgmentabout the relative quality of testsis approached systematically bybreaking it into a number of simplerjudgments, then combining the re-sults. Since these procedures arejudgmental and not precise, youshould regard them as hints forcomparing tests, not as hard andfast rules. Feel free to adaptthem to your needs and resources.

The most important aspect of tests,their relevance to the local curric-ulum, remains to be evaluated atthis point. Chapter 6 takes up thisfinal step in selecting a test.

1

CHAPTER 6How To Select Tests: Comparing Tests

For Their Relevance to a Given Curriculum

The previous chapter contained instructions for screening testsaccording to their potential uses and their technical merits.The measures which remain after screening can now be evaluatedfor their responsiveness to the local curriculum. In thischapter, procedures are described for rating the importance,content relevance, and difficulty of the objectives covered bya test, then comparing the ratings of the various tests. Threeindices for comparing a test's congruence with the program aredescribed: an overall measure, the proportion of a program'sobjectives covered by the test, and the proportion of thetest's items that are relevant to the program.

INTRODUCTION

Achievement tests should be chosenso as to be maximally relevant tothe test user's program. If thematch between test and program ispoor, then the test scores will notbe useful for diagnostic or pre-scriptive purposes. Nor will suchscores be useful for accountabilityor program evaluation purposes.Test.s with low relevance to a givencurriculum will not give fair creditfor the successful teaching andlearning which occur.

The research reviewed in Chapter 2strongly_supports the conclusionthat tests-----riffirhrr-affecting students' scores accordingto how well or badly the objectivestested match the objectives taught.Care taken in selecting tests fortheir curricular relevance will berewarded when the scores are useful

173

for instructional decisions and whenevaluation results give credit forthe program's actual achievements.

This chapter gives step-by-stepprocedures for comparing tests'cul:ricular relevance. The proce-dures involve making a series ofjudgments about program objectivesand test materials, expressing thesejudgments as numbers, combining thenumbers for a single test, then com-paring the results across tests.Table 3 gives a checklist of thesteps for evaluating curricularrelevance.

Because the method described in thischapter is a detailed one, you may_wish to employ it only fei-r maj-Or

test selection decisions. Questionsthat may help determine whether atest selection decision is a majorone include these:

184

How many students will be tested?

How much class time will berequired for testing?

Will the selected test be usedrepeatedly?

Will the test's results be highlyvisible (e.g., to the public andto higher authorities)?

Will the test results be used fordecision making (e.g., about stu-dents, curriculum, teachers, orbudget)?

The complexity of testing, both interms of its relation to curriculumand in terms of numbers of peopleaffected, requires the test selectorto be very thorough and careful. In

choosing a multilevel testing system,it is advisable to have each sepa-rate level of the test rated byteachers and curriculum specialistswho are familiar with your program

1 as it is actually taught. Theobjectives of most test batteriesvary somewhat from level to level incontent and in difficulty, so theirappropriateness for your program mayvary across levels as well.

The methods in this section ask youto compare test items with programobjectives. There are severalreasons for carrying out such athorough analysis of tests beforechoosing one. First, these proce-dures help you to find the test thatis most responsive to your purposes.Many tests are likely not to matchyour program well. Second, theprocedures are explicit and easy toadapt_tothe_agastraints of yoursituation if you find --YOrt-ile/T-w-ith-

out sufficient time or resources tofollow them exactly. Third, theseprocedures call attention to someaspects of tests which should notbe overlooked, for example, theproportion of a test battery that islocally relevant, the proportion

of the local curriculum which a testbattery covers, the importance ofthe objectives covered, and theappropriateness of the test's diffi-culty for the program's students.Finally, the process of makingnumerical ratings at each stage ofjudgment and carrying them to thenext stage ensures that informationfrom earlier judgments is not for-gotten or lost. As in the methodsof Chapter 5, the component deci-sions all have an influence on thefinal rating of a test.

The methods described below dealwith instructional objectives andwith test items. Not every staffmember is equally suited to usethese methods. A number of educa-tors are opposed to instructionalobjectives for various reasons.Many others do not have the patienceor the style of thinking to dealwith objectives. The best peoplefor this task would not only be veryfamiliar with the curri.lulum at therelevant level, but als) have someskill in writing and recognizingobjectives and a belief in theimportance of curricular relevancein tests.

NOTE: In the following discussion,the word skin will sometimes beused interchangeably with the wordobjective.

174 1 S5

TABLE 3Checklist of Steps for Comparing

Tests' Relevance to a Given Curriculum

Step Page

1 Prepare a listing of the objectives of the program com- 176

ponent to be tested.

2 Write your listing of program objectives to be tested in 178

Column 1 of the Test Relevance Rating Form, called theworksheet (Figure 3).

3 Record the number of program objectives in Box B on the 178

final page of the worksheet.

4 Rate the importance of each program objective in your 186

listing, and record these judgments in Column 3.

5 Duplicate the worksheet for all of the raters and all of 186

the tests still under consideration. Fill in the iden-tifying information for each test to be rated.

6 List/index all of the items on the test in Column 2, 186

each on the same line as the program objective that ismost closely related to it.

7 Record the number of items on the test in both Box A 186

and Box C on the final page of the worksheet.

8 Judge how closely the test items correspond with the 187

respective program objectives in format, content, andprocess; record these judgments in Column 4.

9 Rate the appropriateness of the difficulty of each test 190

item, and record the ratings in Column 5.

10 For each program objective that has any acceptable test 191items, multiply the ratings in Columns 3, 4, and 5 foreach item; record the products in Column 6.

11 Add all of the products from Step 10, and record the sumat the bottom of Column 6 and in Box A.

191

12 Record the number of acceptable test itc,As in Box C. 192

13 Compute the summary indices of tests' congruence with 192

the curriculum, and record them at the bottom of thelast page and the top of the first page of the worksheet.

14 Compare the summary indices of the tests under considera- 192

tion. Decide whether one test has markedly greater con-

gruence with your curriculum.

1751 86

STEP 1. Prepare a listing of theobjectives of the program componentto be tested.

To find the test which is most rele-vant and responsive to your program,it is necessary first to be veryclear about the instructional objec-tives of the curriculum to betested. Such clarity is attainedby making an explicit listing orindex of these objectives. Thelisting should be prepared care-fully, for it will serve as thestandard of curricular relevancewith which test materials will becompared.

Preparing such a list may be compli-cated if there are differencesbetween the operational classroomcurriculum and the official, formalone. Another complication ariseswhen the operational curriculumvaries from one organizational unitto another (i.e., from class toclass or site to site). If there islittle commonality of objectivesfrom unit to unit, it will not bepossible to draw up a rea:Asticsingle listing. In this case, asingle test cannot give a respon-sive, representative measure for allunits, and the quick screeningmethod of determining curricularrelevance (Chapter 5) may be thebest you can do.

Suggestions are given here fordrawing up your list of curricularobjectives under two conditions:

When each subject area to betested in the program has a uni-form curriculum (even if there isa discrepancy between the opera-tional curriculum and theofficial, formal one);

When the objectives for the givensubject area vary from organiza-tional unit to unit, but there isgreat commonality in the impor-tant objectives.

1A. When there is a uniform curric-ulum, list Or index) the objectivesfor the program component to betested as follows:

(1) Write the objectives in enoughdetail so that later in the processit will be possible to judge withconfidence how closely a given testitem measures or matches an objec-tive. If, for example, your programteaches division in working (i.e.,radical) form, but a test gives itsdivision problems in number sentenceform, your listing of local mathobjectives should enable the testrater to detect this difference andjudge its importance. In the samevein, the listing of your languagearts curriculum should enable thetest rater to judge how well thewords on a vocabulary test corres-pond with the vocabulary words inyour program. Since curricularobjectives are often stated rathergenerally, it will often benecessary to refine these objec-tives in order to use them as abasis for judging relevance oftest items.

(2) When it would be burdensome toprepare such a full statement ofyour curricular objectives, analternative is to prepare an indexof them in the form of page refer-ences to the relevant teaching andexercise materials used in theclassroom. For each separatelyteachable and testable skill, listin one place all of the pages wherethe skill is taught and practiced.A name or other verbal label foreach of these skills should accom-pany the page references. This pagereferencing of skills to teachablematerials will enable test raters tocompare test items directly withinstructional content and activities--a later step in the curriculum-matching process.

The referencing method of listinglocal curricular objectives may be

176 1 S

used either with or instead of thedetailed method in 1A(1) above.

(3) In either instance above, itwill help test raters to work withthe listing if related objectivesare grouped together. For example,a listing of fifth grade math objec-tives could be grouped under suchheadings as geometry, measurement,money, time, graphing, word prob-lems, basic operations, and thelike. For elementary reading, objec-tives could be grouped under suchheadings as phonics, structuralanalysis, sight words, vocabulary,comprehension, and the like. Sub-headings can be used for smallerclusters of skills such as for thedifferent basic arithmetic opera-tions or the different types ofcomprehension skills which thecurriculum covers. See Figure 4for examples of subheads forgrouping objectives.

(4) When the local curriculum isvery detailed, your task of pre-paring a list of objectives can besimplified by combining small objec-tives. For example, if there areseparate objectives for auraldecoding of each speech sound ineach of three positions withinwords--initial, medial, and final--this set of over 50 objectives couldeasily be reduced to six objectivesdealing with consonants and vowelsin each of the three positions.These six broader objectives wouldthen be written in the listinginstead of the many smaller ones.By combining very small, but closelyrelated objectives, you can simplifythe task of matching tests withcurriculum without overlooking themore general skills which thespecific skills comprise.

Two cautions should be notedregarding combining objectives.First, the amount of combining that

is useful will vary with the in-tended use of the test. Combining

will be of greater use for selectingsurvey tests than for selecting abattery of continuous progress

tests. In the latter case, verydetailed objectives, correspondingto individual lessons, might beneeded. Second, it is possible togroup too much. When objectivesare broad and vague (e.g., criticalthinking, word attack), their des-criptions or labels do not make itclear what is being taught, learned,or tested. Such broad spectrumobjectives do not describe theprogram skill in enough detail toallow the test rater to judgewhether the relevant items measurethe skill as it is taught.

(5) In cases where the formal, offi-cial curriculum and the operationalclassroom curriculum differ to anygreat degree, you will have todecide how to treat the differences.If the formal curriculum has notkept up with advances in classroomteaching, then it is reasonable touse the page referencing method inlisting the program objectives. If,

however, the formal curriculumaccurately represents current pro-gram intentions, it is reasonable tofollow the official formal objec-tives in preparing the listing.Other differences will need to beresolved on an individual basis.

1B. When the operational, classroomcurriculum varies from site to site,but there is great commonality inthe important oblectivesfor the pro-gram component to be tested, make alisting of the common objectives asfollows:

(1) Either compare listings of theseparate classroom curricula andmake a program listing out of theobjectives that are common to theseparate lists; or

17719(9

(2) Give teachers of the differentclassroom level curricula a compre-hensive listing of possible objec-tives for the appropriate level andsubject. Ask the teachers to exam-ine the master list and to check offthe objectives which they actuallyteach at that level. Make a singleprogram-wide listing out of the mostcommonly checked skills.

(3) Then go through the steps in lAabove to make this listing explicit,usable, and manageably short.

STEPS 2 and 3. Write your listingof program objectives to be testedin Column 1 of the Test RelevanceRating Form; record the number ofobjectives in Box B.

Contained in this chapter is a work-sheet on which you can record theappropriate information as you followthe rating procedures. A blank ver-sion (Figure 3) and a worked example(Figure 4) of the worksheet are pro-vided on the following pages.

Column 1 of the worksheet will con-tain your listing (or indexing) ofthe curricular component to betested. This listing will beorganized so that related objectivesare grouped together under a commonheading. Some of the smaller, moredetailed objectives in your programmay not appear separately in thelisting because they have beengrouped together into larger objec-

tives.

Several sheets may be needed forlisting or indexing the programcomponent to be tested. Number thepages and draw a heavy line underthe Zast program objective, writingEND OF LISTING in boZd letters.Count the number of objectives inColumn 13 and enter this number asthe denominator in Box B on the

final page of the worksheet. Countonly the objectives and not thenames of curricular subareas orskill clusters. In Figure 4, thereare 10 program objectives listed.

178S9

First sheet of sheets

TEST NAME, LEVEL, AND FORM

[11.RATER

PROGRAM SUBJECT AND LEVEL DATE

OVERALL RATINGS: GRAND AVERAGE INDEX OF COVERAGE INDEX OF RELEVANCE(fill in last) (average congruence per (proportion of program objec- (proportion of test that is

item ranging from 0-6) tives measured by test) relevant to program objectives)

Step 2

Listing ofProgram Objectives

Step 6Index ofcorrespondingtest items

_Step 4Importanceof programobjectives

Step 8Match betweenitems andobjectives

Step 9Appropriate-ness of itemdifficulty

Step 10

Combinedjudgments Notes

1=minor2=important3=essential

0=notacceptable

1=adequate2=ver cZose

0=too hard ortoo easy

1=acce table

Poductsacrosscolumns3 4 5

,

199 191

TEST NAME, LEVEL, FORMRATER

Continuation sheet: page

DATE


Index ofcorrespondingtest items

Importanceof programobjectives

Match betweenitems andobjectives

Appropriate-ness of itemdifficulty


-

1 ( 3

TEST NAME, LEVEL, FORM RATER

Final sheet: page

DATE







Clearlyirrelevantitems

Step 11

Sum of num-bers in sixthcolumn--enterin Box A also

OVERALLRATINGSStep 13

BOX A BOX B BOX C

GRAND AVERAGE INDEX OF COVERAGE INDEX OF RELEVANCE

Sum of numbers inColumn 6 (Step 11)

Number of program objec-tives adequately mea-sured by test

Number of acceptabletest items (Step 12)

divided byTotal number of testitems (Step 7)

divided byTotal number of testitems (Step 7)

divided byTotal number of programobjectives.in Column 1(Step 3)

194

Figure 3. CSE Test Relevance Rating Form

195

2

First sheet of 4 sheets

TEST NAME, LEVEL, FORM All American Test of Reading Comprehension, brown level RATER Marion Choy%-&-4 PROGRAM SUBJECT AND LEVEL ..1._i2th6thraderegpadincomrehension DATE 1/15/xx

OVERALL RATINGS*: GRAND AVERAGE 2.1 INDEX OF COVERAGE .70

(fill in last) (average congruence per (proportion of program objec-item ranging from 0-6) tives measured by test)

INDEX OF RELEVANCE .63

(proportion of test that isrelevant to program objectives)

Step 2

Listing ofPropram Objectives

Step 6Index ofcorrespondingtest items

Step 4Importanceof programobjectives

Step 8Match betweenitems andobjectives

Step 9Appropriate-ness of itemdifficulty

Step 10


r_

1=minor2=important3=essential

0=notacceptable

1=adequate2=1)ery close

0=too hard ortoo easy

1=acceptable

PPoductsacrosscolumns3, 4, 5

WORD LEVEL OBJECTIVES(curricular

Word attack (skill cluster)

subarea)

p. 1, #1

2

3

4

5

6

p. 2, #78

9

(continuenext sheet)

2

1

I

1

2

2

2

2

2nt

0

0

0

1

1

1

1

1

1

4

4

4

4

4

4

FortmArivoyoPp-- -A/07'.s/m i64/1 EZAia oirhTo "Rob-04Moexisc.7-1 v0

Affixes: In a list of words--some of which have pre-fixes, some others of whichhave suffixes, and some ofwhich do not have affixes--pupils will underline theaffixes. The affixes willbe drawn from this list:re-, pre-, un-, mis-, dis-,-ness, -less, -ful, -ly,-y, -en, and -er (as indriver).

Compound words: Pupils willcomplete compound words bymatching words in a leftcolumn with words in aright column.

*Note: These ratings will vary with your judgments of your pupils' abilities and the importance of theprogram objectives.

e- 196197

All American Test of Reading Comprehension,TEST NAME, LEVEL, FORM brown level

Continuation sheet: page 2

RATER Marion Chov DATE ;y_:ivxx



Importanceof programobiectives

Match betweenitems andobiectives


CombinedJudgments Notes

Root words: Given a list ofwords, each containing anaffix, the pupil will writethe root word. Affixeswill include verb markersfor tense and progressive,comparatives, and superla-tives, and the ones for theobjective on affixes above.

Meaning

p. 2, #1011

12

13

14

p. 3, #1516

17

p. 4, #1819

20

p. 4, #2122

23

p. 5, #2425

26

1

1

1

2

1

1

1

2

I

1

2

2

2

2

2

0

0

0

2

2

2

2

2

2

2

2

2

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

- -

.

4

4

4

4

44

2

2

2

mow ilbVikerAtorA6 14A67,141* 6Kia71#11A/ eimovil 6-mi. Zram Au"elibi,6 FOR PIJP1.1.70 betalT leao-r.wows okom Am//edgy A;AnfAZIAMdm^466.

Synonyms: Given a vocabu-lary word, the pupil willselect from multiplechoices the word or phrasewhich is a synonym.

Autonyms: Given a vocabu-lary word which has anopposite, the pupil willselect its antonym frommultiple choices.

Phrase,_sentence, and textlevel objectives

Meaning from context--wordswith one familiar meaning:Given sentences with one

199198

All American Test of Reading Comprehension,TEST NAME, LEVEL, FORM brown level

7

Continuation sheet: page 3

RATER Marion Choy DATE 1/15/xx

Listing ofPro:ram Ob ectives



1

3

I

3

I

I

Match betweenitems andob ectives

1

1

1

2

2

2

2

2

2


0

0

0

1

1

1

1

1

1

Combinedud.ents

6

6

6

2

2

2

Notes

Too /I IMO

7000 smohd. i'; 0 --

PISAAM.145 1 drawaiiron Am, mc.60/0nehusigamymolorANUiMerAMWMOnoqwrkor

word omitted, pupils willselect from multiplechoices the one word whosemeaning is most closelyrelated to the context.Choices will be about thesame length (± 2 letters)and at least two of themwill start with the sameletter.

Meaning from context--words with more than onefamiliar meaning: Givensentences with a multiple-meaning word underlined,the pupil will pick frommultiple choices the defi-nition of the word whichfits the context.

Main idea: Given a story of3-5 sentences, pupils willselect the main idea, wherethe three distractors dealwith particulars of thestory or with generaliza-tions from single particu-lars.

.Inferences: Given a storyin about three paragraphs,pupils will mark whethereach of several supposedinferences from the story

p. 5, #272829

p. 6, #3031

32

p. 7, #333435

9-

209 201

All American Test of Reading ComprehensionTEST NAME, LEVEL FORM brown level

Final sheet: page 4

RATER Marion MY DATE 1115/xx







iS probably true, probably

p. 8, #394041

1

I

2

2

2

0

0 -0

_ _____

Step 11

88

74906A6K rHeDisneAszthasgover Al RKEsemsa so rile)do01.410,0- AR

d AL /I Ed 7. C NO KA

false, or can't tell.

Meanings of colloquialphrases: Given a sentencewith an idiomatic collo-quial phrase underlined,pupils will select theliteral phrase with thesame meaning from multiplechoices.

END OF LISTING

p. 7, #36,37, 38Clearlyirrelevantitems

sSum of num-bers in sixthcolumn--enterin :uic A also

OVERALLRATINGSStep 13

202

BOX A BOX B BOX C

GRAND AVERAGE 2.1 INDEX OF COVERAGE .70 INDEX OF RELEVANCE .63

Sum of numbers inColumn 6 (Step 11) 88

Number of program objec-tives adequately mea-

Number of acceptabletest items (Step 12) 26

divided byTotal number of tc-A

sured by teSt 7 divided byTotal number of testdivided by

items (Step 7) 41 Total number of program items (Step 7) 41

objectives in Column 1 10

(Step 3)

Figure 4. Worked Example of CSE Test Relevance Rating Form

203

STEPS 4 and 5. Rate the importanceof each program objective. Dupli-cate the worksheet and fill in theidentifying information for eachtest to be rated.

In Step 4, judgments are made aboutthe importance of each of the objec-tives that is listed in Column 1.These judgments are then expressedin numbers, indicating degrees ofimportance, and are recorded in thethird column.

For each of the program objectives,the test rater is to judge howimportant it is for students toattain. The number of degrees ofimportance you decide to use is amatter of local judgment, but threedegrees (minor, important, andessential) offer a balance of con-venience and contrast.

For each objective that is judged tobe of minor importance, assign it arating of 1, and record the ratingin the third column on the same 1'.1eas the objective. A minor objectiveis one that could be omitted withlittle harm to student progress.Important objectives, ones thatclearly contribute to progress orare worth learniag for their ownsake, are assigned a rating of 2.Essential objectives, ones that areprerequisites or are essential forstudent progress, are given a valueof 3.

After judging the importance of eachprogram objective and recording itsimportance rating in Column 3, checkthe ratings by comparing them withone another. That is, after judgingall objectives separately, confirmthe ratings by seeing if ratingsseem appropriate relative to one

another.

On completing all of the steps up tothis point, make enough copies ofthe partially filled-in CSE Test

11=1111Mill

Relevance Rating Form to permit allof the raters to rate all of thetests under consideration. Keep theoriginal form blank in case morecopies are needed. For each test,fill in the blanks at the top ofeach page of the worksheet.

STEPS 6 and 7. List index all ofthe items on the test in Column 2,each on the same line as the programobjective that is most closelyrelated to it. Record the number oftest items in both Box A and Box Con the final page_of the worksheet.

Look at each test item and decidewhich program objective in Column 1,if any, it seems to measure. Foreach item, write its number (or testpage and number) in Column 2 oppo-site the relevant program objective.At this stage, be generous injudging whether an item is respon-sive to an objective; what isimportant here is to assemble foreach objective all of .the items thatmeasure it, even remotely.

Try to pair each test item with onlyone program objective; but if anitem seems to measure more than oneprogram objective, write its numberin Column 2 opposite each objective.Circle any repeated listing of asingle item for later reference.

There will probably be some items onthe test which do not correspond toany of the objectives in Column 1.List these items at the end ofColumn 2, next to the END OF LISTINGin Column 1. Enter either the itemnumber or page and number so thatyou and other test raters can com-pare your judgments about the items.

Ideally, you would be able to listor index a test's objectives ratherthan its items in Column 2 next tothe relevant program objectives. In

186 204

fact the objectives of existingtests are not specific enough toserve as a basis for judging testrelevance accurately.

Before going on, count the totalnumber of items on the test beingrated, and enter the number as thedenominator in Boxes A and C on thefinal pages of the worksheet. If

you make this tally by countingnumbers In Column 2, make sure notto count any item more than once.That is, do not count any circled(i.e., repeated) items.

STEP 8. Judge how closely the testitems correspond with the respectiveprogram objectives in format, con-tent, and process; record theseisegments in Column 4.

The purpose of this step is to judgehow relevant or sensitive each itemis to the corresponding objectivethat your program teaches. Examineeach test item, and judge howclosely it corresponds to therespective program objective informat, content, and procesc tested.The correspondence may be not accep-table, adequate, or very close. Forthose degress of match/mismatch,astlign a score of 0, 1, or 2 respec-ti7ely and record it in the fourthcolumn.

If the item format (e.g., matchingpictures and words) differs from theformat of the relevant instructionand practice, decide whether thatdifference will interfere with yourstudents' displaying their learningof the program skills on the test.If the test format is so unfamiliaras to make it very hard for studentsto show their learning of the pro-gram skill, then a zero ratingshould be recorded.

Attend also to the content and pro-cess which the item measures. For

objectives dealing with specificknowledge (e.g., vocabulary), makeyour judgment according to howclosely the content of the itemsamples the content of the instruc-tion. For objectives dealing withprocesses (e.g., identifying themain idea), decide how well theprocess, as taught, matches the pro-cess needed to answer the itemcorrectly.

Record the overall rating of format,content, and process in Column 4 asone number. For an item earning azero rating, draw a horizontal linethrough the next two columns to showthat it does not need to be ratedfurther.

The issue of program and item con-tent is illustrated by comparing thefirst program objective in theworked example with sample testitems 1-6 in Figure 5. The objec-tive reads as follows:

Affixes: In a Zist of words--some of which have prefixes,some others of which havesuffixes, and some of whichdo not have affixes7-pupilswill underZine the affixes.

Sample test items 1-6 (in Figure 5)earn a congruency rating of 2 in theworked example in part because theircontent is completely congruent withthe program objective on affixes.

For such judgments, you may need toset some arbitrary criteria such asthe following:

90-100% congruence between itemcontent and content described bythe program objective rates a 2

. 80-90% congruence rates a 1

< 80% congruence rates a 0 as notacceptable

295187

DIRECTIONS: In the list ofwords below, draw a lineunder each prefix or suffix.Some of the words do nothave a prefix or a suffix.A worked example is givenin the box.

EXAMPLE:.-rewrite

happy

watchful

Draw a line under eachprefix or suffix.

1. dislike

2. during

3. driver

4. people

5. quickly

6. refill

206

DIRECTIONS: Read each group of fourwords below. If all four words arecompound words, circle Yes. If anyword is not a compound, circle No.The first two are done for you.

EXAMPLE:

Inkblot, screwdriver,pigskin, notebook

No

EXAMPLE:

Hammer, teamwork, Yes (E)keychain, enemy

7. Afternoon, barefoot, Yes Nowalking, mailed

8. Fireplace, football, Yes No

bedtime, icebox

9. Bookcase, ruler, Yes Noraindrop, heavenly

Figure 5. Sample test items

DIRECTIONS: In each box below,a word on the left makes abigger word with one word onthe right. Draw a line toconnect the two words that makea bigger word. The first boxis a worked example for you.

EXAMPLE:

10-14.

an noonafter fly

any light

butter bodyflash other

EukaatIaleyebrowgrapefruitdoorknob

DIRECTIONS: Read the following sentences.

The next morning the two men came back for Brown Pet.Jack and Nancy ran to the barnyardThey wanted to tell the cow good-by.Mr. Stone said., "Your pet wiZZ be happy at the zoo."

If the sentence below could be true, check A. If the sentence isprobabZy false, check B. If you can't say whether it is true orfalse, check C. The first question is done for you.

EXAME:

The men were going to take Brown Pet away.

Va. Probably trueb. Probably falsec. Can't say

33. Brown Pet was in the barnyard.

a. Probably trueb. Probably falsec. Can't say

34. The men were taking Brown Pet to the zoo.


35. The men came for Brown Pet in the morning because it would,take all day to get to the zoo.


INEELI. Sample test items*

*Adapted from the Behavioral Objectives and Test Items bank, GlenEllen, Illinois.

2189

08

The issues of item format and itemsolution processes are illustratedby comparing the second programobjective on compound words in theworked example with items 7-9 and10-14 in Figure 5. The programobjective reads as follows:

Compound words: PUpils willcomplete compound words bymatching words in a Zeftcolumn with words in a rightcolumn.

Items 10-14 fit that description.But items 7-9 present lines of fourwords and ask the student to circleYes or No for each line. The latterformat is different from the oneused in the program and probablymuch less familiar.

Item format often affects the mentalprocesses which a pupil must usefor coming up with correct answers.In items 7-9, pupils need to be ableto understand.the concept of aZZfbur words and to keep it in mindwhile reading the words. They alsoneed to break down each word initems 7-9, sometimes more than once,and judge whether each part is areal word:

fi - replace

fire'- place

Some of the parts are real words andothers are not. A student who usesan efficient method for doing theseproblems-analyzes each word in theitem until (s)he finds a non-compound. On finding a non-compound,s(he) will circle No and go to thenext item directly. If all of thewords in the item are compounds, thetest taker circles Yes and goes on.

In contrast, the processes forsolving items 10-14 involve remem-bering a word on the left, buildingpossible compounds out of it withwords on the right, judging each

possible compound, continuing untila compound is recognized, andrepeating the process until all ofthe words on the left are used.

If the difference between the pro-gram objective and the content/format/process of items 7-9 willinterfere with your pupils' usingtheir program skill to answer thoseitems, assign a congruency ratingof 1 or 0, depending on whether youjudge the items to be acceptablereflections of the objectives, orunacceptable. In Figure 4, thedifferences in format and processesbetween sample items 7-9 and theprogram skill on compound words werejudged to be unacceptable. Recordthe rating for each of the items inthe column for Step 8 on the linewhere the respective items areindexed.

A second example of a differencebetween a program objective and atested one occurs with the sampleitems on inferential comprehensionin Figure 6. The program objectiveasks for stories which are aboutthree paragraphs long. The itemsuse a text which is rather short.If you think that the differencedoes not really change the objec-tive, then you will want to assigna rating of 2 (very close) to theitems and record it in the columnfor Step 8 on the lines where therespective items are indexed. If

the difference in program and testtext length does change the objec-tive sooewttat, then assign andrecord a lower congruency rating.

STEP 9. Rate the appropriateness othe difficulty of each test item,and record the ratings in Column 5.

The last judgment of test itemsinvolves rating the appropriatenessof each item's level of difficulty.Difficulty judgments are expressed

190

209

on a two-point scale where 0=toohard or too easy, and 1=acceptable.These judgments are then recorded inthe fifth column of the worksheet.It will help in making these judg-ments to ask yourself these ques-tions:

Is the item so easy that studentswho are unskilled on the programobjective will answer it correctlymuch of the time?

Is the item so difficult thatstudents who have mastered theprogram objective will miss itmuch of the time?

Whenever the answer is yes, the itemshould get a zero rating. For allsuch items, draw a horizontal linethrough the next column to theright.

As in Step 8, these judgmentsrequire you to study the test items.It it proves hard to separate judg-ments of item difficulty from thoseof format and content (Step 8), thenthis fifth column can be eliminatedand the overall task simplified byone step. Teachers and curriculumspecialists who are very familiarwith your program as it is actuallytaught will be able to make thesetwo types of judgments simulta-neously with confidence. Anyonewho is not intimately acquaintedwith the operational curriculum willhave trouble with the process.

An alternative to judging items'difficulty is to use the test pub-lisher's field test data. Thisoption is open only for tests whichgive item difficulty figures basedon the responses of an appropriatecomparison group of pupils.

STEPS 10 and 11. For each programobjective that has any acceptabletest items, multiply the ratings inColumns 3, 4, and 5 for each item;record the products in Column 6.hen add all of the products, and

record the sum at the bottom ofColumn 6 and in Box A.

A total rating for each test itemis now reckoned by multiplying theimportance value of the respectiveobjective (Column 3) by the item'sratings for curricular match (Col-umn 4) and difficulty (Column 5).Items getting unacceptable ratingsin Columns 4 or 5 will already havebeen lined out in Column 6.

The numbers in Column 6 are summa-ries of the test raters' judgmentsabout the importance, curricularrelevance, and difficulty of theobjectives covered by a test.These numbers range in possiblevalue from 1 to 6. A rating of 6would be received by a test itemthat:

Measures a very important programobjective (rated 3 in Column 3)

Matches the objective closely incontent and format (rated 2 inColumn 4)

Has an acceptable level of diffi-culty (rated 1 in Column 5)

The overall rating for such an item .then comes from multiplying acrossthe form, 3x2x1=6 and is entered inColumn 6.

After multiplying the ratings andrecording them in the sixth column,check your arithmetic. Then addthe numbers in this column, andrecord the sum at the bottom of thecolumn. Also, write it in Box A asthe numerator.

21 0

191

ISTEP12. Record the number ofacce table test items in Box C.

As a step toward finding the propor-tion of the test's items which arerelevant to your program, count thenumber of acceptable items. Theseitems are the ones which were notlined out in Column 6 (Step 10). Inother words, count the number ofnumbers in Column 6, and record itas the numerator in Box C on thelast page of the worksheet.

STEPS 13 and 14. Compute the sum-mary indices, and use them tocompare tests' congruence with yourcurriculum.

To summarize a test's curricularrelevance, three indices are com-puted: the Grand Average, Index ofCoverage, and Index of ReZevance.The Grand Average, which may rangein value from 0 to 6, describes theaverage, per test item, of the com-bined judgments of importance(Step 2), curricular match (Step 8),and item difficulty (Step 9). Com-pute the Grand Average by dividingthe result of Step 11 by the totalnumber of items on the test(Step 7). Record this number inBox A on the final page of theworksheet.

The Grand Average for a single testtakes on meaning when compared withthe same figure for other tests.The one test with the highest GrandAverage does a better job of cov-ering more of the important programobjectives. This one comparisonstill does not indicate whether thehighest rated test covers the pro-gram well enough. That judgment is .

aided by two other statistics on theworksheet, the Index of Coverage andthe Index of Relevance.

The Index of Coverage tells how com-pletely a test covers the program

192

objectives listed in the first col-umn. It is derived by dividing thenumber of objectives in Column 1(Step 3) into the number of thoseobjectives which the test measuresadequately. Adequacy of measurementis determined by two factors: thenumber of test items per objectiveand their goodness of match to theobjective. Test raters will have touse their discretion in decidingwhether the number of items mea-suring an objective is sufficient.This decision, however, will beguided by the intended use of thetest. One or two good items perobjective might be enough for asurvey test, but eight to ten mightbe a minimum for a battery of testsfor monitoring progress. Incounting items per objective, countonly the ones which have an accep-table match with the program objec-tive, that is, which get a numericalrating in the sixth column of 1 orhigher.

While the Grand Average is based ontest items, the Index of Coverage isbased on numbers of objectives: theproportion of program objectives(Column 1) that are adequately mea-sured. Its possible values rangefrom a low of zero to a high of 1.0.If the value of the Index of Cover-age for one test is .6, then 40% ofthe program objectives to be testedare not covered by the test. Fortests that differ very little on theGrand Average, the one with thehighest Index of Coverage would bepreferable. This summary statisticis recorded in Box B.

The last summary figure for com-paring tests is the Index of Rele-vance, which tells what proportionof the tert is sufficiently relevantto your program. It is computed bydividing the total number of itemson the test (Step 7) into the numberof items that adequately match theprogram (Step 12). Those items arethe ones that receive a numerical

211

rating of 1 or higher in the sixthcolumn of the rating form.

The Index of Relevance has possiblevalues ranging from zero (totallyunresponsive to the local program)to 1.0 (all of the test items areadequate measures of program objec-tives). On a test with a relevancerating of .75, a quarter of theitems measure objectives that areeither not part of your curriculumor are not at the right level ofdifficulty.

This third factor is importantbecause selecting a test with alarge percentage of items that arenot relevant to your program meanspaying, both in time and money, fortest materials that work againstyou. Your students may do poorlyon objectives in the te-t which donot match your program, and thetest results will not be very help-ful for assigning program lessons.Enter the Index of Relevance figurein Box C.

Each of the three summary figuresgives a different piece of informa-tion about a test. Since they arebased on different types of infor-mation, it would not be meaningfulto add them for a single summaryjudgment. The final choice of asingle test will be based on acomparison, across several tests,of each of the summary figures.To facilitate this comparison, enterthe three summary figures in thespaces provided at the top of thefirst page of the worksheet.

Other useful kinds of informationcan be derived from the CSE TestRelevance Rating Form. For example,the average importance of programobjectives not covered in a testcould be reckoned and compared as asupplement to the other three sum-mary measures. Also, the entries inthe sixth column of the worksheet

can be used to guide the scoring andreporting of pupils responses to atest. Items which are identifiedbefore the testing occurs asprogram-irrelevant can later beomitted from the analysis of scores.Total test scores could be reported,if required by higher authority, butthe customized, program-relevantscores would provide an importantcontext for interpreting the totalscores.

ON INCREASING THERELIABILITY OF THESE METHODS

The basis of the methods given inthis chapter is human judgment, notprecise physical measurement. Thesemethods are an aid to judgment andmemory, not an errorproof mechanismfor measuring tests. Since thechoice of tests is a social/politi-cal one which depends on knowledgeof curriculum and pupils, it cannotbe completely automated. Thesemethods reduce the unreliability ofjudgment by providing some uniformrating scales (namely, importance ofobjectives, congruenCe of items withobjectives, and difficulty of items)and uniform cutting points orcriteria along these scales. Fur-

thermore, the individual ratingsare recorded as they are made andare combined in a uniform manner,rather than left unrecorded to becombined in an impressionistic andforgetful manner.

The users of these methods canincrease their reliability furtherby several means. First, it willhelp to give the test raters somepractice before having them do anoperational comparison of tests'curricular relevance. A part ofyour program curriculum may be used,for familiarization, along with a

193

21 2

real test. Next, it will help fortest raters to discuss with oneanother the judgmental scales forthe purpose of encouraging unifor-mity in applying the cutting pointsto the scales.

Third, it is important to have eachlevel of a test rated independentlyby more than one person. Where twoor more raters disagree, they rosyresolve their differences, or theymay decide that they have wellfounded differences of judgment andsplit the differences. A final, andessential, method for increasing thereliability of ratings is to havethe job done for pay, not on yourstaff's time off. These methods arelabor, not play, and they are a partof making your program functionbetter.

Although the procedures in thischapter are detailed, they areeasier to carry out than to readabout. They are intended as aflexible prototype to be adaptedto local needs and resources. Theattention to detail will be rewardedby your choice of a test that comesclosest to meeting your needs.

194

21 3

APPENDIX AResources for Developing CRTs Locallyand for Purchasing Made-to-Order CRTs

Many school districts undertake to write their own test batteriesto ensure that their unique testing needs will be met. There areseveral types of resources which can make local development ofobjectives-based tests feasible, if not easy.

The first are reference works on methods of item and tesc con-struction. These books are not specifically on criterion-referenced testing, but they are a great help in writing goodtest items. Books in PPint lists such sources under subjectheadings like "educational tests and measurement." Second, there

are works on creating CRT materials, a number of which are listed

below.

Next, there are lists of objectives around which a curriculum, acontinuum, or a testing battery may be built. Comprehensivesets of objectives in a variety of subject areas are sold sepa-rately from test materials by various publishers such asCommercial-Educational Distributing Services, InstructionalObjectives Exchange, and Westinghouse Learning Corporation. In

addition, many school districts have prepared curriculum guidesor objectives lists which are uncopyrighted. A small sampling

of these is included below.

The fourth resource for local test development is item banks,that is, pools of existing test questions. Along with the objec-

tives lists, objectives-based item banks are listed below. The

ones listed here are in the public domain and thus may be repro-

duced or modified locally. Even when the pre-existing materialsare used only as models, they save much of the labor involved in

thinking of possible objectives, selecting formats for testitems, and developing distractors. The item banks listed in thisAppendix are not included in the test reviews of this bookbecause they did not meet all of the screening criteria.

Finally, there are publishers who provide made-to-order CRTs for

purchase.

214195

Sources on How To Develop CRT Materials

Baker, E. L. Beyond objectives: Domain-referenced tests forc.lvaluation and instructional improvement. EducationalTechnology, 1974, 14(6), 10-16.

Gronlund, N. E. Preparing criterion-referenced tests for theclassroom. New York: Macmillan, 1973.

Hambleton, R. K., & Eignor, D. R. A practitioner's guide tocriterion-referenced test development, validation, and testscore usage (2nd ed.), 1979. [Until these materials arepublished commercially, they are available from the Clearing-house for Applied Performance Testing, Northwest RegionalEducational Laboratory, 710 S.W. Second Avenue, Portland,Oregon 97204.]

Popham, W. J. Criterion-referenced measurement. EnglewoodCliffs, NJ: Prentice Hall, 1978.

Roberson, D. R. Development and Use of criterion-referencedtests. Austin, TX: Educational Systems Associates, 1975.

Sherman, M., & Zieky, M. (Eds.). Handbook for conducting taskanalysis and developing criterion-referenced tests of languageskills. Princeton, NJ: Educational Testing Service, 1974.

Sullivan, H. J., Baker, R. L., & Schutz, R. E. Developinginstructional specifications. In R. L. Baker & R. E. Schutz(Eds.), Instructional product development. New York: VanNostrand Reinhold, 1971.

Sweezey, R. W., & Pearlstein, R. B. Guidebook fbr developingcriterion-referenced tests. ERIC Document TM 005 377, 1976.

Tombari, M., & Mangino, E. How to write criterion-referencedtests for Spanish-English bilingual programs. Austin, TX:Dissemination and Assessment Center for Bilingual Education,1978. [Write DACBE, 7703 N. Lamar Blvd., Austin, Texas78752.]

196

Objectives Lists and Banks of Objectives-Based Test Items

Behavioral Objectives and Criterion-Referenced Test Items inMathematics, K-6.

Uniondale Public Schools, Uniondale Union Free School Dis-trict, Uniondale, New York 11553.

For each of the grades, there are two pamphlets, one with 80or more objectives, the other with an item bank for testingthose objectives.

Cost: cost of copying complete set approximately $20.00.

Behavioral Objectives and Test Items:

Language Arts (ERIC numbers ED 066 498 through 501)Mathematics (ERIC numbers ED 066 494 through 497)Social (ERIC numbers ED 066 502 through 504)Science (ERIC numbers ED 066 505 through 508)

Institute for Educational Research, 793 N. Main Street,Glen Ellen, Illinois 60137.

A bank of approximately 5,000 objectives and 27,000 accompany-ing test items was written by Chicago elementary and secondary

school teachers in the course of their participation in work-shops in the writing of behavioral objectives and test items.Objectives and items in each of the four content areas areavailable for primary, intermediate, junior high, and highschool levels. A volume on measuring students' attitudes(ED 066 493) and an operational guide to the workshops (ED066 492) are also available. Parts of the materials are alsoavailable through the Objectives and Items CO-OP, listedbelow. The Institute for Educational Research expects tohave revised materials in the areas of math and language artsavailable in the fall of 1979 for purchase.

Behavioral Objectives Curriculum Guide, Mathematics, Grade 7.

Bucks County Public Schools, Routes #611 and #313, Doylestown,Pennsylvania 18901.

21 6

197

A framework for the development of a seventh grade mathematicsprogram. The guide includes over 160 behavioral objectiveswith an assessment item and estimated learning time for eachobjective at three levals of difficulty.

Cost: $3.00.

Individualizing Mathematical Learning in the Elementary Schools:

An Ordered List of Mathematical Objectives, K-8Test Items for Primary MathematicsTest Items for Intermediate Mathematics

CCL Document Service, 1025 W. Johnson St., Madison, Wisconsin53706.

Approximately 200 mathematics objectives are available alongwith 400 to 500 sample test items keyed to the objectives.The items and objectives were developed at the WisconsinResearch and Development Center at the University of-Wiscon-sin in cooperation with the Wisconsin Department of PublicInstruction. Although the test items have been out-of-printsince mid-1976, single copies will be made upon request.

Cost: objectives, $1.00; primary items, $7.30; and inter-mediate items, $12.65.

Junior High Unified: Sequencing and Keying of Unified Studies;Test Specifications for Criterion-Referenced Testing;Achievement-Awareness Record for Language Arts. ERIC Docu-ment ED 116 193.

ERIC Document Reproduction Service, P.O. Box 190, Arlington,Virginia 22210.

This language arts curriculum guide for grades 7-9 was devel-oped by the Shawnee Mission (Kansas) School District. It

includes 50 obje.ctives with sample test items on composition.Objectives without sample test items are given in the fol-lowing areas (number of objective in parentheses): syntax(81), listening and viewing (20), literature and reading (24),and speaking (18).

Cost: $6.01 for hard copy plus 66 postage.

2 1 "

198

Managing Readin' by Objectives

El Dorado County Office of Education (attn: CurriculumClerk), 337 Placer-ille Drive, Placerville, California 95667.

This is a reading skills management system developed by teach-ers and district staff in El Dorado County, California. The1971 edition is a revision that was based on teachers' class-room experience with the system. The biggest component is abank of over 10,000 items keyed to over 600 objectives in thefollowing four skill categories: language development (oraland written language, vocabulary), word analysis (sight words,phonics, morphology), comprehension (ten types), and studyskills (twelve subareas). Items are divided into elevenlevels from pre-reading through grade 8. Many of the individ-ual tests will have to be recopied before duplicating, forexample, where fill-in items are already filled-in with thecorrect answer. Single copies of the item bank are sold aswell as the optional resources for a complete testing andaccountability system listed below. The manual includesinformal measures for diagnosis.

Cost: The complete bank of Criterion Questions for alllevels is $23.50. The manual/kit is $6.00; record sheets arelOc for individual pupils and 30C for the class chart; U-sorttask cards are $36.00 per all-level set.

Mathematics Assessment Process Handbook of Objectives, K-9,1973-74.

Greece Central School District, Greece, New York 14616.Available from the Clearinghouse for Applied PerformanceTesting, Northwest Regional Educational Laboratory, 710 S.W.Second Avenue, Portland, Oregon 97204.

Described as a minimum skill component of the total districtmathematics curriculum, this system contains guidance onclassroom management plus over 200 mathematics objectives,each with a sample item.

Cost: $13.40.

21 8

199

The Objectives and Items CO-OP:

Language ArtsMathematicsSocial StudiesScienceVocational Education

The CO-OP, 413 Hills House North, University of Massachusetts,Amherst, Massachusetts 01002.

The CO-OP has collected over 10,000 objectives and 40,000items for elementary and secondary school levels in 47 book-lets developed by a number of school systems and stateeducation ..iz:Nartments for their own use. For example, themathematics materials include those of Project SPPED developedfor the New York State Education Department. Parts of theBehavioral Objectives and Test Items, listed above, areincluded in the CO-OP's materials. These booklets are des-cribed as varying in comprehensiveness and have not beenedited by the CO-OP.

Cost: $1.00 to $58.00 per booklet; complete sets by contentarea--language arts, $141.50; mathematics, $384.50; socialsciences, $58.00; science, $77.00; vocational education,$29.00

Phoenix Minimal Objectives:

Minimal Mathematics Objectives, K-12, 1975Proposed Minimal Reading Objectives, K-12, 1974Proposed Minimal Writing Objectives, K-12, 1974

Curriculum and Instructional Development Services, GreaterPhoenix Curriculum Council, 2526 W. Osborn Rd., Phoenix,Arizona 85017.

This system contains over 300 basic skills objectives, eachwith one or more sample items or suggestions for the writingof assessment items or tasks. Broad performance tasks,involving several skills, are frequently suggested to eval-uate mastery of writing objectives.

Cost: mathematics, $2.00; reading, $6.00; and writing, $2.50.

200

Sample Assessment Exercises Manual for Proficiency Assessment:

Volume I: Sample ExercisesVolume II: Item Statistics for Grades 7, 9, and 11Technical Assistance Guide for Proficiency Assessment

Cashier, State Department of Education, 515 L Street,Sacramento, California 95814.

The first volume gives item specifications and a pool of about1500 test questions for three models of proficiency assess-ment: school context (reading, writing, and math), functionaltransfer (forms, maps, ads, directions, and measures), andapplied performance. Volume II gives item statistics formost of these items along with a description of the field testand directions for reading and using the statistics. TheTechnical Assistance Guide has a variety of resources forsetting up a proficiency assessment program.

Cost: $54.00 tor Volumes I and II together. No charge forGuide.

Names and Publishers of Made-to-Order CRTs

Comprehensive AchievementMonitoring (CAM)

National Evaluation SystemsP.O. Box 226Amherst, Massachusetts 01002

Customized Criterion-ReferencedTests

Multi-Media Associates, Inc.P.O. Box 130524901 E. Fifth StreetTucson, Arizona 85732

Customized Objective MonitoringService

Houghton Mifflin Company777 California AvenuePalo Alto, California 94304

201

IOX Test Development ServiceInstructional ObjectivesExchange (I0X)

P.O. Box 24095Los Angeles, California 90025

Mastery Custom Tests - Readingand Math

Science Research Associates259 East Erie StreetChicago, Illinois 60611

ORBITCTB/McGraw-HillDel Monte Research ParkMonterey, California 93940

220

APPENDIX BSources of Other Test Reviews

Buros' Mental Measurements Yearbook (The Gryphon Press) is themost familiar source of test reviews. Recently a number of otherbooks specializing in reviews of educational tests have been pub-lished. Among these, the following three were funded by theNational Institute of Education:

Hoepfner, R.,Los Angeles:

Hoepfner, R.,Los Angeles:edition).

et al. CSE secondary school test evaluations.Center for the Study of Evaluation, 1974.

et al. CSE elementary school test evaluations.Center for the Study of Evaluation, 1976 (2nd

Pletcher, P., Locks, N., Reynolds, D., & Sisson, B. A guideto assessment instruments for limited EngZish speaking stu-dents. New York: Santillana, 1978.

The first two of these volumes deal exclusively with norm-referenced tests. Two other test review books were funded by theOffice of Education, namely:

Tests of aduZt functional literacy. Portland, OR: NorthwestRegional Educational Laboratory, 1975.

Assessment instruments for bilingual education. Los Angeles:National Dissemination and Assessment Center at CaliforniaState University, Los Angeles, 1978.

A number of professional journals also carry reviews of teststhat are relevant in educational settings:

BilinguaZ ResourcesEducational and Psychological MeasurementJournal of Counseling PsychologyJournal of EdUcational MeasurementJOurnal of Bpecial EducationReview of EdUcational Research

While the present volume was in preparation, articles comparingand evaluating specific criterion-referenced tests began toappear (Denham, 1977; Stallard, 1977a, 1977b; Hambleton and

203 221

Eignor, 1978). Articles of this nature are indexed under appro-priate subject, author, and title headings in Resources inEducation and Current Index of Journals in Education.

204

APPENDIX CGlossary

This section gives definitions for most of the technical termsused in this book. The focus is on basic terms dealing withcriterion-referenced testing, many of which are relevant also tonorm-referenced testing. The definitions are designed to intro-duce basic concepts in a non-technical manner.

Tests exist for a multitude of objectives, traits, and behaviors.In this glossary, we use the summary phrase "test of a skill orattitude" to indicate a test of anything from maximum performance(such as knowledge, skill, or achievement) to typical performance(such,as attitude or trait).

Absolute score A test score reporting the number or Percentageof items correctly answered (cf. comparativeinformation).

Alternate form A second version of a test with the same format,content, and difficulty as the first version,but with different test items. Tests withalternate forms may be useful for assessinglearning with a pretest-posttest procedure.Pupils' scores on a second testing are morevalid when a second form is used because thosescores are less influenced by students' specificmemory for the content of the form used for thefirst testing.

Amplified A form of test specification which consists ofobjective a behavioral objective, sample test item, a

description of possible item format, and a des-cription of content that may be included initem stems and responses.

Assessment The measurement of a thing's quality, amount, oreffectiveness, such as an assessment of a stu-dent's learning.

205223'

Behavioralobjective

Bias

A statement, usually in the following form,

Given (specific materials), students wiZZ(perform specified responses),

which describes an outcome of instruction interms of a testing situation. It is called abehavioral objective because it describes notonly the test content or subject matter, butalso the observable behavior which the studentis supposed to exhibit in responding. Examplesof observable behavior-

seZect from muZtiple choice aZternatives,write a 300-word essay,repeat aloud--

contrast with non-observable behaviors which aretypical of more general educational goals--

know,understand,appreciate,soZve.

A flaw in test construction which causes thetest scores to be unfairly influenced by thetest takers' experience outside the classroomor by traits that are not responsive to experi-ence in school.

Comparative Information that helps to interpret individualinformation or group test scores by comparing them with the

scores of other test takers. Some of the typesof comparative information are percentiles,grade level equivalents, scores of criteriongroups.

Conceptual A term coined for this book.which refers tovalidity aspects of a criterion-referenced test's valid-

ity that are not determined by field testing.These include the quality of the test specifica-tion, the match between the items and theirspecifications, and representativeness of theitems.

224206

Concurrent The validity of a test whose scores correlatevalidity highly with contemporaneous criterion behaviors

(cf. criterion[c]). For example, a pencil andpaper test of skills in auto repair has concur-rent validity if pupils who earn higher scoreson it also are more proficient in the criterionskill of repairing autos.

Confidence A statistical estimate of the interval withininterval which a score, if it were error-free, would

probably fall. Interval estimates contrast withsingle point estimates, such as the average, andare assigned probabilities which are called"levels of confidence."

Consistency A general term used in this book for the varioustypes of reliability. The term is used ratherthan the term reliability to call attention tothe fact that measurement specialists disagreeon the usefulness of traditional reliabilitystatistics for criterion-referenced tests.

ConPtruct When a test is purported to measure a constructvalidity (i.e., a trait, intellectual process, or other

unobservable characteristic of test takers), andit does so, it has construct validity.

Content The term used to describe the efficacy of a testvalidity which measures the content or subject matter it

is intended to measure. This type of validity isusually confirmed by the judgment of subject mat-ter specialists who examine the test specifica-tions and test items (cf. descriptive vaZidity).

Correlation The degree and direction of linear relationbetween two variables. Positive correlationsdescribe direct relations and negative onesdescribe inverse relations. The degree of rela-tion increases as the numerical value of thecorrelation statistic departs 0 and approaches+1.0 or -1.0.

Correlation The number that describes a correlation.coefficient

Criterion [a] In this volume and in many writings on CRT,the pool of potential test items measuring thesame skill, objective, or attitude from which

a25

the actual items on a CRT are a sample. Thelarger set of possible items which the givenitems represent.

[b] In other contexts, the cutting score orpassing score. This general meaning of the termas a synonym for stamiord is misleading becausestandards may be expressed in absolute terms(e.g., 80% correct) or in comparative terms(e.g., 80th percentile). The latter is a norm-referenced standard, not a criterion-referencedone.

[c] The "real world" behavior or state whichsome types of test are designed to reflect. Forexample, a multiple choice test of compositionskills is intended to identify the test taker'sskills on the criterion of actual writing. Acollege entrance exam is meant to identify thefuture criterion of success in college.

Criterion A well defined group of test takers whose typi-group cal score serves as a standard of comparison for

other students' scores. For example, statescience fair medalists might be criterion groupswhose typical scores on a science test couldserve as a standard of high achievement. Arandom sample of students from the populationfor which a test is intended would be a crite-rion group whose typical scores could serve asa standard of average performance.

Criterion- A test designed so that the test items arereferenced "referenced to," or measure, the specific beha-test (CRT) viors described in the criterion. The items for

CRTs are supposed to be a representative sampleof the criterion. CRTs are intended to show theextent to which a student possesses a particularskill or attitude.

Criterion A high correlation of scores on a test with avalidity criterion (cf. criterion[c]) for which the test

is supposed to be an indicator. Predictive andconcurrent validity are types of criterion

validity.

Curriculum An index in which the items of a test are keyedcross- to the pages or sections in published instruc-

reference tional materials that cover the same skills.

226208

Cutting score The score which serves as a dividing line be-tween categories of achievement such as masteryand non-mastery or passing aad not passing.

Decision.rules The rules for interpreting test scores in termsof categories of achievement, such as mastery/uncertain/non-mastery.

Description See test specifications.

Descriptive A term used to describe the efficacy of a testvalidity whose items accurately reflect the content,

behavior, and format called for in its speci-fications. The specifications are then avalid description of the items or tasks.

Diagnostic A test that is designed to give informationtest about a test taker's specific strengths and

weaknesses within a subject area.

Discriminating The degree to which a test item distinguishespower test takers who get high total scores on the

test from those who get low total scores. Itemsare selected for norm-referenced tests so as tohave high discriminating power.

Divergent The validity that a test has when it measuresvalidity the intended skill or attitude without being

much affected by other, irrelevant skills, atti-tudes, or factors. For example, a math testlacks divergent validity if students' scores aregreatly affected by the reading level of wordproblems. A test of reading comprehension lacksdivergent validity if its scores are heavilyinfluenced by pupils' general factual knowledge.

Domain [a] the population of possible test items ortasks from which actual test items are sampled(cf. criterion[a]).

[b] In other contexts, such as "cognitivedomain" or "reading domain," the term refers tothe general curricular area.

Domain- A test that is designed so that test items arereferenced "referenced to," or measure, an individual'stest mastery of the population of tasks in a domain.

Such a test yields information about the

209 227

Domain speci-fication

proportion of tasks within the domain that thetest taker has mastered.

A form of test specification which describes indetail the characteristics of the total pool ofpotential items for measuring a specific skillor attitude. It is a technical document thatdeals with details of test content constructionsuch as characteristics of distractors, rulesfor scoring, and rules for sampling items fromthe domain.

Factor analysis A variety of statibtical methods for identifyingthe distinct factors (e.g., abilities or traitsor interests) that are reliably measured in aset of tests given to the same test takers.

False negative

False positive

Field test

Grade equiva-lent score

The error of deciding that a student does nothave mastery knowledge when (s)he actually does,i.e., failing to pass a deserving student.

The error of deciding that a student has masteryknowledge when (s)he Actually does not, i.e.,passing an undeserving student.

A tryout of a test in the actual conditionsunder which it will be used. Information fromsuch tryouts is used to improve the test andestablish norms and validity.

A form of derived score for NRTs which issupposed to tell, for any raw score, the gradelevel, in years and months, for which that rawscore is the national average. Owing to themisleading nature of grade equivalent scores,the professional test standards (APA, 1974)discourage the use of grade equivalents.

Individual- Designing-instruction to meet the particularization needs of the individual student. Criterion-

referenced measurement is useful for individ-ualizing because it facilitates identification,by objective, of individual students' strengthsand needs.

Inter-item The correlation among items on the same test,correlation taken to show the degree to which the items are

measuring the same thing.

210 228

Item An individual task or question on a test.

Item analysis The process of looking at students' scores ontest items to determine such things as the items'difficulty levels and consistency in discrimi-nating between high and low scorers. Items areanalyzed for the purpose of identifying thosewhich are good and those which are poor.

Item by group The case where the items on a test which areinteraction hardest for one group of test takers (e.g., one

race or one gender) are different from the itemswhich another group finds hardest. A form ofevidence for bias in the test.

Item form A type of test specification which-states incomplete detail the properties of items on atest. It does so by laying out a frame of textwhich is to be constant for all of the testitems, then specifying the variable values thatmay go into specified slots in the frame, andrules for selecting among the possible variablevalues. An item form includes the instructionsor additional information given to the testtaker and describes the appropriate answermethod. It also defines the correct'responses.

Item generation The process of constructing test tasks, items,or questions.

Item-objective The type of validity based on evidence that acongruence test's items are consistent with its specifica-

tions.

Item uniformity The characteristic a test exhibits when all testitems measure a uniform, coherent skill or atti-tude (when the skill or attitude itself isuniform). Item uniformity is determined byfactor analysis, inter-item correlations, anditem-test correlations.

Level Age or grade placement for which a test isdesigned.

Mastery score The score on a particular test which indicatesthat a test taker has reached a predeterminedlevel of proficiency.

229211

Mastery test

Norms

A test designed to determine the extent towhich test takers have learned or become profi-cient in a given unit, concept, topic, or skill.

One ty?e of comparative information forinterpreting norm-referenced tests. Norms areusually given in the form of percentiles. Theydescribe the ranking of each possible scoreamong the students who were in the test's fieldtryouts, but do not indicate the absolute degreeof skill or mastery that is exhibited by thescores.

Norm-referenced In achievement testing, a test that is designedtest to survey the skills and knowledge common to

most educational programs. This type of testyields information about how individual testtakers' scores compare with the scores of theothers who have also taken the test and providesonly a very general description of the skills orattitudes being measured.

Objectives- A test designed so that the items assessbased (or specified objectives for the purpose of makingobjectives- a mastery/non-mastery decision about the testreferenced) taker.

test

Percent,.ie A number which indicates the percentage ofscores which fall below a given test score.For example, a test taker in the 95th percentilescored higher than 95% of the students in thenorm group. Small differences in raw scoressometimes make large differences in percentileranking, especially in the middle percentiles.Percentiles thus should not be taken as adirect or absolute measure of learning.

Practice The change in a test taker's score that is dueeffect to previous experience with the same or similar

test rather than to a change in the skill orattitude to be measured.

Prescriptive Suggesting materials or activities for teachingand learning particular skills.

Program- A test that either is not sold apart from aembedded test body of curricular materials or that refers so

212

23

Random sample

Reference group

Reliability

Responsematerials

Response mode

Response spaces

Sample item

Sampling plan(samplingrule)

Sensitivityto learning

closely to specific curricular materials thatit would be unsuitable for testing students whohad used other texts, practice exercises, etc.

A sumbol that stands for correlation coeffi-cient.

A sample that is drawn from the total population(of students or schools or test items) so that

every member of the population has an equalchance of being selected. This procedure isused to avoid bias in selecting the sample.

A well defined group whose scores are used as astandard of comparison.

The stability or consistency with which a testmeasures a skill or attitude. Absence ofincidental fluctuations in score. Several typesof reliability are distinguished: consistencyof individuals' scores from one occasion toanother (test-retest),; consistency from oneform of a test to another (alternate forms);and consistency among the items themselves(internal consistency or split half). Eitherthe total test scores or the instructionaldecisions based on the test scores may bestudied for their reliability.

The materials a test taker uses for recordinganswers to a test (e.g., test booklets, answersheets).

The answer.form a test requires (e.g., multiplechoice, true-false, short answer, essay).

Places provided on a test form or answer sheetfor recording answers.

A sample test question given as part of theinstructions to students to show them how totake the test.

The selection procedure that is followed toensure that a sample represents the total groupfrom which it was drawn.

A test's ability to detect an increase in thetest taker's knowledge or skill.

213 231

Social fairness The quality a test exhibits when test contentdoes not stereotype or disparage any socialgroup (i.e., any race, language group, gender,etc.).

Specifications See test specifications.

Specimen set A collection of test materials that serve as asample of the complete test package. Many pub-lishers sell these materials to enable testusers to decide whether to buy the entiretesting system.

Standard [a] A degree or amount of quality, excellence,or attainment.

[b] A basis of comparison.

Standardized [a] A norm-referenced test.test

[b] A test that has been designed so that alltestees take the test under similar conditions.This latter usage may lead to some confusionsince it may incluee criterion-refermced tests,unlike meaning [a].

Statistically A difference in scores or numbers that is largesignificant enough as to be unlikely a result of meredifferencr; chance.

Stem

Stimulus

The question or stimulus part of a test item asopposed to the response choices or responses.

The item stem and any other information, such asa graph or picture, that is used to pose thequestion in a test item and to elicit theresponse.

Stratified A sample made by first dividing a populationrandom sample (of people or test items) into naturally

occurring groups (strata), then sampling fromeach in proportion to its relative size.

Template

Test

I PS

A scoring overlay with the pattern of correctanswers perforated to facilitate hand scoring.

A tool for finding out how well students know abody of information, have mastery of a skill,

232214

Test specifi-cations

Val'Aity

or possess an attitude. The tool involves pre-senting some stimuli (or questions) to elicitresponses fvom the students. Checklists andobservation schedules are not considered testsin this context.

The description of the set of possible itemsfor a test and directions for sampling itemsfrom that set. This description tells what isto be measured, and how. It serves as direc-tions to the test writer for constructing atest.

A test or measure has validity if its scoresmean what they are supposed to mean. There aredifferent types of validity (cf. content, des-criptive, criterion, and construct validity),each one verified in a somewhat different way.

233215

APPENDIX DSupplement to Chapter 3:

Example of a Domain DescriptionWhich Would Receive a Level A Rating

Domain Title

Applying principles of U.S. foreign policy].

General Description

Given a description of a fictional international situation inwhich the United States may wish to act and the name ofAmerican foreign policy document or pronouncement, the stu-dents will select from a list of choices the course of actionthat would most likely follow from the given document orpronouncement.

Sample Item

Directions: Below are some made-up stories about worldevents. Answer each question by picking a choice and writingits letter on the answer sheet.

Some Russian agents became members of the ChristianDemocratic Party in Chile. The party attacked the President'shouse and arrested him. The Russian agents set themselvesup as President and Vice-President of Chile. Chile thenasked to become an "affiliated republic" of the U.S.S.R.

Based on the Monroe Doctrine, what would the U.S. do?

a. Ignore the new status of Chile.b. Warn Russia that its influence is to be withdrawn from

Chile.c. Refuse to recognize the new government of Chile because

it came to power illegally.d. Send arms to all groups in the country that swear to

oppose communism.

'This domain by Clinton B. Walker is reprinted from IllustrativeTest Specifications for the USDESEA Matrix of Educational Objec-tives. W. J. Popham, Project Director. Los Angeles: EducationalObjectives and Measures, 1976.

234217

Stimulus Attributes

1. The fictional passage will consist of 50 words or lessfollowed by the name of a foreign policy pronouncement ordocument inserted into the question, "Based on thewhat will the U.S. do?

2. The policy named in the stimulus passage will be a documentor pronouncement selected from the Domain Supplement.

3. Each passage will consist of two parts: a) a backgrounddescription of an action taken by a foreign nation, and b) astatement of the action to which the foreign policy documentor pronouncement is to,be applied.

a. The background statement will be analogous to an histori-cal situation which either preceded the document orpronouncement, or for which the document or pronouncementwas used. For example, the Monroe Doctrine was laid downin response to European designs on American nations thatwere attempting to establish independence. A parallelcase today might describe a European country trying toencroach on the sovereignty of such a country.

b. The statement of an action will be an action taken by areal foreign nation that conforms to one of the followingcategories:

1. Initiation of an international conflict.2. Initiation of a civil conflict. This may include

coups, revolutions, riots, protest marches, civilwar, or a parliamentary crisis.

3. Initiation of an international relationship. Thisincludes trade negotiations, friendship pacts, mili-tary alliances, and all classes of treaties.

4. Appeal for foreign aid to meet economic or militaryneeds.

5. Development and stockpiling of military weapons.

4. All statements in the passage will refer to specific nationsand events. Descriptions such as, "A nation is at war withanother country," are not acceptable. The events describedmay be set in the present or past, as appropriate.

225218

5. When the document or pronouncement mentioned in the stimuluspassage is tied to a particular geographic region, countriesnamed in the passage must belong to that region.

6. Passages will be written no higher than the 8th grade readinglevel.

Response Attributes

1. Students will mark the letter of one of the four given responsealternatives.

2. The correct response will be a course of action that isgoverned by the main principles of the document or pronounce-ment named in the stem.

3. Response choices consist of the correct response and threedistractors. Each choice will have the following character-istics:

a. Describe a specific course of action that refers to thepeople, nations, and actions in the stimulus passage.

b. Be brief phrases written to complete the understood sub-ject, "The United States would . . ."

4. Distractors will be written to meet these additional criteria:

a. At least one distractor will describe an action derivedfrom a different document or pronouncement selected fromthe Domain Supplement

b. Distractors will be plausible courses of action, notfanciful.

Domain Supplement

Foreign Policy Documents and Pronouncements:

The following list of foreign policy pronouncements and docu-ments was selected fror Brockway, T., Basic Documents inUnited States Foreign PoZicy. Princeton, NJ: D. Van NostrandCompany, 1968. The documents were chosen on the basis oftheir historical impact or potential application to currentevents. The list appears in chronological order.

219 236

1. Washington's Farewell Address2. The Monroe Doctrine3. Webster on Revolutions Abroad4. Open Door in China5. The Platt Amendment6. Roosevelt Corollary of the Monroe Doctrine7. The Fourteen Points8. The Washington Conference9. The Japanese Exclusion Act

10. The Kellogg-Briand Pact11. The Stimson Doctrine12. Roosevelt's Quarantine Speech13. The Atlantic Charter14. The Connally Resolution15. The Yalta Agreements16. The Potsdam Agreement17. United States Proposals for the International Control of

Atomic Power18. The Truman Doctrine19. The Marshall Plan20. The Point Four Program21. The North Atlantic Treaty22. American-Japanese Defense Pact23. Atoms for Peace: Eisenhower's Proposal to the United

Nations24. The Eisenhower Doctrine25. Alliance for Progress2C. Kennedy's Grand Design27. Treaty on the Peaceful Uses of Outer Space

23 7

220

APPENDIX EAvailable Tests That Were Screened Out of the Pool

of Measures Reviewed in This Volume

CRTs that are embedded in a specific curriculum

Clues to Reading ProgressEducational Progress Corporation

Communication Skills Program

Ginn & Company

Competency Skills Test for Keysto Reading

Economy Company

Competency Skills Tests for Keysto Independence in Reading

Economy Company

Continuous Progress LaboratoriesEducational Progress Corporation

Criterion Assessment TestsJ. B. Lippincott

Dale Avenue ProjectPaterson (NJ) School District

Developing Mathematical ProcessesRand McNally

Developmental Syntax ProgramLearning Concepts

Gaining Math SkillsMcCormick-Mathers Publishing

Company

Holt Basic Reading SystemHolt, Rinehart, Winston

Individualized MathematicsProgram

Educational & IndustrialTesting Service

Individualized MathematicsSystem

Ginn & Company

Individualized Science ProgramCRTs

Imperial International LearningCorporation

Learning StaircaseLearning Concepts

Math Management System Place-ment Test

Clark County School District

Mathematics Around UsScott Foresman & Company

Mathematics LaboratoryMcCormick-MAthers Publishing

Company

Perceptual Skills CurriculumWalker Educational Book Company

Progressive Achievement TestsNew Zealand Council for Educa-

tional Research

2g.38

Project ACTIVE CRTsACTIVE, Ocean Township (NJ)

Elementary School

Series m: Macmillan MathMacmillan Company

Series r: Macmillan ReadingMacmillan Company

System 80Borg-Warner Educational Systems

System for Teacher Evaluationof Prereading Skills

CTB/McGraw-Hill

Teaching Essential Language &Reading

Educational & IndustrialTesting Service

SWRL Kindergarten ProgramGinn & Company

Tests received in response to our search,but screened out of the ool of tests to be reviewed*

Listed below are the names and publishers of tests which werescreened out. The reasons for excluding each are given, keyed tothe following list:

1. The test became unavailable before publication of this volume.

2. The skills measured are a usual result of maturation orgeneral experience.

3. The test is not built around explicit objectives.

4. Items are not keyed to objectives.

5. There is only one item per objective.

6. Scores for the separate objectives are not given.

7. Scores are not interpreted in terms of proficiency or mastery.

8. The test was not designed as an objectives-based measure.

9. The test was not available to review in time for inclusion inthis volume.

*No judgment about the merits of these tests is intended by theirbeing excluded. Tests not meeting criteria 3 through 8 are notCRTs.

239222

ACER Class Achievement Tests inMathematics (3,8)

Australian Council for Educa-tional Research

APPEL Test (1)Insgroup (formerly EDCODYNE)

Assessment of Career Develop-ment (8)

American College Testing Program

Basic School Skills Inventory(5,7)

Follett Publishing Company

Boehm Test of Basic Concepts(3 or 5, 8)

Psychological Corporation

Brigance Diagnostic Inventory ofBasic Skills (9)

Walker Educational Book Corpora-tion

Cincinnati Mathematics Inven-tory (5)

Cincinnati Public Schools, Dept.of Research 61 Development

Composite Auditory PerceptionTest (8)

Alameda County (CA) School Dept.

Criterion-Referenced Tests forReading and Writing in A Tech-nology of Reading and Writing,Vol. 2 (9)

Academic Press

Delco Readiness Test (3,7)Walter M. Rhoades

Development Test of Visual MotorIntegration (2,8)

Follett Publishing Company

Diagnostic Skills Battery (9)Scholastic Testing Service

Emporia State Algebra II Test(5,8)

Bureau of Educational Measure-ments, Emporia Kansas StateCollege

Individual Phonics CriterionTest (5,7)


Kraner Preschool Math Inven-tory (3)

Learning Concepts

NM Attitude Toward Work Test (7)Monitor

Oral Reading Criterion Test(3,8)


PIRAMID (1)PIRAMID Consortium

Preschool Attainment Record (5)American Guidance Service

Prescriptive Mathematics Inven-tory (1)

CTB/McGraw-Hill

Reading Management System, Diag-nostic Step Tests (4,5,6)

Clark County School District

Reading Skills Survey Tests(5,6)

Economy Company

Self-Directed, Interpretativeand Creative Reading (4,6)

240223

Senior High Assessment ofReading Performance (SHARP)

(9)CTB/McGraw-Hill

SRA Reading Record (8)Science Research Associates

Stanford Achievement Test (8)Harcourt Brace Jovanovich

Stanford Test of AcademicSkills (TASK) (8)

Harcourt Brace Jovanovich

Visual Analysis Test (8)University of Pittsburgh

211

224

INDEX ANames of Reviewed Tests

Name/Publisher/Level zazd Name/Publisher/Level zue.

Analysis of Skills (ASK) - 30 Criterion-Referenced Tests of 46Language Arts

Scholastic Testing Serviceelementary and secondary

Basic Reading and Computa-tional Skills

Multi-Media Associateselementary

Analysis of Skills (ASK) , 32

Mathematics Criterion Test of Basic Skills 48Scholastic Testing Serviceelementary and secondary

Academic Therapy Publicationselementary and secondary

Analysis of Skills (ASK) -Reading

34 Design for Math Skill Develop-ment

50

Scholastic Testing Serviceelementary and secondary

NCS Educational Systemselementary

Basic Arithmetic SkiZZ Evalua-tion (BASE) and BASE II

36 Diagnosis: An InstructionalAid - Mathematics

52

Imperial International LearningCorperation

elementary and secondary

Science Research Associateselementary

Diagnosis: An Instructional 54Basic Word Vocabulary Test 38 Aid - ReadingDreier Educational Systemselementary and secondary

Science Research Associateselementary

Beginning Assessment Test forReading

40 Diagnostic Mathematics Inven-tory

56

J. B. Lippincott Companyelementary

CTB/McGraw-Hillelementary and secondary

Carver-Darby Chunked ReadingTest

42 Doren Diagnostic Reading Testof Word Recognition Skills

58

Revrac Publicationssecondary

American Guidance Serviceelementary

Cooper-McGuire Diagnostic Word 44 Early ChiZdhood Assessment 60

Analysis TestCroft Educational Serviceselementary

Cooperative Educational Serviceelementary

225242

Name/Publisheraevel ague

Everyday Skills Tests:. Reading,Test A; Mathematics, Test A

CTB/McGraw-Hillelementary and secondary

Name/Publisher/Level Page

62 Language and Thinking Program: 82

Mastery Learning CriterionTests

Follett Publishing Companyelementary

Fountain Valley Teacher Sup- 64port System in Mathematics

Richard L. Zweig Associateselementary and secondary

Fountain Valley Teacher Sup- 66port System in Reading

Richard L. Zweig Associateselementary

Group Phonies AnaZysis Test 68

Dreier Educational Systemselementary

Individual Pupil MonitoringSystem - Mathematics

Houghton Mifflinelementary and secondary

IndividuaZ PupiZ MonitoringSystem - Reading

Houghton Mifflinelementary

Individualized Criterion-Referenced Testing - Math

Educational Progresselementary and secondary

Language Arts: Composition, 84

Library, and Literary SkiZZsInstructional Objectives Exchangeelementary

Language Arts: Mechanics and 86Usage

Instructional Objectives Exchangeelementary

Language Arts: Word Forms and 88

SyntaxInstructional Objectives Exchange

70I

elementary

Mastery: An EValuationsTooZ 90

(Mathematics)Science Research Associates

72 elementary and secondary

Mastery: An Evaluation TooZ 92

(SOBAR Reading)Science Research Associates

74 elementary and secondary

Individualized Criterion- 76

Referenced Testing - ReadingEducational Progresselementary and secondary

Instant Word Recognition Test 78

Dreier Educational Systemselementary

KeyMath Diagnostic Arithmetic 80

TestAmerican Guidance Serviceelementary

226

Math Diagnostie/PZacement Tests 94

U -SAIL

elementary

Mathematics: Elements, SymboZ- 96

ism, and MeasurementInstructional Objectives Exchangesecondary

Mathematics: Geometry 98


2,13


Mathematics: Geometry, Opera- 100tions, and ReZations

Instructional Objectives Exchangesecondary

Mathematics: Measurement 102


Mathematics: Numeration andReZations


Mathematics: Operations andProperties


Name/Publisher Level Page

Pre-Reading Assessment Kit 118CTB/McGraw-Hill Ryerson Limitedelementary

Prescriptive Reading Inventory 120CTB/McGraw-Hillelementary

Reading: Comprehension SkiZZs 122104 Instructional Objectives Exchange

elementary

Reading: Word Attack Skills 124

Instructional Objectives Exchange106 elementary

Mathematics: Sets and Numbers 108Instructional Objectives Exchangeelementary

McGuire-Bumpus Diagnostic Com- 110prehension Test

Croft Educational Serviceselementary

New Mexico Career Education 112

TestMonitorsecondary

New Mexico Concepts of Ecology 114

TestMonitorelementary and secondary

New Mexico Consumer Mathematics 116

Test & Consumer Rights andResponsibilities Test

Monitorsecondary

REAL: Reading/EVeryday Activ- 126

ities in LifeCal Press, Inc.secondary

Sipay Word Analysis Tests 128

Educators Publishing Serviceelementary and secondary

SkiZZs Monitoring System: 130

ReadingHarcourt Brace Jovanovich/Psychological Corporation

elementary

Social Studies: AmericanGovernment

Instructional Objectives Exchangesecondary

132

STA Survival Skills in Reading 134

and MathScience Research Associateselementary and secondary

Stanford Diagnostic Mathematics 136

TestHarcourt Brace Jovanovich/Psychological Corporation


24 4227


Stanford Diagnostic Reading 138

TestHarcourt Brace Jovanovich/Psychological Corporation


Survey of Reading SkillsDallas Independent School

Districtelementary and secondary

140

Tests of Achievement in Basic 142Skills - Mdth

Educational and IndustrialTesting Service


Tests of Achievement in Basic 144Skills - Reading and Language

Educational and IndustrialTesting Service

elementary

Wisconsin Design for Reading 146

Skill Development: Compre-hension



Skill Development: StudySkills


Wisconsin Design for Reading 150Skill Development: WordAttack


Woodcock Reading Mastery Tests 152

American Guidance Serviceselementary and secondary

INDEX BTests 1-)y Subject Matter

MATHEMATICS INDEX

UNDERSTANDING MATH CONCEPTS: num-bers and sets; numeral systems andnumber principles; number relation-'ships; and ordering numbers and'symbols

Analysis of Skills (ASK) -Mathematics, Level 1-8

Basic Arithmetic Skill Eval-uation (BASE), Level 1-6

KeyMath Diagnostic Arithmetic 80Test, Level K-6

Mastery: An Evaluation Tool - 90Mathematics, Level K-8

94

104

Mathematics: Sets and Num- 108bers, Level K-6

32 Math Diagnostic/PlacementTests, Level 1-6

36 Mathematics: Numeration andRelations, Level K-6

Basic Arithmetic Skill Eval- 36uation IT (BASE II),Level 7-8

Criterion-Referenced Tests of 46Basic Reading and Computa-tional Skills, Level K-6

Design for Math Skill Devel-opment, Level K-12

Stanford Diagnostic Mathematics 136Test, Level 1-8

Tests of Achievement in BasicSkills: Mathematics,

50 Level K-12

Diagnosis: An Instructional 52

Aid - Mathematics, Level 1-6

Diagnostic Mathematics Inven-tory, Level 1.5-7.5+

Fountain Valley Teacher Sup-port System in Mathematics,Level K-8

Individual Pupil MonitoringSystem - Mathematics,Level 1-8

Individualized Criterion-Referenced Testing - Math,Level 1-8

142

PERFORMING ARITHMETIC OPERATIONS:whole number computations - addi-

56 tion, subtraction, multiplication,division

64 Analysis of Skills (ASK) -Mathematics, Level 1-8

32

Basic Arithmetic Skill Eval- 36

36

70 uation (BASE), Level 1-6

Basic Arithmetic Skill Eval-uation II (BASE II),

74 Level 7-8

229 216

Criterion-Referenced Test ofBasic Reading and Computa-tional Skills, Level K-6

Criterion Test of Basic Skills:Arithmetic, Level K-8

Design for Math Skill Devel-opment, Level K-12

Diagnosis: An InstructionalAid - Mathematics, Level 1-6


Everyday Skills Tests: Mathe-matics




KeyMath Diagnostic ArithmeticTest, Level 1-6

Mastery: An Evaluation Tool -Mathematics, Level K-8

Math Diagnostic/PlacementTests, Level 1-6

Mathematics: Operations andProperties, Level K-6

SRA Survival Skills in Readingand Mathematics, Level 6+

46 Tests of Achievement in Basic 142

Skills: Mathematics,Level K-2

48

50

52

56

62

64

PERFORMING ARITHMETIC OPERATIONS:fractions, decimals, and percentagecomputations - addition, subtrac-tion, multiplication, division


32

Basic Arithmetic Skill Eval- 36uation (BASE), Level 1-6

Basic Arithmetic Skill Eval- 36uation II (BASE II),Level 7-8

Criterion-Referenced Test ofBasic Reading and Computa-

70 tional Skills, Level K-6

74

46

Criterion Test of Basic Skills: 48Arithmetic, Level K-8

Design for Math Skill Devel- 50

opment, Level 2-12

80 Diagnosis: An Instructional 52

Aid - Mathematics, Level 1-6

90 Fountain Valley Teacher Sup- 64

port System in Mathematics,Level K-8

94

Individual Pupil Monitoring 70System - Mathematics,

106 Level 2-8

Individualized Criterion-134 Referenced Testing - Math,

Level 1-8


Test, Level 1-8

74

KeyMath Diagnostic Arithmetic 80Test, Level 3-6

230 21



Mathematics: Numerations andRelations, Level K-6

Mathematics: Operations andProperties, Level K-6

90 Individual Pupil Monitoring 70

System - Mathematics,Level 2-8

94

Individualized Criterion-Referenced Testing - Math,

104 Level 1-8

74

KeyMath Diagnostic Arithmetic 80106 Test, Level 3-6

SRA Survival Skills in Reading 134and Mathematics, Level 6+


Test, Level 1-8

Tests of Achievement in BasicSkills: Mathematics,Level 3-12

Mastery: An Evaluation Tool - 90Mathematics, Level K-8

Math Diagnostic/Placement 94

Tests, Level 1-6

New Mexico Consumer Mathemat- 116142 ics Test, Level 9-12

APPLYING MATHEMATICS: problemsolving, word problems


SRA Survival Skills in Reading 134and Mathematics, Level 6+


Test, Level 1-8

Tests of Achievement in Basic 142

32 Skills: Mathematics,Level 2-12


uation (BASE), Level 1-6

Basic Arithmetic Skill Eval-uation II (BASE II),Level 7-8

36 !GEOMETRY OPERATIONS AND RELATIONS


opment, Level 2-12


Aid Mathematics, Level 1-6

Everyday Skills Tests: Mathe-matics


Analysis of Skills (ASK) - 32

Mathematics, Level 3-8




opment, Level 1-3 (basic),62 4-12

Diagnosis: An Instructional 5264 Aid - Mathematics, Level 1-6

24 8231




KeyMath Diagnostic ArithmeticTest, Level 3-6



Mathematics: Geometry,Level K-6

Mathematics: Geometry, Oper-ations, and Relations,Level 7-9

64 Criterion Test of Basic Skills: 48

Arithmetic, Level K-8

Design for Math Skill Devel-70 opment, Level 4-12

74

80

90

94

98

100


Test, Level 1-9

Tests of Achievement in BasicSkills: Mathematics,Level 2-12

Diagnosis: An InstructionalAid - Mathematics, Level 1-6





50

52

56

64

70

74

KeyMath Diagnostic Arithmetic 80

Test, Level 1-6

Mastery: An Evaluation Tool - 90

Mathematics, Level K-8

Math Diagnostic/Placement 94

142 Tests, Level 1-6

MEASUREMENT: weight, volume,length, angular, time, speed

Analysis of Skills (ASK) -Mathematics, Level 3-8, 1-2(common measure)

Basic Arithmetic Skill Eval-uation (BASE), Level 1-6

Mathematics: Elements, Sym- 96

bolism, and Measurement,Level 7-9

Mathematics: Measurement,Level K-6

32 SRA Survival Skills in Readingand Mathematics, Level 6+

Tests of Achievement in Basic36 Skills: Mathematics,

Level 2-12

232 219

102

134

142

USE OF TABLES, GRAPHS, STATISTICALCONCEPTS


Mathematics, Level 1-4(basic graphs), 5-8



Design for Math Skill Devel- 50opment, Level 4-12


64

Individual Pupil Monitoring 70

System - Mathematics,Level 7-8

KeyMath Diagnostic Arithmetic 80Test, Level 3-6


Mathematics, Level K-8

Mathematics: Numeration and 104Relations, Level K-6

SRA Survival Skill:, in Reading 134and Mathematics, Level 6+

Stanford Diagnostic Mathematics 136Test, Level 1-8

Tests of Achievement in Basic 142Skills: Mathematics,Level 6-12

233 250

READING IYDEX

AUDITORY COMPREHENSION SKILLS: Re-

ception (listening)

Analysis of Skills (ASK) -Reading, Level 1-3 (wholeprogram 1-8)

34

Beginning Assessment Test for 40

Reading, Level K-1

Cooper-McGuire Diagnostic Word 44Analysis Test

Doren Diagnostic Reading Testof Word Recognition Skills,Level 1-4

Early Childhood Assessment,Level preschool-1

Group Phonics Analysis Test,Level 1-3

Individualized Criterion-Referenced Testing - Reading,Level K-8

Language and Thinking Program:Mastery Learning CriterionTests, Level preschool-1


Skill Development: WordAttack, Level K-6

Woodcock Reading Mastery 152Tests, Level K-12

VISUAL COMPREHENSION SKILLS/WORDATTACK SKILLS: reception and pro-duction (reading and writing)

58 Analysis of Skills (ASK) -Reading, Level 1-3 (wholeprogram 1-8)

60 Beginning Assessment Test forReading, Level K-1

68 Cooper-McGuire Diagnostic WordAnalysis Test

76 Crl.terion-Referenced Tests ofBasic Reading and Computa-tional Skills, Level K-6

82 Criterion Test of Basic Skills:Reading, Level K-8

Pre-Reading Assessment Kit, 118Level K-1

Prescriptive Reading Inven- 120

tory, Level 1.5-6.5

Stanford Diagnostic Reading 138

Test, Level 1.5-12

Survey of Reading Skills, 140Level K-8

Tests of Achievement in Basic 144

Skills: Reading andLanguage, Level K-2

34

40

44

46

48


Aid - Reading, Level 1-6

Doren Diagnostic Reading Test 58

of Word Recognition Skills,Level 1-4

Fountain Valley Teacher Sup- 66

port System in Reading,Level K-6

Group Phonics Analysis Test, 68

Level 1-3


System - Reading, Level 1-6

234

25/

Individualized Criterion-Referenced Testing - Reading,Level K-8

Language and Thinking Program:Mastery Learning CriterionTests, Level preschool-1

Language Arts: Word Forms andSyntax, Level K-6

Mastery: An Evaluation Tool -SOBAR Reading, Level K-8

Pre-Reading Assessment Kit,Level K-1

Prescriptive Reading Inven-tory, Level 1.5-6.5

Reading: Word Attack Skills,Level K-6

Sipay Word Analysis Tests,Level 1-adult

Skills Monitoring System -Reading, Level 3-5

Stanford Diagnostic ReadingTest, Level 1.5-12

Survey of Reading Skills,Level K-8

Tests of Achievement in BasicSkills: Reading andLanguage, Level K-2

Wisconsin Design for ReadingSkill Development: WordAttack, Level K-6

Woodcock Reading MasteryTests, Level K-12

76 OCABULARY/WORD RECOGNITION: audi-tor and visual

Analysis of-Skills (ASK) -82 Reading, Level 1-8

88

92

118

120

124

128

130

138

140

Basic Word Vocabulary Test,Level 4-adult

Criterion Test of Basic Skills:Reading, Level K-8

Diagnosis: An InstructionalAid - Reading, Level 1-6

Doren Diagnostic Reading Testof Word Recognition Skills,Level 1-4

34

38

48

54

58

Everyday Skills Tests: Reading 62

Fountain Valley Teacher Sup- 66

port System in Reading,Level K-6

Group Phonics Analysis Test, 68

Level 1-3

Individual Pupil Monitoring 72System - Reading, Level 1-6

Individualized Criterion- 76

Referenced Testing - ReadingLevel K-8

144 Instant Word Recognition Test, 78

Level 1-4

Language Arts: Mechanics and 86150 Usage, Level K-6

Language Arts: Word Forms 88

152and Syntax, Level K-6


SOBAR Reading, Level K-8

Pre-Reading Assessment Kit,Level K-1

Prescriptive Reading Inven-tory, Level 1.5-6.5

118 Diagnosis: An Instructional 54


120 Everyday Skills Test: Reading 62

Reading: Word Attack Skills, 124Level K-6

Sipay Word Analysis Tests,Level 1-adult


Fountain Valley Teacher Sup- 66port System in Reading,Level K-6

128 Individual Pupil Monitoring 72

System - Reading, Level 1-6

130 Individualized Criterion-Referenced Testing - Reading,Level 3-8

SRA Survival Skills in Reading 134

and Mathematics, Level 6+

Stanford Diagnostic Reading 138Test, Le7e1 1.5-12

Survey of Reading Skills,Level K-8

140

Tests of Achievement in Basic 144Skills: Reading andLanguage, Level K-2


76

Mastery: An Evaluation Tool - 92SOBAR Reading, Level 3-9

McGuire -Bumpus Diagnostic Com- 110prehension Test

Prescriptive Reading Inven- 120tory, Level 1.5-6.5

Reading: ComprehensionSkills, Level K-6

122

152 REAL: Reading/Everyday Acti- 126vities in Life, Level 6+

:ADING COMPREHENSION: literalaning (main idea)

Analysis of Skills (ASK) -Reading, Level 1-8

34

Carver-Darby Chunked Reading 42

Test, Level high school-adult(reading rate)

Criterion-Referenced Tests ofBasic Reading and Computa-tional Skills, Level K-6


130

SRA Survival Skills in Reading 134

and Mathematics, Level 6+

Stanford Diagnostic Reading 138Test, Level 1.5-12

Survey of Reading Skills,Level 2-8

140

46 Tests of Achievement in Basic 144

Skills: Reading andLanguage, Level K-2

236

253

Wisconsin Design for ReadingSkill Development: Compre-hension, Level K-6


146 Wisconsin Design for Reading 146

Skill Development: Compre-hension, Level K-6

152

READING COMPREHENSION: interpreta-tive meaning

Analysis of Skills (ASK) -Reading, Level 3-8

34

Criterion-Referenced Tests of 46

Basic Reading and Computa-tional Skills, Level K-6

Diagnosis: An InstructionalAid - Reading, Level 1-6

Individualized Criterion-Referenced Testing - Reading,Level 3-8

Mastery: An Evaluation Tool -SOBAR Reading, -Level 3-9

McGuire-Bumpus Diagnostic Com-prehension Test

Prescriptive Reading Inven-tory, Level 2-6.5

Reading: ComprehensionSkills, Level K-6


Stanford Diagnostic ReadingTest, Level 2.5-12

Survey of Reading Skills,Level 3-8

54

SKILLS: spelling, punctua-ion and grammatical skills


Language Arts, Level 2-8

Doren Diagnostic Reading Test 58of Word Recognition Skills,Level 1-4

Language Arts: Mechanics and 86

Usage, Level K-6

REFERENCE STUDY SKILLS AND TECH-76 NIQUES

92

110

120

122

130


Reading, Level 3-8

Criterion-Referenced Tests of 46

Basic Reading and Computa-tional Skills, Level K-6



Everyday Skills Test: Reading 62

Fountain Valley Teacher Sup-port System in Reading,Level K-6

66


136 System - Reading, Level 1-6

Individualized Criterion-140 Referenced Testing - Reading,

Level 3-8

23?54

76

Language Arts: Composition,Library, and LiterarySkills, Level K-6

Mastery: An Evaluation Tool -SOBAR Reading, Level K-8

Tests of Achievement in BasicSkills: Reading andLanguage, Level K-2

Wisconsin Design for ReadingSkill Development: StudySkills, Level K-6

New Mexico Career EducationTest Series, Level 9-12

84 IAPPRECIATION OF READING (diction-lanes1, newspapers booka)


92 Reading, Level 3-8

Individualized Criterion-144 Referenced Testing - Reading,

Level 6-8

76


148 SOBAR Reading, Level 6-9

Prescriptive Reading Inven- 120tory, Level 3-6.5

OTHER SUBJECTS INDEX

112 New Mexico Consumer Rights and 116

Responsibilities Test,Level 9-12

New Mexico Concepts of Ecology 114Test, Level 6-12

238

Social Studies: American 132

Government, Level 10-12

255

INDEX CPublishers' Names and Addresses

Publisher

Academic Therapy Publications1539 Fourth StreetP.O. Box 899San Rafael, CA 94901

American Guidance Service (AGS)Publishers' BuildingCircle Pines, MN 55014

Cal Press, Inc.76 Madison AvenueNew York, NY 10016

Cooperative Educational ServiceAgency #13

908 W. Main StreetWaupun, WI 53963

Croft Educational Services4922 Harford RoadBaltimore, MD 21214

CTB/McGraw-HillDel Monte Research ParkMonterey, CA 93940

CTB/McGraw-Hill Ryerson Limited330 Progress AvenueScarborough, OntarioCANADA MIP 225

Dallas Independent SchoolDistrict

ATTN: Mr. Dean Arrasmith3801 Herschel StreetDallas, TX 75219

Tests

Criterion Test of Basic Skills

Doren Diagnostic Reading Test ofWord Recognition Skills

KeyMath Diagnostic Arithmetic TestWoodcock Reading Mastery Tests

REAL: Reading/Everyday Activitiesin Life

Early Childhood Assessment

Laze

49

58

80152

126

60

Cooper-McGuire Diagnostic Word 44

Analysis TestMcGuire-Bumpus Diagnostic Compre- 110

hension Test

Diagnostic Mathematics InventoryEveryday Skills Tests: Reading,Test A; Mathematics, Test A

Prescriptive Reading Inventory

Pre-Reading Assessment Kit

Survey of Reading Skills

256239

56

62

120

118

140

Publisher

Dreier Educational SystemsP.O. Box 1291Highland Park, NJ 08904

Educational and IndustrialTesting Service (EdITS)

P.O. Box 7234San Diego, CA 92107

Educational ProgressEducational Development

CorporationP.O. Box 45663Tulsa, OK 74145

Educators Publishing Service75 Moulton StreetCambridge, MA 02138

Follett Publishing Co.Department DM1010 W. Washington Blvd.Chicago, IL 60607

Harcourt Brace Jovanovich(see The PsychologicalCorporation)

Houghton Mifflin777 California AvenuePalo Alto, CA 94304

Imperial InternationalLearning Corp. (IIL)

P.O. Box 548, Route 50 SouthKankakee, IL 60901

Instructional ObjectivesExchange (I0X)

P.O. Box 24095Los Angeles, CA 90025

Tests Page

Basic Word Vocabulary TestGroup Phonics Analysis TestInstant Word Recognition Test

Tests of Achievement in BasicSkills - Math

Tests of Achievement in BasicSkills - Reading and Language

Individualized Criterion-Referenced Testing - Math

Individualized Criterion-Referenced Testing - Reading

Sipay Word Analysis Tests

Language and Thinking Program:Mastery Learning CriterionTests

3868

78

142

144

74

76

128

82


System - MathematicsIndividual Pupil Monitoring 72

System - Reading

Basic Arithmetic Skill Evaluation 36

(BASE) and BASE II

Language Arts: Composition,Library, and Literary Skills

Language Arts: Mechanics andUsage

Language Arts: Word Forms andSyntax

Mathematics: Elements, Symbolism,and Measurement

2240

5 7

84

86

88

96

Publisher

J. B. Lippincott CompanyEducational Publishing

DivisionEast Washington SquarePhiladelphia, PA 19105

MonitorP.O. Box 2337Hollywood, CA 90028

Multi-Media Associates, Inc.EPIC Criterion-Referenced Test

DivisionP.O. Box 130524901 E. Fifth StreetTucson, AZ 85732

NCS Educational Systems4401 West 76th StreetMinneapolis, MN 55435

Tests Page

Mathematics: Geometry 98

Mathematics: Geometry, Operations, 100and Relations

Mathematics: MeasurementMathematics: Numeration and

RelationsMathematics: Operations and 106

PropertiesMathematics: Sets and NumbersReading: Comprehension SkillsReading: Word Attack SkillsSocial Studies: American Govern-ment

102

104

Beginning Assessment Test forReading

New Mexico Career EducationNew Mexico Concepts of Ecology

TestNew Mexico Consumer Mathematics

Test & Consumer Rights andResponsibilities Test

Criterion-Referenced Tests ofBasic Reading and ComputationalSkills

108122

124132

40

112

114

116

46

Design for Math Skill Development 50

Wisconsin Design for Reading 146Skill Development: Compre-hension


Development: Study SkillsWisconsin Design for Reading 150

Skill Development: Word Attack

258241

Publisher

The Psychological CorporationA division of Harcourt BraceJovanovich

757 Third AvenueNew York, NY 10017

Revrac PublicationsDr. Ronald P. Carver10 W. Bridlespur DriveKansas City, MO 64114

Scholastic Testing Service,Inc. (STS)

480 Meyer RoadBensenville, IL 60106

Science Research Associates,Inc. (SRA)

259 East Erie StreetChicago, IL 60611

U-SAIL (Utah System Approachto Individualized Learning)

2971 Evergreen AvenueP.O. Box 9327Salt Lake City, UT 84109

Richard L. Zweig AssociatesTesting Division20800 Beach Blvd.P.O. Box 73Huntington Beach, CA 92648

Tests

Skills Monitoring System: ReadingStanfori Diagnostic MathematicsTest

Stanford Diagnostic Reading Test

Eat

130136

138

Carver-Darby C:unked Reading Test 42

Analysis of Skills (ASK) - LanguageArts

Analysis of Skills (ASK) - Mathe-'matics

Analysis of Skills (ASK) - Reading

Diagnosis: An Instructional Aid -Mathematics

Diagnosis: An Instructional Aid -Reading

Mastery: An Evaluation Tc33(gathematics)

Mastery: An Evaluation Tool(SOBAR Reading)

SRA Survival Skills in Reading andMath

Math Diagnostic/Placement Tests

Fountain Valley Teacher SupportSystem in Mathematics

Fountain Valley Teacher SupportSyst,a in Reading

242

2 59

30

32

34

52

54

90

92

134

94

64

66

REFERENCES

The references are listed under the following categories:

Those cited in the text

Those which provided lists of tests for review in this book

Recommended rending

TEXTUAL

American Psychological Association (APA). Standards for educa-tional and psychological tests. Washington, DC: APA, 1974.

Armbruster, B. B., Stevens, R. J., & Rosenshine, B. Analyzingcontent coverage and emphasis: A study of three curriculaand two tests. Technical Report #26. Urbana, IL: Centerfor the Study of Reading, University of Illinois, 1977.

Baker, E. L. Achievement testing in urban schools: New numbers.To be published by CEMREL, Inc., St. Lolsis, Missouri, in theUrban Education Monograph Series, Margaret Solomon (Ed.).

Barta, M. B., Ahn, J. R., & Gastright, J. F. Some problems ininterpreting criterion-referenced test results in a programevaluation. Studies in Educational Evaluation, 1976, 2(3),193-202.

Campbell, D. T., & Fiske, T. W. Convergent and discriminantvalidation by the multitrait-multimethod matrix. Psycho-logical Bulletin, 1959, 56, 81-105.

Campbell, D. T., & Stanley, J. C. Experimental and quasi-experimental design for research. Chicago: Rand McNally,1963.

Cronbach, L. J. Essentials of psychological testing (3rd ed.).New York: Harper & Row, 1970.

243 260

Denham, C. H. Score reporting and item selection in selectedcriterion referenced and domain referenced tests. Papergiven at the annual meeting of the National Council onMeasurement in Education, New York, April, 1977.

Dotseth, M., Hunter, R., & Walker, C. B. Survey of test selec-tors' concerns and the test selection process. CSE Report#107. Los Angeles: Center for the Study of Evaluation,University of California, 1978.

Ebel, R. L. Essentials of educational measurement. EnglewoodCliffs, NJ: Prentice-Hall, 1972.

Floden, R. E., Porter, A. C., Schmidt, W. H., & Freeman, D. J.Don't they all measure the same thing? Consequen,as of stan-dardized test selection. In E. L. Baker & E. S. Ouellmalz(Eds.), Educational testing and evaluation: Design, analysis,and policy. Beverly Hills, CA: Sage Publications, 1979.

Guion, R. M. Content validity--the source of my discontent.Applied Psychological Measurement, 1977, 1, 1-10.

Hambleton, R., & Eignor, D. Guidelines for evaluating criterion-referenced tests and test manuals. Paper delivered at theannual meeting of the American Educational Research Associa-tion, Toronto, March, 1978.

Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D.Criterion-referenced testing and measurement: A review ofthe technical issues and development. Review of EducationalResearch, 1978, 48, 1-47.

Hoepfner, R. Achievement test selection for program evaluation.In M. J. Wargo & D. R. Green (Eds.), Achievement testing ofdisadvantaged and minority students fbr educational programevaluation. Monterey, CA: CTB/McGraw-Hill, 1978.

Hoefpner, R., et al. CSE secondary school test evaluations. Los

Angeles: Center for the Study of Evaluation, University ofCalifornia, 1974.

Hoepfner, R., et al. CSE elementary school test evaluations.Los Angeles: Center for the Study of Evaluation, Universityof California, 1976

Jenkins, J. R., & Pany, D. Curriculum biases in reading achieve-ment tests. Technical Report #16. Urbana, IL: Center forthe Study of Reading, University of Illinois, 1976

Si244

Katz, M. Selecting an achievement test. Princeton, NJ: Educa-tional Testing Service, 1973.

Linn, R. L. Issues of validity in measurement for competency-based programs. Paper presented at the annual meeting of theNational Council on Measurement in Education, New York,April, 1977.

Linn, R. L., & Slinde, J. A. The determination of the signifi-cance of change between pre- and posttesting periods. Reviewof Educational Research, 1977, 47(1), 121-150.

Lyon, C. D., Doscher, L., McGranahan, P., & Williams, R. Evalua-tion and school districts. Los Angeles: Center for theStudy of Evaluation, University of California, 1978.

Messick, S. The standard problem: Meaning and values in measure-ment and evaluation. American Psychologist, 1975, 30, 955-966.

Popham, W. J. Criterion-referenced measurement. EnglewoodCliffs, NJ: Prentice-Hall, 1978.

Shoemaker, D. M. Evaluating the effectiveness of competinginstructional programs. Educational Researcher, 1972, I(May), 5-12.

Stake, R. E. More subjective! Remarks made in an invited debateon the question, "Should educational evaluation be more objec-tive or more subjective?" at the annual meeting of theAmerican Educational Research Association, Toronto, March,1978.

Stallard, C. Comparing objective based reading programs.Journal of Reading, 1977, 21(1), 36-44.

Stallard, C. Managing reading instruction: Comparative analysisof objective-based reading programs. Educational Technology,1977, 17(12), 21-26.

Tallmadge, G. K., & Horst, D. P. A procedural guide fbr vali-dating achievement gains in educational projects. Washington,DC: Government Printing Office, 1976. (GPO Stock Number017-080-01516-1.)

Tripodi, T., Fellin, P., & Epstein, I. Differential social pro-gram evaluation. Itasca, IL: Peacock, 1978.

2g62

Walker, C. B. ControZ test items: A baseline measure for evaZ-uating achievement. Paper presented at the annual meeting ofthe American Educational Research Association, Toronto,March, 1978.

Walker, D. F., & Schaffarzik, J. Comparing curricula. Review ofEducationaZ Research, 1974, 44, 83-112.

TEST LISTS

Barrett, J. E. (Ed.). Where behavioraZ objectives exist. Norton,MA: Project SPOKE, 1974.

Education programs that work. San Francisco: Far West RegionalLaboratory for Educational Research and Development, 1975.

Gitlin, C. Review of commerciaZZy available criterion-referencedtests. Final Report, Contract No. DAJA37-75-C-1760. UnitedStates Dependents Schools, European Area. Los Angeles:Educational Objectives and Measures, February, 1976.

Keller, C. M. Criterion-referenced measures: A bibZiography.Princeton, NJ: ERIC Clearinghouse on Tests, Measurement, andEvaluation, 1972. (ED 060 041, TM 001 124.)

Knapp, J. A coZZection of criterion-referenced tests. TM ReportNo. 31. Princeton, NJ: ERIC Clearinghouse on Tests, Mea-surement, and Evaluation, December, 1974. (ED 099 427.)

Rosen, P. (Ed.). Test coZZection bibZiographies: Criterion-referenced measures. Princeton, NJ: Educational TestingService, 1973. (ED 104 910, TM 004 362.) (Includes supple-ment dated August, 1974.)

Rosen, P. (Ed.). Test coZZection bulletin. Princeton, NJ:Educational Testing Service, 1975 (Vol. 9), and 1976(Vol. 10).

Test Zibrary cataZog (revised edition). Los Angeles: LosAngeles County Superintendent of Schools, Division of ProgramEvaluation, Research and Pupil Services, 1976.

246

RECOMMENDED READING

Airasian, P. W., & Madaus, G. F. Criterion-referenced testing inthe classroom. Measurement and Education, 1972, 3, 73-88.

Baker, E. L. Cooperation and the state of the world in criterion-referenced tests. Educational Horizons, 1974, 52(4), 193-196.

Block, J. H. Criterion-referenced measurement: Potential.School Review, 1971, 79, 289-297.

Boehm, A. E. Criterion-referenced assessment for the teacher.Teachers College Record, 1973, 75(1), 117-126.

Carver, R. P. Two dimensions of tests: Psychometric andedumetric. American Psychologist, 1974, July, 512-578.

Ebel, R. L. Criterion-referenced measurements: Limitations.School Review, 1971, 79, 282-288.

Ebel, R. L. Criterion-referenced and norm-referenced measure-ments. In Essentials of educational measurement. EnglewoodCliffs, NJ: Prentice Hall, 1972, 83-86.

Ebel, R. L. Evaluation and educational objectives. JournaZ ofEducational Measurement, 1973, 10(4), 273-279.

Esler, W. K., & Dziuban, C. D. Criterion referenced test, someadvantages and disadvantages for science instruction. ScienceEducation, 1974, 58(2), 171-174.

Glaser, R. Instructional technology and the measurement oflearning outcomes: Some questions. American Psychologist,1963, 18, 519-521.

Glass, G. V. Standards and criteria. San Mateo, CA: San MateoEducational Resource Center, 1977. (No. ID 005 555, 55 pages)

Good, T. L., Biddle, B. J., & Brophy, J. E. Criterion-referencedtesting. In Teachers make a difference. New York: Holt,Rinehart & Winston, 1975.

Gronlund, N. E. Preparing criterion-referenced tests for class-

room instruction. New York: Macmillan Company, 1973.

Haladyna, T. The paradox of criterion-referenced measurement.(ERIC Number ED 126 155, April, 1976, 25 pages.)

247 264

Harsh, J. R. The forests, trees, branches and leaves, revisited:Norm, domain, objective and criterion-referenced assessmentsfbr educational assessment and evaluation. Association forMeasurement and Evaluation in Guidance Monograph No. 1. LosAngeles: California Personnel and GuAance Association,February, 1974.

Hively, W. Domain referenced testing. Englewood Cliffs, NJ:Educational Technology Publications, 1974.

Hively, W. Introduction to domain-referenced testing. EdUca-tional Technology, 1974, 14(6), 5-10.

Hocker, R., Green, D. R., Ginsburg, N., & Hyman, H. The natureand uses of criterion-referenced and norm-referenced achieve-ment tests. Special Report, Vol. 4, No. 3. Burlingame, CA:Association of California School Administrators, undated(probably 1975).

Martuza, V. R. Applying norm-referenced and criterion-referencedmeasurement in education. Boston: Allyn and Bacon, 1977.

Mehrens, W. A., & Lehmann, I. J. Norm- and criterion-referencedmeasurement. In Measurement and evaluation in education andpsychology. New York: Holt, Rinehart & Winston, 1973, 63-76.

Millman, J. Criterion-referenced measurement. In W. J. Popham(Ed.), Evaluation in education: Current applications.Berkeley, CA: McCutchan, 1974. (Also available as aseparate monograph.)

Millman, J. . Program assessment, criterion-referenced tests, andthings like that. Educational Horizons, 1974, 52(4), 188-192.

Popham, W. J. Criterion-referenced measurement. EnglewoodCliffs, NJ: Prentice-Hall, 1978.

Popham, W. J. (Ed.). Criterion-referenced measurement: An intro-duction. Englewood Cliffs, NJ: Educational TechnologyPublications, 1971.

Popham, W. J., & Husek, T. R. Implications of criterion-referenced measurement, journal of Educational Measurement,1969, 6, 1-9.

Sanders, J. R., & Murray, S. L. Alternatives for achievementtesting. Educational Technology, 1976, 16(3), 7 pages.

265248

KEY TO THE EVALUATIVE SECTIONS OF CSE TEST REVIEWS*

MEASUREMENT PROPERTIES: CONCgTTUAL VALIDITY

1. Domain Descriptions. How good (i.e., thorough andcomprehensive) are the descriptions of the objectivesor domains to be tested?A. Very good (objectives are thoroughly described)B. Adequate (objectives are stated behaviorally but

not in detail)C. Poor (objectives are loosely described and subject

to various interpretations)

2. Agreement. How well do the test items match theirobjectives?A. The match is confirmed by sound evidenceC. Data are not provided or are not persuasive

3. Representativeness. How adequately do the itemssample their objectives?A. Items are representative of domainsC. Item selection is either unrepresentative or

unreported

MEASUREMENT PROPERTIES: FIELD TEST VALIDITY

4. Sensitivity. Does conventional instruction lead totest-score gains?A. Test scores reflect instructionC. Data are not provided or are not persuasive

5. Item Uniformity. How similar are the scores on thedifferent items for an objective?A. Some evidence of item uniformity is providedC. No data are provided

6. Divergent Validity. Are the scores for each objec-tive relatively uninfluenced by other skills?A. Independence of skills is confirmedC. Data are not provided or are not persuasive

7. Lack of Bias. Are test scores unfairly affected bysocial group factors?A. Persuasive evidence of lack of bias is offered for

at least two groups (e.g., women, specific ethnicgroups

C. Data are not provided or are not persuasive

8. Consistency of Scores. Are scores on individualobjectives consistent over time or over paralleltest forms?A. Consistency of scores for objectives is shown over

parallel forms or repeated testingC. Data are not provided


9. Clarity of Instructions. How clear and complete arethe instructions to students?A. Instructions are clear, complete, and include

sample itemsB. Either instructions or sample items are lackingC. Both are lacking

10. Item Review. Does the publisher report that itemswere either logically reviewed or field tested forquality?A. YesC. No

11. Visible Characteristics. Is the layout and printeasily readable?A. Print and layout are readable for more than 90%

of objectivesC. At least 10% of objectives have problems in

readability

12. Ease of Responding. Is the format for recordinganswers appropriate for the intended students?A. Responding is easy for more than 90% of the

objectivesC. Lack of clarity, crowding, etc., make responding

difficult in at least 10% of objectiveS

13. Informativeness. Does the test buyer have adequateinformation about the test before buying it?A. YesC. No

14. Curriculum Cross-Referencina. Are the test objec-tives indexed to at least two series of relevantteaching materials?A. YesC. No

15. Flexibility. Are many of the objectives tested atmore than one level, and are single objectives easyto test separately?A. Objectives are varied, carry over across test

levels, and are easy to test separatelyB. One feature is missing from variety, carry over,

or separabilityC. Two or three of the features are missing

16. Alternate Forms. Are parallel forms available foreach test?A. YesC. No

17. Test Administration. Are the directions to theexaminer clear, complete, and easy to use?A. Directions are clear, complete, and easy to useC. One or more of the above features are missing

18. Scoring. Are both machine scoring and easy handscoring available?A. YesB. Easy, objective hand scoring is available, but

no machine scoringC. Hand scoring is not easy or objective; or only

machine scoring is offered

19. Record Keeping.. Does the publisher provide recordforms that are keyed to test objectives and areeasy to use?A. YesC. They are not included or not keyed to test

objectives

20. Decision Rules. Are well justified, easy-to-userules given for making instructional decisions onthe basis of test results?A. YesC. Decision rules either are not given, not easy

to use, or not justified

21. Comparative Data. Are scores of a representativereference group of students given for comparingwith scores of pupils in the test user's program?A. National norms, criterion group data, or item

difficulty values are providedC. These are not provided or are not clearly repre-

sentative

*This system for evaluating CRTs is explained in detail in the text. For test features where only two levels of

quality are distinguished, the letters A and C are used to indicate the levels.

266

DOCOMENT RESUME - ERIC · v. DOCOMENT RESUME ED 186 457 TN BOO 146. AUTHOR. walker, Clinton R: And...

Documents

Transcript of DOCOMENT RESUME - ERIC · v. DOCOMENT RESUME ED 186 457 TN BOO 146. AUTHOR. walker, Clinton R: And...