Anastasi Anne Psychological Testing I

ANNE~NASTASIProfessor of Psychology, Fordham Universiry

Psyclwlvgical Testing

MACMILLAN PUBLISHING CO., INC.

New York

Collier Maonillan Publishers

London

I N A revised edition, one expects both similarities and differences. This

edition shares with the earlier versions the objectives and basic approach

of the book. The primary goal of this text is still to contribute toward the

proper evaluation of psychological tests and the correct interpretation

and use of test results. This goal calls for several kinds of information:

( 1) an understanding of the major principles of test construction, (2)

psychological knowledge about the behavior being assessed, (3) sensi-

tivity to the social and ethical implications of test use, and (4) broad

familiarity with the types of available instruments and the sources of

information about tests. A minor innovation in the fourth edition is the

addition of a suggested outline for test evaluation (Appendix C).

In successive editions, it has been necessary to exercise more and more

restraint to keep the number of specific tests discussed in the book from

growing with the field-it has never been my intention to provide a

miniature Mental Measurements Yearbook! l:\evertheless, I am aware

that principles of test co~struction and interpretation can be better un-

derstood when applied to~particular tests. Moreover, acquaintance with

the major types of available tests, together with an understanding of

their special contributions and limitations, is an es!>entialcomponent of

knowledge about contemporary testing. For these reasons, specific tests

are again examined and evaluated in Parts 3, 4, and 5. These tests have

been chosen either because they are outstanding examples with which

the student of testing should be familiar or because they illustrate some

special point of test construction or interpretation. In the text itself, the

principal focus is on types of tests rather than on specific instruments. At

the same time, Appendix E contains a classified list of over 250 tests,

including not only those cited in the text but also others added to provide

a more representative sample.

As for the differences-they loomed especially large during the prepa-

ration of this edition. Much that has happened in human society since

the mid-1960's has had an impact on psychological testing. Some of these

developments were briefly described in the last two chapters of the third

edition. Today they have become part of the mairn;tream.;()fpsychological'

testing and have been accordingly incorpo~i-ted in the apprqpqate sec-

tions throughout the book. Recent changes in psychological Jesting that

are reflected in the present edition can be delpribed on three levels:

(1) general orientation toward testing, (2) Stlbm,IJ,tiveand inethod()l~i-

cal developments, and (3) "ordinary progress" w1)Q as the publiciitibn

of new tests and revision of earlier tests.

All rights reserved. No part of this book may be reproduced or

transmitted in any form or by any means, electronic or me-

chanical, including photocopying, recording, or any informa-

tion storage and retrieval system, without permission in writing

from the Publisher.

Earlier editions copyright 1954 and © 1961 by Macmillan

Publishing Co., Inc., and copyright © 1968 by Anne Anastasi.

MACMILLAN PUBLISHING Co., INC.

866 Third Avenue, New York, New York 10022

COLLIER MACMILLAN CANADA, LTD.

Librarlj of Congress Cataloging in Publication Data

Anastasi, Anne, (date)

Psychological testing.

Bibliography: p.

Includes indexes.

1. Mental tests. 2. Personality tests. I. Title.

[DNLM: 1. Psychological tests. WM145 A534P]

BF431.A573 1976 153·9 75-2206

ISBN O-<>2-30298<r3

Preface

; An example of changes on the first level is the increasing awareness of

~e ethical, social, and legal implications of t~sting. In the present edi-

lon, this topic has been expanded and treated 111a separate chapter early

b the book (Ch. 3) and in Appendixes A and B. A cluster of related

l..evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention IS

eing given to administering tests for self-kuowledge and self-develop-

~entl and to training individuals in the use of their own test res?lts. in

,lJecisionmaking (Chs. 3 and 4). In the same category are the contmumg

~eplacementof global scores with multitrait profiles and the application

bf classificationstrategies, whereby "everyone can be above average" in

bne or more socially valued "ariables (Ch. 7). From another angle,

rffortsare being made to modify traditional interpretations of test scores,

~n bothcognitive and noncognitive areas, in the light of accumulating

psychological knowledge. In this edition, Chapter 12 brings together

'psychological issues in the interpretation of intelligence test scores,

:touchingon such problems as stability and change in intellectual level

.overtime; the nature of intelligence; and the testing of intelligence in

:earlychildhood, in old age, and in different cultures. Another example

is providedby the increasing emphasis on situational specificity and

I person-by-situationinteractions in personality testing, stimulated in large

partbythe social-learning theorists (Ch. 17).

T~e second level, -covering substantive and methodological changes,

is illustratedby the impact of computers on the development, administra-

"tioll,scoring,and interpretation of tests (see especially Chs. 4, 11, 13, 17,

18, W). The use of computers in administering or managing instructional

pro/ramshas also stimulated the development of criterion-referenced

t~~~although other conditions have contributed to the upsurge of

'i!restin such tests in education. Criterion-referenced tests are discussed'1c •

,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that have

to prominence and have received fuller treatment in the present

n include: tests for identifying specific learning disabilities (Ch.

inventories and other devices for use in behavior modification pro-'

(Ch. 20), instruments for assessing early ch~ldhOod education

14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education

literacy tests for adults (Cbs. 13 and 14), and techniques for the

ment of environments (Ch. 20). Problems to be considered in the

, ment of minority groups, including the question of test bias, are

ined from different angles in Chapters 3, 7, 8, and 12.

the third level, it may be noted that over 100 of the tests listed in

edition have been either initially pUblished or revised since the

ication of the preceding edition (1968). Major examples include the

arthy Scales of Children's Abilities, the WISC-R, the 1972 Stanford-

norms (with all the resulting readjustments in interpretations),

Preface vii

Forms Sand T of the DAT (including a computerized Career Planning

Program), the Strong-Campbell Interest Inventory (merged form of the

SVIB), and the latest revisions of the Stanford Achievement Test and theMetropolitan Readiness Tests.

It is a pleasure to acknowledge the assis~nce received from many

sources in the preparation of this edition. The completion of the project

was facilitated by a one-semester Faculty Fellowship awarded by Ford-

ham University and by a grant from the Fordham University Research

Council covering principally the services of a research assistant. These

services were performed by Stanley Friedland with an unusual combina-

tion of expertise, responSibility, and graciousness. I am indebted to the

many authors and test publishers who provided reprints, unpublished

manuscripts, specimen sets of tests, and answers to my innumerable in-

quiries by mail and telephone. For assistance extending far beyond the

interests and responsibilities of any single publisher, I am especially

grateful to Anna Dragositz of Educational Testing Service and Blythe

Mitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge the

Significant contribution of John T. Cowles of the University of Pittsburgh,

who assumed complete responSibility for the preparation of the Instruc-tor's Manual to accompany this text.

For informative discussions and critical comments on particular topics,

I want to convey my sincere thanks to Willianl H. Angoff of Educational

Testing Service and to several members of the Fordham University Psy-

chology Department, including David R. Chabot, Marvin Reznikoff,

Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledg-

ment IS also made of the thoughtful recommendations submitted by

course instructors in response to the questionnaire distributed to current

users of the third edition. Special thanks in this connection am due to

Mary Carol Cahill for her extensive, constructive, and Wide-ranging

suggestions. I wish to express my appreciation to Victoria Overton of

the Fordham University library staff for her efficient and courteous as-

sistance in bibliographic matters. Finany, I am happy to record the

contributions of my husband, John Porter Foley, Jr., who again partici-

pated in the solution of countless problems at all stages in the prepara-tion of the book.

A.A.

CONTENTS

PART 1CONTEXT OF PSYCHOLOGICAL TESTING

1. FUNCTIONS AND ORIGINS OFPSYCHOLOGICAL TESTING 3

Current uses of psychological tests QEarly interest in classification and training of the mentally

retarded 5The first experimental psychologists 7

Contributions of Francis Galton 8

Cattell and the early "mental tests" 9

Binet and the nse of intelligence tests 10

Group testing 12

Aptitude testing 13 ~

Standardized achievement tests 16

Measurement of personality 18

Sources of information about tests 20

2. NATURE AND USE OFPSYCHOLOGICAL TESTS

What is a psychological test? 23Reasons for controlling the use of psychological tests

Test administration 32

Rapport 34

Test anxiet\' 37Examiner ~nd situational variables 39

Coaching, practice, and test sophistication 41

3. SOCIAL AND ETHICAL IMPLICATIONS

OF TESTING "

User qualifications 45

Testing instruments and procedures 47

Protection of privacy . 49

Confidentiality 52

Communicating test results 56

Testing and the civil rights of minorities 57

ix

4. NORMS AND THE INTERPRETATION OF

TEST SCORES

Statistical concepts 68

Developmental norms 73

Within-group norms 77

Relativity of norms 88Computer utilization in tile interpretation of test scores 94

Criterion-referenced testing 96

5, RELIAB ILITY

The correlation coefficient 104

Types of reliability 110

Reliability of speeded tests 122Dependence of reliability coefficients on the sample tested 125

Standard error of measurement 127

Reliability of criterion-referenced tests 131

Content validity 134

Criterion-related validity 140

Construct validity 151

Overview 158

7. VALIDITY: MEASUREMENT AND

INTERPRET ATION

Validity coefficient and error of estimate 163

Test validity and decision theory 167

Moderator variabll;;s 177Combining information from different tests 180

Use of tests for cl.assification decisions 186

Statistical analyses of test bias 191

8. ITEM ANALYSl-S

Item difficulty 199

Item validity 206

Internal consistency 215

Item analysis of speeded tests 217

Cross validation 219

Item-group interaction 222

PART 3

TESTS OF GENERAL INTELLECTUAL

LEVEL

9. INDIVIDUAL TESTS

Stanford-Binet Intelligence Scale 230

Wechsler Adult Intelligence Scale 245

Wechsler Intelligence Scale for Children 2.'55Wechsler Preschool and Primary Scale of Intelligence 260

10. TESTS FOR SPECIAL POPULATIONS

Infant and preschool testing 266

Testing the physically handicapped 281

Cross-cultural testing 287

Croup tests versus individual tests 299

Multilevel batteries 305

Tests for the college level and beyond 318

12. PSYCHOLOGICAL ISSUES ININTELLIGENCE TESTING

Longitudinal studies of intelligence 327.

Intelligence in early childhood 332

Problems in the testing of adult intelligence 337

Problems in cross-cultural testing 343

Nature of intelligence 349

PART 4

TESTS OF SEPARATE AInLJTIES

13. MEASURING MULTIPLE APTITUDES

Factor analysis 362

Theories of trait organization

MUltiple aptitude batteries

Measurement of creativity

369

378

388

14. EDUCATIONAL TESTING

Achievement tests: their nature and uses 398General achievement batteries 403

Standardized tests in separate subjects 410

Teacher-made classroom tests 412

20. OTHER ASSESSMENT TECHNIQUES

"Objective" performance tests 588

Situational tests 593

SeH-concepts and personal constructs 598

Assessment techniques in behavior modification programs

Observer reports 606

Biographical inventories 614

The assessment of environments 616

Diagnostic and criterion-rdt:renced tests 417

Specialized prognostic tests 423

Assessment in early childhood education 425

~ OCCUPATIONAL TESTING

\V Validation of industrial tests 435Short screening tests .for industrial personnel 439

Special aptitude tests 442

Testing in the profeSSions 458

Diagnostic use of intelligence tests 465

Special tests for detecting cognitive dysfunction

Identifying specific learning disabilities 478

Clinical judgment 482

Report writing 487

B. Guidelines on Employee Selection Procedures (EEOC)

Guidelines for Reporting Criterion-Related and

Content Validity (OFCC)

PART 5PERSON ALITY TESTS

17. SELF-REPORT INVENTORIES

Content validation 494

Empirical criterion keying - 496

Factor analysis in test development

Personality theory in test development

Test-taking attitudes and response sets

Situational specificity 521

Evaluation of personality inventories

506510

515

18. MEASURES OF INTERESTS, ATTITUDES,AND VALUES ;527

Interest inventories 528

Opinion and attitude measurement 543

Attitude scales 546Assessment of values and related variables 552

19. PROJECTIVE TECHNIQUES

Nature of projective techniques 558

Inkblot techniques 559Thematic Apperception Test and related instruments

Other projective techniques 569

Evaluation of projective techniques 576

PART 1

C01ltext of

. Psychological Testing

CHAPTER 1

Functions and 01~igiTlSof

Psycllological TeStiTlg

A'NYONE reading this book today could undoubtedly illush'ate what

. is meant by a psychological test, It would be easy enough to recall

. a test the reader himself has taken in school, in college, in the

armed services, in the counseling center, or in the personnel office. Or

perhaps the reader has served as a subject in an experiment in which

standardized tests were employed. This would certainly not have been the

case fifty years ago. Psychological testing is a relatively young branch of

one of the youngest of the sciences.

Basically, the function of psychological tests is to measure ,9.:iffe~~~.n~L_

1Jetween individuals or between the reactions of the same individual on

different occasions. One of the first problems that stimulated the develop-

ment of psychological tests was the identification of the mentally re-

tarded. To this day, the detection of int~i1ectual deficiencies remains an

Important application of certain types of psychological tests. Related

clinical uses of tests include the examination of the emotionally disturbed,

the delinquent, and other types of behavioral deviartts. A strong impetus

to the early development of tests was likewise provided by problems

arising in education, At present, schools are among the largest test users.

The classifica.tiOIlOfchildren with reference to their ability to profit

from different types of school instruction, the identi£ication of the in-

tellectually retarded on the one hand and the gifted on the other, the

diagnosis of academic failures, the educational and vocational counseling

of high school and college students, and the s~~ction of applicants for

professional and other special schools are among the many educational~uses of tests.

The selection and classification of industrial personnel represent an-

other major application of psychological testing. From the assembly-line

4 COllfcl't of Psychological Testing

operator or filing clerk to top management, there is scarcely a type of job

for which some kind of psychological test has not proved helpful in such

matters as hiring, job assignment, transfer, promotion, or termination.

To be sure, the effective employment of tests in many of these situations,

es eciiill-"Tri('Onnection with high-level jobs, usuall • re uires that the

t!.:ts he used as an adjunct to s -i u interviewing, so that test scores

may be properly int~rpreteaTnt1leli ht of other back ound' rmatiQn

a out the m IVI un. evertheless, testing constitutes an important part

~ total personnel program. A closely related application of psycho-

logical testing is to be found in the selection and classification of military

personnel. From simple beginnings in "Vorld 'War I, the scope and

variety of psychological tests employed in military sihlations underwent

a phenomenal increase during World War II. Subsequently, research

on test development has been continuing on a large scale in all branches

of the armed services,

The use of tests in counseling has gradually broadened from a nar-

rowly defined guidance regarding educational and vocational plans to

an involvement with all aspects of the person's life. Emotional well-

being and effective interpersonal relations have become increasingly

prominent objectives of counseling. There is growing emphasis, too, on

the use of tests to enhance self-understanding and personal development.

Within this framework, test scores are part of the information given to

the individual as aids to his own decision-making processes.

It is clearly evident that psychological tests are currently being em-

ployed in the solution of a wide range of practical problems. One should

not, however, lose sight of the fact that such tests are als? serving impor-

tant functions in basic research Nearly all problems in differential psy-

chology, for example, require testing procedures as a means of gathering

data. As illustrations, reference may be made to studies on the nature and

extent of individual differences, the identification of psychological traits,

the measurement of group:' differences, ~nd the investigationfijo]ogical

and cUltural factors associated WIth 6ehavioral differences. For all such

areas of research-and for many others-the precise mt>.asurement of

individual differences made possible by well-constructed tests is an

essential prerequisite. Similarly, psycholOgical tests provide standardized

tools for investigating such varied problems as life-span developmental

changes within the individual, the relative effectiveness of different edu-

cational procedures, the outcomes of psychotherapy, the impact of

community programs, and the influence of noise on performance.

From the many different uses of psychological tests, it follows that some

knowledge of such tests is needed for an adequate understanding of most

fields of contemporary psychology. It is primarily with this end in view

that the present book has been prepared. The book is not designed to

make the individual either n skilled examiner and test administrator or

an"experf on test construction. It is directed, not to the test specialist, but

to the general student of psychology. Some acquaintance with the lead·'

ing current tests is necessary in order to understand references to the use

of such tests in the psychological literature. And a proper evaluation and

interpretation of test results must ultimately rest on a knowledge of how

the tests were constructe<l, what they can be expected to accomplish, and

what are their peculiar limitations. Today a familiarity with tests is re-

quired, not only b~' those who give or construct tests, but by the generalpsychologist as well.

A brief overview of the historical antecedents and origins of psychologi-

cal testing will provide perspective and should aid in the understanding

of present-day tests.' The direction in which contemporary psychological

testing has been progressing can be clarified when considered in the light

of the precursors of such tests. The special limitations as well as the

advantages that characterize current tests likewise become more intel-

ligible when viewed against the background in which they originated.

The roots of testing are lost in antiquity. DuBois (1966) gives a pro-

vocative and entertaining account of the system of civil service examina-

tions prevailit:\g in the 'Chinese empire for some three thousand years.

Among the ancient Greeks, testing was an established adjunct to the

educational process. Tests were used to assess the mastery of physical as

well as intellectual skills. 'the Socratic method of teaching, with its

interweaving of testin and t~hin has mch i mmon with toda 's

rrograme earning. From their beginnings in the middle ages, European

umversities relied on formal examinations in awarding degrees and

honors. To identify the major developments that shaped contemporary

testing, however, we need go no farther than the nineteenth century. It

is to these developments that we now turn,

EARLY INTEREST IN CLASSIFICATION AND

TRAINING OF THE MENTALLY RETARDED

The nineteenth century witnessed a strong awakening of interest in the

humane treatment of the mentally retarded and the insane. Prior to that

time, neglect, ridicule, and even torture had been the common lot of these

unfortunates. With the growing concern for the proper care of mental

I A more detlliled account of the early origins of psycllOlogical tests can be found

in Goodenough (1949) and J. Pefers~n (1926~. See also Boring (1950) and Murphyand Kovach (1972) for more general backgrq~md, DuBois (1970) for a brief but

comprehensive history of psychologi~l tClsting, and ,Anastasi (1965) for historicalantecedents of the study of individual differences.

6 Context of Psychological Testing

deviates came a realization that some uniform criteria for identifying and

classifying these cases were required. The establishment of many special

institutions for the care of the mentally retarded in both Europe and

America made the need for setting up admission standards and an ob-

jective system of classification especially urgent. First it was necessary to

differentiate between the insane and the mentallv retarded. The former

manifested emotional disorders that might or might not be accompanied

by intellectual deteriomtion from an initially normal level; the latter were

characterized essentially by i~tellectual defect that had been present

from birth or early infancy. What is probably the first explicit statement

of this distinction is to be found in a two-volume work published in 1838

by the French physician Esquirol (1838), in which over one hundred

pages are de\'oted to mental retardation. Esquirol also pointed out that

there an! many degrees of mental retardation, varying along a continuum

from normality to low-grade idiOCy. In the effort to develop some system

for claSSifying the different degrees and varieties of retardation"Esguiroi

tried several procedures but concluded that the individual's use of lan-

guage provides the m05t de endable criterion of his intellectual level. It

is meres mg to note t at current criteria 0 menta retardation are also

largely lingUistic ant! that present-day intelligence tests are heavily

loaded ~vith Yerbal content. The important part verbal ability plays in

our concept of intelligence will be repeatedly demonstrated in subsequent

chapters.

Of special significance are the contributions of another French physi-

cian, S,egll~. who pioneered in the training of the mentally retarded.

Having rejected the prevalent notion of the ineurability of mental re-

tardation SeO'uin (1866) eXIJerimented for many vears with what he, v ~ "

termed the physiological method of training; and in 1837 he,:es,tal:6hed

the nrst school devoted to the education of mentally reta .." ~hildren.

In 1848 he emigrated to America, where his ideas gaine _ ide recog-

nition. Man~- of the sense-training and muscle-trainirJg techniques cur-

rently in use in institutions for the mentally retarded \vere originated by

Seguin. By these methods, severely retarded children are given intensive

exercise in sensory discrimination and in the development of motor con-

trol. Some of the procedures developed by Seguin for this purpose were

'eventually incorporated into performance or nonverbal tests of intelli-

gence. An example is the Seguin Form Board, in which the individual

is required to insert variously shaped blocks into the corresponding

recesses as quickly as possible.

More than half a century after the work of Esquirol and Seguin, the

French psychologist Alfred Binet urged that children who failed to

respond to normal schooling be examined before dismissal and, if con-

sidered educable, be assigned to special classes (T. H. Wolf, 1973). With

Functions and Origins of Psychological Testing 7

his fellow members of the Society for the Psychological Study of the

Child, Binet stimulated the Ministry of Public Instruction to take steps to

improve the condition of retarded children. A specific outcome was the

<'stablishment of a ministerial commission for the study of retarded chil-

dren, to which Binet was appointed. This appointment was a momentous

event in the history of psychological testing, of which more will be saidJal'er.

The ~arly experimental psycholOgists of the nineteenth century were

not, in general, concerned \vith the measurement of individual'differ-

ences. The principal aim of psychologists of that period was the fomm-

lation of generalized descriptions of human behavior. It was the

uniformities rather than the differences in behavior that were the focus

of attention. Individual differences were either ignored or were accepted

as a necessary evil that limited the applicability of the generalizations.

Thus, the fact that one individual reacted diHerently from another when

observed under identical co~ditions was regarded' as a form of -etror.

The presence of such error, or individual variability, rendered the

generalizations approximate rather than exact. This was the attitude

toward individual differences that prevailed in such laborotodes as that

founded by '''undt at Leipzig in 1879, where many of the early experi-mental psychologists received their training.

In their choice of topics, as in many other phases of their work, the

founoers of experimental psychology reBected the influence of their back-

grounds in physiology and physics. The problems studied in their labora-

tories were concerned largely with sensitivit~ to ~al, auditory, and~

other sensory stimuli and \vith simple reaction time. This emphasis on

sen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psycho-

logICal tests, as will be apparent in subsequent sections.

. St:ilI another way in which nineteenth-century experimental psychology

Influenced the course of the testing movement may be noted. ,The earlv

ps~'chological experiments brought out the need for rigorous control

of the conditions under which observations were made. For example, the

\\'?rding of directions given to the subject in a reaction-time experiment

mIght appreci~bly incre.ase or decrease the speeg 'i\t the subject's re-

sponse. Or agam, the bnghtness or color oEthe sUtr~,,~:ding field could

mar~edly alter the appearance of a visu~J s~mulU~:".1\h~portance of

makmg observations on all subjects un4i~.,s~ndardiz~& conditions was...!fu1svividly demonstrated: Such standardization of proce,dure eventually

became one of the special earmarks of psychological tests.

Functions and Ol'igills of Psychological Testing 9

mathematically untrained investigator who might wish to treat test re-

sults quantitatively. He thereby extended enormously the application of

statistical procedures to the analysis of test data. This phase of Galton's

work has been carried forward by many of his students, the most eminent

of whom was Karl Pearson.It "'as the English biologist Sir Francis Galton who ,,:as. primarily r~-

sponsible for launching the testing movem~l~t: A umfY~lg. factor ~n

Calton's numerous and vaI'ied research activities was hiS }nterest llL

'humaJ;rheredit ". In the course of his imestigations on heredity, Calton

t~a 'ize t e need for measuring the characteristics of related and un-

related persons. Only in this way could he discover, for example, the

exact degree of resemblance bet:w'een p~ren~s and offspring, 1;'rothers and .

sisters; cousins, or twins. With this end 11l View, Calton was mstrument~l '

in inducing a number of educational institutions to keep systematic

anthropometric recOl:ds on their students. ~e al<;oset up an anthropo~ct-

ric laboratory at the International EXposI~on of ,18~4wh~re, by .pa) mg

threepence, visitors could be measured 111 ce~yslcal traIts and

could take tests of keenness of vision and hearing, muscular strength,

reaction time, and other simple sensorimotor functions. Whe~l the expo-

sition closed, the laboratory was transferred to South Kensmgton Mu-

seum, London, where it operated for six years. By such methods, the nrstlarge, systematic body of data on individual differences in simple psycho-

logical processes was gradually aceu~ulated. . . .Galton himself devised most of the sun pIe tests admIDlstered at hIS an-

thropometric laboratory, many of which are still familiar either in ~heir

original or in modified forms. Examples include the Cal~o~ bar for ,,:mual

,discrimination of len h, the Galton whistle for determmlllg the hlghest

au i e pitch, and graduated series of weights for measurin? k~ne.sth~tic

discrimimltion. It was Calton's belief that tests of sensory discrlrmnatlOn

could serve as a means of gauging a person's intellect. In this respec,~' he

was partly influenced hy the theories of L?cke. Thus Galton wrote: .The

only information that reaches us concernmg outward events appeals to

pass through the avenue of our senses; and the n~ore per~ptive the sen~es

are of difference, the larger is the field upon which our Judgment and 10-

telligence can act" (Calton, 1883, ~'. 27). C~lt~n !lad.:~lso noted that

idiots tend to be defective in the ability to discrlmmaJe·:heat, cold, and

pain-an observation that furtller strengthene5iYnis ~nviction that sens~ry

discriminative capacity "would on the whole' be highest among the m-

tellectualh- ablest" (Galton, 1883, p. 29). .Galton also pioneered in the application of rating-sca~c ~nd ques~lOn-

naire methods as well as in the use of the free associatIon techmque

subsequently ~mployed for a wide ~arietyof purposes. A .fu.rther contri-

bution of Galton is to be found in hiS development of statistical methods

for the analysis of data on individual differences. Galton selected and

adapted a n~mber of techniques previously derived ~y m~thematicians.

These techniques he put in such form as to permit theIr use by the

An especially prominent position in the development of psychological

testing is occupied by the American psychologist James McKeen Cattell.

The newly established science of experimental psychology and the still

newer testing movement merged in Cattelfs work. For his doctorate at

Leipzig, he completed a dissertation on individual differences in reaction

!ime, despite Wundt's resistance to this t'ype of investigation. While lec-

tming at Cambridge in 1888, Cattell's own interest in the measurement

of individual differences was reinforced bv contact with Calton. On his

return to America, Cattell was active both 'in the- establishment of labora-

tories for experimental psychology and in the spread of the testing

movement.l -;;\- ';e~ U-U..~

In an article written by Cattell in ,,890, the term "mental test'. was . _

used for the £rst time in the psychological literature. This article de-

scribed a series of tests that were beinO' administered anlluallv to collegeo .

students in the effort to determine their irteilectuall~yel. The tests, which

had to be administered individually, included measures of muscular

strength, speed of movement, sensiti~ty to pain, keenness of vision and

of hearing, weight discrimination, reaction time, memory, and the like. I

In his choice of tests, Cattell shared Galton's view that Jl measure of/M-.,';';;.(,V1.""V'.-(~

i,ntellectual functions could he Qbt<}ined through tests of sensorv cis,- f<.U4-~e.I..t., ;~~

c~pination and reaction time. Cattell's pI'eference for such tests was also !1~~tl<-.~bolst.e~ed by the fact that simple functions could be measured with .p!i<ck{t<:1.<-lA.~J

preCiSIOnand accuracy, whereas the development of objective measures1-<=~.M "..it-r I

for the more complex functions seemed at that time a well-nigh hopeless r:YL-'task. ' .

Catten's tests were typical of those to be found in a number of test

series developed during the Jast decade of the nineteenth century. Such

test series were administered to schoolchilqren, college students', and mis-

ccllaneous adults. At the Columbian Exposition Jield in Chicago in 189~,

Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests of

sensory, motor, and simple perceptual processes and: to compare tlieir

skill with the norms (J. Peterson, 1926; Philippe, 1894·~.A few attempts

to evaluate such early tests yielded very discOuraging results: The indi-

vidual's Rerform~Dce showed little correspondence from one test to an-

other (Sharp, 1~1899; Wissler, 1901), and it exhibited little or no

10 Context of PSlJc11010gical Testing

relation to independent estimates of intellectual levC:'1based on teachers'

ratings (Bolton, 1891-1892; J. A. Gilbert, 1894) or academic grades

(Wissler, 1901).

A number of test series assembled by European psychologists of the

period tended to cover somewhat more complex functions. Kraepelin

(1895), who was interested primarily in the clinical examination of psy-

chiatric patients, prepared a long series of tests to measure what he re-

garded as basic factors in the characterization of an individual. The

tests, employing chiefly simple arithmetic operations, were designed to

measure practice effects, memory, and susceptibility to fatigue and to dis-

traction. A few years earlier, Oehrn (1889), a pupil of Kraepelin, had

emploY€idtests of perception, memory, association, and motor functions

in an investigation on the interrelations of psychological functions. An-

other German psychologist, Ebbinghaus (1897), administered tests of

arithmetic computation, memory span, and sentence completion to school-

children. The most complex of the three tests, sentence completion, was

the only one that showed a clear correspondence with the children's

scholastic achievement.

Like Kraepelin, the Italian psychologist Ferrari and his students were

interested primarily in the use of tests with pathological cases (Guicciardi

& Ferrari, 1896). The test series they devised ranged from physiological

measures and motor tests to apprehension span and the interpretation of

pictures. In an article published in France in 1895, Binet and Henri criti-

cized most of the available test series as being too largely sensory and as

concentrating unduly on simple, specialized abilities. They argued further

that, in the measurement of the more complex functions, great precision

is not necessary, since individual differences are larger in these functions.

An extensive and varied list of tests was proposed, covering such func-tions as memory, imagination, attention, comprehension, suggestibility,

aesthetic appreciation, and many others. In these tests we can recognize

the trends that were eventually to lead to the development of the famous

Binet intelligence scales.

Functions and Origi;ls of Psychological Testing 11

ously cited commission to study procedures for the education of retarded

children. It was in connection 'with the objectives of this commission that

Binet, in collaboration with Simon, prepared the first Binet-Simon Scale(Binet & Simon, 1905).

This scale, known as the 1905 seale, consisted of 30 problems or tests

arranged in ascending order of difficulty. The difficulty level was deter-

mined empirically by administering the tests to 50 normal children aged

3 to 11 years, and to some mentally retarded children and adults. The

tests were designed to cover a wide variety of functions, with speCial

emphasis onJ.udgmt;nt, comprehension, and reasoning. Which Binet re-

garded as essential components of intelligence. Although sensory and

perceptual tests were included, a much greater proportion of verbal

content was found in this scale than in most test series of the time. The

1905 scale was presented as a preliminary and tentative instrument, and

no precise objective method for arriving at a total score was formulated.

In the second, or 1908, scale, the number of tests was increased, some

unsatisfactory tests from the earlier scale were eliminated, and all tests

were grouped into age levels on the basis of the performance of about

300 normal children between.. the ages of 3 and 13 Years. Thus, in the

3-year level were placed all tests passed by 80 to 00 percent of normal3-year-olds; in the 4-year-Ievel, all tests similarly passed by normal 4-year-

olds; and so on to age 13. The child's score on the entire test could then

be expressed as a mental level corresponding to the age of normal chil-

dren whose performance he equaled. In the various translations and

adaptations of the Binet scales, the term "mental age" was commonly

substituted for "mentalleveI." Since mental age is such a simple concept

to~rasE> the introduction of this term undoubtedly did much to popu-

larize intelligence testing.> Binet himself, however, avoided the term

"mental age" because of its unverified developmental implications and

preferred the more neutral term "mental level" (T. H. \\Tolf, 1973).

A third revision of the Binet-Simon Scale appeared in 1911, the year of

Binet's untimely death. In this scale, no fundamental changes were intro-

duced. Minor revisions and relocations of specific tests were instituted.

More tests were added at several year levels, and the scale was extendedto the adult level

Even prior to the 1908 revision, the Binet-Simon tests attracted wide

> Goodenough (1949, pp. 50-51) notes that in 1881, 2l y~aTs befor~ the appear-ance of the 1908 Binet-Simon Scale, S. E. Chaille publi!iheq in the New Orleans

Medical a~d Surgical Journal a series of tests for infan~ 11l7anged according to the

a!1:eat whIch the tests are commonly passed. Partly because' of the limited circulation

of the journal 'nd partly, perhaps, because the scientific ~Om!J1l1nity was not readyfor it, the significance of this age-scale concept passed unnoticed at the time. Binet's

own scale was in~ed by the work oE some oE ~is contemporaries, notably Blinand Damaye, who prepared a set of oral questions from which they derived a singleglobal score Eor eaclrdiild (T. H. Wolf, 1973). .

Binet and his co-workers devoted many years to active and ingenious

research on ways of measuring intelligence. Many approaches were tried,

including even the measurement of cranial, facial, and hand form, and

the analysis of handwriting. The results, however, led to a growing con-

viction that the direct, even though crude, measurement of com lex

1 fence a unc ons 0 ere t e greatest promise. T en a specific situ-

ation arose that brought Binet's efforts to imme(]iate practical fruition.

In 1904, the Minister of Public Instruction appointed ~inet to the previ-

12 Context of Psyc11010gical Testing

attention among psychologists throughout the world. Translation~ and

adaptations appeared in many lang;uages. In Americ;l, a number of diHer-

ent revisions were prepa.red, the most famous of which is the one de-

veloped under the direction of L. ~tTerman a.t Stanford University, and

known as the Stanfmd-Binet (Terman, 1916). It was in this test that the

intelligence quotient (IQ), or mtio between mental age and chronologi-

cal age, was first used. The latest revision of this test is widely employed

today and will be more full\' considered in Chapter 9. Of special interest,

too. is the first Kuhlmann-Binet revision, which extended the scale down-

ward to the age level of 3 months (Kuhlmann, 1912). This scale repre-

sents one of the earliest efforts to develop preschool and infant tests of

intelligence.

Functions and Origins of Psyc1101ugical Testing 13

fo~ g~n~ral routine te~ting; t~e latter was a nonlanguage scale employed

WIth Illiterates and wIth foreign-born recruits who were unable to take a

tcst in English. Both test~ w~re suitable for administratio~ to large groups.

Shortly af~e~ the temunatlOn of "Vorld War I, the Army tests were re-

leased for cmhan use. Not only did the Army Alpha and Army Beta

themselves pass through many revisions, the latest of which are even now

in use, b.ut they also sVVed as ~dels for most group intelligence tests.

The te~ting .movement underwent a tremendous spurt of growth. Soon

group mtelhgence tests were being devised for all ages and types of

~ersons, from preschool children to graduate students. Large-sc~le test-

109 progra~ns: previously impossible, were now being launched with

~est~ul optimIsm. Because group. tests were designed as mass testing

lUsh uments, they not only permItted the simultaneous examination of

large groups but also simplified the instructions and adminish'ation pro-

cedu~es so as to demand a minimum of training on the part of the

exammer. Schoolteachers began to give intelligence tests to thcir classes.

Coll~ge studen~s were routinely examined prio~ to admission. Extensive

studies of specIal adult groups, such as prisoners, were undertaken. And

soon the general public became IQ-conscious. "---

T~e application of such group intelligence tests far outran their techni-

cal Improvement. That the tests were still crude instruments was often

f?rgotten in the rush of gathering scores and drawing practical condu-

slO~Sfrom the ~esults. 'Vhen. ~he tests failed to meet unwarranted expec-

tations" skepticism and hostiht)' toward all testing often resulted. JJ1US.the testi boom of the twenties, based on the indiscriminate use of tests i?ISma~ have ~one as much to retai' as to ad\'ance the progress of psvcho- ---logical test mg. - ~

The Binet tests, as well as all their revisions, are indil;iclual scales in

the sense that the\" can be administered to onlY one person at a time.

Man\' of the tests in these scales require .oral re~ponses from the subject

or n~cessitate the manipulation of materials. Some call for individual

timing of responses. For these and other reasons, such tests are not

adapted to group administration. Another characteristic of the Binet type

of test is that" it requires a highly trained examiner. Such tests are es-

sentiallv clinical instruments, suited to the intensive study of individualJ .' •

cases.Group testing, like the first Binet scale, was developed to meet a press-

ing practical need. When the United States entered l)!orld 'Var I in

1917, a committee was appointed by the American Psychological Associ-

ation to consider ways in which psychology might assist in the conduct of

the war. This committee, under the direction of !lobert 1.•.1. Yerkes, recog-

nized the need for the rapid classification of the million and a ha1f re-cruits with respect to general intellectual level. Such informati~.~~~va:s

relevant to many admmistrative decisions, including rejection or dis-

charge from military service, assignment to different types of sel'vicei, or

admission to officer-training camps. It was in this setting that the first

group intelligence test was developed. In this task, the Ar-m~' psycholo-

gists drew on all available test materials, and especially on an unpub-

lished group intelligence test prepared by ~rthur S. Otis, which hc

turned over to the Army. A major contribution of Otis's test, which he

designed while a student in one of Terman's graduate courses, was the

introduction of multiple-choice and other "objective" item types.

The tests finally developed by the Army psychologists came to be

known as the ~rm""yAlpha and the Army Beta The former was designed

~lthough intelligence tests were originally designed to sample a wide

vanety of ~unctions in order to estimate the individual's general intelIec-

tua~ level, It soon became apparent that such tests were quite limited in

theIr .cove~age. Not all important functions were represented. IJ:!. fact,

most mtelhgence tests were primarily measures of verbal ability and. to a

lesser extent, of the ability to handle numerical and other abstract and

symb~~ic re~ations. Gr~dually psychologists eame to recogni~e that the

~erm . Il1telhgence test was a misnomer, since only certain aspects of

mtelligence were measured by such tests.

To be sure, th~ tests cov~red abilities ,t~t are ot p.rime importance in

our culture. B~ It was. realized that more'precise designations, in terms

of the type of mformation these tests are able to yield, w<;lUlq be prefer-

14 Context of Psyclwlo{!.ical Testing

able, For example, a number of tests that would probably have been

caned intelligence tests during the twenties later came to be known as

scholastic aptitude tests. This shift ill terminology was made in l'ec:og-

nition of the fact that mallY so-called intelligence tests measure that

combination of abilities demanded by academic work.

E\'l'n prior to Vvorld War I, ps\'ch~logists had begun to recognize the

need for tests of spE'cial aptitudes to suppkment the global intelligence

tests. These s ecial a till/de tests ' , , _ '

vocationa counseling and in the selection and classification of industrial

and military ersonn~1. Among the most widely used are tests of.!!lechani-

ea , c erica, musical, and artistic aptitlldes.-TI~ca~lation of intelligence tests that follm,'ed their wide-

sl>\'eadand indiscriminate use durinlJ the twenties also revealed another, 0

lIote"iOlthy fact: an individual's erformance on '

test often -showed mar -c variation. This ,yas especially apparent on

gl'OUptests, 111whlch the items ar~mmonly segregated into subtests of

relath'e1\- homogeneous content. For example, a person might score rela-

tively high on a verbal subtest and low on a numerical subtest, or vice

versa, To some extent, such internal variability is also discernible on a

test like the Stanford-Binet, in which, for example, all items involving

words might prove difficult for a particular individual, whereas itcms

employing pictures or geometric diagrams may place him at an ad-

vantage,Test users. and especially clinicians, frequently utilized such interc~l11-

parisons in order to obtain 1110reinsight into the individual's psychological

make-up. Thus, not only tllC'IQ or other global score but also scores on

subtests wonld lJt' examined in the e\'aluation of the indhidual case, Such

a practice is not to be general1~' recommended, ho,~,('ver. ~)eeaus~ in-

tellig('J]ce tests were not designed for the purpose of ,dIHerel,~h,~11aphtude

anal;'sis. Often the subtests heing compared contain t0o,14C\\' items to

yield a stable or reliable estimate of a specific ability:;jis'a result, the

obtained diffl:'rence betwcen subtest scores might be reversed if the

individual were retestE'd on a different day or with another foml of the

same test. If such intraindividual comparisons are to be made, tests are

needed that are specially designed to reveal differences in performance

in various functions.While the practical apl)lication of tests demonstrated the l1~.ed for

differential aptitude tests, a parallel development in the stu,d)' of trait or-

ganization was gradually providing the means for constructing SUC? tests.

Statistical studi('s on the nature of intelligence had been explonng the

iflterrelatiol1s among scores obtained by many persons on a ,,,ide variety

of different tests, Such investigations were begun by the English ,psy-

chologist Charles Spearman (1904, 1927) during the £lrst decade of the

Functions and OrigillS of PSljchological Testing 15

present century. Subsequent methodological developments, based on the

work of such American psychologists as T. L. ReIley (1928) and L. L.

!hurs~one (1935, 194i), as well as on that of other American and English

ll1veshgators, have come to be known as "factor analvsis."

The contributions that the methods of factor ana'lysis have made to

test c'Onstruction will be more fully examined and ill~strated in Chapter

1:3. For the present, it will suffice to note that the data gathered by such

procedures have indicated the presence of a Dumber of rebtiyely ;nde-

J)endent factors. or traits. Some of these traits were represen'ted, in

vary~ng proportions, in the traditional intelligence tests. Verbal compre-

henSIOn and numerical reasoning are examples of this tvpe of trait.

Others, such as spatial, perceptual, and mechanical aptitude~, were found

more often in special aptitude tests than in intelligence tests.

One of the chief practical outcomes of factor analysis was the develop-

ment of multiple aptitude batteries. These batteri('s arc desiuned to pro-

vide a measure of the individual's standing in each of a number of traits.

In place of a total score or IQ, a separate score is obtained for such traits

as "erhal comprehension, numerical aptitude, spatial visualization, arith-

m~tic re~soning, and perce~tual speed, Such batteries thus provide a

SUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I'

1 e~'e ~nOSls, t at c inicians a een tr\'ing for matiy years to

.obtam, wlth crude and often errODl:'OUSresults from intelligence tests.

These batteries also incorporate into a comprehensivl:' and svstl:'matic

testing program much of the inform,ation formerly obtained fro~l special

aptihlde tl:'sts, since the multiple aptitude batteries cover some of the

traits not ordinarily me u e JlI IJ1e 1 ence tests.

, u tip e ap u e atteries represent a relatively late development in

the testing field. Nearl~' all have appeared since 1945. In this connection,the work of thc military psychologists during World War II s.J~d also

be noted. ~fuch of the test research conducted in the armed services was

based on factor analysis and was directed toward the construction of

mu.ltiple aptitude batteries. In the Air Force, for example, special bat-

tent's were constructed for pilots, bombardiers, radio operators, range

finders, and scores of other military specialists. A report of the batterics

prepared in the Air Force alone occupies at least nine of the nineteen

volumes devoted to the aviation psychology program during 'Vorld War

II (Anny Air Forces, 1947-1948). Research along these line~ is still in

progress under the sponsorship of various branches of the armed services.

A.~~mber of multiple aptitude batteries !rl,\yelikewise ~en 4,eveloped for

clVllian. use and are being widely appliel:l\,n educati0l1~l and vocational

counselmg and in personnel' selectioll and' cJassincadqIl. Examples ofsuch butteries will be discussed in Chapter 13, ,"-' "

To avoid confusion, a point of terminology shoul\!l be clarified. The

16 COIl!ex! of Psyclwlogict,{ Tcsrillg

term "aptitude test" has been tracHtiollalJ" cmployed to refer to tests

measuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of

• I I \., t le term "intelliO'ence test" customarih' refers to more hderogenc-Co) e-. .~ests yielding a single global score sm:h as an IQ. S~)ecial aptitu~c

tests typically measure a single aptitude. ~lultiple al~tltl1de battenes

measure a number of aptitudes but pro\"ide a profile of scores, one for

eaeh aptitude.

FI/I1C!iol1.\' mltl Origi/l.~ of Psyc1IO/<l{!.ical Tcsli,l{!. 17

and other hroad educational objectives. The deeade of the 19:305 also

witnessed the introduction of test-seoring maehines, for which the newohjec:tive tests could be readily adapted.

The establishment of statewide, regional. and nalional testing programs

,,,as another noteworthy parallel denlopment. Probably the best known

.?f these programs is that of the College Entrance Examination Board

~t;EEB). Established at thc turn of the ce_ll'~' to reduce duplication in

the exa"tnining of entering college freshmen, this program has undergone

profound changes ill its testing procedures and in the number and nature

?f partie-ipa.ting col1eges-c·hangcs that reflect inten'ening developments

111both testIng and cducation. In 1947, the testing functions of the CEEB

were llIerged with those of the Carnegie Corporation and the American

Council on Education to form Educational Testing Service (ETS). In

subscq.t1cnt ~'ears, ETS has assumed responsibility for a growing number

of testlllg programs on behalf of universities, professional schools, gov-

ernment agencies, and other institutions. \[ention should also he made of

the American Collegc Testing Program established in 1959 to scrccn

applicants to colleges not included i~ thc CEEB program, and of several

national testing programs for the selection of highl\' talented studentsfor scholarship awards. .

. Achievem.ent tests are used not only for educational purposes but also

III the se]Pchon of applicants for industrial and government jobs. \fention

has already been made of the systematic use of ci\'i\ sen'jce examinations

in the Chinese empire, dating from 111.5 .B.c. In modern times, selection

of go\'~rnI~lent emplo:-e~s by examination was introduced in European

countnes 111the late eIghteenth and eark nineteenth centuries. The

l!llited States Chi! Service Commission in~talled competitive examina-

tions as a regular procedure in 1883 (Kanuck, 19.56). Test construction

techniques developed during and prior to World "'a~ I were introduded

into tll<:'examination program of the United States Ch-il Service with the

appointment of L. J. O'Rourke as director of the newlv established re-search dh'ision in 1922. '

. As more and more psychologists trained in psychometrics participated

m the construction of standardized achievement tests, the technical as-

pects of achievement tests increasingly came to resemble those of in-

telligence and aptitude tests. Procedur~s for cons,trllcting and evaluating

all ~hese tcsts have much in common. The incre~s!ng effOlts to prepare

achIevement tests that would measure the attainment of broad educa-

tional goals, as contrasted to the recall of factualiminutiae also made

the content of achievement tests resemble more -cioselv th~t of intelli-

ge~lce tests. Today the difference between these two 'types of tests is

dueHy one of degree of specificity of content and extent to which the

test presupposes a designated course of prior instruCtion.

While psychologists were busy developing intelligence and aptitude

tests, traditional school examinations were undergoing a number of tech-

nical improvements (Caldwell & Courtis, 192:3; Ebel & Damrin, 1960 ~.

An important step in this direction was taken by the Boston pubhc

schools in 1845, when written examinations wefe substituted for the oral

interroO'ation of students by visiting examiners. Commenting on this in-

nDvati~l, Horacc ~fann cit~d arguments remarkably similar to those used

much later to justify the replacement of essay questions hy objective

multiple-choice items. The written examiuations, \lann noted, put all

students in a uniform situation, permitted a wider cO\'erage of content,

reduced the chance element in question choice, and eliminated tIll' pos-

sibilitv of h\'oritism on the examiner's part.

Aft~r the turn of the centurv, the first stand-ardized tests for measuring

the outeomes of school instnl~tion began to appear. Spearheaded h~' the

work of E. L. Thorndike. these tests utilized measurement principks de-

veloped in the psychological laboratory. Examples include scales for

rating the quality of handwriting and written compos.itiol1s, as. well ~s

tests in spelling, arithmetic computation, and arithmetic reasol1lng. Stl11

later came the achie\"ement batteries, initiated by the publication of the

first edition of the Stanford Achievement Test in 192:3. Its authors were

three earl" It'aders in test development: Truman L. Kelley, GHes ~f.

Ruch, ami Lewis M. Terman. Foreshadowing many characteri·stic'S of

modern t'fsting, this battery provided com~arable measu~'es of perfo~-

ance in different school subjects, evaluated 111 terms of a smgle norma live

group.At the same time, evidence was accumulating regarding the lack of

agreement among teachers in grading essay tests. By .1930 it was.widely

recognized that essay tests were not only more hme-cOnsumll1g for

examiners and examinees, but also yielded less reliable results than the

"new type" of objective items. As the latter came into increasing use in

standardized achievement tests, there was a growing emphaSiS on the

design of items to test the understanding and application of knowledge

J' IIIIC/ /(111,\ {///(/ (higill., of J'sydl(l'(/~i('111 1'<'S!iIlt!. 19

of bc-!Ja>ior01' Wl'I'('<:olll:erncd with mOl'(' dbtindly social r('~pons('s, such

as dOl1lmalll'C-sublllission in interpersonal ('ontacts. A later development

\\'as th<: constmction of tests for quantifying the expression of interests

and athtude's, These tests, too, W('H' based l'ssentialh' on <llll'stionnairet('chniqul's, .

.All(~th('rapproach to the measurement of personalit~' is through the ap-

pllc,\hon of performatlce or situational tests. In such tests, the subject has

a task to perform whose purpose is often disgUised, :\Iost of these tests

s~llIulate e\'eryday-life situations quite c1ose1~'.Th(' first extensive applica-

tIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the late

twenhcs and earl~' thirties by Hartshorne, ~fa\', and their associates

(1928, 1929, 19:30), This series, standardized on s'choolchildren, was con-

cerned \:'ith such beha"ior as cheating, lying. stealing, cooperatin'ness,

and pcmstenct', Objective, quantitative scores could he obtained on each

of a largc numb('r of sp('cific tests, A more recent illustration, for the

a~1I1.tlev;l, is l~ro\'ided by the series of situational tests developcd during

" OJld "ar II 111 the Assessment Program of the Office of Strate<TicServ-ices (OSS, 19-48). These tests wem' C:Oll('erned with rclath·ely ~omplex

and subtle sodal and emotional beha\'ior and refluired rather ehlborate

f~cilities and tr~lin:d personnel for their admillistration, The interpreta-

tIOn of th,e subject s responses, moreover, \\'as rdati\'C I~' suhjectivc.

Pro,ectll;e techniqlles represent a third approach to the study of per-

sO,nall.tyand olle that has shown phenomenal gro\vth. cspecially among

dlll1CIans. In such tests. the subject is gi\'en a relatin'Jy unstructured

task that permits "'ide latitudl' in its solution, The assumption underlvincr

such metllocls is that the indi\'idual will project his characteristic m~d~:

of response into stich a task. Lik(' the performancc and situational tests.

proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose, thereby

reducmg the chances that the subject can dt'li1wrateh- create a desired

impressi?l1, The prc\'iously cited free association test'represe.nts one of

thc earlIest types of projccth'e techniques. Sellten('e-completion tests

hav.e al.so been tlSed in this manner. Otller tasks commonly employed\n

proJech\'e techniques include drawing, arranging toys to create a scene,

('xtempor~nt'ous dramatic play. and interpreting pictures or inkblots.

All.a\'aJlable types of personality t('sts present serious difficulties. both

practi~al and theoretical. Each approach has its own special advaqtages

and. dlsad\:antages. On the whole, personality testing has lagged far

behmd aptitude t('sting in its positive accomplishments. But such lack of

progress is not to be attributed to insufficient eHOI't. Hesearch on the

~~~urement ~f. pers?nality ~as attained i~pr~s~ive Pl~p,p'ortions since

. ' ~nd .man) mgemous devIC.'csand techmcal J1nprovemeil~s arc under

~VeStigabon. It is rathe,r the spt'cial difficulti~ encountel:fd in the

easurement of personality that account for the slow advances in thisu~ . ,

Another area of psy<:holo~ical testing is concerned with the aH('ctive or

nonint('lIectnal aspects of b('ha\'io!'. Tests d('signed for this purpose are

commonly known as personality tests. although some psychologists prefer

to lISt' the term personalit~, in a hroader sense, to refer to the cntirc indi-

vidual. Intellectual as well as nonintellectual traits ,,"ould thus be included

under this heading, In the terminology of psychologit·al testing, howcver,

the designation "personality test" most often refers to measures of such

characteristics as emotional adjustment, interpersonal relations, moth·a-

tion, interests, and attitudes.An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra,:-

pelin's use of the free association test with abnormal patients. In thIS

test the subject is gh'en specially selectcd stimulus words and is required

to r('spond to each with the first word that comes to mind, Kraepelin

( 1892) also employed this technique to study the psychological effects

of fatigue, hunger, and drugs and concluded that all these agents in-

crease the relati\'{~ frequenc~' of superficial associations, Sommer (1894),

also writing: during the last decade of the nineteenth century, suggested

that the free association test might be used to differentiate between the

various forms of mental disorder. The fre(' association technique has

subscqllenth' becn utilized for a vari('ty of testing purpos('s and is still

curr('nth- en\plcn'ed, \Iention should also be made of the 'York of Galton,

Pear~on: and C;lttell in the dpyelopment of standardized questionnaire

and ratin~-,~'ale tl'chniqn('s. Although origin~l1y devised for other pur-

poses. these proc-edmes \wre e\'entual1~' employed by othNs in construct-

ing some of the most common types of current personality tests.

The protntype of tht, personalit\' qnpstionnaire, or self-report inventory,

is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \Var

I (DuBois. 1970; Symonds. 19:31,eh. 5; Goldlwrg, 19(1). This test was

designed as a rough screening device for identifying seriously ~urotic

men \\'110 would be' unfit for military service. The inventor\' conslst~d of

a number of questions dealing with common neurotic sy~pt01'!lS, ,~'hich

the individual answered about himself. A total score was o\5t~ined by

counting the number of symptoms reported, The Personal Data ~heet

was )lot completed carly enough to permit its operational use .J)efore the

war cnded. Immediatel" after the war, however, civilian forms were

prepared, including a special form for use with children. The \Vood-

worth Personal Data Sheet, moreover, served as a model for most subse-

quent emotional adjustment inventories. In some of these questionnaires,

an attempt was made to subdivide emotional adjustment into more spe-

cific forms. such as home adjustment, school adjustment, and vocational

adjustment. Other tests concentrated more intensively on a narrower area

imtruJl1cnts {,;m hr found in A SourcelJook for .Hell/(/I 11ealtll Measures

(Comn'~·. Backer, & Glaser, 197:1). Containing approximately 1,100 ab-

stracts. this sourcehook includes tests, questionnaires, rating scales, and

other <ledc('s for assessing both aptitude and personality variables in

adults and children. Another similar reference is entitled Measures for

Psychological Assessment (Chun, Cobb, & Frenrh, ]975). For each of

:1,000 measures, this volume' gives the original sOl\J'et' as well as an anno-

tat<,d bibliography of the studies in which the measure was subscquently

used. The entries were located through a search of 26 measurement-

related journals for the Years 1960 to 1970.

Information 011 asses~ment devices suitable for children from birth to

12 years is summarized in Tests and Measurements in Child Development:

A Handbook (Johnson & Bommarito, 1971). Covering only tests not listed

in the \nrr, this handbook describes instruments located through an

intensi\'(~ journal search spanning a ten-year period. Selection criteria

included availability of the test to professionals, adequate instructions

for administration and scoring, sufficient length, and convenience- of use

(i.p., not requiring expensive or elaborate equipment). A still more spe-

cialized collection CO\'crs measures of social and emotional development

applicable to children between the ages of ,3 and 6 years (Walker, 1973).

Finanv, it should be noted that the most direct source of information

regardiI;!!: specific curr~ltksts is pro\'ided h~' the catalo~t1cs of tcst pub-

lIshers and b~' tht· mannal that accompani0s ('ach test. A comprehensive

list of test publishers, \\'ith addresses, can be found in the lates't Mell/al

M el/S/lTcmcnfs rearl)()ok~ For reach' reference, the namt's and nddrt'sses

of some of the largt'r .-\merican p'uhlishers and distributors of psycho-

logical tests are gi\'en in AppendiX D. Cltalog\1('s of current tests can be

obtained from each of these publishers on requcst. :\lanuals and speci-

men sets of tests can be purchased hy qualified users.

The test manual should provide the ('ssential infurmation required for

administering, scoring. and evaluating a particular test. In it should be

found full and detailed instructions, scoring key, norms, and data on re-

Iiahilit~, and validity. :\fo!'E'over, the manual should report the number

and nature of subjects on whom lIonns, reliahilit~·. and validity were

est~b~ished, the methods employed in computing indices of reliability and

valIdity, and the specific criteria against which validity was checked. In

~he e\'ent that the necessary information is too lengthy to fit conveniently

mto the manual, references to the printed sour<.:esin which such infor-

mation can be readily located should be given. The manual should, in

other. words, enable the test user to evaluate the ·test before choosing it

for IllS specific purpose. It might be added that ma~y test manuals still

fa!1 short of this goal. But some of the larger ancl more professionally

onented test publishers are giving increasillg attention to the preparation

Psychological testing is in a state of rapid chan~e. There are shifting

oriel;tations, a constant stream of new tests, revisc>dforms of old tests, and

additional data that mav refine or alter the interpretation of scores on

existing tests. The accelerating rate of <:hange, together with ~he vast

number uf available tests, makes it impracticable to sun'ey speCific tests

in any single text. \lore intensive coverage of testing instruments and

problems in special areas can be found in books dealing with the us~ of

tests in such fields as counseling. clinical practice, personnel selection,

and education. References to such publications are given in the appropri-

ate chapters of this book. In order to keep abreast of current develop-

ments, however, anyone working with tests needs to be familiar with

IlUoredirect sources of contemporary information about tests.

One of the most important sources is the series of Mental !Ifeasurements

)'eaTbooks (MMY) edited hy Buros (19i2). Th('sc yearbooks cover nearly

all commercially available psychological, educational, and vocational tests

published in English. The coverage is especially .complete .for paper-~nd-

pencil tests. Eaeh yearbook includes tests publIshed dunng a speCified

period, thus supplementing rather than supplanting the earlier yearbooks.

The Ser,enth Mental Measurements rear7JOok, for example, is concernedprincipally with tests appearing bet\\'een 1964 and 1~70. Tests. of con-

tinuing interest, however, may be reviewed r~peat('dly m StH.·cesSlyey~ar-

hooks, as nt'w data accumulate from pertment research. The earhest

publications in this series were merely bi~)liographies of tests: B~ginning

in ]9,38,however, the ),earbook assumed Its ('UlTt'I\t form, wlll(:h llldudes

critical reviews of most of the tests by one or more test experts, as well

as a complete list of published references pertailling to each lest. .Routine

information regarding poblisher, -price, forms, and age of subjects for

whom the test is suitable is also regularly giv('n.A comprehensive bibliography covering all types of published tests

available in English-speaking countries is provided by Te:~ts in Print(Buras, 1974). Two related sources are Reading Tests and Reviett;~

(Bums, 1968) and Personality Tests and Reviews (Buras, 11970). Both

include a numbeF'~9f tests not found in any volume of the MMY, as well

as master indexes'that facilitate the location of tests in the :\1\1Y. Reviews

of specific tests are also published in several Ilsychological and educa-

tional journals, such as the Journal of Educational Measurement and the

JOllrnal of Counseling Psyc1101ogy.Since I9iO several sourcebooks have appeared which provide informa-

tion about u~published or little known instruments, largely supplement-

ing the material listed in the MMY. A comprehensive survey of such

22 Context of Psyc11010gical Testing

ofmanuals that meet adequate scientific standards. An enlightened PU?-lie of test users provides the firmest assurance that such standal'ds wIll

be maintained and improved in the future.. .A succinct but comprehensive guide for the evaluatwn of psy~hologlcal

tests is to be found in Standards for Educational arul Psyc11010glCal Tests

(1974), published by the American Psychological As~ocia~ion. These

standards represent a summary of recommended practices 111 test con-

struction based on the current state of knowledge in the field. They are

concerned with the information about validity, reliability, norms, and

other test characteristics that ought to be reported in the manual. In their

latest revision, the Standards also provide a guide for the proper use of

tests and for the correct interpretation and application of test results.

Relevant portions of the StQnda~ds "ill.be cited in the following chapters,

in connection with the appropnate tOpICS.

CHAPTER 2

J\rat1ure arld Use of

Psyclz.ological Tests

T.HE HISTORICAL introduction in Chapter 1 has already suggested

some of the many uses of psychological tests, as well as the wide

diversity of available tests. Although the general public may still

associate psychological tests most dosely with "IQ tests" and with tests

designed to detect emotional disorders, these tests represent only a small

proportion of the available types of instruments. The major categories of

psychological tests will be discussed and illustrated in Parts 3, 4, and 5,

'\'hich cover tests of general intellectual level, traditionally called intelli-

gence tests; tests of separate abilities, including multiple aptitude bat-

teries, tests of special aptitudes, and achievement tests; and personality

tests, concerned with measures of emotional and motivational traits, in-

terpersonal behavior, interests, attitudes, and other noncognitive char-

acteristics.

In the face of such diversity in nature and purpose, ,~hat are tIle

common differentiating characteristics of ps~'Chological tests? Ho," do

psychological tests differ from other methods of gathering information

about individuals? The answer is to be found in certain fundamental

features of both the construction and use of tests. It is with these featm!es

that the present chapter is concerned.

BEHAVIOR SAMPLE..-A, psychological test is essentially an objective

.~d standardized measure orit's'ample of behavior. Psychological tests

are like tests in any other science, insofar as 0R~flh~tions are made on a

small hut carefully chosen ,sample .~ .an ip~jyjil~)rs behaviQr.. In this

respect, the psychologist proceeds in much·.the 'Jame way as the chemist

who tests a patient's blood or a community.}swater supply by analyzing

,-et'more samples of it. If the psychologistwish¢'~ to test the extent

,iff a c1lild's vocabulary, a clerk's ability to perform arithmetic computa-

tions, or a pilot's eye-hand coordination, he ('xamim's their performance

with a representatin' set of wonls, :11'ithmclie prol>lems, or motor tests.

"'hetlwr or not the test adeqnately co\'(.'rs the behavior under con-

sideration obviously depends on the number and nature of it nls in thesamp e. or examp e, an ant 1I1letJctest consisting of only five problems,

~le including only multiplication items, would be a poor measure of

the indiyidual's computational skill. A yoealmlary test composed entirely

of baseball terms would hardly proYide a dependable estimate of a

child's total range of vocalmlar~'.

The diagnostic or 'redictiJ;c t;a7uc of a lsycholC!gical test depend~_ol!

the debH,',~O which it sen'es as an indicator of a relatively broad and

!!guinea;t area·Ofb~;:. Measurement of the hehaYior sample directl~'

cO\'ered by the test is J:arely, if ever, the goal of psychological testing.

The child's knowledge of a particular list of 50 words is not, in itself, of

,great interest. Nor is the job applicant's performance on a specific set

of 20 arithmetic problems of much importune-e_ If, however, it can be

demonstrated that there is a dose correspondence between the child's

knO\dedge of the word list and his total l1laster~-of vocabulary, or be-

tween the applicant's score on the arithmetic problems and his computa-

tional performance on the joh. then the tests are ser\'ing their purpose,

It should be noted ir.. this connectiolJ that the test items need not

resemble closely the beha.vior the test is.to }[('dicr."It is only necessary

tna " .- on ence be demoHstrated bet"'ecn the tm); The

degrec of similarity between the test sample and the predicted behavior

ma\' vary widely. At one extreme. the test mav coincide completelY with

a part o'f the b;'h~or to he preclictt'cl. An e.\:Imple might be a foreign

vocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\-

words th~y have studied; another example is provided by the ro,ld test

taken prior to obtaining a driver's liccme. A lesser degree of similarity is

illustrated by many vocational aptitude tests administered prior to joh

training, in which there is only a mod<'rate rese ance between the

tasks peIformed on the joh and those incorporat ,in the test. At the

other extreme one finds projecth'e personality test!>'" eh as the Rorschach

inkblot test, in which an attempt is made to predict from the subject's

as~ociations to inkblots how he will rcad to other people, to ~motionally

toned stimuli, and to other complex, everyday-life situations, Despite

their superficial differences, all these tests consist of samples of the indi-

~s behavioL., And each mUst prove Its worth by" an empirically

demonstrated correspondence between the subject's pcrformance on the

test and in other situations.

Whether the term "diagnosis" or the term "prediction" is employed in

this connection also represents a minor distinction. Prediction eommonly

connotes a temporal estimate, the individual's future performance on a

job, for example, heing foreeast from his present test performance. In a

hroader sense, ho\\"('\'er, e\-en the diagnosis of present condition, suell as

mental retardation ur emutional disorder, implies a prediction of what

the incIi\'idual will cIO in situations other than the present test. It is

logically Simpler to consider all tests as behavior samples from which

predictions regarding other JX.havior can be made. Different typps of

tests can then be characterized as variants of this basic pattern.

Anotlwr point that should be considered at the outset pertains to the

cone-ept of Clll}(/cify. It is entirely possible, for example, to dc\'isc a test

fur predicting how well an individual can learn Fre11Ch before he has

even begun the study of French. Such a test would invoh-e a sample of

the types of behavior required to learn the new language, but would in

itself presuppose no knowledge of French. It could then be said that

this test measures the indh'idual's "capacity" or "potentialitt for learn-

ing French, Such tenus should, hO"'ever, be used with caution in refer-

ence to ps~'dlOlogical tests. Onl\' in the senSe that a present behavior

sample can be used as an indicator of other, future behayior can we

s~ak.()f a test measuring "capacity." Ko psychological test can do more

than measurelJel1"UDor. 'Vh~ethci:S\1ch behavior can serve as an effective

inc!('x of other IX'hador can be determined only by empirical try-out.

STA:-;DARDIZATIO:-;, It ,,-:"iIlhe recalled that in the initial definition a ps~--

chological test \\'as described as a standardized measure. Standardization

implies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test If

the scores obtained by different iudiyiduals are to be comparable, testin~

conditions must obYiously be the same for all. Such a requirement is only

a speCial application of the need for controlled conditions in all scientific

ohse-ryations. In a test situation, the single independent \'ariable is

usuall~' the indh-idual being tested.

In order to secure uniformity of testing conditions, the test constructor

provides detailed directions for administering each newly developed h:'st.

The formulation of such directions is a major part of the standardization

of a new test_ Such standardization extends to the exact materials em'plo~d, time limits, oral instructions to subjects, prc>Jiminary demonstra-

: ~ns, ways of handling queries from subjects. and evel,\, other ~

the testing situation. :Many other, more subtle factors may influence the

subject's performance on certain tests. Thus, in giving instructions or,

presenting problems orally, consideration must be given to the rate of

speaking, tone of voice, inflection, pauses, and faCj~1 e}pression. In a

test involving the detection of absurdities, tot eX;lnit>le, the correct an-~wer may be given away by smiling or paY~jlg wh~n the crucial word

J~.read .. Stand~rdized testing p.rocedure, ~r:,~i[th~\. ex.aminer's point of\1:w, Will be dJscussed further m a later sect~g~ of-<tl;lJSchapter dealing

\\'Jth problems of test administration. ."

26 COlltext Of Psychological Testing

Another important step in the standardization of a test is the establish-

ment of norms, Psychological tests have no predetermined standards of

pli5singor fa'inng; an individual's score is evaluated by comparing it with

the scores obtained by others. As its name implies, a norm is the normal

or average performance. Thus, if normal B-year-old children complete

12 out of 50 problems correctly on a particular arithmetic reasoning test,

then the 8-year-old norm on this test corresponds to a score of 12, The

latter is known as the raw score on the test, It may be expressed as

number of correct items, time required to complete a task, number of

errors, or some other objective measure appropriate to the content of the

test. Such a raw score is meaninglcss until evaluated in terms of a suitable

set of norms, .

In the process of standardizing a test, it is administered to a large,representative sample of the type of subjects for whom it is designed.

This group, known as the standardization sample, serves to establish the

norms. Such norms indicate not only the average performance but also

the relative frequency of varying degrees of deviation above and below

the awrage. It is thus possible to evaluate different degrees of superiority

and inferiority. The specific ways in which such norm" may be expressed

will be considered in Chapter 4. All permit the designation of the indi-

"idual's position with reference to the normative or standardization

sample.

It might also be noted that norms are established for personality tests

. in esse!1tially the same way as for aptitude tests. The norm on a person-

ality test is not necessarily the most desirable or "ideal" performance,

any more than a perfect or errorless score is the norm on an aptitude

test. On both types of tests, the norm corresponds to the performance of

typical or average individuals. On dominance-submission tests, for ex-

ample, the nonn falls at an intermediate point representing the degree

of dominance or submission manifested by the average individual.

Similarly. in an emotional adjustment inventory, the norm does not

ordinarih· correspond to a complete absen<.'C of unfavoral;>le or mal-

adaptive' }'esponses, since a few such responses occur in the majority of

"normal" individuals in the standardization sample. It is thus apparent

that psychological tests, of whatever type, are bascq'· on lmpirically

established norms.

Nature alld Use of Psychological Tests 27

the discussion of standardization. Thus, the administration, scoring, and

interpretation of scores are objective insofar as they are independent of

the subjective judgment of the individual examiner. Anv one individual

should theoretically obtain the identical score on a test r~gardless of who

happens to be his examiner. This is not entirely so, of comse, since per-

fect standal'dization and objectivity have not been attained in practice.

But at least such objectivity is the goal of test consb'uction and has been

achieved to a reasonably high degree in most tests.

There are other major ways in which psychological tests can be prop-

erly described as objective. The determination of the difficulty level of an

item or of a whole test is based on objective, empirical procedures. 'Vhen

Binet and Simon prepared their original, 1905 scale for the measurement

of intelligence, they arranged the 30 items of the scale in order of in-

creasing difficulty. Such difficulty, it will be recalled, was determined by

trying out the items on 50 normal and a few mentally retarded children.

The items correctly solved by the largest number of' children were, ipso

facto, taken to be the easiest; those passed by relativdy few children were

regarded as more difficult items. By this procedure, an empirical order

of difficulty was established. This early ,:xarnple typifies the objective

measurement of difficulty level, which is now common practice in psycho.logical test construction.

:l'ot only the arrangement but also the selection of items for inclusion

in a test can be determined by the proportion of subjects in the trial

samples who pass each item. Thus, if there is a bunching of items at the

easy or difficult end of the scale, some items can be discarded. Similarly,

if items are sparse in celiain portions of the difficulty range, new items

can be added to fill the gaps. More technical aspects of item analYsiswill be considered in Chapter 8. .

. RELIABILITY. How good is this test? Does it really work? Thes£l ques-

t~ons could-and occasionally do-result in long hours of futile discus-

sIOn. Subjective opinions, hunches, and personal biases may lead, on the

one hand, to extravagant claims regarding what a particular test can

acco~plish and, on the other hand, to stubborn rejection. The only way

q~estlOns sU~h ~s these can be conclusively answered is by,empirical

trial. The olJ]ectlve evaluation of psychological tests involves primarilv

t?e d~tennination of the reliability and the validity of the test in specifiedSltuatlons.

As used in psychometrics, the term reliability always means consis-

tenc~', Test reliability is the consistency of scores obtain_ed;~ the same

persons when retested with the identical test or with an eqRhYalent form

of the test. If a child receives an IQ of 110 on Monday and an IQ of 80

OBJECTIVE MEASUREMENT OF DIFFICULTY. Reference to the definition

of a psychological test with which this discussion opened will show that

such a test was characterized as an objective as well as a standardized

measure. In ,••.hat specific way~.are such tests objective? Some aspects of

the objectivity of psychologieh'l tests have already been touched on in

when retested on Friday, it is obvious that little or 110 confidence can be

put in either score. Similarly, if in olle set of 50 words an individual

identifies 40 correctl~·, whereas in another, supposedly equivalent set he

gets a score of only 20 right, then neither score can be taken as a de-

pendable index of his verbal comprehension. To be sure, in both illustra-

tions it is possible that only one of the two sC'ores is in error, but tlus

could be demonstrated only by further retests. From the given data, we

can conclude only that both scores cannot be right. \Vhether one or

neither is an adequate estimate of the individual's ability in vocabulary

cannot be established without additional information.

Before a psychological test is released for general use, a thorough,

objective check of its reliability should be carried out. The different types

of test reliability, as well as methods of measuring each, will be con-

sidered in Chapter 5. Reliability can be checked with reference to

Itemporal fluctuations, the particular selection of items or behavior sample

constituting the test, the role of different examiners or scorers, and other

aspects of the testing situation. It is essential to specify the type of re-

liability and the method employed to determine it, because the same test

may vary in these different aspects. The number and nature of indi-

viduals on whom reliability was checked should likewise be reported.

With such information, the test user can predict whether the test will be

about equally reliable for the group with 'which he expects to use it, or

whether it is likelv to be more reliable or less reliable.

VALIDITY, Undoubtedly the most important question to be asked about

any psychological test"concerns its validity, i.e., the degree to which the

test actually measures what it purports to measure. Validity provides a

direct check on how well the test fulfills its function. The determination

of validity usually requires independent, external criteria of-whatever the

test is nesigned to measure. For example, if a medical aptitude test ist9

be used in selecting promising applicants for medical school,. ultimatle

success in medical scholYlwould be a criterion. In the process of ·y~lidat-

ing such a test, it would be administered to a large group of students at

the time of their admission to medical school. Some measure of per-

formance in medical school would eventually be obtained for each stu-

dent on the basis of grades, ratings by instructors, success or failure in

completing training, and the like. Such a composite measure constitutes

the criterion with which each student's initial test score is to be correlated.

A high correlation, or validity coefficie,,!t, would signify th~t those indi-

viduals who scored high on thetest. had been relatively successful in

medical school, whereas those scoring low on the test had done poorly in

medical school. A low correlation would indicate little correspondence

l,,,t"'ppn tp~t ~('orp.rind criterirJn measure and hence poor validity for the

test. The validity coefficifnt enables us to determine how closel\' the

criterion perfor~ance could have been predicted from the test scor~s.

In a similar manner, tests designed for other purposes can be validated

against appropriate criteria. A vocational aptitude test, for example, can

be validated against on-the-job success of a trial group of new employees.

A pilot aptitude battery can 1;>evalidated against achie\'ement in flig:lt

training. Tests designed for broader f\nd more varied uses are validated

against a number of criteria and their validity can be established only by

the gradual accumulation of data from many different kinds of investiga-tions.

The reader may have noticed an apparent paradox in the concept of

test validity. If it is necessary to follow up the subjects or in other ways

to obtain independent measures of what the test is trying to predict, why

not dispense v.ith the test? The answer to this riddle is to be found in the

distinction between the validation l,TfOUp on the one hand anci the groups

on which the test will eventually be employed for operational purposes

on the other. Before the test is ready for use, its validity must be estab-

lished on a representative sample of suhjects. The scores of these persons

are not themselves employed for operational purposes but serve only in

the process of testing the test. If the test proves valid b~' this method, it

can then be used on other samples in the absence of criterion measures.

It might still be argued that we would need only to wait for the crite-

rion measure to mature, to become available, on any group in order to

obtain the information that the test is trying to predict. But such a pro-

cedure would be so wasteful of time and energy as to be prohibitive in

most instances. Thus, we could detennine which applicants will succeed

on a job or which students will satisfactorily complete college by admit-

ting all who apply and waiting for subsequent developments! It is the

very wastefulness of this procedure-and its deleterious emotional im-

pact on individuals-that tests are designed to minimize. By means of

tests, the person's present level of prerequisite skills, knowledge, and

other relevant characteristics can be assessed with a deferminable margin

of error. The more valid and reliable thef~, the smaller will be this,margin of error. .

The special problems encountered in determining the validity of dif-

ferent types of tests, as well as the specific criteria and statistical pro-

cedures employed, willlJ~ fhscussed in Chapters 6 and 7. One further

point, however, should be coq$fdered at this time. Validitv tells us more

than the degree to which the te~t is f~lfilling its funcpari.ft actually tells

us what the test is measuring. By studying the validation data, we can

objectively determine what the test is measuring. It would thus be more

accurate to define validity as the extent to which we Jrnow what the test

measures. The interpretation of test scores would undoubtedly be clearer

and less ambiguous if tests were regularly named in terms of the criterion

Context of Psychological Tes/ing

'~:~hl:oughwhich they had been validated. A tendency in this direction

pe'recognized in such test labels as "~cholastic aptitude test" and

sonnel classification test" in place of the vague title "intelligence

'SONS FOR CONTROLLING THE USE OF

,CHOLOCICAL TESTS

'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for;

i~sionto,School X and I'd like to give him ~ol1lepractice so he can pass."

o improve the reading program in our school, we need a culture-free IQ

,t .that measures each child's inllate potential."

st night I answered the questions in an intelligence test published in a

~gazineand I got an IQ of SO-I think psychological tests are silly."

.. 'y roommate is studying psych. She gave me a personality test and I came1neurotic. I've been too upset to go to class ever since."

, 'ast ~'enryou gave a new personality test to our employees for research pur-

.;poses.We would now like to have the scores for their personnel folders."

The above ·remarks are not imaginary. Each is based on a re~fincident,

nd the list could easily be extended by any psychologist. SuQ't remarks

'lustrate potential misllses or misinterpretations of psychological tests in

uch wavs, as to rrnder the tests worthless or to hurt the indi:,V;idual.Like

ny sd~ntillc instrument or precision tool, psychological t~~s"roJ!~.LP.!:_

9perly used to be effective. In the hands of either the unscrupulous or

"we -meamng ut uninformed user, such tests can cause serious

~~~ ~.There are two principal reasons for controlling the use of psychological

ests: (a) to revent general familiarity with test content, which would

.' invalidate the test an ( to ensure tat e test is used ~ a qualified :>

, '~\' if an individual were to merr'lbrize the correct' re-

O' sponses on a test o'f' color blindness, such a test w~ld no longer be a

'measure of color vision for him. Under these condItions, the test would

be completely invalidated. Test content clearly has to be restricted in

, order to forestall deliberate efforts to fake scores.

In other cnses, however, the effect of familiarity may be less obvious,

or the test may be invalidated in good faith by misinformed persons. A

\ ,schoolteacher, for example, may give her class special praettee in prob-

.1ems closely resembling those on an intelligence test, "so that the pupils

will be well prepared to take the test." Such an attitude is simply a carry-

"over from the usual procedure of preparing for a school examination.

When applied to an intelligence test, however, it is likely that such

specific training 01' coaching will raise the scores on the test without ap-

preciably affecting the broader area of beha"ior the test tries to sample.

Under such conditions. the validity of the test as a predictive instl'l1ment

is reduced.

The need for a qualified examiner is evident in each of the three major

aspects of the testing situation-selection of the test, administration and

scoring, and i~terpretation of scores. Tests cannot be chos'en like lawn

mowers, from a mail-order catalogue. They cannot be evaluated by name,

author, or other easy marks of identification. To be sure, it requires no

psychological training to consider such factors as cost, bulkiness and ease

of transporting test materials, testing time required, and ease and rapidity

of scoring. Information on these practica] points can '\lsually be obtained

from a test catalogue and should be taken into account in planning a test-

ing program. For the test to serve its function, however, an e"nlnation of

its technical merits' in terms of such characteristics as validity reliability

difficulty level, and norms is essential. Only in such a way' ~an the tes~

user determine the appropriateness of an)' test for his particular purpose

and its suitability for the type of persons with whom he plans to use it.

The introductory discussion of test standardization earlier in this chap-

ter has ah'eady suggested the importance of a trained examiner. An ade-

quate realization of the need to follow instructions precisely, as well as a

thorough familiarity with the standard instructions, i~ required if the test

scores obtained by different examiners are to be comparable or if anyone

individual's score is to he evaluated in terms of the published norms.

Careful conh-ol of testing conditions is also essential. Similarly, incorrect

or inaccurate scoring may render the test score worthless. In the absence

of proper checking procedures, scoring errors are far more likeh- to occur

than is generally realized. . ,\

The proper interpretation of test scores requires a thorough under-

standing of the test, the individual, and the testing <'Onditiolls. What is

being measured can be objectively determined only by reference to the

specific procedures in terms of which the particular test was validated.

Other information, pertaining to reliability, nature of the group on which

norms were established, and the like, is likewise relevant. Some back-

ground data reg,arding the individual being tested are essential in inter-

preting any test score. The same score may be obtained by different per-

sons for very different reasons. The conclusions to be drawn from such

scores would therefo.re be quite dissimilar. Finally, some consideration

must also be given to special factors that may have influenced a particular

score, such as unusual testing conditions, temporary emotional or physical

state of thl> subject, and extent of the subject's previous experience with

tests.

The basic rationale of testing im·olves generalization from the behavior

sample observed in the testing situation to beha"ior manifested in other,

nontest situations, A test SCOl'e should help us to predict how the client

will feel and act outside the clinic, how the student will achieve in col-

lege courses, and how the applicant will perform on the job. Any influ-

ences that are specific to the test situation constitute error variance and

reduce test validity. It is therefore important to identify any test-related

influences that may limit or impair the generalizability of test results.

A whole volume could easil\' be devoted to a discussion of desirable

procedures of test administration, But such a survey falls outside the

scope of the present book. Moreover, it is more pra~ticable to acquire

~.such techniques within specific settings, because no one person would

normally be concerned with all forms of testing, from the examination

of infants to the clinical testing of psychotic patients or the administra-

tion of a mass testing program for military personnel. The present discus-

sion will therefore deal principally with the common rationale of test

administration rather than with specific questions of implementation. For

detailed suggestions regarding testing procedure, see Palmer (1970),

Sattler (1974), and Terman and Merrill (1960) for individual testing,

and Clemans (1971) for group testing.

ADVASCE PREPARATIOS OF E."I:AMINERS. The most important requirement

for good testing proc;.edure is advanc-e preparation. In testing there can

he no emergencies. Special efforts must therefore be made to foresee and

forestall emergencies. Only in this way can unifom1ity of procedure be

..a{ls.\wed.

'Advance preparation for the testing session takes many forms. Memo-

rizingthe exact verbal instructions is essential in most individual testing.

Even ill a group test in which the instructions are reauto the subrects,

some· previous familiarity with the statements to be read prevents mis-

reading and hesitation and permits a more natural. informal ;manner dur-

ing test admillish'ation. The preparation of test materials is an9ther im-

portant preliminary step. In individual testing and especially in the ad-

ministration of performance tests, such preparation invqlves the actual

layout of the necessary materials to facilitate subsequent use with a

minimum of search or fumbling. Materials should generally be placed on

a table near the testing ta.~le so that they are within easy reach of the

examiner but do not distriCt Vte subject. When apparatus is employed,

frequent periodic checking and calibration may be necessary. In group

testing, all test blanks, answer sheets, special pencils,· or other materials

Nature alld (he of PsycllOlogiclIl Tc'sls 33

needed should be carefully counted, checked, and arranged in advance

of the testing day.

Thorough familiarity with the specific testing procedure is another im-

portant prerequisite in both individual and group testing. For individual

testing, supervised training in the administration of the particular test is

usually essential. Depending upon the nature of the test and the type of

subjects to be examined, such training may requi.re from a few demonstra-

tion and practice sessions to over a year of instruction. For group testing,

and espeCially in large-scale projects, such preparation may include

advance briefing of examiners and proctors, so that each is hilly in-

fonned about the functions he is to perform, In general, the examiner

reads the instructions, takes care of timing, and is in charge of the group

in anyone testing room. The proctors hand out and collect test materials,

make certain that subjects are following instructions, answer individual

questions of subjects within the limitations specified in the manual, and

prevent cheating.

· J

TESTING COXDlTlOXS. Standardized procedure applies not only to verbal

instructions, timing, materials, and other aspects of the tests themselves

but also to the testing environment. Some attention should be iven to

the selection of a . ~ flijJ.. This room should be

hould wvide , venti-

~ .~cial~should a so e ta -en to prevcnt mtcrrup ons unng the test. Posting a

sign on the door to indicate that testing is in progress is effective, pro-

vided all personnel have learned that such a sign means no admittance

under any circumstances. In the testing of large groups, locking the doors

or posting an assistant outside each door may be neeessarv to-prevent the

entrance of late-comers. --

. It is important to realize the extent to which testing conditions may

lI1fluence scores. Even apparentl~' ·minor aspects of the testing situation

may appreciably alter performance. Such a factor as the use of deSKSor

of chairs with desk arms, for example, proved to be significant in a group

testing project with high school students, the groups using desks tending

to obtain higher scores (Kelley, 1~43:Traxler & Hilkert, 1942). There is

also evidence to show that the Slli9ir~loyed may affecttest scores (Bell, Hoff, & Hoyt,-19t3~1~li'~1lfr-~~ab1ishment of in-

dependent test-scoring and data-processing agencies that;, provide their

0\1.'11machine-scorable answer sheets, examiners sometimes administer

group tests with answer sheets other than those lIsed in the standardiza-

tion sample. In the absence of empirical verification, the equivalence of

these answer sheet# cannot be assumed. The Differential Aptitude Tests,

for example, may be administered with any of five different answer

Context of Psychological Testing

eets.On the Clerical Speed and Accuracy Test of this battery, separate

s are provided for three of the five answer sheets, because they were

nd to yield substantially different scores than those obtained with the

reI' sheets used by the standardization sample.

testing children below the fifth grade, the use of (Illy separate answer

t may significantly lower test scores (Meh'opolitan Achievement Test

ial Report, 19i5). At these grade levels, having the child mark the

\'ers in the test booklet itself is generally preferable.

any other, more subtle testing conditions have been shown to affect

ormance on ability as well as personality tests. Whether the ex-

inel' is a stranger or someone familiar to the subjects may make a

'nificant difference in test scores (Sacks, 1952; Tsudzuki, Hata, & Kuze,

57). In another study, the general manner and behavior of the exam-

, as illustrated by smiling, nodding, and making such comments as

ood" or "fine," were shown to have a decided effect on test results

"ickes, 1956). In a projective test requiring the subject to write stories

'fit given pictures, the presence of the examiner in the room tended to

hibit the inclusion of strongly emotional content in the stories (Bern-

ein, 1956). III the administration of a typing test, job applicants typed

'a significantly faster rate when tested alone than when tested in groups

liHwo or more (Kirchner, 1966).

Examples.could readily be multiplied. The implications are threefold.

.first, follow standardized procedures to the minutest detail. It is the re-

onsibility of the test author and publisher to descdbe such procedures

ully and clearly in the test manual. Second, record any unusual testing

onditions, however minor. Third, take testing conditions into account

;hcn interpreting test results. In the intensive assessment of a person

rough individual testing, an experienced examiner may occasionally de-

rt from the standardized test procedure in OJ:der to eJi~it additional in-

rmation for special reasons. \Vhen he docs so, he ~ no longer in-

rpret the subject's responses in terms of the test norms, Under these

rcumstances, the test stimuli are used only for qualitative exploration;

. ld the responses should be treated in the same way as any other infor-

"malbehavioral observations or interview data.

In psychometrics, the term "rapport" refers to the examiner's effOl'ts

o arouse the subject's interest in the test, elicit his cooperation, and

nsure that he follows the standard test instructions. In ability tests, the

nstructions call for careful concentration on the given tasks and for put-

'ng forth one's best efforts to perform well; in personality inventories,

ey call for frank and honest responses to questions about one's usual

Natml.' anel USe' Of Psychological Tests 35

behavior; in certain projective tests, they call for full reporting of associa-

tions evoked by the stimuli, without any censoring or editing of content.

Still other kinds of tests may require other approaches. But in all in-

stances, the examiner endeavors to motivate the subject to follow the

mstructlOns as fullv and conscientiously as he can.

The training of examiners covers techniques for the establishmcnt of

rapport as well as those more directly related to test administration. In

establishing rapport, as in other testing procedures, uniformity of condi-

tions is essential for comparability of results. If a child is given a coveted

prize whenever he solves a test problem correctly, his performance can-

not be directly compared with the norms or with that of other children

who are motivated only with the standard verbal encoura"ement 01', 0

praise. Any deviation from standard motivating conditions for a particular

test should be noted and t,aken into account in interpreting performance.

Although rapport can be more fully established in individual testing,

steps can also be taken in group testing to motivate the subjects and re-

lieve their anxiety. Specific techniques for establishing rapport vary with

the nature of the test and with the age and other characterbtics of the

subjects. In testing preschool children, special factors to be considered

include shyness with strangers, distractibility, and negativism. A friendly,

cheerful, and relaxed manner on the part of the examiner helps to reas-

sure the child. The shy, timid child needs more preliminary time to be-

come familiar with his surroundings. For this reason it is better for the

examiner not to be too demonstrative at the outset. but rather to wait

until the child is ready to make the first contact. Test periods should be

br~ef, and the ~asks should be varied and intrinsically interesting to the

chll.d.. The testIng should be presented to the child as a game and his

cunoslty aroused before each new task is introduced. A certain flexibilitv

of procedure is necessary at this age level because of possible refusal~,

loss of interest, and other manifestations of negativism.

Children in the first two or three grades of elementary school present

many of the same testing problems as the preschool child. The game ap-

proach is still the most effective way of arousing their interest in the test.

The older schoolchild can usually be motivated through an appeal to his

competitive spirit and his desire to do well on tests. 'Vhen testing chil-

dren from educationally disadvantaged backgrounds or from different

cultures, however, the examiner cannot assume they will be motiyated to

excel on academic taSKSto the same extent as children in the starfdardiza-

ti~n sa~~le ..This pro~le~ and others pertaining to the testing of persons

\\ lth diSSImilar expenential backgrounds will be c'Onsidered further inChapters 3, 7, and 12.

. Special. motivational problems may be encountered in testing emo-

tionally disturbed persons, prisoners, or juvenile delinquents. Especially

when examined in an institutional setting, suca persons are likely ·to ..

manifest a number of unfavorable attitudes, such as suspicion, insecurity,

fl'ar, or cynical indifh'renee. Abnormal conditions in their past experiences

are also likely to influence their test perforrnanee adversely. As a result

of early failures and frustrations in school, for example, they may have

developed feelings of hostility and inferiority toward academic tasks,

\rhich the tests resemble. The experienced examiner makes special efforts

to establish rappolt under these conditions. In any event, he must be

sensitive t~ these special difficulties and take them into account in inter-

preting and explaining test performance.

In testing any school-age child or adult, one should bear in mind that

e\'e1')'test presents an implied threat to the individual's prestige. Some

reassurance should therefore be given at the outset. It is helpful to ex-

plain, for example, that no one is expected to finish or to get all the itcms

correct. The individual might otherwise experience a mounting sense of

failure as 11e advances to the more difficult items or finds that he is un-

able to finish anv subtest within the time allowed.

It is also desil:able to eliminate the element of surprise from the test

situation as far as possible, because the unexpected and unknown are

likely to produce al1xiet~'. :Many group tests provide a prdiminaryex-

planatory statement that is read to the group by the examiner. An even

better procedure is to announce the tests a few days in advance and to

give each subject a printed booklet that explains the purpose and nature

of the tests, offers general suggestions on how to take tests, and contains

a few sample items. Such explanatory booklets are regularly available to

participants in large-scale testing programs such as those conducted bythe College Entrance Examination Board (1974a, 1974b). The United

States Employment Service has likewise de\'eloped a booklet on how to

take tests, as well as a more extensive pretesting orientation~.technique

for use with culturally disadvantaged applicants unfamili~f. ,v'ith tests.

\1ore general orientation booklets aie also .available, si'tc11 as l\feetingthe Test (Anderson, Katz, & Shimberg, 1965), A tape recOl'ding and two

booklets are combined in Test Orientatioll Procedure (TOP), designedspecifically for job applicants with little prior testing experience CBen-

nett & Doppelt, 1967), The first booklet, used together with the tape,

provides general information on how to take tests; the second contains

practice tests. In the absence of a tape recorder, the examiner may read

the instructions from a printed script.

Adult testing presents--some additional problems. Unlike the school-

child, the adult is not so likely to work hard at a task merely because it is

assigned to him. It therefore becomes more important to "sell" the pur-

pose of the tests to the adult, although high school and college students

also respond to such an appeal Cooperation of the examinee can usually

;be secured by convincing him that it is in his own interests to obtain a\,

valid score, Le., a score correctly indicating wh~lt he can do rather than

overestimating or underestimating his abilities. ~Iost persons will under-

stand that an incorrect decision, which might result from invalid test

scores, would mean subsequent failure, loss of time, and frustration for

them. This approach can serve not only to motivate the individual to

try his best on ability tests but also to reduce faking and encourage frank

reporting on personality inventories, because the examinee realizes that

he himself would otherwise be the loser. It is certainly not in the best

interests of the individual to be admitted to a course of study for which

he is not qualified or assigned to a job he cannot perform or that he

would find uncongenial.

:\lany of the practices designed to enhance rapport sen'e also to reduce

test anxiety. Procedures tending to dispel surprise and strangeness from

the testing situation and to reassure and encourage the subject shottld

certainly help to lower anxiety. J'he examiner's own manner and a well-

organized, smccthly running testing operation will contribute toward the

same goal. Individual differences in test anxiety have been studied with

hoth schoolchildren and college students (Ga~dry& Spielberger, 1974;-

Spielberger, 19i2). Much of this research was initiated bv Sarason and

his associates at Yale (Sarason, Davidson, Lighthall, "'aite, & Ruebush,

1960). The first step was to construct a questionnaire to assess the indi-

vidual's test-taking attitudes. The children's form, for example, contains

items such as the following:

Do you worry a lot before taking a test?

\\'hen the teacher sa~'s she is going to find out how much you h,we learned,does your healt begin to beat faster?

While 'you are taking a test, do you usually think you are not doing wen.

Of primary interest is the finding that both school achievement and intel-

ligence test scores yielded significant negative correlations with test anx-

iety. Similar correlations have been found among college st1tdcn!s (1. G.

Samson, 1961). Longitudinal studies likewise revealed an inverse relation

between changes in anxiety level and changes in inteJligence or achieve-

ment test perfonnance (Hill & Sarason, 1966; Sarason, Hill, & Zim-bardo, 1964). .

~uch findings, of course, do not indicate the direction of caUsal relation-

slllps. It is possible that children develop test anxiety because they per-

Context of Psydl(Jlogical Testiug

formpoorly on tests and haw thus experienced failure and frustration in

previous test situations. In support of this interpretation is the finding

that \\ithin subgroups of high scorers on intelligence tests, the negative

"rrelation between anxiet~' level and test performance disappears

Denny, 1966; Feldhusen & Klausmeier, 1962). On the other hand, there

5 evidence suggesting that at least some of the relationship results from

he deleteLious effects of anxiety on test performance. In one study

(:Waite,Sarason, Lighthall, & Davidson, 1958), high-anxious and low-

, 'iotlschildren equated in intelligence test scores were given repeated

ials in a learning task Although initially equal in the learning test, the

w-allxiousgroup improved significantly more than the high-anxious.

Severalinvestigators have compared test performance under conditions

esigned to evoke "anxious" and "relaxed" states. Mandler and Sarason f;;(.1952), for example, found that ego-involving instructions, such as telling

subjects that everyone is expected to finish in the time allotted, had a

beneficialeffect on the performance of low-anxious subjects, but a dele-

teriouseffect on that ofbigh-anxious subjects. Other studies have likewise

foundan interaction between testing conditions and such individual char-

~cteristicsas anxiety level and achievement motivation (Lawrence, 1962;

Palll& Eriksen, 1964). It thus appears likely that the r~latjQn between

anxiety,and test performance is nonlinear, a slight amount Qf anxiety

,\lein bencficia~ while a lar e amount is detrimental. Individuals who are

',cllstomariy ow-anxious benefit from test con i,tions t lat arouse some

et:>, ",hi e t lose who are customarilv hi<rh-anxiol1s )erform better

Ii ' firmore re axe can itions.

it is undoubtedl\' true that a ~hronicalh- high amidv len'l will c:I;erJ a

deb'imental effect 'on school learning and' int~lIectual dewlopllleltf,_",~~ch

"aneffect, howe\'er, should be distinguished horn the tesr:tiinit1!,r- ~'ects

with which this discussion is concerned. To what extent do~s test auxier.·

,make the individual's test performance unrepresentative of his cust~mar~'

;'performance level in nontest situations? Because of the competitive pre~-

sureexperienced by college-bound high school seniors in ,,\merica today,

it has been argued that performance on c'OlIege ~dmissif>il tests may be

unduly affected by test anxiety. In a thorough ana::4ontrol1ed investi.

gationof this question, French (1962) compar~d Jhf'p,erformancc of high

school students on a test given as part of the fe-gular administration of

the SAT with performance on a parallel form of the test administered at

,a different time under "relaxed" conditions, The instructions on the latter

, occasion specified that the test was given for 'research purposes only and

scores would not be sent to any college. The results showed that per-

formance was no poorer during the standard administration than during

the relaxed administration. Moreover, the concurrent validitv of the test

scoresagainst high school course grades did not differ signifi~antly under

the two conditions.

Comprehensive surveys of the effects of examiner and situational

variables on test seores'lmve been prepared by S. B. Sarason (1954),

Masling (l~60), ~foliarty (1961, 1966), Sattler and Theye (1967),

Palmer (19,0), and Sattler (1970, 1974). Although some effects have

been demonstrated with objective group tests, most of the data have been

obtained with either projective techniques or individual intelligence tests.

These extraneous factors are more likely to operate with unstructured and

ambiguous stimuli, as well as "ith difficult and nO"el tasks, than with

clearly defined and well-learned functions. In general, children are more

susceptible to examiner and situational influences than are adults; in the

examination of preschool children, the role of the examiner is especially

cruCiaL.Emotionally disturbed and insecure persons of an\' age are also

mClre likely to be affected by such conditions than are well-adjustedpersons,

There is considerable evidence that test results may vary systematically

as a function of the examiner (E. Cohen, 1965; ~'Iasling, 1960). These dif-

ferences may he related to personal characteristics of the examiner, such

as his, age, sex, race, professional or socioeconomic status, training and

expenence, personality charaderistics, and appearance. Se\'eral studies of

thes~ examiner variables, however, have yielded misleading or illcon-

cluSl\'e results because the experimental designs failed to control or iso-

late the influence of differcnt examiner or subject characteristics. Hence

thp l:'ffeds of two or more variables ma\, be confounded.

The examiner's behavior before and during test auministration has also

heen s~lown to affect test results, For example, controlled investigations

ha\'e YIelded significant differences in intelligence test performance as a

res~lt of a "warm" versus a "cold" interpersonal relation between ex-

amllJer and examinees, or a rigid and aloof versus a natural manner on

the part of the examiner (Exner, 1966; Masling, 1959). Moreover, there

may be Significant interactions between examiner and examinee' charac-t " , he~lstJCs,III t e sen~e that the same examiner characteristic or testing man-

nel may have a dIfferent effect on different examinees as a function of

the examinee's Own personality characteristics. Similar interactions may

occur '~ith task variables, such as the nature of th,e test, the purpose of

the testing, and the instructions given to the subjects. Dyer (1973) adds

even more variables to this list, calling attention to the possible inHirenceof th t t· , d . ," .. c es gIVers an the test takers' diverse perceptions of the funetigllsand goals of testing.' 'St'll '• '. I. an,other way in which an examin8r may inadvertently affect the

~x~~m~e s responses is through ~is own 'cexpectations, This is simply a

P clal mstance of the self-fulfilhng prophecy (Rosenthal, 1966; Rosen-

40 Context of Psycholog.ical Testing

thaI & Rosnow, 1969). -An experiment conducted with the Rorschach will

illustrate this effect (Masling, 1965). The examiners were 14 graduate

student volunteers, 7 of whom were told, among other things, that ex-

perienced examinel's elicit more human than animal responses from the

subjects, while the other 7 were told that experienced examiners elicit

more animal than human responses. Under these conditions, the two

groups of examiners obtained significantly diHerent ratios of animal to

human responses from theh subjects. These differences occurred despite

the fact that neither examiners nor subjects reported awareness of any

influence attempt. ~foreover, tape recordings of all testing sessions re-

vealed no evidence of verbal influence on the part of any examiner. The

examiners' expectations apparently operated through subtle postural and

facial cues to which the subjects responded.

Apa~ from the examiner, other aspects of the testing situation may

Significantly affect test performance. Military recmits, for example, are

often examined shortly after induction, during a period of intense read-

justment to an unfamilim' and stressful situation. In one investigation

designed to test the effect of acclimatization to such a situation on test

performance, 2,724 recruits were given the Navy Classification Battery

during their ninth day at the ~a\'al Training Center (Gordon & Alf,

1960). When their scores were c'Ompared with those obtained by 2,180

recruits tested at the conventional time, during their third day, the 9-day

group scored Significantly higher on all subtests of the battery.

The examinees' activities immediately preceding the test may also af-

fect their performance, especially when such activities produce emotional

disturbance, fatigue, or other- handicapping conditions. In an investiga-

tionwith third- and fourth-grade schoolchildren, there was some evidence

to suggest that IQ on the Draw-a-Man Test was influenced Qrthe chil-

dren's preceding classroom activity (McCarthy, 1944). On one occasion,

the class had been engaged in writing a composition on "The" Best

Thing That Ever Happened to Me"; on the second occasion, they had

again been writing, but this time on "The Wo~sLThing That Ever'Hap-

pened to Me." The IQ's on the second test, fOllowing what may have

been an emotionally depressing experience, averaged 4 or 5 points lo\ver

than on the first test. These findings were corroborated in a later investi-

gation specifically designed to determine the effect of immediately pre-

eeding experience on the Draw-a-Man Test (Reichenberg-Hackett, 1953).

In this study, children who had had a gratifying experience involving the

successful solution of an interesting puzzle, followed by a reward of toys

and candy, snowed more improvement in their test scores than those who

had undergone neutral or less gratifying experiences. Similar results were

obtained by W. E. Davis (1969a, 1969b) with college students. Per-

fonnance on an arithmetic reasoning test was significantly poorer when

preceded by a failure experience on a verbal comprehension test than it

Natufa aile! Use of Psychological Tests 41

was in a control group given no preceding test and in one that had taken

a standard verbal comprehension test under ordinary conditions.

Several studies have been concerned with the effects of feedback re-

garding test scores on the individual's subsequent test performance. In a

particularly well-designed investigation with seventh-grade students,

Bridgeman (1974) found that "success" feedback was followed by sig-

nificantly higher performance on a similar test than was "failure" feed-

hack in subjects who had actually performed equally well to begin with.

This type of motivational feedback may operate largely through the goals

the subjects set for themselves in subsequent performance and may thus

represent another example of the self-fulfilling prophecy. Such general

motivational feedback, however, s1)ould not be confused with corrective

feedback, 'whereby the individual is informed about the specific items he

missed and given remedial instruction; under these conditions, feedback

is much more likely to improve the performance of initially low-scoring

persons.

The examples cited in this section illustrate the wide diversity of test-

related factors that may affect test scores. In the majority of well-admin-

istered testing programs, the influence of these factors is negligible for

practical purposes. Nevertheless~ the skilled examiner is constantly on

guard to detect the possible operation of such factors and to mipimize

their influence. When circumstances do not permit the control of these

conditions, the conclusions drawn from test performance should be

qualified.

In evaluating the eHect of coaching or practice on test scores, a funda-

mental question is whether the improvement is limited to the specific

items included in the test or whether it extends to the broader area of

~ehavior that the test i~gned to p;edict. The answer to this ques~

represel1ts the difference between coacmng and education. Obviously

any educational experience the indiVidual undergoes, either formal or in-

formal, in or out of school, should be reflected in his performance on tests

sampling the relevant aspects of behavior. Such broad influene.es will in

no way invalidate the test, since the test score presents an aar:a,tate piC-

ture of the individual's standing in the abilities under conside~n. The

difference is, of course, one of degree. Influences cannot..:..be~dassified as

either. narrow or broad, but obviously vary widely in scop~~f;om those

~ffecting only a single a~lllinis~tj~n of a.,single test, throu~hJib.~se. affect-

~ng'p~rformance on all Items ;()fi,ca /:crtUln,type, to those mtfUencmg the

mdl vidual's performance in the large .Irtai9rity of his activities. From the

standpOint of effective testing, however, a workable distinction can be

COlltext of P~yc1lOlogic(/l Testing

e. Thus, it can be stated that a test score is inmlidated only when a

':'cular experience raises it withont appreciably affecting the criterion

~Lviorthat: the test is deSigned to predict.

:";{CHIKC.'the effects of coaching on test scores have been widely in-

gated. Many of these studies were conducted by British psycholo-

,with special reference to the effects of practice and coaching on the

brinerly used in assigning ll-year-old children to different types of

'Ilrv;,schools (Yates et aI., 195:3-1954). As might be expected, the

ot ~~ovement depends on the ability and earlier educational;

'ences of'the examinees, the nature of the tests, and the amount and

'of coaching provided. Individuals with deficient educational back-

unds are more likely to benefit from special coaching than are those

'ihave had superior educational opportunities and are already pre-

, to do well on the tests. It is obvious, too, that the closer the re-

,blance between test content and coaching material, the greater will

the improvement in test scores. On the other hand, the more closely

truction is restricted to specific test content, the less likely is improve-

:nt to extend to criterion performance.

"n America, the College Entrance Examination Board has been con-

hed about the spread of ill-advised commercial coaching courses for

lege applicants. To clarify the issues, the College Board conducted

veral well-controlled experiments to determine the effects of coaching

'its Scholastic Aptitude Test and surveyed the results of similar studies

other, independent investigators (Angoff, 19711>;Conege Entrance

'amination Board, 1968). These studies covered a variety of coaching

ethods and included students in both public and private high schools;

e investigation was conducted with black students in 15 urban and

'"ralhigh schools in Tennessee. The conclusion from all"these studies is

':at intensive drill on items similar to those on the SAT is unlikelY to

'oduce appreciably greater gains than occur wrJ/i students are rete~ted

'th the SAT after a year of regular high schot;il instruction.

On the basis of such research, the Trustees of the College Board issued

.formal statement about coaching, in which the fonowing points were

ade, among others (College Entrance Examination Board, 1968,p.8-9):

e results of the coaching studies which ha,'e thus far been completed in-

te that average increases of less than 10 points on a 600 point scale can,expected. It is not reasonable to believe that admissions decisions can be

ected by such small changes in scores. This is especially true since the testsmerely supplementary to the school record and other evidence taken into

. unt b'): admissions officers. . . , As the College Board uses the term, ap-itude is not something flxed and impervious to influence by the way the child

\in'S and is taught. Rather, this particular Scholastic Aptitude Test is a meas-

ure of abilities that seem to grow slowly and stubb(lrnl~'. profoundly influcllcedby conditions at home and at school over thc years, but not responding tohasty attempts to relive a young lifetime.

It should also be noted that in its test construction procedures, the Col.

lege Board im'estigates the susceptibility of new item types to coaching

(:\ngoH, 1971b; Pike & Evans, 1972). Item types on which perfo.rma1lce

can be appreciably raised by short-term drill or instruction of a narrowly

limited nature are not included in the operational forms of the tests..

PRACTICE.The effects of sheer repetition, or practice, on test per-

formance are similar to the effects of coaching, but usuaIl~' less pro-

nounced. It should be noted that practice, as well as coaching, may alter

the nature of the test, since the subjects may emplo~' different work meth-

ods in solving the same problems. Moreover, certain types of items may

be much easier when encountered a second time. An example is 'provided

by problems requiring insightful solutions which, once attained, can be

applied directly in solving the same or similar problems in a retest. Scores

on such tests, whether derived from a repetition of the identical test or

from a parallel form, should therefore be carefully scrutinized.

A number of studies have been concerned ~,'ith the effects of the

identical repetition of intelligence tests over periods ranging from a few

days to se,'eral years (see Quereshi, ]968). Both adults and children,

and both normal and mentally retarded persons have been employed. The

studies have covered individual as well as group tests. All agree in show-

ing significant mean gains on retests. Nor is improvement necessarily

limited to the initial repetitions. \Vhether gains persist or level off in suc-

cessive administrations seems to depend on the difficulty of the test and

the abilit~· level of the subjects. The implications of sucll findings are il- \

lustrated by the results obtained in annual retests of .3,500 schoolchildren

with a Yariety of intelligence tests (Dearborn & Rothnev, 1941). When

the same test was readministered in successive years, th~ median IQ of

the group rose from 102 to 113, but it dropped to 104 when another test

w~s substituted. Becaus~ of the retest gains, the meaning of an IQ ob-

tamed on an initial and later trial proved to be quite different. For exam-

ple, .a~ ~Q of 100 fell approximately at the average o£'lhe distribution on

the Im~lal trial, -but in the lowest quarter On a retest~S\ldl iQ's, though

numencally identical and derived from the same te~ 1l;!ightthus signify

normal ability in the one instance and inferior ability#},(,the other.

G~ins in score are also found on retesting with pili:dIel -forms <1j the

same tes~, although such gains tend in general to be .srh.a4Ier.Significant

m~a,n gams have been reported when altema"f~ forins ofa 'test were ad-

rnullstered in immediate succession or after intervals ranging from orie

Context of Psychological Tesring

b three years (Angoff, 1971b; Droege, 1966; Peel, 1951, 1952).

. r results have been obtained with normal and intellectually gifted

)children, high school and college students, and employee samples.

a"onthe distribution of gains to be expected on a retest with a parallel

should be provided in test manuals and allowance for such gains

. ~dbe made when interpreting test scores.

)17 SOPHJSTICATIO~. The general problem o(test sophistication should

'"be considered in this connection. The individual who has had ex-

'vl! prior experience in taking psychological tests enjoys a certain ad-

Jage in test performance over one who is taking his first test (Heim &

, IIace,194~1950; Millman, Bishop, & Ebel, 1965; Rodger, 1936). Part

Ithis advantage stems from having overcome an initial feeling of

angeness, as well as from haVing developed more self-confidence and

"etter test"taking attitudes. Part is the result of a certain amount of over-

lap in the type of content and functions covered by many tests. SpeCific

,"familiaritywith common item types and practice in the use of objective

"answer sheets may also improve performance slightly. It is particularly

important to take test sophistication into account when comparing the

scores obtained by children from different types of schools, where the

extent of test-taking experience may have varied Widely. Short orienta-

tion and practice sessions, as described em'lier in this chapter, can be

quite effective in equalizing test sophistication (Wahlstrom & Boersman,1968).

CHAPTER 3

Social a1ld Etltical

11JljJZicatioTls of Testi1lg

IxORDER to prevent the misuse of psychological tests, it has become

necessary to erect a number of safeguards around both the tests

themselves and the test scores. The distribution and use of psycho-

logical tests constitutes a major area in Ethical Standards of Psychologists,

the code of professional ethics officially adopted by the American Psycho-

logical Association and reproduced in Appendix A. Principles 13, 14, and

15 are specifically directed to testing, being concerned with Test Security,

Test Interpretation, and Test Publication. Other principles that, 'although

broader in scope, are highly relevant to testing include 6 (ConfideIi-

tiality), 7 (Client Welfare), and 9 (Impersonal Services). Some of the

matters discussed in the Ethical Standards are closely related to points

covered in the Standards for Educational and Psychological Tests (1974),

cited in Chapter 1. For a fuller ,and richer understanding of the principles

set forth in the Ethical Standards, the reader should consult two com-

panion publications, the Casebook on Ethical Standards of PsycllOlogists

(1967) and Ethical Principles in tIle Conduct of Researc11 with Human

Participants (1973). Both report specific incidents to illustrate each prin-

Ciple. Special attention is given to marginal situations in which there may

be a conflict of values, as between the advancement of science for human

betterment and the protection of the rights and welfare of individuals.

The requirement that tests be used only by appropriately qualified

examiners is one step toward protecting !he indiy!~ual againE: the im-

~oper use of tests. Qf course, the necessary qualiB,c~tions vary with the

type of test. Thus, a relatively long pe.ri!'d of int~nsive training and

s~pervised experience is required for the proper use of individual intel-

ligence tests and most personality tests, whereas a mini~um of specialized

psychological training is needed in the case of educational achievement

45

46 COllfext of Psycl1010gicaf Testing

or vocational proficiency tests. It should also be noted that students who

take tests in class for instructional purposes are not usually equipped to

administer the tests to others or to interpret the scores properly.

The well-trained examiner chooses tests that are a )ro riate for 0

the particular purpose for whie 1 e is teshn an t ex-

amme. e IS a so cognizant of the available research literature on the

clioseiitest and able to evaluate its technical merits with reC1ard to suchocharacter,istics as norms, reliability, and validity. In administering the

test, he is sensitive to the many conditions that

~ such as those 1 ustrate 10 apter 2. He draws conclusions or

makes recommendations only after considering the test score (or scores)

in the light of other pertinent information about the individual. Above all,

lie shpuld be sufficiently knowledgeable about the science of human be-

havior to guard against unwarranted inferences in his interpretations of

test scores. When tests are administered' by psychological technicians or

assistants, or by persons in other professions, it is essential that an ade-

quately qualified psychologist be available, at least as a consultant, to

provide the needed perspective for a proper interpretation of test per-

formance.

Misconceptions about the nature and purpose of tests and misinter-

pretations of test results underlie Illany of the popular criticisms of psy-

chological tests. In part, these difficulties arise from inadequate com-

munication between· psychometricians and their various publics-

educators, parents, legislators, job' applicants, and so forth. Probably th~

most common examples center on unfounded inferences kdfrtIQs. Not alT

IU1sconcephons· about tests, howcyer, can bc attrib_R!;~ to inadequate

communication between psychologists and laymeD.)~'c.:hological testing

itself has tended to become dissociated from~;.the· mainstream of be-

havioral science (Anastasi, 1967). The growing.Fdrnplexity of the science

of psychology has inevitably becn accompani~,dby increasingspecializa-

tion among psychologists. In this process, psychometricians have concen-

trated more and more on the technical refinements of test construction

and have tended to lose conta:tt wit'rr developments in other relevant

specialties, such as learning, child development, individual diffe;ences,

and behavior genetics. Thus, the technical aspects of test construction

have tended to outstrip the psychological sophistication with which test

results are interpreted. Test scores can be properly interpreted only in

the light of all available knowledge regarding the behavior that the tests

are designed to measure.

Who is a qualified psychologist? Obviously, with the diversification of

the field and the consequent specialization of training, no psychologist is

equally qualified in all areas. In recognition of this fact, the Ethical

Standards specify: "The psychologist recognizes the boundaries of his

competence and the limitations of his techniques and does not offer

Social alief Etllicalll1lplications of Testing 47

selyices or use techniques that fail to meet profeSSional standards estab-

lished in particular fields" (Appendix A, Principle 2c). A useful distinc-

tion is that between a psychologist working in an institutional setting,

such as a school system, university, clinic, or government agency, and one

engaged in independent practice. B~cause the inde endent ractitioner

is less subject to judC1ment and eva ua on l' wle eable collen es

t lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards

? -pro esslOna qualifications. The same would be true of a psychologist

responSIble for the supervision of other i·nstitntional psychologists or one

who serves as an expert consultant to institutional personnel.

A Significant step, both in upgrading professional standards and in

helping the public to identify qualified psychologists, was the enactment

of state licensing and certification laws for psychologists. Nearly all states

now have such laws. Although the terms '1icensing" and "certification"

are often used interchangeably, in psychology certification typically refers

to legal protection of the title "psychologist," whereas licensing controls

the practice of psychology. Licensing laws thus need to include a defini-

tion of the practice of psychology. In either type of law, the requirements

are generally a PhO in psychology, a specified amount of snpervised

experience, and satisfactory performance on a qualifying examination.

Violations of the APA ethics code constitute grounds for revoking a

celtiRcate or license. Although most states began with the simpler certifi-

cation laws, there has been continuing movement toward licensing.

At a more advanced level, speCialty certification within psychology is

provided by the American Board of Professional Psychology (ABPP).

ReeJuiring a high level of training and experience within deSignated

specialties, ABPP grants diplomas in such areas as clinical, counseling,

industrial and organizational, and school psychology. The Biographical

Director~' of the APA contains a list of current diplomates in each spe-

cialty, which can also be obtained directly from ABPP. The principal

f~nction of ABPP is to provide information regarding qualified psycholo-

gIsts. As a privately constituted board within the profession, ABPP does

~)()thave the enforcement authority available to the agencies administer-

mg toe state licensing and certification laws.

.The. p~rchase of tests is generally restricted to persoJl~ ,who meet cer-

tam z:nlmmalqualifications. The catalogues of major testp~1>lishers specify

reqUlr~ments that must be met by purchasers, Usually ~pdividuals with a

mast~r s degree in psychology or its equivalent qu~l.i~~' -SO'rtle publishers

claSSIfytheir tests into levels with reference to user qt;al~fi~~ions, ranging

from educational achievement and vocational proficiency tests, through

'Context of Psychological Testing

, , 'entories to such clinical instru-ltelligence tests and mterest In\ t 'ersonalit tests, Distinc-

s individual intelligence tests al1ldmOhsPers alld a~thorized insti-

' db' d' 'idua [lUre ase alsoma e etween In ,1\ t . Graduate students who mayh of appropnate tes s, , hPure asers " f research must have t e. , f I ignment or or ,

. articular test or ~ c ass a~s h "ehology instructor, who as-" order countersigned by t elf ps~ ,

'b'l' f' th oller use of the test. ,sponsl 1 Ity 01 e'pr, f h a dual objective: secuntyto restrict the distn~uboll o· ~ests ;~: Ethical Standards state:' . 1 d prevenhon of mIsuse, 1 , tatena san ,'th professional mteres s' , I' 't d to persons \\1

to such deVices IS ImI e , , 1 13)' "Test scores like test' d h' "( Pnnclp e, ,~ll safeguar t elr use who arc ualifled to interpret and

als, are rele::sed ~nl~ to perso~:sshould beqnoted that although test

m properly (Prmciple 14)" I t these obJ'cctives, the con-k ' , efforts to Imp emen 'b'l'

utors ma 'e SllleCIe , '1 limited, The major responsl 1 ItyYare able to exert IS neeessan y h ' d' 'dual uscr or institution

f 'd in t e 111 IViproper use 0 tests resl es h t MA degree in psychology

~ed,It is evident, ~or exampleA~;p a~i 'lorna-do not necessarily

en a PhD, state hc~nse, a~ld P articular test or that his

' hat the indi\'idualls quah~ed ~o u;e tia;: of the results obtained

is relevant to the proper mtel pre a 0

at test. . s the Il1arketing of psvcho-. l' 'bilihr concern. ,er professIOna lcsponsl '} h Id - t be released pre-

I d blishers Tests s Oll notests by aut lOrs an pu , I' be made regardin

crthe

' ' 1 N" h Id anv c aUllS bV for <renera use, • 01 S ou '. b' t" c"l'dence 'I\'hen a

o f fficient 0 Jec lye, .f a test in the absence 0 su nI\, this condition should

d If search purposes 0 .''sttibute ear y or reo , , f the test n;'stricted accordingly,y specified and the d~,S:'lb~tIOl~~e data to permit an evaluation

manual should pro\ 1 e a, eq, re ardin administraUon,

est itself as well as full il1fo~~n~ttonfactal e~OSitiOl1 of what

nd norn1S,The manual S IOU ,e a d vice:.desi ed t~;~t1t the'b t tlle test rather than a sellmg c ;'" gn h da ou , , )onsibility of the test ,aut or an'favorable lIght, It IS the rfeSl h to prevent obsolescence, i

' dorms 0 ten enougr to reVise tests an n d t d 'II of course var)'

, t be ames out a e WI, ,idity with wlueh a tes c "

vith the nature of the tehst, ld t be published in a newsp.aper,' t of tests s ou no If

~~ma °orUI:l'Sbook either for descri tive wrposes or forI SC

b

-e, or , " If 1 t' on would not on y e'00, Under these COndltI~:\;eW~~~~j~~ \vorthless, but it might

, such drastic errors as I' d' 'dual Moreover any pub-I . II ' , , s to t le In 1VI, ,~ychoogl~a y mJ~nou will tend to invalidate the future use of

,n to speCIfic test It~~S 'ght also be added that presentation of

)Vithother 'persOJ~s, m~ to create an erroneous and distortedprials in thIS fashIOn ten ,s 01 ~""h nllhlicitv may foster

Social alld Ethical Implicatiolls of Tes/ing 49

either naIve credulity or indiscriminate resistance on the part of the pub-lic toward aU psychological testing,

Another unprofessional practice is testing by mail, An individual's per-

formance on eithel' aptitude or personalit~· tests cannot be properly as-

sessed by mailing test forms to him nnd lla\'ing him return them by mail

for scoring and interpretation, Not only does this procedure provide no

control of testing conditions but usually it nlso involves tIle interpretation

of test scores in the absence of other pertinent information about the in-

dividual. Under these conditions, test results may be Worse than useless,

A question arising particularly in connection with personality tests is

that of invasion of privacy, Insofar as some tests of emotional, motiva-

tional, or attitudinal traits are necessarily disguised, the subject may re-

veal characteristics in the COurse of such a test without realiZing that he

is so dOing, Although there are few available tests whose appr~1ts

subtle enough to fall into this category, the possibility of developing s'i1~1.r

indirect testing procedures i~~ a grave responsibility on the pi.

choIogist who uses them. F~~se61 ijf'te§..ting cliee:tii\'ene~,~. De,..necessary to keep the examinee"in'1gnQ.f~~ the speCific~.):hhis l'esponses on any Oue test are to be int~fpreted, Xe\'er~~ •.a.1Jt'r_

son should not be subjected to any testing program under false pretenses,

Of primary importance in this connection is the obligation to have a

dear understanding with the examinee regarding the use that will be

made of llis test results, The- Jellowing statement contained in Ethical

Standards of Psychologists (Principle 7d) is especially germane to thisproblem:

The psychologist who asks that an individual reveal personal information inthe COurseof interviewing, testing, or evaluation, or who allows such infonna-

tion to be divulged to him, does s9 only after making certain that the r:e-sponsible person is fully aware oflhe purposes of the intervjew, testing, orevaluation and of the ways in which the information may be used,

Although concerns about the invasion of privacy have .been expressed

most commonly about perspnalit)' tests, they logi<:ally apply to any type

of test. Certainly any itlteJligence, aptitude, or achievement test may re-

veal limitations in skills and knowledge that an individual would rather

1Totdisclose. Moreover, any observation of an individual's behavi@r-'tt'~

in an interview, casual conversation, or, other personal '~llcoul1ter-m:lM'

yield information about him that he wouldpr~fer to c.qnCe.E.l1 and that I¢may reveal unWittingly. The fact that psycI11;)Jogicaltests have often been.

Il/('xl (If Psychological Testing

lit in discussions of the invasion of privacy probably reflects

.misconceptions about tests. If all tests were recognized as

.of behavior samples, with 110 mysterious powers to penetrate

havior,popular fears and suspicion would be lessened.

'Id also bc noted that all behavior research, whether employing

het-observational procedures, presents the possibility of invasion

'. Yet,as scientists, psychologists are committed to the goal of

g,.knowledge about human behavior. Principle 1a in Ethical

s ofPsychologists (Appendix A) clearly spells out the psycholo-

Viction"that socieh' v.·ill be best served when he investigates

judgment indicate~ investigation is needed." Several other prin-

theother hand, are concerned with the protection of privacy

'the{velfare of research subjects (see, e.g., 7d, 8a, 16). Conflicts

may thus arise, which must be resolved in individual cases.

amplesof such confl.ict resolutions can be found in the previously

icalPrinciples in the Conduct of Research tcit11 Human Par-

s (1973).

problemis obviously not simple; and it has been the subject of

"e delibemtion by psychologists and other professionals. In a re-

titledPrivacy and Be7IGvioral Research (1967), prepared for the

f Science and Technology, the right to privacy is defined as "the

the individual to decide for himself how much he will share with

histhoughts, his feelings, and the facts of his personal life" (p. 2).

fllrthercharacterized as "a right that is essential to insure dignity

reedomof sf>lf.determination"-(p. 2). To safeguard personal pri-

jno universal rules can be formulated; only general guidelines £illl

rovided.In the application of these guidelines to specific cases, th~~~

substitute for the ethical awareness and professional respons~i{9

Ie individual psychologist. Solutions must be worked out in ter~ p£:particularcircumstances. -

:'nerelevant factor is the purpose for which the testing is conducted-

'ther for individual counseling, institutional decisions regarding~~lec-

andclassification, or research. In clinical or counseling sit1,j.tions, the

_ t is usually willing to reveal himself in order to obtain h~]p with his

,oblems.The clinician or examiner does not invade privacy'where he is

eelyadmitted. Even under these conditions, however, the client should

tie warned that in the course of the testing or interviewing he may reveal

:informationabout himself without realizing that he is so doing; or he

Irony disclose feelings of which he himself is unawar

- When tes ng IS con uded for institutional purposes, the lfiaffiinee

Isbouldbe fully informed as to the use that will be made of his test scores.

, It is also desirable, however, to explain to the examinee that correct as-

sessmentwill benefit him, since it is not to his advantage to be placed

in a position where he will fail or which he will find uncongenial. The

results of tests administered in a clinical or counseling situation, of course,

should not be made available for instihltional purposes, unless the ex-

aminee gives his consent.

When tests are given for research purposes, anonymity should be pre-

served as fully as possible and the procedures for ensuring such anonym-

ity should be explained in advance to the subjects. Anonymity does not,

however, solve the problem of protecting privacy in all research contexts.

Some subjects may resent the disclosure of facts they consider personal,

even when complete confidentiality of responses is assmed. In most cases,

however, cooperation of subjects may be elicited if they are convinced

that the information is needed for the research in question and if they _

have sufficient confidence in the integrity and competence of the in-

vestigator. All research OIl human behavior, whether or not it utilizes

tests, may present conflicts of values. Freedom of inquiry, which is es-

sential to the progress of science, must be balanced against the protection

of the individual. The investigator must be alert to the values involved

and must carefully weigh alternative solutions (see Ethical Principles,1973; Privacy and Be1lGvioral Researc11,1967; Ruebhausen & Brim, 1966).

Whatever the purposes of testin tlle rotection f riva

two Key concepts: re evanc consent. The information that

t e m iVl ua is asked to reveal must be relevant to the stated purposes

of the testing. An important implication of this principle is that an prac-

ticable effOlts should be made to ascertain the validity of tests for the

particular diagnostic or predictive purpose for which they are used. An

instrument that is demonstrably valid for a given purpose is one that

provides relevant information. It also behooves the examiner to make

sure that test scores are correctly interpreted. An individual is less likely

to feel that his privacy is being ~aded by a test assessing his readiness

for a particular educational progrlfm than by a test allegedly measuring

his "innate intelligence."

The concept.,£.f informed consellt also requires clarification; and its ap-

plication in individual cases mav call for the exercise of considerable

judgment (Ethical Principles, 1973;,Ruebhausen & Brim, 1966). The ex-

aminee should certainly be infoJ'l!le~.about the purpose of testing, the

kinds of data sought, and the use tha1;:wifi be made of his scores. It is not

implied, however, tliat he be shown the test items in advance or told

how specific responses will be scored. Nor should the test items be shown

to a parent, in the case of a minor. Suc~ infonnation would usually in-

validate the test. Not only would the giving of this information seriously

impair the usefuhless of an ability test, boutit would alsotcm.d Jo distort

responses on many personality tests. For ~xaQJple, if an indi®~,~l is told

in advance that a self-report inventory-will be scored v.ith adorpinance

Social and Ethical Implications of Testing 53

tent, the hazards of misunderstanding test scores, and the need of various

persons to know the results.

There has been a growing awareness of the right of the individual

himself to have access to the findings in his test re ort. He should also

lave e opportum to comment on e contents of the report and if

necessary to clarify or correct factual information. Counselors are now

trying more and more to involve the client as an active participant in his

O\\'n assessment. For these purposes, test results should be presented in

a form that is readily understandable, free from technical jargon or

labels, and oriented toward the immediate objective of the testing.

Proper safeguards must be observed against misuse and misinterpretation

of test findings (see Ethical Standards. Principle 14).

-In the case 'of minors, one must also consider the parents' right of

access to the child's test record. This presents a possible conflict with the

child's own right to privacy, especially in the case of older children. In a

searching analysis of the problem, Ruebhausen and Brim (1966, pp. 431-

4,32) wrote: uShould not a child, even before the age of full legal re-

sponsibility, be accorded the dignity of a private personality? Considera-

tions of healthy personal growth, buttressed with reasons of ethics,

seem to command that this be done." The previously mentioned Guide-

lines (Russell Sage Foundation, 1970, p. 27) recommend that uwhen a

student reaches the age of eighteen and no longer is attending high

school, or is married (whether age eighteen or not)," he should have the

right to deny parental access to his records, However, this recommenda-

tion is followed by the caution that school authorities check local state

laws for possible legal difficulties in implementing such a policy.

Apart from these- possjble exceptjons, the question is not whether to

commUDlcute test results to arents of a minor but how to do so. Parents 1

norma y have a legal right to information- a out eir child; and it is

usually desirable for them to have such information. In some cases, more-

over, a child's academic or emotional difficulties may arise in part from

parent-child relations. Under these conditions, the counselor's contact

WIth die parents IS of prime importance, both to fill in background data

and to elicit parental coope.ration.

Discussions of ~he ~n6dentiality of test records have usuall~ dealt

with accessibility to a thIrd person, other than the in~hjdilal tese~d (orparent of a minor) and the examiner (Ethical Stando,r.ds, Principle 6;

Russell Sage Foundation, 1970). The underlying principle is that such

records should not be released without the knowl~~~. an..d. conseiitOf •the individual.' .,

'Vhen tests are administered in an institutional context, as in a school

system, court, or employment setting, the indi~dual should be .infonne~

at the time of testing regarding the purpose 6f~!he test, how th~ results- _._~--_._ _.-._-- ...•.. ~,:~'-- ..

of Psychological T('sting. fl d by stereotyped (and often

,p'J)se~are likely to bbeIntthu~n:ait or by a false or distorted

as'he may have a ou IS ,

. . 'th regard to pa-ng of children, special qU,es~ons anse "':1 e the Russell

~~:(i~;~~~U~~iS~~:~r:i:;:~;n~;:;d~di::I7:rc tfite COeelle~~i~~:, . ' f P '1 Recor s. 11 re ereo,.and DissenunatlOl1 0 tip' . d' "d al consent,nt, the Gujdelines differentiate b;tween l~t:~~o:al consent,

'hild, his'tiparents, or both, an . ~r~~e resentatives, such'arents: legally elected ~r .appoll1t~. . p the Guidelines

board. 'While avoiding rigId preSC;lpti~n:h type of instru-

, and achie",~mcnt tests as examp es °b em' t, at the" , I nt should e su Cleo,,!lich representation a conse .. . "", t' cite . 1 1 I

, e, personality ~ss~ssm~~i~:lilles is the inclusion of sample

helpfu eature. o. ~ e~~tten consent. There is also a selected.forms,for obtammo ' 1 t of school record keeping,, on the ethical and ~ega alsPdec~ that protect the indi-a d 'penmenta eSlgnsrocedur~~ a~o pe:rucipate and that adequately safeguard hist ,to eCme. . f 1data resent a challengeHevielding scientifically meanmg u 'tP d the establish-, '. . W'th oper rappor anc,hologist's ipgenUlty.. 1 pr h b of refusals to, 1 t however t e num ertitudes of mutua' respec, . 'bl ' tity The technical dif-may be reduced to a neghgl e quan ' h'; b avoided.

,bi;sed sampling and voluntee~ error, may : USe;t; tllat this

rom both national and stateWide .SUlvley~'t:gg,·, nd in the

, h . f g educatlona ,ou comes abe achieved, bot III t':s III rch (Holi:zn~~n, 1971; Womer,

'~itiv~area of pers~~allty ;~:~ath(' number of respondents 'whoere Is-also some eVI ence ,',' "on of privacy or

. . t enresents an mvaSI .a personahty llwen ory r 1" .'. 'S" nt'ly reduced when

h' ff nsive 15 slgm ca''der some of t e.ltems 0 e " : ex lanation of h.Q.YLitemL

:preceded by a Simple.and ~orthrJ ~:d..(Fink & Butcher, 1972).ted and I 0\ ores WI I be mterpre_ , .1:'~,' h

,- lid' 't hould be' adde~~~"t sue anstandpoint of test va Ity, 1 Sf'" the personality'on did not affect the mean profile 0 scores on - , '

IDENTIALITY

, . . which it is related, the problem oft,~e~rotectlOn ~f p~lVacYiftf:ceted. The fundamental question is:

tiahty of test ata ISmu {ts? Several considerations influence the

all hav~ access. to t~t resAmu~ng them are the security of test con-in particular situations.

Social alld Ethical Implications of TCStillg 55

the c:lpacity to record faithfully, to maintain permanently, to retrieve promptly,

and to communicate both Widely and instantly.

'ntext of psychological Testing

'd nd their availabilih' to institutional personnel who h~v~ aISC ,ad f th UncJ:e,r'these conditions, nO further penms~lOne nee or em. . hi 'h' t1 institutiOn,d tl ti results are made avalla e Wit III Ie .-e at Ie me 1 r uested by outsiders,'nt situation exists when test resu ts are eq t It from" 'm lover or a college requests tes resu s

"~.R:::~~t~~:s: i~st~nc~s, ind~v~d~l~~e~o;:~:~t~o:;:~~:~e~;dt~:

equired, T~e same r~qUlre%~,nres:~ch urposes. The previously

and coullsehng contexts, or d' 1971 P 42) contain a sample

uidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission, ofiformfor the use of schoo sys ems I ,

ta. , f . d'n institutions. Oner pr,oblem pertains to the l'ete~tlO~l? recor s I

bevcr' valuable,

,hand, longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counseling

y for researc I purposes u . advanta es resuppose proper

son. As is so often the cas~, th;se t1 othe; haKd the availability

. interpretation •.of test resu ts, n m::uses as inl~rrect inferences

rleords opens t~e way f~~ s~ch for otber than the original'solete data atld.~unauthollze 1 acbcessd for example to cite an IQ

I Id b anifest v a sur , , ,.gpurpose, ~wou em: d b a child in the third gradereading achle\'t>ment sco~e, obtalOe II Ye Too much may have hap-

n evaluating him for admISSion to co eg 'k h ·1' and ""'lated, I" 'ears to ma e suc eaI' ,.,..d to llim in t Ie mtervemng ) d etained fo'l"many, f I S' '1 Iv when recor s are rscoresmeaning u. Iml ar .' b ed for purposes that the individ-

rs,t11ereis dan!!:er that tbey ma): edu~nd would not have approved.(or his parents) never suspecte 'd t' d either for le-

I, when recor s are re amea I1revent suc 1 mIsuses, , f h 'd'" d al or for ac-, " I 'the interest 0 t e m 1'111 ulate longltudma use m them should be subject to unusual¥i

table research purPloseCs,a,cdcej:setso(Russell Sal1e foundation, 1970:W', t troIs In t Ie w e In I:> d tngen can . d 1 . 'fi d into three categories-with regar· 0

t2), sch~ol recol' s. are c aSSli~in factor in this classification is the

'I" retenti~n, ~. major det~~~ilih~ of the data; anot\l.er is rdevance to i

ree of objectIVity and ven a 'J 1 I Id be ..s-e for any type of. 1 b' ti f the schoo. t wou ,,"", .

e educationa 0 Jec ves ,0 '1 . l' 't policies regardit.g the destruc-.stitution to fonnulate SHmar exp lCl d'. . . 'b'1't f personal recor s.:t!on, retention, and acceSSI I I Y a 't nd accessibility of test results'" ' bl f . tenance secun y, a, The pro ems 0 mam, . 'fi d bv the develop-

.~and uf all other ~ersonal da~:n~avlen b~~; ;:~~e eta the Guidelines

. inent of computenzed, aata . 5-6) Ruebhausen wrote;(RussellSage Foundation, 1970, pp, , ,

. d a new dimension into the issues of pnvacy.Modernscience has mtl'Oduce h tr t allies of privacy were the in-

t' 1e among t e s ongcs ,Therewas a Ime W 1 n . d the healing compaSSion, h f II'b'n" f hiS memorv an

efficiencyof man, tea 1 1 I • ,0 f' t' 'd' the warmth of human reeol-, d b th the passmg 0 tme an '

lhatat'compame,. 0 ,'." .., ""II" ,_,,"'.\fnrlrrn sciellcehas !!ivenus

The unprecedented advances in storing, processing, and retrieving data

made possible by computers can be of inestimable service both in re-

search and in the more immediate handling of social problems, The po-

tential dangel"s of invasion of privacy and violation of con~dentiality

need to be faced squarely, constructively, and imaginatively, Rather than

fearing the centralization and efficiency of complex computer systems, we

should explore the possibility that these very characteristics may permit

more effecth'e procedures for protecting the security of individual

records.

An example of what can be accomplished with adequate facilities is

provided by the Link system de\'eloped by the American Council of

Education (Astin & Boruch, 1970), In a longitudinal research program

on the effects of different types of college environments, questionnaires

were administered annually to several hundred thousand college fresh-

men, To permit the collection of follow-up data on the same persons

while preventing the identiflcation of individual responses by anyone at

any future time, a three-file system of computer tapes was devised, The

first tape, containing each student's responses marked with an arbitrary

identincation number, is readily accessible for research purposes. The

second tape, containing only the students' names and addresses with the

same identification numbers, was originally housed in a locked vault and

used only to print labels for follow-up mailings. After the preparation of

these tapes, the original questionnaires were destroyed.

This two-file system repl'esents the traditional security system. It still

did not provide complete protection, since some staff members would

have access to both files. ~'Ioreover, such files a-re subject to judicial and

legislative subpoena. For these reasons, a third me was prepared. Known

as the Link file, it contained only the original identification numbers and

a neW set of random numbers which were substituted for the original

identification ~umbers in the name and address file. The Link file was

dcposited at a computer facilit), in a foreign country, with the agreement

that the file would never bC;le)eased to anyone, inclu~jpg the American

Council on Education. Follow-u.p data t~p!s are sent tq the f{)reign fa-

cility, which substitutes one set of code numbers f~the other. With the

.decoding files and the research data files under: the control of different

organizations, no one can identify {he responses of illdividuals ~ the

data files. Such elaborate precautions roi'the protection of conlidentiality

obviously would not be feasible except in a!aJge-scale computerized data

bank. The procedure could be simplified sQmewhat if the lin\ing faCility'·

were located in a domestic agency given,:adequate protection against

subpoena.

i$tshave given much thought to the comm~nication of test

"formthat will be meaningful and useful. It IS clear that the

should not be transmitted routinely, but should be accom-

nterpretive explanations by a professionally trained person.

imicating scores to parents. for example, a recommended

to arrange a group meeting at which a counselor or school

'\explains the purpose and nature of the tests, the sort of

th'tt"t mav reasonably be drawn from the results, and the

of the d~ta. Written'reports about their own children may

ributed to the parents, and arrangements made for personal

';vithany parents wishing to discuss the ~epol'ts further .. ~e-

how they afe transmitted, however, an Important condItIon

resu1tsshould be prcsented in terms of descriptive perform-

rather than isolated numerical scores. This is especiall}' tnu::..

nee test· which are more likely to be misinter reted than are

't tes,ll1icatingresults to teachers, school administrators, emplo'yers,

approprig.te persons, similar safeguard~ shoul~ b~ proVided.

Is of performance and qualitati\·e descnptot~ns 111 Sllnple terms

preferred over specific numerical. scores, cxc,:pt when com-

g with adequately trained professlOnals .. Ev~n well-educated

ye been known to confuse percentiles WIth Q~~centa~e scor~s,

with lQ's, norms with standards, and int~Fts~ ratlOgs With

'ores.But a.more serious misinte )fetation )ertams to the con-

rawn from test SCOl'es,even w en their te.c:nnical meaning is

mderstood. A familiar example is the popuhyassumption that

!cates a fixed characteristic of the individual wl)ich pTede-

is lifetime level of intellectual achievemen~. , ,-litcommunication it is desirable to take .i.W:oaccount the char- .

of the person who is to receive the i~fomlation. This. applies I

o at person's general educatIOn 1~:imowledge about psy-

nd testing. but also to his anticipated eIllotional response to the

on. In the case of a parent or teacher, for. example, personal

I' involvement with the child may interfere with a calm and

'cceptance of factual information. . .ut by no means least is the problem of commumcatlOg test re-

';e individual himself, whether child or adult. The same gene.ral

.'s against misinterpretation apply here as in ~mmuni~tm~

ird party. The person's emotional reaction to the mforrnatlOn lS

ly important. of course, when he is learnin? about hfs 0'1\'11 assets.,... :..~.. 'H1".~ ,,,, ;nr1;vir'll1:1lis !!iven hiS own test results, not

Social and "Etl1icalIIll1"ications of Testing 57

onl~. s~ould the data be interpreted by a properly qualified person, but

faclli~Ies shoul.d also be available for counseling anyone who may become

cmOti01~any dIsturbed by such information. For example, a college stu-

dent mIght become seriously' discouraged when he leams of his poor

performance on a scholastic aptitude test. A gifted schoolchild might de-

velop habits of laziness and shiftlessness, or he might become uncoop-

erahve a~ld unm.anageable, if he discovers that he is much brighter than

any of Ius asso.clates. A severe personality disorder may be precipitated

when a ~aladlust('d individual is given his score on a personality test.

Such de~nmental effects may, of course, occur regardless of the correct-

ness or lllcorrectness of the score itself. Even when a test has been ac-

curately administer:d and scored and properly interpreted, a knowledge

of such a score WIthout the opportunity to discuss it further ~nay be

harmful to the individual.

Counseling psychologists h~e been especially concerned with the de-

v~lo ment of effective wavs of transmittin test inform' to-their-_

c IC11t5 see, e.g., Goldman, 1971. Ch. 14-16). Although the details of

..-tfu~ pr.ocess ~re be}'o~d the, scope of ?~present discussion, two major

gll1del~nes are of particular mterest. FI~ test-reporting is to be \'iewed~

as an mtegral part of the counselin rocess and incor orated into the

o a counse or-c lent relationshi . Se d, insofar as ossible, test results

shou e reported as answers to specific !:lucstions raised bv the CQun-

~. An Important consideration in counseling relates to the' counselee's

~cceptance o~ the information presented to him. The counseling situation

IS such thaf If the individual rejects any information, for whatever rea-

sons, then that information is likely to be totally wasted.

I

1 II III

T~ SETfINC.~he decades since 1950 have witnessed an increasing

publIc concern With the rights of minorities,' a concern that is reflected in

the enactment of civil rights legislation at both federal and state levels.

In conn~t~on with mechanisms for improving educational and vocational

opportumhes ~f such groups, psychological testing has been a major

focus of att:nbon. Th~ psychological literature of the 1960s and early

197?s co~tams many dI~cussions of the topic, whose impact ran.ges from

clanflcabo~ ~o obfuscation. Among the more clarifying contributions are

several po.slbon papers by professional associ,tit>ns (see, e.g., American

Psychological Association, 1969; Cleary, Humphreys, Kendrick, & \Ves-

Ie I tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population.

ga I.y,~c~upalJonallY'in in otlu~rways.they have s~ed Jllany of the problemsof mmoTlhes. Hence w the term "minority" is use(i "fu tnis section it will beunderstood to includj) men. '

. 'onlcxt of Ps!}clIOlogica1Testing

'5' Deutsch Fishman, Kogan, North, & Whiteman, 1964; Tl::'1Jl~use of t~sts 1972). A brief but cogent paper b~ F~augh

Isohelps to cle~r away some preval~nt S~ll;C~So~;:; ~~I~~~ural'of the concern centers on the lowenng 0 es sc . d . t r-ns that ma)' have affected the devel~p;ne~t lofc~;:~~e:;;ti: eoEotivation, attitudes, and other psyc ~ O~IC: for the problemou members. Some of the propose so u Ions . al

mi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls. . I b kgroun s 0 groups or 10

iflerencesin the expenentia ac hI' 1 test~itablymanifested in test performanlce. Ev:rytPsbychaoVl~o~C~tsin-

. 1 I f as Cll ture alIec s e ,res a beh:wlOl' samp e. nso ar If 1 ut aU culturalwill and should be detected by tests. .we ~ e. 0 as a measure

I1tialsHom a test, we may th.erebdYlower Its ;ah~?t case the test

behavior domain it was deslgnc to assess. n'fail to provide the kind of information needed to correct the very

'ionsthat impaired performance. . 1 citron the

ause the testing of minorities repr~sent\a sP:~~~l ~~; :heoretical

.er problem of cross-cultural te.stmg, t e U full) in Cha ter 12.

naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7,

chnicalanalysIs of the concep 0 h t h ter our interest is, . h l'd't In t e presen c ap ,

llnnectlOl1Wit test va I I y. ., ., f inDrity groUpwily in the basic issues and SOCialImplications 0 m

·ng.

Social and Etllicallmplications of Testing 59

iarity with such objects. On the other hand, if the development of arith-

metic ability itself is more strongly fostered in one culture than in an-

other, scores on an arithmetic test should not eliminate or conceal such

a difference.

Another, more subtle way in which specific test content may spuriously

affect performance is through the examinee's emotional and attitudinal

responses. Stories or pictures portraying typical suburban middle-class

family scenes, for example, may alienate a child reared in a low-income

inner-city home. Exclusive representation of the physical features of a

single racial type in test illustrations may have a similar effect on mem-

bers of an ethnic minority. In the same vein, women's organizatiDlls have

objected to the perpetuation of sex stereotypes in test content, as in the

portrayal of male doctors or executives and female nurses or secretaries.

Certain words, too, may have acquired connotations that are offensive to

minority groups. As one test publisher aptly expressed it, "Until fairly

recently, most standardized tests were constructed by white middle-class

people, who sometimes clumsily violate the feelings of the test-taker

without even knDwing it. In a way, one cDuld say that we have been not

so mueh culture biased as we-have been 'culture blind'" (Fitzgibbon,1972, pp. 2--3).

The major test publishers now make special efforts to weed out in-

appropriate test cDntent. Their Dwn test construction staffs have becDme

sensitized to pDtentially offensive, culturally restricted, or stereotyped

material. Members of different ethnic groups participate either as regular

staff members or as consultants. And the reviewing of test content with

reference to possible minority implications is a regular step in the process

of test construction. An example Df the application Df these procedures

in item construction and revision is provided by the 1970 edition of the

Metropolitan Achievement Tests (Fitzgibbon, 1972; HarcDurt Brace Jo-

vanovich,1972).

TEST.RELATED FACTORS. In testing culturally di"h·elt·seffPerst°bno~hi:e~~~~d, . . b cultural factors t a a ecrtant to differentiate etween . t . t d to the test It is

I' d th hDse in uence is res nc e - .·terionbe laVlor an ose w d ~ Ex~mples of suchatter, tSst-related actors that. re l\~e va 1 .; ion to erEorm

to~sinclude previous experience m ~akmg tests, mo~;t. variable; ,~_

~veJlon tests, rappDrt with the exammer, an~ an1y0 tet

r~_-i<c;€fit~rion

th fcular test but me evan O~ __ . --_.-

fectingperformance on .e pa~ I h ld be m'aae toreduce the opera-d. s'deration SpeCial en arts s ou .' . '.

~ I' when testing persons wltn diSSimilarlion of these test-related factors - . - 'd adequate test-

,.ii:ctilfuralbacKg~.n:dS: A d~b1e proc~~urea:\~u~~~;~d\Y the bookletsakingorientation and prehmmary prnc iCe,. 'th parallel form is. d" d' Chapter 2 Retestinl1 WI a _ ~--~d tape recor mgs cite III '. Ph h had little or noIsoreeDmmended with low-seorin examm s w a ave -

~prl;;e~st~t~e:~.~:e~c:;also~n~~e:~:s:e;: :~~~e;e~7c ~:::o::~, ~::

unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular-ample the use of names or piC res . d h diex l' T ld obviously represent a test-restncte an cap.cultura mlleu wou h' k' d not depend upon fami!-

. Ability to carry out. quantitative t m mg oes

INTERPRETATION AND USE OF TEST SCORES. By far the most important

coflsiderations in the testing of culturally diverse groups-as in all testing

-,;..,pertain to the interpretation of test scores. The most frequent misgiv-

ings regarding the use Df tests with minority group m~w:bers ste~ from

misinterpretations of scores. If a minority examinee Qn~l:li~sa low score

on an aptitude test or a deviant score on a personality):est, it is essential

tQ.investigate why he did so. FDr example, an infel~i'St:ore on an arith-

metic test could result from low test-taking motivation, poor reading

ability, or inadequate knowledge of arithmetic, among other reasons.

Some thought should also be given to the type of nQCWsto be employed

in evaluating individual scores. Depending on the purpose of the testing,

the appropriate norms may be general nDrms~.!2gl;oUP.Jlotms based Qn- . .

Many bright, non-conformingpupils, with backgrounds different from those oftheir teachers, make favorable showings on achievement tests, in contrast totheir low classroom marks. These are very often chffarenwhose cultural handi-

caps are most evident in their overt social and interpersonal behavior. Withoutthe intervention of standardized tests, many such children would he stigma-

tized by the adverse subjective ratings of teachers who tend to reward can·formist behavior of middle-class character.

Social alld Et!lical171lplicatiolls of Testing 61

an IQ would thus serve to perpetuate their handicap. It is largely be-

cause implications of permanent status have become attached tq.Jhe IQthat in 1964 the use of group intelligenGe-testS-..M:asdiscontinued in the

l\ew York City public schools (H. B. Cilbeli, 1966; Loretan, 1966). That

it proved necessary to discard the tests in order to eliminate the miscon-

ceptions, about the fixity of the IQ is a revealing commentary on the

tenacity of the misconceptions. It should also be noted that the use of

individual intelligence tests like the Stanford-Binet, which are admin-

istered and interpreted by trained examiners and school psychologists,

was not eliminated. It was the mass testing and routine use of IQs by

relatively unsophisticated persons that was considered hazardous.

According to a popular misconception, the IQ is an index of innate

intellectual potential and represents a fixed property of the organism. As

will be seen in Chapter 12, this view is neither theoretically defensible

nor supported by empirical data. \Vhen properly intcrrireted, intelligence

test scores should not foster a l'igid categorizing ~f persons. On the con-

hary, intelligence tests-and any other test-may be regarded as a map

on which the individual's present position can be located. When com-

bined with information about his experiential background, test scores

should facilitate effective planning for the optimal development of the

individual.

OBJECTIVITY OF TESTS. "'hen social stereot:'pes and prejudice may dis-tort interpersonal evaluations, tests provide a safeguard against fa-

voritism and arbitrary or capricious decisions. Commenting on the use of

tests in schools, Gardner (1961, pp. 4&-49) wrote: "The tests couldn't see

whether the youngster was in rags or in tweeds, and they couldn't hean

the accents of the slum. The tests revealed intellectual gifts at every level

of the population."

In the same vein, the Guidelines for Testitlg Minority Group Children(Deutsch et at, 1964, p. 139) contain the follOWingobservation:

\Vith regard to personnel selection, the contr!!>ution:,of t~sts was aptly

characterized in the following words by John ,¥:, Macy, Jr.,'Chairman of

the United States Civil Service Commission (7.f~~!f,rgand Public Policy,1965, p. 883) :""'.

be of states enacted legislation and estlt •••AL REGULATIONS. Anum. r ., (FEPC) to implement i,t..:,.d F . E 10 ment Practices CommiSSions. -1'im\'!\.

"e aIr mp y f h legal mechanisms at the federal l~;l~~'nor to the development 0 suc lIotts have been made to pat-

1iI0ngthe states that did so 7t~r, sfme;e\ The most pertinent federal

tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (Title

legislatIOnISprovld.ed by the ~q 1964 ~ ?ts subsequent amendments).>

'n of the Civil Rl?hts Act o. a~ ;nfottement is vested in the, e. sponsibility for Implementation

Can ., (EEOC) When charges

, 0 rtunity ommlSSlon .qual Employment ppo. h plal'nt and if it finds the charges

, " h EEOC' shgates t e com ,-arefiled, t e lllve t th 'tuation through conferences and

. '6 d'" first to correc e Sltobe lush e , u1.es d f '1 EEOC may proceed tor If these proce ures al,voluntary com~ lance. d d . t orders and finally bring action inhold hearings, ISsue cease an eSlS ,

. 1 al developmentssince midcentury, including'A brief summary of ~he major e~d rt decisions, can be found in Fincher

legislativeactions, executive orders, an cou

(1973).

Social and Etlticallmplications of Testing 63

the federal courts. In states having an approved FEPC, the Commission

will defer to the local agency and will give its Bndings and conclusions

"substantial weight."

The Office of Federal Contract Compliance (OFCC) has the authority

to monitor the use of tests for employment purposes by government con-

tractors. Colleges and universities are among the institutions concerned

with OFCC regulations, because of their many research and training

grants from such federal sources as the Department of Health, Educa-

tion, and Welfare. Both EEOC and OFCC have drawn up guidelines re-

garding employee testing and other selection procedures, which are vir-

tulillly identical in substance. A copy of the EEOC Guidelines on Em-

ployee Selection Procedures is reproduced in Appendix B, together with

a 1974 amendment of the OFCC guidelines clarifying acceptable pro-

cedures for reporting test validity,3

Some major provisions in the EEOC Guidelines should be noted, The

Equal Employment Opportunity Act prohibits discrimination by em-

ployers, trade unions, or employment agencies on the basis of race, color,

religion, sex, or national origin, It is recognized that properly conducted

testing programs not only are acceptable under this Act but can also

contribute to the "implementation of nondiscriminatory personnel poli-

cies." Moreover, the same regulations specified for tests are also applied

to all other formal and informal selection procedures, such as educational

or work-history requirements, interviews, and application forms (Sec-

tions 2 and 13),

\Vhen the use of a test (or other selection procedure) results in a

significantly higher rejection rate for minority candidates than for non-

minority candidates, its utility must be justified by evidence of validity

for the job in question. In defining acceptable procedures for establish-

ing validity, the Guidelines make explicit reference to the Standards for

Educational and Psychological Tests (1974) prepared by the American

PsycholOgical Association. A major portion of the Guidelines covers mini-

mum requirements for acceptable validation (Sections 5 to 9). The

reader may find it profitable to review these requirements after reading

the more detailed technical discussion of validity in Chapters 6 and 7 of

this book. It will be seen that the requirements are generally in line with

good psychometric practice.

In the final section, dealing with affirmative action, the Guidelines

point out that even when selection procedures have been satisfactorily

ntcxt of Psychological Testing

" ., f pIc that are related to job per-sityto measure charactenS!lCS 0 peo h' h' the basis for entrv, is at the very root of the merit system, ~v~u:s over the veal'S, th~

~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of.. .. h s had a vita mteres m d bl'eTVIcea d bt that the widesprea pu Iegicaltesting methods. I ha\'~ ~o ou d res has in large part been" in the objectivity of ~ur 111

fnnhgp;~ce ~: the' practicality, and the

"by the public's perception 0 t e alrne., .-.'ofthe appraisal methods they must submit to.

". • 101 ee Selection Procedures, prepared by the:GUldeltnes on Emp y. ., (1970) as an aid in the". I t 0 portumty CommiSSIOnmp oymen P b' 'th the following state-'entation of the Civil Rights Act, rgm WI

purpose:I h belief that properly validated and

elin,esin this part ar~ based o~ ~e: can significantly contribute to thefzedemployee selection proce u I I'CI'es as required bv Title

d' ' . t personne po I ' , ,

entationof no~ Iscnmma or; . llv developed tests, when used in(is also recogmzed that pro esslon~ ;~sessment and complemented by'ction with other tools of perso~n~fi tl,'d in the development and" f ' b d' may sign! can 'Ii al dprograms0]0 eSlgn, - d . deed aid in the utilization antenanceof an efficient work force an , In ,

servationof human resources generally,

, b 'sused in testing culturally disadvantagedar)' tests can e ml ' 'h th '

nsumm .'. ,I When properly used, owever, e). ns-as 111 testmg aD.yon~ ese, ting irrelevant and unfair discrim-

, e an important fun~tlOn 111 pre~te~ive index of the extent of cultural. 'ti' The\' also prOVIde a quanti ~ ..lOaon, - . d'al programsnandicapas a necessar~' first step In reme 1< •

3 In 1973, in the interest of simplIficationand improvedcoordination,the prepara-tion of a set of uniform guidelineswas undertaken by the Equal EmploymentOp-portunity Coordinating Council, consisting of representatives of E ,the U.S,Department of Justice, the u.s. Civil Service Commission,the U.S'c,rtlJlent ofLabor, and the U.S. Commissionon Civil Rights. No'uniform versioD,o<... et. 1u!syet been adopted. " '•.

Context of Psychological Testing

'ted, if disproportionate rejection rates result for minorities, steps

e.takento reduce this discrepancy as much as possible. Affirmative

'~impliesthat an organization does more than merely avoiding dis-

'. ry practicCli,.Psychologically, affirmative action programs may

dedas eHorts to compensate for the residual effects of past social

~s.Such effects may include deficiencies in aptitudes, job skills,

~,motivation, and other job-related behavior. They may also be

'~iniH~erson'sreluctance to apply for a job not traditionally open" ndidates, or in his inexperience in job-seeking procedures.

~mative actions in meeting these problems include re-

media most likely to reach minorities;, explicitly en-

minority candidates to apply and following other recruiting

esignedto counteract past stereotypes; and, when practicable,

special training programs fOI the acquisition of prerequisite

knowledge.

PART 2

Primipus of

Psychological listing

CHAPTER 4

NornlS a'nd the

11lterjJretation of Test Scores

INTHE absence of additional interpretive data, a raw score on any

psychological test is meaningless. To say that an individual has

correctly solved 15 problems on an arithmetic reasoning test, or

identified 34 words in a vocabulary test, or successfully assembled a

mechanical object in 57 seconds conveys little or no information about

his standing in any of these functions. Nor do the familiar percentage

scores provide a satisfactory solution to the problem of interpreting test

scores. A score of 65 percent correct on one vocabulary test, for' example,

might be equivalent to 30 percent corred on another, and to 80 percent

correct on a third. The difficulty level of the items making up each test

will, of course, determine the meaning of the score. Like aU raw scores,

percentage scores can he interpreted only in terms of a dearly defined

and uniform frame of reference.

Scores on psychological tests are mOst commonly interpreted by ref-

erence to norms which represent the test performance of the stand-

ardization sample. The norms are thus empirically established by de-

termining what a representative group of persons actually do on the test.

Any individual's raw score is then referred to the distribution of scores

obtained by the standardization sample, to discover where he falls in that

distribution. Does his score coincide with the average performance of the

standardization group? Is he slightly below average? Or does he fall near

the upper end of the distribution?

In order to determine more precisely the individual's exact position

with reference to the standardization sample, the raw score is converted

into some relative measure. These derived scores are designed to serve a

dual purpose. First, they indicate the individual's t~lativ.e standing in

the normative sample and thus permit an evaluation of his'performance

in reference to other persons. Second, they provide comparable measures

that permit a direct comparison of the individual's performance on dif-

ferent tests. For example, if an individual has a raw score of 40 on a

vocabulary test and a:raw score of 22 on an arithmetic reasoning test, we

67

il1lcsof Psychological Tcstillg

'nownothing about his relative performance on the two tests.

invocabulary or in arithmetic, or equally good in both? Since

'.9ndifferent tests are usually expressed in different units, a

,a)'isollof such scores is impossible, The difficulty level of the

est would also affect such a comparison between raw scores.

,~s,on the other hand, can be expressed in the same units

"to the same or to closely similar normative samples for

. The individual's relath'e performance in many different

,thusbe compared.ariousways in which raw scores may be converted to fulfill

p.vesstate'd above. Fundamentally, however, derived scores

)0 one of two major ways: (1) developmental level at-

relative position within a specified group. These types of

~rwith some of their common variants, will be considered

::tionsof this chapter. But first it ,vill be necessary to ex-

'elementary statistical concepts that underlie the develop-

'zation of norms. The following section is included simply

.meaningof certain common statistical measures. Simplified

.examples are given onl~; for this purpose and not to pro-

'~ statistical methods. For computational details and spe-

s to be ~llowed in the practical application of tl1ese tech-

er is refeHed to any recent textbook on psychological or

atistics.

TABLE 1

Frequency Distribution of Scores of 1 000 C II Studon a Code-Learning Test ' 0 ege ents

(Data from Anastasi, 1934, p. 34)

-Class Interval Frequency

52-55

48-51

1

44-471

40-43

20

36-S9

73

32-35156

28-31

328

24-27

244

20-23

136

16-1928

12-158

8-1132

•. ~-:-na-=

1,000

fa

ject of statistical method is to organize and summarize

)~ in order to facilitate their understanding. A list of 1,000

be an overwhelming sight. In that form, it conveys little-

step in bringing order into such a chaos of Iaw data is to

es into a frequency distribution, as illustrated in Table l.

'on is prepared by grouping the scores into convenient

d tallying each score in the appropriate interval. When

.n entered, the tallies are counted to find the frequency,

es, in each class im"erval. The sums of these frequencies

'e total number of cases in the group, Table 1 shows the

,~ollegestudents in a code-learning test in which one set

ds, or nonsense syllables, was to be substituted for an-

, ~cores, giving number of correct syllables substituted

Inute trial, ranged from 8 to 52. They have been grouped

'1sof 4 points, from 52-55 at the top of the distributionIe frequency column reveals that two persons scored

~~~ws:e~n~and 11, three b~tween 12 and 15, eight ,between 16 and 19,

The information provided b fpresented graphicallv in the f y af r~~ue~lcy. distribution can also be

the data of Table 1 'l'n gra h,ormf° ao lstnbubon curve. Figure 1 shows

p lC orm. n the b r h'are the scores grouped int I' ase me, or onzontal axis,frequencies, or number of o. c ass/1~ervals: .~n the vertical axis are the

graph has been plotted I' teases a mbogwlthm each class interval. The

n wo ways th fo be' .In the histogram, the hei ht of the :x.l rms 109 m common use.terval corresponds to the g b umn erected over each class in-

can think of each individ n~mt erd~f persons scoring in that interval. Wecolumn In the fre ua 1s an mg on another's shoulders to form the

is indi~ated by aq;~i:~Y~o Y~~'th~ number of persons in each intervalacross from the appro n~atacef m t e center of the class interval and, , ,p erequency The s c' .Jomed by straight I' . u ceSSlVepomts are then

meso '

Except for minor irregularities th di 'b . .resembles the bell-shaped normdl e stn ution por~ayed in Figure 1

~erfect normal curve is reproduce;~:~i A mathem.atically dete~jned,lmportant mathematical TO erti ' , . gu:e 3, This type of curve hasof statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds

. represent purpo htures will be noted E ti n h se, owever, only a few fea-

. ssen a y t e curve . d' th "number of ca 1 " m lcatesat'J4~ largest

ses custer In the center of the range and thattlie nu;ri15er

Norms and tile Interpretation of Test Scores 71

~he most ~bvious and faniiliar way of reporting variability is in terms of

e range etween the highest and lowest score The ran e h .cxtrem I d d . g, owever IS. . e y cru c an unstable, for it is determined by onl two scores' A

smgle unu~ually high 01' low score would thus markedly Iffect its size' A

:ore precIse method of measuring variability is based on the d'ff .etwee~ eac~ individual's score and the me;n of the ou 1 erence

w~t:~ P01~t it will be helpful to look at the exam~Ie r~Table 2 in

10 c t ~ va~ous measures under consideration have been computed on

str:~~~' alu~ a s~an group was chosen in order to simplify the demon-• ,< tough 111 actual practice we would rarely perform these co

putations on so fe' ' T hI J m-ard statistical sym~o~~~~t s~o~: ~ervetS adlsfotO

fintroduce certain stand-

e no e or uture reference Original

raw scores are conventionally designated by a capital X d . n .used to refer to deviations of each score from the ' an a sma x IS

letter I means "sum of" It 'n b group mean. The Greek

g. th d f . Wi e seen that the first column in Table 2lves e ata or the 'f40, th d" computation 0 mean and median. The mean is, erne lan IS 405 fall' 'd b" mg ml way etween 40 and 41-five cases

Principles of Pbycl1010gical Testing

ps off gradually in both directions as the extremes are approached.

.curve is bilaterally symmetrical, with a single peak in the center.

st distributions of human traj,ts, from height and weight to aptitudes

personality characteristics, approximate the normal curve. In gen-

I,the larger the group, the more closely will the distribution resemble

theoretical normal curve.

340

320

300

280

260

240

~ 220

i3 200'0180

•• 160

i 140:l 120

100

80

60

40

20

- Frequency polygon--- Histogram

TABLE 2 ~

Illustration of Central Tendency and Variabilit)·

•• ""JI fi!.=z:r--

--I

12- 16- 20-15 19 23

Diff. Squared

(:1:2 )24- 28- 32- 36- 40- 44- 48- 52-27 31 35 39 43 47 51 55

scores

50% of {:~ ~~1cases ~~ ~!J +20

Medi,n ~ 40.5 ~~:, ~ {E =H -20

___ 3_2 =~J~X = 400 ~ Ixl = 40

~X 400M=N=1O=40

AD = }; ixj _ 40_N - 10}~ 4

V. ~x' 244·

anance = 0" = - = - - 24 40N 10 - .

SD or u = ~~2 = v'24.40 = 4.9

Flc.1. Distribution Curves: Frequenc\: polygon and Histogram.

(Data from Table 1.)

A group of scores can also be described in terms of some measure.:of

central tendency. Such a measure provides a single, most typical or repJi~-sentative score to characterize the performance of the entire grouf:- 'The

most familiar of these measures is the average, more technically known

as the mean (M). As is well known, this is found by adding all scores

and dividing the sum by the number of cases (N). Another measure of

central tendency is the mode, or most frequent score. In a frequency

distribution, the mode is the midpoint of the class ihterval with the

highest frequency. Thus, in Table 1, the mode falls midway between 32

and 35, being 33.5. It will be noted that this score corresponds to the

highest point on the distribution curve in Figure 1. A third measure of

central tendency is the median, or middlemost score when all scores

have been arranged in order of size. The median is the point that bisects

the distribution, half the cases falling above it and half below.Further description of a set of test scores is given by measures of varia-

, "', ..• 1. r ~ ••• ~"'t "f ;..,rl;"i"'l1~ 1 flifkrences around the central tendency.

64

49

9

1

1

o4

1636

64

:£x' = 244

.,;Principles of Psychological Test ing

'eIcent) are above the median and five below. There is little point in

a mode in such a small group, since the cases do not show c1ear-

tering on anyone score. Technically, however, 41 would repre-

mode, because t",o persons obtained this score, while all other

ccur only once.and column sho\\'s how far each score deviates above or below

of 40. The sum of these deviations will always equal zero, be-

.EOsitiveand negative deviations around the m~an nec~ssarily.

or cancel each other out (+20 - 20 = 0). If we Ignore slgns, ofe Ci,\1laverage the absolute deviations, thus obtaining a measure

th'eaverage deviation (AD). The symbol Ix\ ill the AD formula

that absolute values were summed, without regard to sign. Al-

f ~mnedescriptive value, the AD is not suitable for use in fur-

thema'tical analyses because of the arbitrary discarding of signs.

99.72'1

95.44'1

68.26'1tI

1III

IIIIIIIIII

-30' -leT Mean +leT +20'

FIC. 3. Percentage Distribution of Cases in a NOlmal Curve.

diffe~ent tests in terms of norms, as will be shown in the section on

stan~ard scores. The interpretation of the SD is especi~lly clear-cut when

apphed to a normal or approximately normal distribution curve. In such

a distribution, there is an exact relationship between the SD and the

proportion of cases, as shown in Figure 3. On the baseline of this normal

curvc have been marked distances representing one, two, and three

standard deviations above and below the mean. For instance, in the ex-

ample given in Table 2, the mean would correspond to a score of

40, +1u to 44.9 (40 + 4.9), +20' to 49.8 (40 + 2 X 4,9), and so on. Thepercentage of cases that fall between the mean and +lu in a normalcurve is 34.13. Because the curve is symmetrical, 34.13 percent of the

cases are likewise found between the mean and -1u, so that between

+ 1u and - 1(1 on both sides of the mean there are 68.26 percent of the

cases. Nearly all the cases (99.72 percent) fall within ±3u from the

mean. These relationships are particularly relevant in the interpreta.tion

of standard scores and percentilcs, to be discussed in later sections.

One way in which meaning can be attached to test scores is to indicate

how far along the normal developmental path the individual has pro-

gressed. T~us a~ 8-year-old who performs as well as the average 10-year-

old on an mtelhgence test may be described, as having a mental age of

10; a mentally retarded adult who performs at the saifre level would like-

wise be assigned ~n MA of 10. ~n a different context. 11i~.urth-grade child

may be cba.ractenzed as reacbmg the sixth-grade nonn An a reading test

and the t~l~d-grade n~rm in an. ar~thmetic test. Other d~velopm~tal

systems uti!tze more hIghly quahtative deSCriptions of be.JU~yi9I.in ~r

-- Lorge SD

---Small SD

Scores

Frequenc\'Distributions ...\'ith the Same Mean b~t Different Variahility.

. h more serviceable measure of variability is the standard devw-

:mbolized by either SD or u), in which the negative signs are

'ely eliminated by squaring each deviation. This p~ has

owed in the last column of Table 2. The sum of thiS column

: ("iX2)by the number of cases N is known as the variance, or mean

eviatiol1, andc~ymbo1ized by u'. The variance has proved ~x-

'useful in sorting out the contributions of different factors to m-

differences in test performance. For the present purposes, how-

chief concern is with the SD, which is the square root of the

as shown in Table 2. This measure is commonly employed in

.'g the variability of different groups. In F.igur.e 2,. for e~a~~le,

distributions having the same mean but dlflenng In vanabllity.

ribution with wider individual differences yields a larger SD

"one with narrower individual differences.Sf) also provides the basis for expressing an individual's scores on

'Prillcil,lesof PSljchological Testing

unctionsranging from sensorimotor activities to concept formation.

-I'erexp~essed, scores based on developmental norms tend. to be

oinetricallvcrude and do not lend themselves well to precise sta-

treah~e~t. Nevertheless, they have considerable appeal for de-

\ve purposes, especially in the intensive clinical study of individuals

orcertain research purposeS.

Norms and the Interpretation of Test Scores 75

readily ~isualized if ••w~ think ~~ the in.dividual's height as being ex-

pressed 10 tem1S of heIght age. The dIfference in inches between a

height age of 3 and 4.years would be greater tha~ that betw~en a height

age of 10 and 11. OWll1gto the progressive shrinkage of the MA unit, one

year of acceleration or retardation at, let us sav, age 5 represents a largerdeviation from the norm than does one vear 'of acceleration or retarda-

tion at age 10, .

'l'l;TAL ACE. In Chapter 1 it was noted that the tenn "mental ~ge"

s;ddelv popularized through the various translations and adaptatiOns

the Billet-Simon scales, although Binet himself had employed the

re nelitral term "mental levcl." In age scales such as the Binet and

'revisionsjitemsare grouped into year le,·els. For example, those items

ssedbv the majority of 7-vear-olds in the standardization sample are

~jacedi~ the 7-year level, tilose passed by the m~j~rity of 8-year-olds

~e assignedto the 8-year level, and so fOlth. A child s score on the test

',,~11then correspond to the highest year level that he can succe5sful~y

'omplete.In actual practice, the indh'idual's performance shows a certal~

'~mountof scatter. In other words, the subject fails some tests below h1s

mentalage level and passes some above it. For this reason, it is c~stom-

ar}'to compute the basal age, i.e., the highest age at and below w~lCh all

testsare passed. Partial credits, in months, are then ~d?ed to thiS basal

,'agefor all tests passed at hi~e:;p~r ~evels The chIld s mental age o~the test ISthe sum of the ba~:gp ;lvitbe:dditjonaJ months of credit

earned at higher age level§.:. - . '~tal age norms have also been employed wl~h ~ests that are l:ot dl-

divedinto year levels. In such a case, the subJect s raw scor~ 1S first

determined. Such a score may be the total number of correct Items on

thewhole test; or it may be based on time, on number of~p"(lrs, or on

somecombination of sU~'hmeasures. The mean raw scores.t;,t)Q~ninedby

the children in each year group within the standardiza~tQn' sample con-

stitute the age norms for such a test. The mean raw seore of the 8-~ea~-

old children, for example, would represent the 8-year nonn. If an ll1d~-i

vidual's raw score is equal to the mean 8-year-old raw SCOre,then hiS

mental age on the test is 8 years. All raw scores on such a test can be

transformed in a similar manner by reference to the age nonns.It should be noted that the mental age unit does not remain constant

with age, but tends to shrin~ with advancing years. For example, a child

who is one year retarded at age 4 will be approximately three. years. re-

tarded at age·12. One year of mental growth from ages 3.to 4 ISeqUIVa-

lent to three years of growth from ages 9 to 12. Since mtellectual de-

velopment progresses more rapidly at the earlier ages and gradually

decreases as the individual approaches his mature limit, the mental age

unit shrinks correspondingly with age. This relationship may be more

GRADE EQUIVALENTS.Scores on educational achievement tests are often

interpreted in terms of grade equivalents. This practice is understandable

becaus.e,the t<:stsare employed within an academic setting. To describe

a pupil s ~chlevement as equivalent to seventh-grade performance in

spelhng, eIghth-grade in reading, and fifth-grade in arithmetic has the

same popular appeal as the use of mental age in the traditional intelli-

gence tests.

~rade ~orms are found by computing the mean raw score obtained by

chIldren In each grade. Thus, if the average number of problems solved

c~ITectly on .an arithmetic tes~ by the fourth graders in the standardiza-

hon sample 1S23, th~n a raw score of 23 corresponds to a grade equiva-

lent of 4. IntermedIate grade equivalents, representing fractions of a

gr~de, a~e usually found by interpolation, although they can also be ob-

tamed directly by testing children at different times within the school

year. Because the school year covers ten months, successive months can

be expressed as decimals. For example, 4.0 refers to average perfonnance

at the beginning of the fourth grade (September testing), 4.5 refers toaverage performance at the middle of the grade (Febmary testing), and

so forth.

Despite their popularity, grade norms have several shortcomings. First,

the content of instruction varies somewhat from grade to grade. Hence,

grade norms are appropriate only for common subjects taught through-

o~t the grade le~els covered by the test. They are not generally ap-

phcable at the hIgh school level, where many subjects may be studied

for only one or two. years. Even Vlith subjects taugkt in each grade,

however, the emphas1s placed on different subjects may vary from grade

~o grade, and ~rogress may therefore be more rapid in oJ1esubject than

III ~other dUrIng a particular grade. In other words, grade-units are

obv~ously unequal and these inequalities occur irregqllirly in differentsubjects. ,; .

Grade norms are also subject to misinterpretation uni~s ,the test user

keeps fi~ly in mind the manner in which they were ·deri~ed. For ex-

am~le, .If a fourth-grade child obtains a grade eq.~ivalent of 6~9in arith-

metic, I.t does ~ot mean that he has mastered thfi aritn,w.etic processestaught In the SIxth grade. He undoubtedly obtained'hjs sc6r~ largely by

.Principles of Psyc11010gicaJ Testing .

" . I Id not• >~t . ce 'I·nfouI,th grade arithmetic. It certam y COUlOrpenorman - • . d 'h fcI. med that he has the prerequi~ites for seventh-gra e ant me I ~

adc norms tend to be incorrectly regarded as performan~l

;df. A sixth-grade teacher, for example: may assume tha.t all h~!:e~class should fall a! or close to tl~e sixth-grade ,n?rm In ac rade

ests This misconception is certamly not surpnsmg when g h

iare ~sed Yet individual differences within any onc grade ar~ suc

·.,:therange' of achJevement test scores will inevitably exten over

pal grades,

1 t developmental norms derivesDINAL SCALES. Another approac 1 0 1 b t' f behavior

, hI' E Ipirica 0 serva Ion 0research in chIld psyc 0 og~, . n . 1 d t the description of be-'pment in infants and young chlldl;n e. 0 1 omotion sensory

typical of successive ages in ,SUC? uncti~ns as OCt forma~ion. An

.' inati0t, .lingui~~c dc~~~~~;~~:t:~n~f a~ese~lo:~e£ his associates at

(eAxampe1913sl~:~e~et ~l. 1940; Gesell & Amatruda, 1947; H~lver-mes" , h d I h 0 th apprOXImate

1933) The Gesell Developmental Sc e U es s 0''0 e h f flopm~ntallevel in months that the child has attained in eadc 0 °aul

r

d ptive lan<1uage an person -.areasof behavior, namely, motor, a ~ J h'ld' 'behaVior with1 Tliese levels are "found by companng tIe CIS h• 0 0 0 k a ran ing from 4 weeks to 38 mont s.typlCalof eight ey at>es, g . d tl uential patterning ofsell and his co-workers emphaSize Ie .seq. f'f'-

Th 't d xtenslVe eVidence 0 um or1111, behavior development. ey CIe e. . f behaviorof developmental sequences and an orderly pdrogressllolllb~ect piaced

h hOld' fons towar a sma 0 ]Iges.For example, tee I s reac I , . visualont of him exhibit a characteristic chronologIcal sequen:e I~ d in

ion and in hand and finger movements. Use of th~ entire an

'de attempts at palmar prehension OCC~Il'S~t a~ ear~er ~g~i~h;: t~::he thumb in opposition to the palm; thIS t)~e 0 pre en~, t pincer-owedb use of the thumb and index finger In a more e c~en .. Y f the ob'ect Such sequential patterning was hkewlse ob-

cg~~;wOalking,st!ir ~limbing, and ~ost ~f th; s~~~~:~l~:o~':~:~~;kt of the first few years, The scales eve ope ~ 'c6nstant. do I' the sense that developmental stages follow In a .

e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr char-

a~teJ'isticof earlier stages.', •• . I I" differs from that in statistics, in which an'.Thisusageof the term ordma sca ~ k l' f individuals wjthout" .' I that permlt~ a ran -oruenn~ 0

o .al scale IS simp y one . . between them' in the statistical sense; o~1. dgeabout amount of dilI~r~nce i les Ordinal sillIes of child developmentarecontra.stedto equal-umt mterva ~:m~~ scale or simplex, in which success-

uallydeSignedon1theI ~o~~;so:u:c:ss at an lower levels (Guttman, 1944). An

:pprformanceat one eve mlp I

Norms arid the ITltcrprc:tafioTl of Test Scores 77

Since the 19605, there has been a sharp upsurge of interest in the de-

velopmental theories of the Swiss child psychologist, Jean Piaget (see

Flavell, 1963; Ginsburg & Opper, 1969; Green, Ford, & Flamer, 1971).

Piaget's research has focused on the development of cognitive processes

from infancy to the midteens. He is concerned with specific concepts

rather than broad abilities. An example of such a concept, or schema, is

object permanence, whereby the child is aware of the identity and con-

tinuing existence of objects when they are seen from different angles

or are out of sight. Another widely studied concept is conservation, or

the recognition that an attribute remains constant over changes in per-

ceptual appearance, as when the same quantity of liquid is poured into

differently shaped containers, or when rods of the same length are placed

in different spatial arrangements.

Piagetian tasks have been used widely in research by developmental

psychologists and some have been organized into standardized scales,

to be discussed in Chapters 10 and 14 (Goldschmid & Bentler, 1968b;

Loretan, 1966; Pinard & Laurendeau, 1964; Uzgiris & Hunt, 1975). In ac-

cordance with Pia get's approach, these instruments are ordinal scales, in

which the attainment of one stage is contingent upon completion of the

earlier stages in the development of the concept. The tasks are designed

to reveal the dominant aspects of each developmental stage; only later

are empitical data gathered regarding the ages at which each stage is

typically reached, In this respect, the procedure differs from that fol-

lowed in constructing age scales, in which items are selected in the first

place on the basis of their differentiating between successive ages.

In summary, ordinal scales are designed to identify the stage reached

by the child in the development of specific behavior functions. Although

sc.'Oresmay he reported in terms of approximate age levels, such scores

are secondary to a qualitative description of the child's characteristic be-

havior. The ordinality of such scales refers to the uniform progression of

development through successive stages. Insofar as these scales typically

provide information about what the child is actually able to do (e.g.,

climbs stairs without assistance; recognizes identity in quantity of liquid

when poured into differently shaped containers), they share important

features with the criterion-referenced tests to be discussed in a later

section of this chapter.

Nearly all standardized tests now provide some foryn of within~group

norms. With such norms, the individual's performa,~~,. is evaluated in;t.~·

extension of Guttman's analysis to Include nonlinear hi~archies i,~ilescribc:dby Bart

and Airasian (1974), with special reference to Piagetillrr··~al.~".~ .

Principles of Psychological Testing

msof the performance of the most nearly comparable standardization

up, as when comparing a child's raW score with that of ~hi~dren of

e same chronological age or in the same school grade. Wlthm-group

reshave a uniform and clearl\' defined quantitative meaning and can

appropriately employed in m~st types of statistical analysis.

PERCEKnLES. Percentile scores are expressed in terms of the percentage

persons in the standardization sample who fall be~ow a given raw

reoFor exampk, if 28 percent of the persons obtam fewer than 15

bblemscorrect on an arithmetic reasoning test, then a raw score of

<j<\rrespdndsto the 28th percentile (P~~). A percentile indicates ~he

.J{iiduafs relative position in the standardization sample. ~ercent~les

. :)\150 be regarded as ranks in a group of 100, except th~t m rankmg

ustomary to start countin<1 at the top, the best person m the group

'ing a rank of one. 'With ~ercentiles, on the other hand, we begin

ing at the bottom, so that the lower the percentile, the poorer the

'dual's standing. .'e 50th percentile (P;;(I) corresponds to the medlan, already dls-

d as a measure of central tendency. Percentiles above 50 represent

e-average performance; those below 50 signify inferior p~rforman:e.

'.25th and 75th percentile are known as the first and thlrd quartile

hits (Ql and Q3), because they cut off the lowest and highest quarters

the distribution. Like the median, they provide convenient landmarks

Qrdescribing a distribution of scores and comparing it with other dis-

ributions. .Percentiles should not be confused with the familiar percehtage scores.

he latter are raw scores, expressed in terms of the percentage of correct

/items;percentiles are derived scores, expressed in terms of perce~ltage of

}<persons.A raw score lower than any obtained in the stand~rdizahon sam-

.:ple would have a percentile rank of zero (Po); one hl~her than any

.. scorein the standardization sample would have a percentile rank of 100,

. (PH"')' These percentiles, however, do not imply a zero raw score and a

perfect raw score.Percentile scores have several advantages. They are easy to compute

and can be readily understood, even by relatively untrained persons.

Moreover, percentiles are universally applicable. They can be used

equally well with adults and children and a~e sUit~ble for any type of

test, whether it measures aptitude or personahty vanables. .The chief drawback of percentile scores arises from the marked 10-

equality of their units, especially. at ~he extremes of the distribut~on. If

the distribution of raw scores approx1mates the normal curve, as lS true

of most test scores, then raw score differences near the median or center

of the distrihution are exag~erated in the percentile transformation,

__________________________ •••••••••••• ·1

Norms and tile Interpretation of Test Scores 79

whereas raw score differences near the ends of the distribution are

greatly shrunk. This distortion of distances between scores can be seen

in Figure 4. In a normal curve, it will be recalled, cases cluster closely at

the center and s~atter more widely as the extremes are approached. Con-

sequently, any glYen percentage of cases near the center covers a shorter

distance on the baseline than the same percentage near the ends of the

distribution. In Figure 4, this discrepancy in the gaps between percentile

ranks (PH) can readily be seen if we compare the dj$tance between a

PR of 40 and a PH of 50 with that between a PR oero and a PR of 20.

Even more stdking is the discrepancy between these distances and that

between a PH of 10 and PR of 1. (In a mathematically derived normal

curve, zero percentile is not reached until infinity and hence cannot be

shown on the graph. )

Q1 Mdn Q3

20130405106070180

i J i I I : iI 1 I I I I I

: \ : I \ I II I I I I II I I 1J I I II II I

1 I

99II

1~ I

III

1III

I\

IIIII

+20- +30-

98 99.9-30- -10- M +10-

~m ~ ~ ~

FIC. 4. Percentile Ranks in a NOlmal Distribution.

The same relationship can be seen from the opposite direction if we

examine the percentile ranks corresponding to equal u-distances from the

mean ~f a. normal curve. These percentile ranks are given under the

graph m Flgure 4. Thus, the percentile difference i;letween the mean and

+ lIT .is 34 (84 - 50). That between + I.,. and +~is only 14 (98 - 84).

. It IS apparent that percentiles show each indiyf<Jual's relative position

In the normative sample but not the amount of <h~ence between scores.

If plotted on arithmetic probability paper, however, percentile scores

can also provide a correct visual pictUre of th~ differences between

sc..or~s. A~ithmetic probability paper is a cr~ss-se<:;rl?npaper i~ which the

vertical h.nes. are. spaced .in t?e same W~y asltM'percentile p~~nts in anormal dlstnbubon (as. m FIgure 4), whereas the horizonta.1i~.nes are

uniformly spaced, or vice versa (as in Figure 5). Such normqJpe;centile

.";pfillciIJles of Psychological TestingNorms and the Interprdation of Test Scores 81

of differences between standard scores derived by such a linear trans-formation corresponds exactly to that between the ;aw scores. All-proper-

ties of the original distribution of raw scores are duplicated in the

distribution of these standard scores. For this reason, any computationsthat can be carried out with the original raw scores can also be carried

out with linear standard scores, withollt any distortion of results.

Linearly derived standard scores are often desilTnatedsimpl\' as "stand-b .

ard scores" or "z scores." To compute a :; score, we find the differencebetween the individual's raw score and the mean of the normative group

and then divide this difference by the SD of the normative group.

Table 3 shows the computation of z scores for two individuals, one ofwhom falls 1SD above the group mean, the other .40 SD below the

mean. Any raw score that is exactly equal to the mean is equivalent to a

z smre of zero. It is apparent that such a procedure will yield derivedscores that have a negative sign for all subjects falling below the mean.

.Moreover, because the total range of most groups extends no farther

than about 3 SD's above and below the mean, such standard scores will

have to be reported to at least one decimal place in order to provide

sufficient differentiation among individuals.

John Mary Ellen Edgar Jane Dick Bill Debby

~h-ANormal"PercentileChart. Percentiles are spaced so as to ~orrespond

~~Idistancesin a normal distribution. Compare the sc~re. distance ~e-" hn and Mary with that between EIIen and Edgar; w!.thm both pal:s,

entile difference is 5 points. Jane and Dick differ by 10 percentile

as do Bill and Debby.

TABLE 3

Computation of Standard Scores

X-M

SD

JOHN'S SCORE

X\=65

65 - 60Zl=

5

= +1.00

BILL'S SCORE

X:=58

58 - 60"'canbe used to plot the scores of different persons. on the same

r thescoresof the same person on different tests. In elther case, theillinterscoredifference will be correctly represented~ Many aptitude

achievementbatteries now utilize this technique in their score pro-

'whichshow the individual's performance in each test. An example

~eIndividualReport Form of the Differential Aptitude Tests, repro-

d in Figure 13 (Ch. 5).

. "AXDARD SCORES. Current tests are making increasing use of standard.

. scoreswhichare the most satisfactory type of derived score ftom most~oints'of view. Standard scores express the individual's distance from

t ", meanin terms of the standard deviation of the distribution.Standardscores mav be obtained by either linear or nonlinear trans-

ationsof the origi~al raw scores.Whe~ found by a l.in.eartransforma-; theyretain the exact numerical r~labons of the ongmal raw scores,. usethey are computed by subtracting a constant from each raw scorethendividing the result by another con~tant The relative magnitude

Both the abovE'conditions, viz., the occurrence of negative values and

of decimals, tend to produce awkward numbers that are confusing and

difficult to use for both computational and reporting purposes. For this

reason, some further linear transformation is u~u,:lly applied, simply to

put the scores into a more convenient form. ,For. ~x~lnple, the scores onthe Scholastic Aptitude Test (SAT) of the College Entrance Examina-

tion Board are standard scores adjusted to a mean ot;~:,and an SD of

100. Thus a standard score of -Ion this test would b: .ressed as 400(500 - 100= 4(0). Similarly, a standard score of +l.S ou1ltcorrespond

to 650 (500 + 1.5 X 100 = 650). To con"er~ an origi~$ll!tandard score tothe new scale, it is Simplynecessary to multiply the standard score by the

Principles of P~Y;'IO'ogical Testing

'ed SD (100) and add it to or subtract it from the desired mean

). Any other convenient values can be arbitrarily chosen for the

,mean and SD. Scores 011 the separate subtests of the Wechsler In-

ence Scales, for instance, are converted to a distribution with a

1 of 10 and an SD of 3. All such measures are examples of linearly

sformed standard scores.'twill be recalled that one of the reasons for transforming raw scores

o any derived scale is to render scores on different tests comparable.

e linearlv derived standard scores discussed in the preceding section

" be cO~lparable only when found from distributions that have ap-

ximatelythe same form. Under such conditions, a score corresponding

~.I SD above the mean, for example, signines that the individual occu-

ies the same position in relation to both groups. His score exceeds ap-

roximately t1J.e.same percentage of persons in both distributions, and

is percentage can be determined if the form of the distribution is

'known.If, howeyer;"one distribution is mal'kedly skewed and the other

"normal,a z score of +1.00 might exceed only 50 percent of the cases in

,negroup but would exceed 84 percent in the other.

In order to achieve comparability of scores from dissimilarly shaped

,distl-ibutions,nonlinear transformations may be employed to fit the scores

to any specified type of distribution curve. The mental age and percentile

scores described in earlier sections represent nonlinear transformations,

but they are subject to other limitations already discussed. Although

under certain circumstances another type of distribution may be more

appropriate, the normal curve is usually employed for this purpose. One

of the chief reasons for this chotee is that most raw score distributions

approximate the normal CUJ;V-e more closely than they do any other type

of curve. Moreover, physical me1tsures such as height and weight, which

use equal-unit scales derived. thl:"'t'fugh physical operations, generaU,y yield

normal ~istributions., Anoth'1f"frnportan: advantage .of the ~or.~al :~rveis that It has many useful mathematical properties, whlchl""faclhtate

further computations.

NQrmalized standard scores are standard scores expressed in terms of a

distribution that has been transformed to fit a normal curve. Such scoreS

can be computed by reference to tables giving the percentage of cases

falling at different SD distances from the mean of a normal curve. Firsf,

the percentage of persons in the standardization sample falling at or

above each raw score is found. This percentage is then located in the

normal curve frequency table, and the con-esponding normalized stand-

2 Partly for this reason and partly as a result of other theoretical considerations. it

has frequently been argued that, by normaliZingraw scores. an e(lual-unit scale could

be developcd for psycholo~ical measurement similar to the equal-twit sL-dlesof physi-

cal measurement. This, however, is a debatable point that involves certain question-

able assumptions.

ard score is obtained. Normalized standard scores are expressed in the

same form as linearly derived standard scores, viz., with a mean of zero

and an SD of 1. Thus, a normalized score of zero indicates that the indi-

vidual falls at the mean of a normal curve, excelling 50 percent of the

group. A score of -I means thafhe surpasses approximately 16 percent

of the group; and a s(:ore of + I, that he surpasses 84 percent. These per-

centages correspond to a distance of 1 SD below and 1 SD above the

mean of a normal curve, respectively, as can be seen by reference to the

bottom line of Figure 4.

Like linearly derived standard scores, normalized standard scores can

be put into any convenient form. If the normalized standard score is

multiplied by 10 and added to or subtracted from 50, it is converted into

a T score, a type of score first proposed by McCall (1922). On this scale,

a score of 50 corresponds to the mean, a score of 60 to 1 SD above the

mean, and so forth. Another well-known transformation is represented

by the stanine scale, developed by the United States Air Force during

World War II. This scale provides a single-digit system of scores with a

mean of 5 and an SD of approximately 2.3 The name stanine (a contrac-

tion of "standard nine") is based on the fact that the scores run from

1 to 9. The restriction of ~cores to single-digit numbers has certain

computational advantages, for each score requires only a Single column

on computer punched cards.

TABLE 4

Normal Curve Percentages for Use in Stanine Conversion

Percentage

Stanine

Raw scores can readily be co~verted to stanines by arranging the origi-

nal scores:in order of size and ~,~fn assigning stanines in accordance with

the normal curve percentages"re,produced in Table 4. For example, if

tlJ.e group consists of exactly I()() persons, the 4 lowest-scoring persons re-

ceive a stanine score of 1, the next 7 a score of 2, the next 12 a score of 3,

and so on. When the group contains more or fewer than l00~cases, the

number corresponding to each deSignated percentage is first computed,

and these numbers of cases are then given the appropriate stanines."'c

-" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight

(;han~es in the percentages and yields an SD of exactly 2, thus being e~Werto handlequantitatively. Other variants are the C scale (Guilford & ltruchter, :1,.913" Ch. 19),consisting of 11 units and also yielding an SD of 2, and tl.!~~lO-Uilitstefl scale, with

5 units above and 5 below the mean (Canfield, 1951}.'\: ".Co


for comparability of ratio IQ's throughout their age range. Chiefly for

this reason, the ratio IQ has been largely replaced by the so-called devi-

ation IQ, which is actually another variant of the familiar standard score.

The deviation IQ is a standard score with a mean of 100 and an SD

that approximates the SD of the Stanford-Binet IQ distribution. Al-

though the SD of the Stanford-Binet ratio IQ (last used in the 1937

edition) was not exactly constant at all ages, it fluctuated around a

median value slightly greater than 16. Hence, if an SD close to 16 is

chosen in reporting standard scores on a newly developed test, the result-

ing scores can be interpreted in the same way as Stanford-Binet ratio

IQ's. Since Stanford-Binet IQ's have been in use for many years, testers

and clinicians have become accustomed to interpreting and classifying

test performance in terms of such IQ levels. They have learned what to

expect from individuals with IQ's of 40, 70, 90, 130, and so forth. There

are therefore certain practical advantages in the use of a derived scale

that corresponds to the familiar distribution of Stanford-Binet IQ's.

Such a correspondence of score units can be achieved by the selection of

numerical values for the mean and SD that agree closely with those inthe Stanford-Binet distribution.

It should be added that the use of the term "IQ" to designate such

standard scores may seem to be somewhat misleading. Such IQ's are not

derived by the same methods employed in finding traditional ratio IQ's.

They are not ratios of mental ages and chronological ages. The justifi-

cation lies in the general familiarity of the term "IQ," and in the fact

that such scores can be interpreted as IQ's provided that their SD

is approximately equal to that of previously known IQ's. Among the first

tests to express scores in terms of deviation IQ's were the \Vechsler In-

telligence Scales. In these tests, the mean is 100 and the SD 15. Deviation

IQ's are also used in a number of current group tests of intelligence

and in the latest revision of the Stanford-Binet itself.

\Vith the increasing use of deviation IQ's, it is important to remember

that deviation IQ's from different tests are comparable only when they

employ the same or closely similar values for the SD. This value should,always be reported in the manual and carefully noted by the test user. If

a test maker chooses a different value for the SD in making up his devia-

tion IQ scale, the meaning of any given IQ on his test will be quite differ-

ent from its meaning on other tests. These discrepancies are illustrated in

Table 5, which shows the percentage of cases}i1normal distriblltions with

SD's from 12 to 18 who would obtain IQ's at different l~els.These SD

values have actually been employed in the IQ scales ofp*lJli~hed tests.

Table 5 shows, for example, that an IQ of 70 cuts off the lo\v(j:..st3.1 per-

cent when the SD is 16 (as in the Stanford-Binet); but it _",;;y cut off.

as few as 0.7 percent (SD = 12) or as many as 5.1 percen .' = 18) .An IQ of 70 has been used traditionally as a cutoff point fpl' . ying

Prillciplcs of Psycl1010gical Testing

us,out of 200 cases, 8 would be assigned a stanine of 1 (4 percent of

= 8). With 150 cases, 6 would receive a stanine of 1 (4 percent of

== 6). For any group containing from 10 to 100 cases, Bartlett and

,erton (1966) have prepared a table whereby ranks can be directly

rted to stanines. Because of their practical as well as theoretical

rimtages,stanines are being used increasingly, especially with aptitude

achievement tests.

IthoughnOlmalized standard scores are the most satisfactory type of

.refor the majority of purposes, there are nevertheless certain tech-

al objections to normalizing all distributions routinely. Such a trans-

:)ation should be carried out only when the sample is large and rep-

Iltativeand when there is reason to believe that the deviation from

in~litvresults from defects in the test rather than from characteristics

he sample or from other factors affecting the behavior under con-

ration/it should also be noted that whpn-the original distribution of

scoresapproximates normality, the linearly derived standard scores

the normalized standard scores will be very similar. Although the

:ods of deriving these two types of scores are quite different, the

tiltingscores will be nearly identical under such conditions. ObViously,

.!proeessof normaliZing a distribution that is already virtually normal

rproduce little or no change. Whenever feasible, it is generally more

'rable to obtain a normal distribution of raw scores by proper adjust-

,t of the llifficulty' level of test items rather than ~by subsequently

alizing a markedly nonnormal distribution. With an approximately

al distributiou of raw scores, the linearl\' derived standard scores

,servethe same purposes as normalized st;ndard scores.

.~ DEVIAT10JlO IQ. In an effort to convert ~1A scores into a ~6rm

J of the individual's relative status, the ratio IQ (Intelligence

Jient) was introduced in early intelligence tests. Such aIJ.,IQ was

ply the ratio of mental age to chronological age, multiplied by 100 to'pate decimals (IQ = 100 X MAjCA). Obviously, if a child's ~IA

Is his CA, his IQ will be exactly 100. An IQ of 100 thus represents

'\i.\ or average performance. IQ's below 100 indicate retardation,

above 100, acceleration.

" apparent logical simplicity of the traditional ratio IQ, however,

proved deceptive. A major technical difficulty is that, unless the

f the IQ distribution remains approximately constant with age,

will not be comparable at different age levels. An IQ of 115 at age

r example, may indicate the same degree of superiority as an IQ

at age 12, since both may fall at a distance of 1 SD from th~

. of their respective age distributions. In actual practice, it prm'e,&'

. ifficult to constmc:t tests that met the psychometric requiremeritS'

5

tage of Cases at Each IQ Interval in Normal Distributions with Mean

and Different Standard Deviations

esyTest Department, Ha~court Brace Jovanovich, Inc.)

In

5:cov

'0.8E

'"z0.13%

0.13%

-40- -10- Mean +1<1 +2<1 +3<1 +4<1Test score

z score I ! I I I I I I-4 -3 -2 -I +1 +2 +3 +4

Tscore L I I I I ! I I.. I10 20 30 40 50 GO 70 80 90

CEEB score I I I I I200 300 - 400 500 600 700 800

Deviation IQ(SD =15) ! I I I I I

55 10 85 100 115 130 145

Stanine4%

I 7% ,12%,17% ,20%! 11% 112% 17% I 4%

2 3 4 5 6 7 8

. : 1Q1iltervalSD= 12 SD = 14 SD = 16 SO = 18

s',b .

130 \Rh above 0.7 1.6 3.1 5.1120-129 4.3 6.3 7.5 8.5

··:110-119 15.2 16.0 15.8 15.4

100-109 29.S} 59.6 26.1}52.2 ;;::}47.2 21.°l42090- 9~ 29.8 26.1 21.0) .

80- 89 15.2 16.0 15.8 15.4

70- 79 4.3 6.3 7.5 8.5.Below70 0.7 1.6 3.1 5.1

Total 100.0 100.0 100.0 I100,0

= -,'1II9tA~;r.r ...).~~"""""~

mental retardation. The same discrepancies, of course, apply to IQ's of

130 and above, which might be used in selecting children for special

programs for the intellectually gifted. The IQ range between 90 and lIO,

generally described as normal, IJlay include as few as 42 percent or as

many as 59,6 percent of the popula-tion, depending on the ~est chosen. To

be sure, test publishers are making efforts to adopt the umform SD of 16

in new tests and in new editions of earlier tests. There are still enough

variations among cuaently available tests, however, to make the checking

of the SD imperative.

Percentile I I I I I I I I I

5 10 20 30 40 50 60 10 80 90 95 !l9

FIC. 6. Relationships among OiHerent Types of Test Scores in a NormalDistribution.

INTERRELATIONSHIPSOF WITHIN-GROUPSCORES,At this stage in our dis;

cussian of derived scores, the reader may have become aware of a

rapprochement among the various types. of scores. Percentiles ~ave

gradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzed

standard scores. Linear standard scores arc mdlstingmshable from

normalized standard scores if the original distribution of raw scores

closely approximates the normal curve. Finally, standard s(:ores have. be-

come IQ's and vice versa. In connection with the last point, a ree,xamm~-

tion of the meaning of a ratio IQ on such a test as the Stanford-.Bmet WIll

show that these IQ's can themselves be interpreted as standard scores. If

we know that the distribution of Stanford-Binet ratio IQ's had a mean of

11") ronrl ~n qT) of :mnroximatelv 16. we can conclude that an IQ of 1I6

falls at a distance of 1 SD above the mean and represents a standard

score of +1.00. Similarly, an IQ of 132 corresponds to a standard score

of +2.00, an IQ of 76 to a standard score of -1.50, and so forth. More-

over, a Stanford-Binet ratio IQ of lI6 corresponds to.~Percertile rank

of approximately 84, because in a normal curve 84 plirc~1it of-the casesfall helo. +1.00 SD (Figure 4). . ,.

In Figure 6 are summarized the relaHbnships that exist in a normal

distribution among the types of scores so far discussed in .this chapter.

These include z scores, College Entrance Examination Bqp,rcd (CEEB)

scores, Wechsler deviation IQ's (SD = 15), T SCOres,stanines, and per-

centil~s. Ratio IQ's on any test will coincide with th~g_iven deviation iQscale-If they are normally distributed and have an S1). of 15. Any other


ally distributed IQ could be added to the chart, provided we know

'SD. If the SD is 20, for instance, then an IQ of 120 corresponds to

'1 SD, an IQ of 80 to -1 SD, and so on.

In conclusion, the exact form in which scores are reported is dictated

gelyby convenience, familiarity, and ease of developing nonns. Stand-

scores in any form (including the deviation IQ) have generally

placed other types of scores because of c.-ertain advantages they offer

'th regard to test construction and statistical treatment of data .. ~ost

pesof within-group derived scores, however, are fundamentally s1m1lar

_. carefully derived and properly interpreted. When certain statistical

conditionsare met, each of these scores can be readily translated into

...anyof the others.


tests may differ in content despite their similar labels. So-called intelli-

gence tests rrovide many illustrations of this confusion. Although com-

mon]y descnbed by the same blanket term, one of these tests may include

only v~rba] content, another may tap predominantly spatial aptitudes,

and still another may cover verbal, numerical, and spatia] content in

about equal proportions. Second, the scale units may not be comparable.

As explained earlier in this chapter, if IQ's onone test have an SD of 12

and IQ's on another have an SD of 18, then an individual who received

an IQ of 112 on the first test is most likely to receive an IQ of 118 on the

secon~. !hird, the composition of the s~dardi;;;ation sa'!!Ples used in

establIshmg nonns for different tests may vary. ObViously, the same indi-

~idu~l will appear to have performed better when compared with an

mfenor group than when compared with a superior group.

Lack of comparability of either test content or scale units can usually

be detected by reference to the test itself or to the test manual. Differ-

ences in the respective normative samples, howeyer, are more likely to

be overlooked. Such differences probably account for many otherwise un-explained discrepancies in test results.

ISTERTEST COMPARISONS, An IQ, or allY other score, should always be

accompanied by the name of the test on which it was obtained. Test

~corescannot be properly interpreted in the abstract; they must be re-

e ferred to particular tests. If the school records show that Bill Jones re-

. ceived an IQ of 94 and Tom Brown an IQ of 110, such IQ's cannot be

accepted at face value without further information. The positions of

these two students might have been reversed by exchanging the par-

ticular tests that eq,ch was given in his respective school.

Similarly, an individual's relative standing in di~erent functions may

be grossly misrepresented through lack of comparability of test norms.

Let us s~ppose that a student has been given a verbal comprehension

test and a spatial aptitude test to determine his relative standing in the

two fields. If the verbal abilitv test was standardized on a random sample

of high school students, while the spatial tes~ was standardized on a

selected group of boys attending elective shop courses, the examiner

might erroneously conclude that the individual is much more able along

verbal than along spatial lines, when the reverse may actually be the case.

Still another example involves longitudinal comparisl?,ns of a single

individual's test performance over time. If a schoolchild's cumulative

record shows IQ's of 118, 115, and 101 at the fourth, fifth, and sixth

grades, the first question to ask before interpreting these changes is,

"What tests did he take on these three occasions?" The apparent decline

may reflect no more than the differences among the tests. In that case,

he would have obtained these scores even if the three tests had been

administered within a week of each other.

There are three principal reasons to account for systematic variations

among the scores obtained by the same individual on different tests. First,

THE NORMATIVE SAMPLE.• Any norm, however expressed, is restricted

to the particular normative population from which it was derived, The

test user should never lose sight of the way in which norms are estab-

lished. Psychological test norms are in no sense absolute, univer;!U,or

penn~ne~t. They JIle~ely represent the test performance of the subi.~15

consti~tmg the~i\r..~ardization sample. In choosing such a sample·, af1

eff?rt IS usual.lr~de t(t'Qbtain a representative cross sectiol\Hlf.the popu-latIon for which th~.it~st is designed. .

In st~tistjca] terminology, a distinction is made between sample and

populatIOn. Th: former refers to the group of individuals actually teste (i.

Th~ latter des1gn~tes the larger, but similarly constituted, group froin

which the sample 1Sdrawn. For example, if we wish to establish nonns of

test performance for the population of 10-year-old, urban, public schoo]

boys, ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boys

attendmg PUb~IC schools in several American cities. The sample would

be checked w1th reference to geographical distribution, socioeconomic

level, ethnic (,'omposition, and other relevant characteristics to ensure that

it was truly representative of the defined population.

In the development and application of test norms, considerable atten-

tion should be. given to the standardization sample. It is,,apparent that the

sample on wh1ch the norms are based should be large enough to provide

stable values., Another, similarly chosen sample of th•.•same populationshould not yIeld nonns that diverge appreciably frorp tfl.ose obtained.

"Prillciplesof Psychological Testing

, with a large sampling error would obviollsly be of little yalue in

~erpretationof test scores.

uallyimportant is the requirement that the sample be representative

',populationunder consideration. Subtle selective factors that might

. the sample unrepresentative should be carefully investigated. A

ber of such selective factors are illustrated in institutional samples.

ausesuch samples are usually large and readily available for testing

oses,they offer an alluring field for the accumulation of normative

. The special limitations of these samples, however, should be care-

yanalyzed. Testing subjects in school, for example, will yield an in-

'singlysuperior selection of cases in the sllccessive grades, owing to

e progressive dropping out of the less able pupils. Nor does such

iffiinationi?,ffectdifferent subgroups equally. For example, the rate of

ctiveelimination from school is greater for boys than for girls, and

/~greater in lower than in higher socioeconomic levels.

S~I~ctivefactors likewise operate in other institutional samples, such

.prisoners,patients in mental hospitals, or institutionalized mental re-

dates.Because of many special factors that determine institutionaliza-

'n itseH,such groups are not representative of the entire population of

riminals,psychotics, or mental retardates. For example, mental retard-

teswith physical handicaps are more likely to be institutionalized than

re the physically fit. Similarly, the relative proportion of severely re-

ardedpersons will be much greater in institutiunal samples than in the

total population.

Closely related to the question of representativeness of sample is the

needfor defining the specific population to which the norms apply. Obvi-

ous]y,one way of ensuring that a sample is representative is to restrict

the population to fit the ~ecifications of the available sample. For ex-

. ample, if the population i$ defined to include only 14-year-old school-

chDdrenrather than all 14-year-old children, then a school sample would

be representative. Ideally, of course, the desired population should be

definedin advance in terms of the objectives of the test. Then a suitable

sample should be assembled. Practical obstacles in obtaining subjects,

however, may make this goal unattainable. In such a case, it is far better

to redefine the population more narrowly than to report norms on an ideal

population which is not adequately represented by the standardization

sample. In actual practice, very fe''''' tests are standardized on such broad

populations as is pORularly assumed. No test provides norms for the

human species! And it is doubtful whether any tests give truly adequate

norms for such broadly defined populations as "adult American men,"

"lO-year-old American children," and the like. Consequently, the samples

obtained by different test constructors often tend to be unrepresentative

of their alleged populations and biased in different ways. Hence, the

rr<llJtin~norms are not comparable.

NATION~L ANCHOR NORMS. One solution for the lack of comparability

of n~rms IS to use an anchor test to work out eqUivalency tables for scores

?n dl~erent tests. Such tables are designed to show what score in Test A

IS e~Ulvalent to ~ach score in TestB. This can be done by the equiper-

cent,ze m.ethod, m which scores are considered equivalent when ther

have equal percentiles in a given group. For example, if the 80th pel:'

centile in the same group corresponds to an IQ of lI5 on Test A and to

an IQ of 120 on Test B, then Test.A-IQ 115 is considered to be equivalent

to Test-B-IQ 120. This approach has been followed to a limited extent

by so~e test publishers, who have prepared equivalency tables for a fewof theIr Own tests (see, e.g., Lennon, 1966a).

More ambitious proposals have been made from time to time for cali.

brat~n~ each new test against a single anchor test, which has itself been

admllllstered to a highly representative, national normative sample (Len-

~on, 1966b). No single anchor test, of course. could be used in establish-

mg norms for all tests, regardless of content. "'hat is required is a batterY

of anchor tests, all administered to the same national sample. Each ne,~'

~est could then be checked aKainst the most nearlY similar anchor test111 the battery. .

The data gathered in Project TALENT (Flanagan et a!', 1964) so far

come closest to providing such an anchor batten' for a high school popu-

la~ion. Using a r~ndo~ sample of about 5 per~nt of the high schools in

tIllS country, th~ lllVeStIga.torsadministered a two-day battery of specially

cons~ructed aphtude, achIevement, interest, and temperament tests to ap-

pr~:llnately 400,000 students in grades 9 through 12. Even with the avail-

~bihty of anchor data such as these, however, it must be recognized tItatl~dependen~ly dev.eloped tests ·can ~ever be regarded as completely inter-

changeable. At best, the use of natIOnal anchor norms would appreciably

reduce the lack of comparability among tests, but it would not elimi.nate it.

Th~ Pro!ec~ TALENT battery has been employed to calibrate several

test battenes III use by the Navy and Air Force (Dailey, Shaycoft, & Orr,

1962: ~haycoft, Neyman, & Dailey, 1962). The general procedure is to

admllllster both the Project TALENT battery and the tests to be cali-

bra~ed to the same sample. Through correlational analysis, a ,composite of

Project TALENT tests is identified that is most n~ya,dycomparable to

each test to be norme?. By means of the equipercentile method, tables

are then prepared g1Vlllg the corresponding scores On the Project

T~LENT composite and on the particular test. For several other bat-

tenes, data have been gathered to identify the Project TA.Lf:NT com-

4 F~r an excellent analysis of some of the technical difficulties involved in effortsto achIeve score comparability with different tests, see Angolf (i~~. 1966, 1971a).

"~-

SPECIFIC NORMS. Another approach to the nonequivalence of existing

norms-and probably a more realistic one for most tests-is to standard-

ize tests on more narrowly defined populations, so chosen as to suit the

specificpurposes of each test. In such ca.ses. the limits of the normative

; population should be clearly reported wIth the norms. :hus, the n?rms

" might be said to apply to "employed clerical worke~',s 111 large busll1~sS

'. organizations" or to "first-year enginee~ing students. For many test~ng

<. purposes. highly specific norms are deSirable. Eve~ w~e~ representatIve

. norms are available for a broadly defined populatIon. It IS often helpful

.tohave separately reported subgroup norms. This is true whenever recog-;

.•nizable subgroups yield appreciably different scores on a particular ~est.

The subgroups may be formed with respect to ag~, grade, type.of curnc~-

. lum, sex, geographical region, urban or rural envIronment, soclOeCOnO~T1lc

'level and manv other factors. The use to be made of the test determmes

the ~pe of differentiation that is most relevant. as well as whether

general or specific norms are more appropriate., Mention should also be made of local norms, often developed by the

test users themselves within a particular setting. The groups employed in

r11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups

• FIXED REFERENCE GROUP. Although most derived scores are computed

m such a way as to provide an immediate normative interpretation of test

perfom~ance, there. ~re some notable exceptions. One type of non-

normative scale utIlIzes a fixed reference group in order to ensure

compar~bility and continuity of scores, without providing normative

evaluation of performance. \Vith such a scale, normative interpretation

requires reference to independently collected norms from a suitable

population. Local' or other specific norms are often used for this purpose.

One of the clearest examples of scaling in terms of a fixed reference

group is provided by the score scale of the College Board Scholastic

Aptitude Test (Angoff, 1962, 1971b). Between 1926 (when this test was

first a~ministered) and 1941, SAT scores were expressed on a normative

scale, 111 t~r.ms o~ the mean and SD of the candidates taking the test at

each adm~mstration. As the number and variety of College Board member

colleges l~lcreased and the composition of the candidate population

changed, It was concluded that scale continuity should be maintained.

Otherwise, an individual's score would depend on the characteristics ot

the group tes~ed .dUring a particular year. An even more urgent reason

for scale continu~ty ~temmed from the observation that students taking

the. SA~ at certam .hmes of the year performed mOre poorly than those

~akll1g It at other bmes, Qwing to the differential operation of selective

f~ctors. After 1941, therefore, all SAT scores were expressed in terms of

the ~ean and SD of the approximately 11,000 candidates who took the

test m 1941. These candidates constitute the fixed reference group em-

ployed in scaling all subsequent forms of the test. Thus, a score of 500 on

any form of the SAT corresponds to the mean of the 1941 sample' a scoreof 600 falls 1 SD above that mean, and so forth. ' ,

To permit translation of raw scores on any {prm of the SAT into these

~x~d-refere~ce-group scores, a short anc~or test (9r set of common items)

IS lI:c1uded 111 each fonn. Each new form is thereby linked to one or two

~arher forms. which in turn are linked with other forms by"g chin of

Items extend!ng back to the 1941 form. These nonnormative SAT scores

can then be mterpreted by comparison with any appropriate distribution

,~ Principles of Psychological Testing

..positecorresponding to each test in the battery (Cool~y, 1965; Cooley &

Miller,1965). These batteries include the General AptItude Test Battery

'oftheUnited States Employment Service, the Differential Aptitude Tests,

.andthe Flanagan Aptitude Classification Tesfs ..Ofparticular interest is The Anchor Test Study conducted by the Edu-

cationalTesting Service under the auspices of the U:S. Office of E~u-

qation(Jaeger, 197.3). This study represents a systematIc effort to proVIde

comparable and tI'uly representative national norms for the seve~ most

'dely used reading achievement tests for. elementa~ schoolchIldren.

hrough an unusually \vell-controlled ~xpenmental desl.gn, o.ver 300,000

fourth-,fifth-, and sixth-grade schoolchIldren were exammed 111 50 states.

The anchor test consisted of the reading comprehension and vocabulary

btests of the Metropolitan Achievement Test, for which new norms

creestablished in one phase of the-project. In the equating phase of the

"d)', each child took the reading comprehension an~ voca?ula~ sub-

ests from two of the seven batteries, each battery bemg paned In turn

withevery other battery. Some groups took parallel forms of t~~ t\.•.•o sub-

:testsfrom the same battery. In still other groups, all the pamngs were

'duplicated in reverse sequence, in order to control for order. of ad-

ministration. From statistical analyses of all these data, score eqUivalency

"tablesfor the seven tests were prepared by the equipercentile method. A

manual for interpreting scores is provided for use by school systems and

. other interested persons (Loret, Seder, Bianchini, & Vale. 1974).

Norms alld the Intcrpretation of Tcst Scores 93

considered a?ove. Thus, an employer may accumulate norms on appli-

cants for a gIVen type of job within his company. A college admissions

office may develop norms on its own student population. Or a single

elementa~y school may evaluate the performance of individual pupils in

terms of Its own sco:e distribution. These local norms are more appropri-

ate than broad nahonal norms for many testing purposes, such as the

prediction of subsequent job performance or college achievement, the

comparison of a child's relative achievement in different subjects, or

the measurement of an individual's progress o\-er time.

"94 Princil)les of Psychological Testing

of scores; such as that of a particular college, a type of college, a r~gi?n,

etc. These specific norms are. more useful in making colle.ge adml~slon

decisions than would be annu~l norms based on ~he entire. candidate

o ulation. Any changes in the candidate populatlOn o.ver time, more-

~v~r,can be detected only with a fixed-score scale. It will be noted that

the principal difference beh":een the fixed-reference-group scales u~der

consideration and the previously discussed. scales ~ased on natlOn~1

anchor norms is that the latter require the chOIce of a. smgle group that IS

broadl representative and appropriate for normative purposes. Apart

from the practical difficulties in obtaining such a group and the need to

update the norms, it is likely that for many testing purposes such broad

norms are not required. .Scales built from a fixed reference group are analogous m one respect

to scales employed in physical measurement. In this connection, Angoff

(1962}pp. 32--33) writes:

There is hardly a person here who knows the precise origina~ definition of ~heI gth of the foot used in the measurement of height or distance, or which~:git was whose foot was originally agreed upon as the standard; on t~eother hand, there is no one here who does not know how to. evalm~te lengt s

and distances in terms of this unit. Our ignora~ce of the precise on.gmal me~n-. g or derivation of the foot does not lessen Its usefulness to us In a~y "ay.

~~susefulness derives from the fact that it remains the same ~ver time andallowsus to familiarize ourselves with it. Needless to say, .preclsely th~ same

considerations applv to other units of measurement-the mch, the mile, th:de ree of Fahrel1h~it, and so on. In the field ofpsych?l.ogical measureme.nt It. g. 'lar]y reasonable to say that the original defimtlOn of the scale IS orIS Slml . . h . t ce of ashould be' of no consequence. ~Vhat is of consequence IS t e ~am enan .. t nt scale--which in the case of a multiple-form testmg program, ISconsa·, d 1 . . f s pIeachieved bv rigorous form-to-form equati~g-an . t 1e provlSl~n 0 up.-

t,. or'nlative data to aid in interpretation and III the formation of specific

men alY n , . d't' .. ntdecisions, data which would be revised from time to time as con I lOllSwalla .

Norms and the Intcrpretat,ion of Test Scores 95

computer capabilities should serve "to free one's thinking from the con-

straints of the past."

Various testing innovations resulting from computer utilization will be

discussed under appropriate topics throughout the book. In the present

connection, we shan examine some applications of computers in the

interpretation of test scores. At the simplest level, most current tests, and

especially those designed for group administration, are now adapted for

computer scoring (Baker, 1971). Several test publishers, as well as inde-

pendent test-scoring organizations, are equipped to provide such scoring

services to test users. Although separate answer sheets are commonly

used for this purpose, optical scanning equipment available at some

scoring centers permits the reading of responses directly from test book-

lets. Many innovative possibilities, such as diagnostic scoring and path

analysis (recording a student's progress at various stages of learning)

have barely been explored.

At a somewhat more complex level, certain tests now provide facilities

for computer interpretation of test scores. In such cases, the computer

program associates prepared verbal statements with particular patterns

of test responses. This approach has been pursued with both personality

and aptitude tests. For example, with the ~1innesota Multiphasic Per-

sonality Inventory (MMPI), to be discussed in Chapter 17, test users

may obtain computer printouts of diagnostic and interpretive stl;\tements

about the subject's personality tendencies and emotional condition,

together with the numerical scores. Similarly, the Differential Aptitude

Tests (see Ch. 13) proVide a Career Planning Report, which includes

a profile of scores on the separate subtests as well as an interpretive

computer printout. The latter contains verbal statements that combine

the test data with information on interests and goals given by the

student on a Career Planning Questionnaire. These statements are

typical of what a counselor would say to the student in going over his

test results in an individual conference (Super, 1973).

.. Individualized interpretation of test scores at a still more complex level

is illustrated by interactive computer systems, in which the individual is

in direct contact with the computer by means of response stations and

in effect engages in a dialogue with the computer (J. A. Harris, 1973;Holtzman, 1970; M. R. Katz, 1974; Super, 1970). This technique has been

investigated with regard to educational and vocational planning and de-

cision making. In such a situation, test scores are usually incorporated in

the computer data base, together with other inforn:tation ,,tovided by the

student or client. Essentially, the computer com~thes all the available

information about the individual with storedt-t' ",bout educational

programs and occupations; and it utilizes all re,lev;tnt' facts and relations

in answering the individual's questions and aiding him in reaching de-,

cisions. Examples of such interactive computer systems, ii!' various stages

COMPUTER UTILIZATION IN THE INTERPRETATION

OF TEST SCORES

Computers have already made a Sig~i~cant.impact ,upon eve? phase

of testing, from test construction to admlmstrahon, sconng, reportmg, and

interpretation. The obvious uses of computers-and those develop~d

earliest-represent simply an unprecedented increase in the spe~d WIth

which traditional data analyses and scoring processes can be earned out.

F'mportant however are the adoption of new procedures and

ar more 1 " .' h' h ldthe exploration of new approaches to psychological testmg w lC wo~

have been impossible without the fle:dbility, speed, and d~ta-processl~g

('~n:lhiliti('s of computPTS. As Baker (1971, p. 227) SUCCinctlyputs It,

. PrillcijJlesof Psychological Testing

. 1 d IBM's Education and Career Ex-erationaldevelopment, mc~T;' s S 'stem for Interactive Guidance

!:ionSystem (ECES). a~d fi ld i . I show good acceptance ofation (SIGI). Prehmmary e na s. nts (Harris 1973).

systemsby high school stud~nts and1 theltroPfart~edata utilized in

It an mtegra part results a so repres~n I) I der to present instructionaltiter-assisted instructwn (CAd .~ n or t le\'el of attainment, the, . t ch stu ent s curren1appropnate 0 ea d I ate the student's responses to

'ermust r.epeated~' s~or.ea~hi~::~onse history, the student may'Pgmatenal. On t e aSlSo. I . to further practice at the present

edtomore ad.vanced m:te~:r~~~ he receives instruction in more

,r to a reme~l~l branc .w . nostic anal sis of errors may leadtaryprereqUIsItematenal. .Dlag correcr the specific learning,instructionalprogram desIgned to

ltiesidentified in individual cases. f 'ble variant of computerd' t' ally more eaSl

ss costly an opera Ion d ';nstruction (CMI-see1 . . computer-manage ,

ion in earmng IS , 1 I mer does not interact directly

leton,1974). In suc~ syst~~~~t::me:ter is to assist the teacher in

,~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~;U~~~'~eu::;~~

'tionpackages or more ~onventlOn:l t~: rather formidable mass of'utionof the computer ISto proces f f each student in a

'1 d' g the per ormance 0 ,ceumulateddal y regar m. 1 d' dl'fferent activity and to'I I Y be InvOve In a ';,omW lere eac I, ma ..' xt instructional step for eachthese data in prescnbmg the ne -, 'ded by the

l' t' of computers are PIOVI,J,. Examplesof thi,Sapp lCan~~iduallY Prescribed Instruction-seeJsityof Pittsburgh s IPI (1

1968) d' Pro)'ect PLAN (Planning for

.. & GI 1969' Glaser an . Iaser, , " 1 d b the Amencan n-i~gin Accordance with Needs) deve ope SYh Brudner &

I 1971' Flanagan anner, ,s for Researc~e~t ;~~~ninclud~s a progr~m of self-knowled?e,!lr,1975). Pro) d t' al planning 'as well as instructionaualdevelopment, an occupa Ion ,

"entaryand high school subjects.

'<TERION-REFERENCED TESTING

, , h testing that has aroused a surge of',URE AN~ USES.~n appro~c t~ enerally desi<Ynatedas "criterion-

J,particularly 1~ education'd1sbygGlaser (1963) this term is still

d . " Fnst propose '.)lee testmg. I 'and its definition varies among different wnters.

; mewhat1010asl~~nativeterms are in common use, such as content-,ver,severa '

.~ "f ,del)' used CAI system for tE':lching reading to first-,r a descnptlOn 0 a \\ 1 • '( n-1 \

1" 1 ' \ ch'Ll---,. 'C'(' F, C, :\t1:,,~,n!1 1,,;, 0'

~l;:.n( t.H!·(_~T!~({' :: .'~.'-


domain-, and objective-referenced. These terms are sometimes employed

as synonymsfor criterion-referenced and sometimes with slightly differ~nt

connotations. "Criterion-referenced," however, seems to have gained

ascendancy, although it is not the most appropriate term.

Typically, criterion-referenced testing uses as its interpretive frame

of reference a specifiedcontent domain rather than a specified population

of persons. In this respect, it has been contrasted with the usual norm-

referenced testing, in which an individual's score is interpreted by com-

paring it with the scoresobtained by others on the same test. In criterion-

referenced testing, for example, an examinee's test performance may be

reported in terms of the specific kinds of arithmetic operations he has

mastered, the estimated size of his vocabulary, the difficulty level of read-

ing matter he can comprehend (from comic books to literary classics),

or the chances of his achieving a designated performance level on an

external criterion (educational or vocational).

Thus far, criterion-referenced testing has found its major applications

in several recent innovations in education. Prominent among these are

computer-assisted, computer-managed, and other individualized, self-

paced instructional systems. In all ,these systems, testing is closely inte-

grated with instruction, being introduced before, during, and after

completion of each instructional unit to check on prerequisite skills,

diagnose possible leaming difficulties, and prescribe subsequent instruc-

tional procedures. The previously cited Project PLAN and IPI are

examples of such programs.

From another angle, criterion-referenced tests are useful in broad sur-

veys of educational accomplishment, such as the National Assessment of

Educational Progress (\Vomer, 1970), and in meeting demands for edu-

cational accountability (Gronlund, 1974). From still another angle,

testing for the attainment of minimum requirements, as in qualifying for

a driver's license or a pilof s license, illustrates criterion-referenced

testing. Finally, familiarity with the concepts of criterion-referenced

testing can contribute to the improvement of the traditional, informal

tests prepared by teachers for classroom use. Gronlund (1973) provides

a helpful guide for this purpose, as well as ~ simple and well-balanced

introduction to criterion-referenced testing. A brief but excellent 'discus-

sion of the chief limitations of criterion-referenced tests is given by

Ebel (1972b).

CONTENTMEANING. The major distinguishing feature of criterion-

referenced testing (however defined and whether designated by this

term or by one of its synonyms) is its interpretation of test performance

in terms of content meaning. The focus is clearly on u;hat the person can

do and what he kno'.\'s,not on how he compares with others. A funda-

I:,1\1" '

IIE\Ii

lill~:,I

r\:11',I [

,1111: :

!

1 "

MASTERY TESTING. A second major feature almost always found in

criterion-referenced testing is the procedure of testing for mastery. Es-

sentiany, this procedure yields an all-or-none score, indicating that the

Norms and tIle Interpretation of Test Scores 99

indiVidual. has ~r has not attained the preestablished level of mastery .

When basic skIlls are tested, nearly complete mastery is generally ex-

pected (e.g., 80--85% correct items). A three-way distinction may also

be employed, including mastery, nonmastery, and an intermediate doubt-

ful, or "review" interval. '

In connection with individualized instru('tion, some educators have

argued that, given enough time and suitable instructional methods nearly

~veryone can achieve complete mastery of the chosen instructio~al ob-

J:etives. Individ~al differences would thus be manifested in learning

hme rather than In final achievement as in traditional educational testing

(Bloom, 1968; J. B. C~rroll, 1963, 1970; Cooley & Glaser, 1969; Gagne,

1965). It follows t.hat In mastery testing, individual differences in per-

fo~m~nce are of httle or no interest. Hence as generally constructed

cnter~on-refer~nced tests minimize indh'idual differences. For example:

they lnclude items passed or failed by all or nearly all examinees al-

though such. ite~ns are usually excluded from no~n-referenced t~sts.'

Mas:er~ t.estin? IS r~gularly. employed in the previously cited programs

fo~ l~dlvlduahzed mstructIon. It is also characteristic of published

cr~tenon-referenced tes~ for basic skills, suitable for elementary school.

Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory and

Pres~np~lve Mathem~tlCsJnventory (California Test Bureau), The Skills

M:omtor~ng System in Reading and in Study Skills (Harcourt Brace

!o\'anovlch) '.and ~iagnosis: An Instructi onal Aid Series in Reading andIn Mathematics (ScLCnceResearch Associates).

Beyond basic skills, mastery testing is inapplicable or insufficient. In

more. ad~'~nced and less structured subjects, achievement is open-ended.

The ll1dlvJ~ual m~~ progress almost without limit in such functions as

understandmg, cnbcal thinking, appreciation, and originality. Moreover,

content ~vel:a~e m~y p~oc~ed in many different directions, depending

upon .the mdl~I~~al s abllibes, interests, and goals, as well as local in-

structional factllties. Under these conditions complete ma t .r . ' S ery IS un-rea lStiCan.d unnecessary. Hence norm-referenced evaluation is generally

enlployed In such cases to assess degree of attainment. Some published

tcsts are so constructed as to permit both norm-referenced and criterion-

refe~enced applications. An example is the 1973 Edition of the Stanford

AchIevement Test. While providing appropriate norms at each level this

batt~ry ~eets three important requirements of criterion-referenced ;ests:

speclflc~tlO~ of ~etailed instructional objectives, adequate coverage of

each obJective WIth appropriate items, and wide range of item difficulty,

It should be noted that criterion-referenced testing is neither as ne~'

rinciplrs of Psychological T ('sting

equirement in constructing this type of test is a. clearly defined

. f knowledge or skills to be assesscd by the test. If scores. on such

e to have communicable meaning, the content domam to be

~lust be widely recognized as important. The selected domain

subdivided into small units defined in performance terms.

llciHQIlUI context these units correspond to behaviorally defined

6nal~.bjectives, 'such as "multiplies three-digit by two-digit

•.or "identifies the misspelled word in which the final e is re-

,hen addl~g -ing." In the programs prepared for in?ividualized

ion; these objectives run to several hundred for a smgle school

.~Afterthe instructional objectives have been fonnulated, items are

d to sample each objective. This procedure is admittedly difficult

, e -consuming. \Vithout such careful specification and control of

..t, however, the results of criterion-referenced testing could de-

rite into an idiosyncratic and uninterpretable jumble.,en strictly applied, criterion-referenced testing is best adapted for

ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s.

heseareas, insh'uctional objectives can also be arranged m an ordmal

archy, the acquisition of more elementary skills being prerequisite

:the acquisition of higher-level skills.6 It is impr~eticab~e a?d probably

ndesirable, however, to formulate highly speCIfic obJectIves for ad-

vancedlevels of howl edge in less highly structured subjects. At these

',ievels,both thc content and sequence of learning are likely to be much

'moreflexible.On the other hand, in its emphasis on content meaning in the interpre-

tation of test scores, criterion-referenced testing may exert a salutary

effecton testing in general. The interpretation of intelligence test scores,

_,for example, would benefit from this approach. To describe a child's

" intelligence test performance in terms of the specific intelJech~al skills

and knowledge it represents might help to counteract the confuSIOns a~d

misconceptions that have become attached .to the IQ. VVhen stated I~

these general terms, however, the critenon-referenced approa~h IS

equivalent to interpreting test sCOTesin t~e light of the demonstra~ed

validity of the particular test, rather than m terms of vague underlymg

entities. Such an interpretation can certainly be combined with n?rm-

referenced scores.

6ldeaUy, such tests follow the simplex model of a Guttman scale (see Popham &

1T1Isck,]9(9), as do the PiaF:etian ordinal scales discussed earlier in this chapter.

. : As a resl~lt.of this reduction in variability, the usual methods for findin tdtlJio~,hty and \'al,d'.ty are,inapplkahle to most criterion-referenced tests. Further irSCIlIl.sum of these pomts Willbe found in Chapters 5, S, and 8.

rinciples of Psychological Testing-/

Norms and the Interpretation at Test Scores'II 101

one I ustrated in Table 6 Tl d .171 high school boys en 'II dl~ ata for thIs table were obtained from

d' ro e m courses in Am' h'Ictor was the Verbal R' encan Istor)', The pre-

easomng test of the D'ff t' I .administered earl . th I eren la Aphtude Tests

y m e course. The crite . 'd I

The correlation between test d ~lOn."as en -of-course grades.scores an crltenon was ,66.

TABLE 6

Expectancy Table Showing Relation betwe .and Course Grades in America H' t f en DAT lerbal Reasoning Test

n IS ory or 171 Boys in Crade 11

(Adapted from Fifth Edition Manual for . .T, p. ll~. Reproduced by permission th~. DIfferential Aptitude Tests, Forms Sand

Corporation, New York, N.Y. All right~~~;:~~~~,~ 1973, 1974 by The Psychological

~'-=-==-r----=--r:--.:.:----

clearly divorced from norm-referenced testing as some of its

ts imply. Evaluating an individual's test performance in absolute

ch as by letter grades or percentage of correct items, is certainly

, er than normative interpretations, More precise attempts to

test performance in terms of content meaning also antedate the

lion of the term :'criterion-referenced testing" (Ebel, 1962;

il,l962-see also Anastasi, 1968, pp. 69-70), Other examples may

_in early product scales for assessing the quality of handwriting,

_tions, or drawings by matching the individual's work sample

f a set of standard specimens. Ebel (1972b) observes, further-

that the concept of mastery in education-in the sense of all-or-

earning of specific units-achie\"ed considerable popularity in the

and 19305and was later abandoned.om1ativeframework is implicit in all testing, regardless of how

, are expressed, (Angoff, 1974). The very choice of content or

to be measured is influenced by the examiner's knowledge of what

e expected from human organisms at a particular developmental or

ctional stage. Such a choice presupposes information about what

persons have done in similar situations, Moreover, by imposing

rm cutoff scores on an ability continuum, mastery testing does not

'by eliminate individual differences, To describe an individual's level

ding comprehension as "the ability to understand the content of

•~ett;York Times" still'leaves room for a wide range of indi\'idual

erencesin degree of understanding.f

Test ~umberPercentage Receiving Each Criterion Crade

Score of Cases Below 70 70-79 80-89 90 & above

40 & above 4630-39 36

15 22 63

20-29

6 39 39 17

43

Below 20

12 63 21 5

46 30 52

--=17

The first column of Tahle 6 shows h .' .class intervals' the numb f t d t e test SCOles, dlVlded into four" ' er 0 s u ents whose f 11' .IS gIven in the second column The r " scores. a. mto each mtervaltable indicate the pe t' f emall1l1lg entnes m each row of the

rcen age 0 cases 'th' hwho received each grade at th d f h WI III eac . test-score interval

wi~h scores of 40 or above 0:~e ;e:b e course. ~hus, of the 46 students

celved grades of 70-79 22 al Reasomng test, 15 percent re-

d' percent grades of 80-89 d 63

gra es of 90 or above At th th ' an percentbelow 20 on the test '30 e 0 er e~treme, of the 46 students scoring

b' percent receIved gr d b I 7

etween 70 and 79 a d 17 - a es e ow 0, 52 percent

limitations of the a~ai~ble dPtercent between 80 and 89. Within the

estimates of the probabilit ~ha~tthese. p~rcentages. represent the best

criterion grade. For exam ? 'f an mdlVldual WIll receive a given

34 (i.e" in the 30--39 inte~,:i/ ':e n~w t~udent receives a test score of

of his obtaining a grade of 90 ~ _"ou . conclude that the probability

of his obtaining a grade betweer~~ove lS817. out of 100; the probability

In many practical situation n. ~n 9 ISS9'~ of 100, and so on.cess" and "failure" in a 'ob ' s, cntena can be dicliotomized. into "suc-

these conditions, an e~ e~;::;,se cof study, or othe.r undertak~ng. Under

probability of success oPrfa"I y hart can be prepared, showing the

F. I ure corresponding t 'h 'Igure 7 is an -example f h 0 eac . score mterval.

selection battery developeod ~\h a~.ex:ectanc~ chart. Based on a piloty e Ir orcc, thIS expectancy en,lirt shows.

PECTANCY TABLES.Test scores may also be interpreted in terms of

eeted criterion performance, as in a training program or on a job,

s usage of the term "criterion" follows standard psychometric prac-

, as when a test is said to be validated against a particular criterion

Ch, 2), Strictly speaking, the term "criterion-referenced testing"

uld refer to this type of performance interpretation, while the other

proachesdiscussed in this section can be more precisely described as

tent,referenced. This terminology, in fact, is used in the APA test

ndards (1974).n expectancy table gives the probability of different criterion out-

roesfor persons who obtain each test score. For example, if a student

tains a score of 530 on the CEEB Scholastic Aptitude Test, what are

e chances that hislreshman grade-point average in a specific college

ill fall in the A, B, C, D, or F category? This type of information can

e obtained by examining tbe bivariate distribution of predictor scores

SAT) plotted against criterion status (freshman grade-point average),

'f the number of cases in each cell of sueh a bivariate distribution is

Changedto a percentage, the result is an expectancy table, such as the

, R l' bet "een Performance7 Expectancv Chart ShowlI1g e atlon \ , . .

G. • , d E1' ' fan from Primary Flight Trall1JUg.IectionBattery an IIDlllaI

',(From Flanagan, 1947, p, 58.)

~ . ,'thin each stanine on the battery whothe pertentage of men scormg "I .. 'It b seen that 77 percent. l' W ht trammg can e,failedto camp :t: pnmary. 19 f 1were eiiminated in the course of train-.ofthe men recelVlDg a stamne 0 . 9 f. 'led to complete the" 1 I 1 4 t of those at stamne aJ,lng. W Ii c on y percen es the ercentage of failurestraining satisfactorily. Between these ex.trcm ., . Po the basis of this

. 1 the succeSSl'\'e stamnes. n ', decreases consIstent y over ". d f Ie that approximately, expectancy chart, it ~uld be predlcte , °t

re:amPco~e of 4 win fail and

f 'I t d t who obtain a s anme s40 percent 0 pI 0 ea e s '1 1 t 'marv flight train-

;tpproximately 60 percent wil1:.atisf~ctor:':b~~~i~ye o~~~cces~ and failure

ing. Similar statements .reia: dm1 t eh~ receive each stanine. Thus, an

could be ma.de about. m lVI ua s :v60'40 or 3:2 chance of completing

individual wIth a stamne o.f 4 has . . . a criterion-referenced interpre-

primary flight training. Besldebsprovldmthg t both expectancy tables and. f t t es it can e seen a d'

tatlol1 0 es scor., 1 'd f the validitv of a test in pre Ict-expectancy charts give a genera 1 ea 0 ,

ing a given criterion.

No. of

Men

9 21,474

8 19,444

7 32,129

6 39,398

5 34,975

•• '23,699

3 11,209

2 2,139

904

CHAPTER 5

Reliability

RLIABILlTY refers to the consistency of scores obtained by the

same persons when reexamined with the same test on different

occasions, or with different sets of equivalent items, or under

othel: variable examining conditions. This concept of reliability underlies

the computation of the error of measurement of a single score, whereby

we can predict the range of fluctuation likely to occur in a single indi-

vidual's score as a result of irrelevant, chance factors.

The concept of test reliability has been used to cover several aspects of

score consistency. In its broadest sense, test reliability indicates the extent

to which individual differences in test scores are attributable to "true"

differences in the characteristics under consideration and the extent to

which they are attributable to chance errors. To put it in more technical

terms, measures of test reliability make it possible to estimate what pro-

portion of the total variance of test scores is error variance. The crux of

the matter, however, lies in the definition of error variance. Factors that

might be considered error variance for one purpose would be classified

under true variance for another. For example, if we are interested in

measuring fluctuations of mood, then the day-by. day changes in scores

on a test of cheerfulness-depression would be relevant to the purpose of

the test and would hence be part of the true variance of the scores. If, on

the other hand, the test is designed to measure more permanent person-

ality characteristics, the same daily fluctuations would fall under the

heading of error variance.

Essentially, any condition that is irrelevant to the purpose of the test

represents error variance. Thus, when the examiner tries to maintain

uniform testing conditions by controlling the testing environment, in-

structions, time limits, rapport, and other similar factors, he is reducing

error variance and making the test scores more reliable. Despi~e optimum

testing conditions, however, no test is a perfectly reliablei~strument.

Hence, every test should be accompanied by a statemellt of its reliability.

Such a measure of reliability characterizes the test when administered

under standard conditions and given to subjects simil!lT to those con-

stituting the normative sample. The characteristicsof thiss~mple should

therefore be specified, together with the type of reliability that was meas-

ured.

iud/Iles of Psychological Testing

could,of course, be as many varieties of test reliability as there

'lions affecting test scores, since any such conditions might be

_for a certain purpose and would thus be classified as error vari-

e types of reliability computed in actual practice, however, are

few. In this chapter, the principal techniques for measuring the

. of test scores will be examined, together with the sources of

iance identified by each, Since all types of reliability are con-

-with the degree of consistency or agreement between two inde-

By derived sets of scores, they can all be expressed in terms of a

lion coefficient, Accordingly, the next section will consider some

;basic characteristics of conelation caefficients, in order to clarify

use and interpretation, More technical discussion of correlation, as

as more detailed specifications of computing procedures, can be

d:in any elementary textbook of educational or psychological statis-

such as Guilford and Fruchter (1973).

9

I : ~- i m

9ii .Jifflll

,,I ~.j/ff

I , II

~4H1Hff

iiNt I

.Jiff.j/ff'

4/It.j/ff1

!.j/ff!

JItt.j/ff I ---

.Jifflll I i: !

:.j/ff JHt ! :!I ;

mr I ! !,i

II II , ,

0. 0. 0.

N

OJ

:g 60-69"5~ 50-59o

~ 40-49oX

T N MO.O. ()o..

2 I, i"'P'?o 0 0 0 0N M """ "0 -0

Score on Variable J

Bivariate D' t'b' fISn utlOn or a Hypothetical Correlation of +1.00.

fEAl\,~G OFCORRELATION.Essentiallv, a correlation coefficient (T) ex-

~ssesthe d'egree of correspondence, '01' relationship, between two sets

;scores,Thus, if the top-scoring individual in variable 1also obtains the

scorein variable 2, the second-best individual in variable 1 is second

..~stin variable 2, and so on down to the poorest individual in the group,

ncn there would be a perfect correlation between variables 1 and 2.

ucha correlation would ha\'e a value of +1.00,A hypothetical illustration of a perfect positive correlation is shown in

igure 8. This figure presents a scatter diag~\lm, or hivariate distributiOflt,

ch tally mark in this diagram indicate~~~e score of one individual in

th vllriable 1 (horizontal axis) and vain.\:B1e:2 (vertical axis). It will be

noted that all of the 100 cases in thee grolJ.l) are distributed along ~~

diagonal running from the lower left- to,'theupper right-hand corner of

.,'the diagram. Such a distribution indicates a perfect positive correlation

(+ 1.00), since it shows that each individual occupies the same relative, ,position in both variables. The closer the bivariate distribution of scares

approaches this diagonal, the higher will be the positive correlation.

Figure 9 illustrates a perfect negative correlation ( -1.00 ). In this case,

there is a complete reversal of scores from one variable to the other. The

best individual in variable 1is the poorest in variable 2 and vice versa,

this reversal being consistently maintained throughout the distribution. It

will be noted that, in this scatter diagram, all individuals fall on the

diagonal extending from the upper left- to the lower right-hand comer.

This diagonal runs in the reverse direction from that in Figure 8."- ....,,1,,.;,,~;."l;,,~t,,~('()mnlete "bsence of rdationship, such as

might occur by chance, If each individ l'out of a hat to determine his 'f' tl.a s n~me were pulled at randomwere repeated for variabl~ C) pOSI IOn m vanable 1, and if the process

Under these conditions l't -, alzderbo~r near~zero correlation would result., WOu e ImpOSSIblet d' ,

relative standing in variable 2 from k 0 pre, Ict an 1l1dividual's

1.The top-~oring subJ'ect I'n "bl a1n~whledge of IllS score in variablE!

, valla I" mlg t scar I' I IIn variable 2. Some individual 'h b h e ug I, ow, or average

~oth variables, or below ave;a~e~l~gb~th~ ~hance ~core above average in

In one variable and below in the oth .' 'Uers might ~all above averageaverage in one and at th ' .er, sh others 11lIght be above the

, e avel acre 111 the second d fwould be no regularit}, in the relate: h' f ' an so orth. ThereTI lOns Ip rom one i d' "d IIe coefficients found in a t I' n 1\I ua to another.

extremes, having some value 'h~ ~1 p~actIce generally fall between these

lations between measures of ~1,t't an zero but lower than 1.00. Corre-frequentlY low When a a I,lIes are nearly ;rlways positive, aIthoug'h

" negative conel t' . b'such variables, it usually results f th a IOn IS a tamed between twopressed, For example 1'£ t' rom e way in which the scores are ex-

, Ime scores are correla't d 'thnegative correlation wl'11 prob bl I Th ,e;. WI amount scores, a

, a y resu t. u '~f -h b' ,an anthmetic computation t t' d s, '1. cae su lect s score:()n, es IS recor ed as the xi' b f

qmred to complete all itenls h"l h' ',pm er a secondsre·, W I e IS Score on an 'th .

test represents the number of bl ,~, an mehc reasoning1 ' pro ems correctly soh d .ahon can be expected In su I h . ~" a negative corre-

. CIa case, t e poorest (i.e.", slowest) individ-

. R l' bet \'een PerformanceCh t Showmg e atIon \ .,IG,7. Expectancy aT p.' . Flight Training.ejectionBattery and Elimination from I1maly

.{FromFlanagan, 1947, p. 58.)

: . ,thin each stanine on the battery who,thepercentage of men scormg \\ I . . . It b seen that 77 percent, J' fI' ht trammg can eailed to comp :t: pnmary. Ig f 1were eiiminated in the course of train-of the men receIVing a stamne 0 . 9 f 'led to complete the

1 '1 I 4 t of those at stamne aling, W 11 C on y percen es the ercentage of failuresh'aining satisfactorily, Between these ex.trcm ", Po the basis of this

. 1 the succeSSl'\'e stanmes. ndecreases consistent y over '. d f I that approximatelyexpectancy chart, it ~uld be predlCt,e , °t

re~amPcoe~eof 4 will fail and

f 'J t d t who obtam a s amne s40 percent 0 pI 0 ca e s 1 . flight train-

itpproximately 60 percent wil1;atis~~~tor~'~b~~~i~ye~fl:~:::~and failureing, Similar statements re~a~ m~ hp i each stanine. Thus, an

could be made about. indlvldua s w 6~.~~c:rv;:2 chance of completing

. individual. with a. s~amne o.f 4 has ~idin' a criterion-referenced interpre-

primary fhght trammg. Besldebspro thgt both expectancy tables and

. f t t scores it can e seen a d'tatlon 0 es .' I 'd f the validitv of a test in pre lct-expectancy charts glVe a genera 1 ea 0 J

ing a given criterion.

No. of

Men

9 21,474

S 19,444

7 32,129

6 39,398

5 34,975

4 '23,699

3 11,209

2 2,139

904

CHAPTER 5

Reliability

RLIABILITY refers to the consistency of scores obtained by the

same persons when reexamined with the same test on different

occasions, or with diHerent sets of equivalent items, or under

othel: variable examining conditions. This concept of reliability underlies

the computation of the error of measurement of a single score, whereby

we can predict the range of fluctuation likely to occur in a single indi-

vidual's score as a result of irrelevant, chance factors.

The concept of test reliability has been used to cover several aspects of

score consistency. In its broadest sense, test reliability indicates the extent

to which individual diHerences in test scores are attributable to "true"

differences in the characteristics under consideration and the extent to

which they are attributable to chance errors. To put it in more technical

terms, measures of test reliability make it possible to estimate what pro-

portion of the total variance of test scores is error variance. The crux of

the matter, however, lies in the definition of error variance, Factors that

might be considered error variance for one purpose would be classified

under true variance for another. For example, if we are interested in

measuring fluctuations of mood, then the day-by-day changes in scores

on a test of cheerfulness-depression would be relevant to the purpose of

the test and would hence be part of the true variance of the scores. If, on

the other hand, the test is designed to measure more permanent person-

ality characteristics, the same daily fluctuations would fall under the

heading of error variance.

Essentially, any condition that is irrelevant to the purpose of the test

represents error variance. Thus, when the examiner tries to maintain

uniform testing conditions by controlling the testing environment, in-

structions, time limits, rapport, and other similar factors, he is reducing

error variance and making the test scores more reliable. Despite optimum

testing conditions, however, no test is a perfectly reliable instrument.

Hence, every test should be accompanied by a statement of its reliability.

Such a measure of reliability characterizes the test when administered

under standard conditions and given to subjects similllr to those con-

stituting the normative sample. The characteristics of thiss~mple should

therefore be specified, together with the type of reliabIlity that was meas-

ured.

iflciplesof Psychological Testing

"~ould, of course, be as many varieties of test reliability as there

,jtionsaffecting test scores, since any such conditions might be

t for a certain purpose and would thus be classified as error vari-

e types of reliability computed in actual practice, however, are

few. In this chapter, the principal techniques for measuring the

'f}'of test scores will be examined, together with the sources of

illiance identified by each. Since all types of reliability are con-

,with the degree of consistency or agreement between two inde-

'flyderived sets of scores, they can all be expressed in tcrms of a

'on coefficient. Accordingly, the next section will consider some

basic characteristics of correlation cBefficients, in order to clarify

use and interpretation. ?\fore technical discussion of correlation, as

·as more detailed specifications of computing procedures, can be

,in any elementary textbook of educational or psychological statis-

; such as Guilford and Fruchter (1973),

9

I : ,- ; ",

!i ! ./Iff III

,

!.JHt-./Iff

.., II j

!mr ./Iffi

i#ff I ;

./Iff./lff'

T./Iff./lffl

./Iff!

./Iff./lff ,j'--

./Iff 11/ I: !

:./Iff./lff

i i

I,;

lilt I ! !, ,,

II I ,

I I,

0- 0- 0-

N

••:g 60-69

'g> 50-59co

~ 40-49v

'"

I N (""')0. ().. ()o.

gb b ';t'fl'?N t") ~ Si ~

SCore On Variable I

FIG. 8, Bivariate Distr'b t' fI U IOn or a Hypothetical Correlation of +1.00.

might OCcur by chance If each ind' 'd I'out of a hat to determ'ine hi .1:1 1I.as n~me \"ere pulled at random

, s pOsitIOn In vanahle 1 a d 'f thwere repeated for variable" ' n I e processUnder these conditions it -, alzderbo~r near~zero correlation would result.

, \Vou e ImpOSSible to d' t drelative standing in variable 2 from k pre. IC an in ividual's

~. The top-sl!Oring Subject in variable a1~~w~edge of l~,s SCore in variableIn variable 2. Some individ I 'h b g t Score 11lgh,low, or average

both vadables or below av:;'l s n~,gbt hY chhance score above average in. ' age In ot . ot ers mightf II b111 one variable and below in the oth .' '11 .a a Ove averageaverage in one and at th " .er, sh others mIght be above the

ld b e a\el:lge III the second and f hwou .e no regularity in the relationshi from '.. ,. so art, ThereThe coefficients fOund in t I ~ one mdl\ Idual to another.

extremes, having some value ~~ ~'l .p~achce generally fall between these

lations between measures of a~1.tt an zero but lower than 1,00. Corre-frequentlv low When a I,lies are nearly a-lways positive, althoug'h

,. negative con-el t' . b'such variables, it usually results from th a IOn.IS 0 .tamed between twopressed. For example if time e way III which the scores are ex-

, ' SCores are correlated withnegat.lYc correlation will probabl ' result. Th ';~:'~' , am.ou~t scores, aan anthmetic computation te t .) d d us, 1f~!ch sublect s score'On. d S IS recor e as the dumb f d

qUire to complete all items wh'l h' '~er a secon sre·t t ' I e IS Score on an arith t'es represents the number of hI '''.' me IC reasoningI t' pro ems correctly sol\!cd 'a Ion can be expected. In SUell 'h . :,<:,' a negatIve cone-

,a case, t e poorest (I.e., slowest) individ-

EA!\'ING OF CORRELATION. Essentially, a correlation coefficient (T) ex-

ses the d'egree of correspondence, or relotions1Jip, between two sets

cores.Thus, if the top-scoring individual in variable 1also obtains theop score in variable 2, the second-best individual in v-ariable 1is second

~stin variable 2, and so on down to the poorest individual in the group,

'brll there would be a perfect correlation between variables 1 and 2.

ucha correlation would ha\'e a value of + 1.00.A hypothetical illustration of a perfect positive correlation is shown in

igure 8. This figure presents a scatter diag~ll.m, or bivariate distrihutiOl/,.

ach tally mark in this diagram illdicated~e score of one individual in

'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). It will be

noted that all of the 100 cases in thee groBl) are distributed along u.~"diagonal running from the lower left- t~,'the"upper right-hand corner of

:the diagram. Such a distribution indicates a perfect positive correlation

, (+1.00), since it shows that each individual occupies the same relative

position in both variables. The closer the bivariate distribution of scares

approaches this diagonal, the higher will be the positive correlation,

Figure 9 illustrates a perfect negative correlation ( -1.00), In this case,

there is a complete reversal of scores from one variable to the other, The

best individual in variable 1is the poorest in variable 2 and vice versa,

this reversal being consistently maintained throughout the distribution. It

will he noted that, in this scatter diagram, all individuals fall on the

diagonal extending from the upper left- to the lower right-hand comer,

This diagonal runs in the reverse direction from that in Figure 8.,,- 1..•;,,~ ;."l;r·~tr'~ ('omnlete flbsellce of rdationship, such as

0- 0- 0- 0-

1 '? '? r;-~ ~ ~ R

Score on Variable 1

Reliability 107

tive. When some prod t ..

W

'll b 1 uc s are posItive and some negative the correlation1 e c ose to zero. '

In actual practice it's tstandard score befo' ~ d~o n~cessary to convert each raw scorc to acan be mad . re n mg t e cross-products, since this conversion

, he once for all after the cross-products have been added Thereare many s ortcuts fo . .method demonst. ar .computmg.the Pearson correlation coefficient. The

meanin of the ~te m. Table 7 l~ not the quickest, but it illustrates thethat l rf. rr~latIon coeffiCient more clearly than other methods

Pears~~ I:~;~:~~t:~~::i\~hor::uts. Table 7 shows the computation of a

to each child's nam ~1e ICand reading scores of 10 children. Next

reading test (Y) T~ are. hiS s~ores in the arithmetic test (X) and the

the res ective c~l e sums an .means of the 10 scores are given under

each aJthm ti umn;- The thU'? column shows the deviation (x) of

the deviatio~ (yS~o~;ero~1 thed~nthmetic mean; and the fourth column,

deviations are squareda~n ~~: ;::g /~ore fr~m the reading mean. Thesesquares are used in . x wo co umns, and the sums of the

and reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmetic

dividing each x and y by'ts . 0 eSdc~le m Chapter 4. Rather than1 correspon mg u to find standard scores, we

/II

\I

./ill I \

./iIt./ill\ I

11IIJIlt

JIlt 1/1 i- Jlltl/tf

.IIII./iII \./ill

11II11II1 I

i./ill ,I i

./iII./iII.

\II

1

\I.11/I11I

1/1

i

'"~ 60-69

.9Ii> 50-59co

~ 040-049ouVl

Ic.9. Bivariate Distribution for a Hypothetical Correlation of -1.00.TABLE 7

Computation of Pearson Product-Moment Correlation Coefficient

Arith- Read-metic ing

Pupil X Y x y x:z y' xI}

Bill 41 17 +1

Carol-4 1 16 - 4

I38 28 -2 +7

Geoffrey 48 22

4 49 -14

+8 +1 64

Ann 32

1 8

Bob16 -8 -5 64 25

: 3440

18 -6 -3 36

Jane9 18

36 15 -4 -6 16

Ellen 41 24

36 24

+1 +3 1

Ruth 43

9 320 +3 -1 9

Dick 47 23 +7

1 - ~

Mary 40

+2 49 4· 14

27 0 +6 0 36

S 400 210

0

M

0 0 2~4 186 86

40 21

IN -- . ~186 --fT. = 10= v'24.40= 4.94 fT, = 10= v'18.60= 4.31

r,,=~= 86 86NUru. (10)(4.94)(4)R} = 212;91=.40

I ? "':.:'Ii';~l

. .'''_~~i~i ' -.[

. 'ualwillhave the numerically highest score on the first test, while the best

individualwill have the highest score on the second.Correlation coefficients may be computed in variom ways, depending

on the nature of the data. The. most common is the Pearson Product-Moment Correlation Coefficient. This correlation coefficient takes into

a.ceountnot only the person's position in the group, but also the amount

ofhis deviation above or below the group mean. It will be recalled that

. wheneach individual's standing is expressed in}erms of standard scores,

personsfalling above the average receive positive standard scores, while

thosebelow the average receive negative scores. Thus, an individual who

is superior in both variables to be corre1al:ed,:would have two positive

standard scores; one inferior in both woul~ have two negative standard

scores.If, now, we multiply each individ\i&r" tandard score in variable I

by his standard score in variable 2, all.at . products will be positive,

provided that each individual falls on theA.ame side of the mean on both

variables. The Pearson correlation coefficje,))t is Simply the mean of these

products. It will have a high positive val\ie:'W~~n corresponding standard

scores are of equal sign and of approximately equal amount in the two

variables. When subjects above the average in one variable are below the

average in the other, the corresponding cross-products will be negative.

If the sum of the cross-products is negative, the correlation will be nega-

'08 Prillcip1t's of PS!Jchological T('8ting ,

t the end as shown in the correlationform this division only once ad' ' the last column (xI)) have

1 7Th oss-pro uets m' dula in Tab e, e cr , d' g deviations in thc x an y

1· l' the cOITespon lll' d ten found by mu tIp ymg '( r) the sum of these cross-pro uc slumns,To compute the _~orrelatlOn(N ) , and by the product of the twodivided bv the number. of cases

ndard de~'iatiol1s (11':<Uy),

correlation of ,40 found in Table 7 ind~-STATISTICALSIGNIFICAJ'CE,The , 1 t' hl'p between the arithmetic

d f ositwe re a Ions 11tes a moderate egree 0 p d for those children doing wereading scores. There is some1ten h

encyadl'ng test and vice versa, al-

f wel on t e re , h harithmeticalso to ~er orm If we are concerned only Wit t eugh the relation IS not close, cept this correlation as an

10 'h'ldren we can acrmance of the,se c 1 , fIt' existing between the two

. f th degree 0 re a lOnuate descriptlOn 0 e '1 ch however we are usu-

I holog1ca resear , ' d 1" les in this group. n psyc d h t'cular sam1J/e of indivi ua s

1" beyon t e par 1terested in genera lZln~ h'ch the represent, For example, we

to the larger populatIOn ": 1 etic ;nd reading ability are corre-t want to know whether anthm f h e age as those we tested,

. h lchildren 0 t e sam .amongAmencan sc 00 . d vould constitute a very m-iously,the 10 cases actually ~xamAlneth'r comparable sample of the: 'f 1 opulatlOl1, no e .uate sample 0 sUf 1 a p much higher correlatIOn.sizemight yield a much lowfer ort~ tl'ng the probable fluctuation

. . 1 dures or es Imaere are stabshca proce , Ie in the size of correlations, means,'expectedfrom sample to samp , The <!uestion usually, , d' ther group n1easures, . 'rd deviations, an an) 0 .' 1 whether the correlation IS

1, however IS SImp v , h,about carre atlOns, , h 'd l'f the correlation 111t e

1 . In ot er war 5,antly greater t lan zelo. . as hi h as that obtained in our'lion is zel'O, could a cOTTel~hon glne? When we say that a

d f Plmg error a 0 'have resulte rom sam t (01) level" we mean the, ,... 'fi t at the 1 percen. , 1bon IS slgm can t of 100 that the population corre n-are no greater than one ou h t van'ables are truly corre-

, 1 de that t e wo ,zero.Hence, we conc U h risk of error we are willing to ta~eignificancelevels refer to ~ e

tIf a correlation is said to be Slg-

ing conclusions from our ~;"t f error is 5 out of 100. Most

at,the .05 ~eve1, th~ pr;;~e; ~~e ~01 or the ,05 levels, although

oglcalresearch applies 10 ed for s ecial reasbns~

,ificancelevels may be. emp b( 7 f 'Is fo reach significance evenrrelation of .40 found 111Ta e,. al d 't,h';ill11\,10 cases it is

, ht h e been antiCIpate ,WI ..•~;r,Y flevel. As mlg av l' h' conc1usively~\Yith this size 0o establish a general re at,lOn.s

6Ip t a";"t'the "05-1eve1 is .63. Any

, 1 t' Igm can ' ,he smallest corre a Ion s, '" " wered the question of,n below that value simply leaves unans

Reliability 109

whether the two variables are correlated in the population from which

the sample was drawn.

The minimum correlations significant at the .01 and ,05 levels for

groups of different sizes can be found by consulting tables of the signifi-

cance of correlations in any statistics textbook. For interpretive purposes

in this book, however, only an understanding of the general concept is

required. Parenthetically, it might be added that significance levels can

be interpreted in a similar way when applied to other statistical measures.

For example, to say that the difference between two means is significant

at the .01 level indicates that we can conclude, with only one chance out

of 100 of being wrong, that a difference in the obtained direction would

be found if we tested the whole population from which our samples were

drawn. For instance, if in the sample tested the bo),s had obtained a

significantly higher mean than the girls on a mechanical comprehension

test, we could conclude that the boys would also excel in the total popu-

lation,

THE RELIABILITYCOEFFICIENT.Correlation coefficients have man)' uses

in the analysis of psy.chological data, The measurement of test reliability

represents one application of such coefficients. An example of a reliability

coefficient, computed by the Pearson Product-Moment method, is to be

found in Figure 10. In this case, the scores of 104 persons on two equiva-

lent forms of a Word Fluency test' were correlated. In one form, the sub-

jects were given five minutes to write as many words as:'they could that

began with a given letter. The second form was identical, except that a

different letter was employed. The two letters were chosen by the test

authors as being approximately equal in difficulty for this purpose.

The correlation between the number of words written in the two forms,\

of this test was found to be ,72. This correlation is high and significant at

the ,01 level. With 104 cases, any correlation of .25 or higher is significant

at this revel. Nevertheless, the obtained correlation is somewhat lower

than is desirable for reliability coefficients, which usually fall in the .80's

or .90's, An ~nation of the scatter diagram in Figure 10 shows a

typical bivariate distribution of scores corresponding to a high positive

correlation. It will be noted that the tallies cluster c~ose to the diagonal

extending from the lower left- to the upper right-haridcorner; the trend

is definitely in this direction, although there is a certain amount of scatter

of individual entries. In the follOWing section, the uSe of the correlation

coefficient in computing different measures of test reliability will be con-

sidered. '

lOne of the subtests of the SRATests of Primary Mental Abilities' for Ages 11 to17. The data were obtained, in an investigation by Anastasi and Drake (1954).

ReliabilifY 111

less susceptible the scores are to the random daily changes in the condi-

tion of the subject or of the testing environment.

When retest reliability is reported in a test manual, the interval over

which it was measured should always be specified. Since retest correla-

tions decrease progressively as this interval lengthens, there is not one

but an infinite number of retest reliability coefficients for any test. It is

also desirable to give some indication of relevant intervening experiences

of the subjects on whom reliability was measured, such as educational or

job experiences, counseling, psychotherapy, and so forth.

Apart from the desirability of reporting length of interval, what con-

siderations should guide the choice of interval? Illustrations could readily

be cited of tests showing high reliability over periods of a few days or

weeks, but whose scores reveal an almost complete lack of correspond-

ence when the interval is extended to as long as ten or fifteen years.

Many preschool intelligence tests, for example, yield moderat~ly stable

measures within the preschool period, but are virtually useless as pre-

dictors of late childhood or adult IQ's. In actual practice, however, a

simple distinction can usually be made. Short-range, random fluctuations

that occur during intervals ranging from a few hours to a few months are

generally included under the error variance of the test score. :rhus, in

checking this type of test reliability, an effort is made to keep the interval

short. In testing young children, the period should be even shorter than

for older persons, since at early ages progressive developmental changes

are discernible over a period of a month or even less. For any type of

person, the interval between retests should rarely exceed six months.

Any additional changes in the relative test performance of individuals

that occur over longer periods o£ time are apt to be cumulative and pro-

gressive rather than entirely random. Moreover, they are likely to charac-

terize a broader area of behavior than that covered by the test perform-

ance itself. Thus, one's general level of scholastic aptitude, mechanical

comprehension, or artistic judgment may have altered appreciably over

a ten-year,period, owing to unusual intervening experiences. The indi-

vidual's status may have either risen or dropped appreciably in relation

to others of his own age, because of circumstances peculiar to his own

home, school, or community environment, or for other reasons such as

illness or emotional disturbance.

The .extent to which such factors can affect an individual's psycho-

logical development provides an important problem for investigation.

This question, however, should not be confused with that of the reliabil-

ity of a particular test. When we measure the reliability of the Stanford-

Bin~t, for example, we do not ordinarily correlate retest _~~res over a

penod of ten years, or even one year, but over a few ,,~et:1ks.'-T.p be sure,long-range retests have been conducted wit~ such tests-; bpt the results

are ~enerally discussed in terms of the predictability of adult intelligence

Flc.l0. A Reliability Coefficient of .72.

·<:.(Dalafrom Anastasi & Drake, 1954.)

l

;1;;:TYPES OF RELIABILITY

r, ost obvious method for finding the re-, TEST-RETEST RELIABILITY. The m. h'd ntical test on a second occa-

liabilityof te.st ~c~res is by. rcpeCatll1)gi:;h~S:ase is simply the correlation. " sian.The I'ehablhty coeffiCIent Tn on the two administra-

bt' d by the same persons~betweenthe scores 0 ame d to the random fluctua-

. Th . variance correspon s" lionsof the test. e enor . t the other These variations. f test seSSIOn0 •tionsof performance rom one n d t ting conditions such as extrememay result in part from uncontr? e eds ther distractions or a broken

. th dden nOlses an 0 '. hchangesm wea er, su h they arise from changes in t e

pencil point. To so~e ext:nt, lfowev~~~strated by illness, fatigue, emo-

condition of the subject h1l11Se.' as 1 f pleasant or unpleasant nature,. ecent experIences 0 a

tionalstram, worry, r . ., h the extent to which scores on a testand the like. Retest reliabIlIty sows. th higher the reliability, thecanhr I!eneralized over different occaSlDns; e

I.1

i

, ; \

\-1I : ."I

\ i -HH"

1 : 1111

\ " I.

1 i

I \ 1111 ',.jilt I \o/Ht'lII; ,

$ ~1 IIt) 0

-0. "

()."f0'0"t ~~("") M ~ I I

~ b J, ~ ~ ~ ~ ~N (") M IT'Score on FormJ: Word Fveney e.

Prillciples of PsycllOlogical Testing

omchildhood performance, rather than in terms of the reliability of a

rticular test. The concept of reliability is generally restricted to short-

ge, random changes that characterize the test performance itself

.r;ilherthan the entire behavior domain that is being tested,

It should be noted that different behavior functions may themselves

.ry in the extcnt of daily fluctuation they exhibit. For example, steadi-

ess of delicate finger movements is undoubtedly more susceptible to

, ht changes in the person's condition than is verbal comprehension, If

wish to obtain an over-all estimate of the individual's habitual finger

diness, we would probably require repeated tests on several days,

reas a single test session would suffice for verbal comprehension,

~gainwe must fall back on an analysis of the purposes of the test and

9iJ a thorough understanding of the behavior the test is designed to pre-Biet,

:'l'Although.apparently simple and straightforward, the test-retest tech-

, '~iquepresents difficulties when applied to most psychological tests.

lPracticewill probably produce varying amounts of improvement in the

~testscoresof different individuals. Moreover, if the interval between re-

estsis fairly short, the examinees may recall many of their former re-

ooses.In other words, the same pattern of right and wrong responses

_likelyto recur through sheer memory. Thus, the scores on the two ad-

1Jlinistrationsof the test are not independently obtained and the correIa-

between them will be spuriously high, The natt\re of the test itself

ay also change with repetition, This is especially true of problems in-

lyingreasoning or ingenuity. Once the subject has grasped the princi-

involvedin the problem, ur once he has worked out a solution, he can

roduce the correct Iesponse in the future without going through the

erveningsteps. Only tests that are not appreciably affected by.'if!.'Jeti-

n lend themselves to the retest technique, A number of sensory dis-

(~riminationand motor tests would fall into this category, For the large

,majorityof psychological tests, however, the retest technique is inap-

ropriate.

. ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-

untered1n test-retest reliability is through the use of alternate forms

the test. The same persons can thus be tested with one form on the

stoccasjonand with another, comparable form on the second. The cor-

lationbetween the scores obtained on the two forms represents the

'ability coefficient of the test. It will be noted that such a reliability

efficientis a measure of both temporal stability and consistency of

nse to different item samples (or test forms). This coefficient thus

binestwo ty,pes of reliability. Since both types are important for most

Reliability 113

testing purposes 110.... Imeasure for e\'al~at' 'ever, a temate-form reliability provides a useful

mg many tests.The concept of item sam Iin '

alternate-form reliability bu~ al~ ~;hcontellt salllpl~llg: ?lIderlies not onlyshortlv. It is the f . er types of reltabIhty to be discussed

- re ore appropnate to ex . 'thas probably h d th' amlOe 1 more close lv, Everyone

a e expenence of taking . ..-he felt he had a "I k b k" a course exammatlOn in \vhich

very topics he happue~:d t~e~aveb;~:~:e many of the items covered the

easion, he may have had th . ed mo~t carefully, On another oc-large number of l't e opposite expenence, finding an unusually

ems on areas he had f 'I d .situation illustrates error va . I al e to reVICW,This familiar

what extent do Scores on th.n~nc: ;esu ting from content sampling, To

ticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the par-

ently, were to pre!)are another te It ~rent IOdvestlgator,workmg independ-

t' h s In accor ance with the 'fiIOns, ow much would an indi .d l' . same speci ca-Let us suppose that a 40't VI ua bSslcore differ on the two tests?

-I em voca u ary t t h ba measure of general verbal c ,e.s - as een constructed as

~ist of 40 different words is ass~:b~:~e;~:~~~ :ow suppose that a secondItems are constructed with I ame purpose, and that thecultv as the first test The d,effqua can~ to cover the same range of diffi-. d: , ,I erences 111 the sco e bt' d bm lVIduals on these two tests 'II t r s 0 ame y the same

,IUS rate the type of 'conSIderation. Owing to fortuitous f . error vanance underferent individuals the relat' , d'ffi aftors In the past experience of dif-what from pcrso~ to pe !VeT]·1 cu ty of the two lists will vary Some-

rson. IUS the Ii t I' t . hnumber of words unfamiliar to individ ;s IS mIg t contain a larp;el-The second list on the oth h d .ua A than does the second list.1 ' er an mIght co t' d'arge number of words unfamiIi t I • d' 'd n

lam a Isproportionately

ar 0 111 IVI ua B If the t . d"d Iare apprOXimately equal in thei II . WO 111 IVI ua ~"true scores") B' will neverth I r overa word knowledge (i.e., in thei~

excel B on th~ second The eIe~ excel A on the first list, while A will

therefore be reversed o'n th trea ].ve standing of these two persons will

. e wo Ists o' t hselection of items, ' wmg 0 c anee differences in the

Like lest-retest rcliabilit, alt . £ ' ..accompanied by a stateme~' f t~rntc- ~rm rdl~bIhty should always be

ministrations as well as ado . t~ engft of the mterval between test ad-If h·' escnp Ion 0 relevant' t .t e two forms are administered' . In ervenmg experiences.

correlation shows reliabilit Ifn Immediate succession, the resulting. y across orms only not .

error vanance in this cas 8' ' across occasIOns. Thee represents uctuat'o' f

one set of items to another b t H ,I ns In per ormance fromIn the d I ' u not uctuations over time

eve Opment of alternate forms h Id· .cised to ensure that the are trul ' care s ou ..?f ('Ourse be exer-of a test should be jnd~endc t{ parallel. F~ndamentaJ)y, parallel forms

same specifications. The tests :h~ ~nstruct~ tests desi~ed to meet theU ('Ontam the same number of 't

.. . 1 elDS,

Reliabilify 111

less susceptible the scores are to the random daily changes in the condi-

tion of the subject or of the testing environment.

When retest reliability is reported in a test manual, the interval over

which it was measured should always be specified. Since retest correla-

tions decrease progressively as this interval lengthens, there is not one

but an .infinite number of retest reliability coefficients for any test. It is

also desirable to give some indication of relevant intervening experiences

of the subjects on whom reliability was measured, such as educational or

job experiences, counseling, psychotherapy, and so forth.

Apart from the desirability of reporting length of interval, what con-

siderations should guide the choice of interval? Illustrations could readily

be cited of tests showing high reliability over periods of a few days or

weeks, but whose scores reveal an almost complete lack of correspond-

ence when the interval is extended to as long as ten or fifteen years.

Many preschool intelligence tests, for example, yield moderarely stable

measures within the preschool period, but are virtually useless as pre-

dictors of late childhood or adult IQ's. In actual practice, however, a

simple distinction can usually be made. Short-range, random fluctuations

that occur during intervals ranging from a few hours to a few months are

generally included under the error variance of the test score. :rhus, in

checking this type of test reliability, an effort is made to keep the interval

short. In testing young children, the period should be even shorter than

for older persons, since at early ages progressive developmental changes

are discernible over a period of a month or even less. For any type of

person, the interval between retests should rarely exceed six months.

Any additional changes in the relative test performance of individuals

that occur over longer periods of time are apt to be cumulative and pro-

gressive rather than entirely random. Moreover, they are likely to charac-

terize a broader area of behavior than that covered by the test perform-

ance itself. Thus, one's general level of scholastic aptitude, mechanical

comprehension, or artistic judgment may have altered appreciably over

a ten-year, period, owing to unusual intervening experiences. The indi-

vidual's status may have either risen or dropped appreciably in relation

to others of his own age, because of circumstances peculiar to his own

home, school, or community environment, or for other reasons such as

illness or emotional disturbance.

The .extent to which such factors can affect an individual's psycho-

logical development provides an important problem for investigation.

This question, however, should not he confused with 'that of the reliabil-

ity of a particular test, When we measure the reliability of the Stanford-

Binet, for example, we do not ordinarily correlate retest :~~res over a

period of ten years, or even one year, but over a few weeks,'~'£p he SUfe~

long-range retests have been conducted wit~ such tests:; bpt the .fcsults

are generally discussed in terms of the predictability of adult intelligence

I.1

i

I : \

\-i\ : ."I .

I

\\\

\" ;4!It 1/: 1/1

\ j

\ 4!It \ " I III/

1/11 '.flit I \.fIIt1H1!

0- ~ 0- ~Ii') 0()

0()"-

I II 1Ii') 0

0Ii') 0

Ii') 0 0() "-0Ii') ~ ~ Ii') Il'l 0()

sc:e onMFormJ: Word fluencY Test

,. '!G. 10. A Reliability Coefficient of .72.

Data from An8~tasi & Drake, 1954.)

':TYPES OF RELIABILITY

, ost obvious method for finding the re-TEST-RETESTRELIABILITY. The m. h 'dentical test on a second occa-

.. liabilityof test scores is by. rcpeatlll)g.t :h~ ase is simply the correlation.: 'sion.The l'eliability coefficlenf (Tn III ISC, n the two administra-

b . d b the same persons 0\[1betwe~i'Ithe scores 0 tame Y. d to the random fluctua-. Th . vanance correspoll S'; tions of the test. e enor . t the other These variations. . f e test seSSIOn 0 •" tionsof performance rom on II d t t'ng conditions such as extreme, I' rt f ncontro e es 1 ' kmay resu t 111 pa rom u . d ther distractions or a bro en

I • h dden nOlses an 0 " hchanges 111 we at er, su th y arise from changes m t e

. . T extent however, e .pend pomt. 0 so~e .' f 'Uustrated by illness, fatigue, emo-condition of the subject hmlsel : as 1 f pleasant or unpleasant nature,

I· recent expenences 0 a ttiona stram, worry, . ., h the extent to which scores on a tesand the like. Retest rehablhty sows. the higher the reliability, thecan he I':eneralized over different occaSIOns;

Prillciples of Psychological Testing

om childhood performance, rather than in terms of the reliability of a

rticular test. The concept of reliability is generally restricted to short-

nge, random changes that characterize the test performance itself

therthan the entire behavior domain that is being tested.

It should be noted that different behavior functions may themselves

, in the extent or daily fluctuation they exhibit. For example, steadi-

of delicate finger movements is undoubtedly more susceptible to

ht changes in the person's condition than is verbal comprehension, If

wish to obtain an over-all estimate of the individual's habitual finger

diness, we would probably require repeated test~ on several days,

'hereas a single test session would suffice for verbal comprehension,

gainwe must fall back on an analysis of the purposes of the test and

i1 a thorough understanding of the behavior the test is designed to pre-t.

Althoughapparently simple and straightforward, the test-retest tech-

ique presents difficulties when applied to most psychological tests,

.racticewill probably produce varying amounts of improvement in the

.testscores of different individuals. Moreover, if the interval between re-

s is fairly short, the examinees may recall many of their former I'e-

. Dnses.In other words, the same pattern of right and wrong responses

.4 likelyto r~cur through sheer memory. Thus, the scores on the two ad-

inistrationsof the test are not independently obtained and the correIa-

n between them will be spuriously high, The natnre of the test itself

:ayalso change with repetition, This is especially true of problems in-

DIvingreasoning or ingenuity. Once the subject has grasped the pdnci-

"Ieinvolvedin the problem, or once he has worked out a solution, he can

produce the correct response in the future without going through the

itervellingsteps, Only tests that are not appreciably affected by"lfi,eti-

tiDnI~nd themselves to the retest technique. A number of sensory dis-

,criminationand motor tests would fall into this category. For the large

ajorityof psychological tests, however, the retest technique is inap'

opriate,

ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-

imteredin test-retest reliability is through the use of alternate forms

the test. The same persons can thus be tested with one form on the

stDccasjonand with another, comparable form on the second. The cor-

ation between the scores obtained on the two forms represents the

'ability coefficient of the test. It will be noted that such a reliability

cient is a measure of both temporal stability and consistency of

nse to different item samples (or test forms). This coefficient thus

binestwo types of reliability. Since both types are important for most

Reliability 113

~:~:~t~;~~':~~~~~:g'enYlear, altternate-form reliability provides a usefulny ests.

The concept of item sam tin 'altemate-fOlm reliability bu~ al~ ~;hcontellt sampl:llg: ~nderlies not onlyshort Iv. It is the f . er types of reltabllIty to be disclIssedhas p;obably h drethoreappr.opnate to examine it more closely, Everyone

a e expenence of tak' g ,he felt lIe had a "I k b k» 'In a course examination in which

uc v rea because f h .very topics he happen~d to have studi many 0 t e Items covered thecasion, he may have had th ' ed mo~t carefully, On another oc-large number of I't e opposIte expenence, finding an unusually

ems on areas he had f 'I d .situation illustrates error' I al e to reVICW.This familiar

what extent do Scores on ~~n~nc: ;esu ting from content sampling. To

ticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par-

I . a I erent mvestigator k' . dent y, were to preIJare another t t' d ' wor mg In epend-t' h es m accor ance with th 'fiIOns, ow much would an indi .dr, e same specI ca-Let us suppose that a 40-'t VI ua bS slcore differ ort the hm tests?

I em voca u ary test h ba measure of general verbal c h' . - as een constructed as

~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a secondItems are constructed with I ame purpose, and that theculty as the first test The d.eff

quacan; to cover the same range of dim-

. d: , '. I erences 111 the sco e bt' d bIII JVldua]s on these two tests ']1 t r s a ame y the same

. I us rate the type of 'consIderation. Owing to fortuito f ' error vanance underferent individuals the relat' d~~ ators In the past experience of dif-what from pcrso~ to pe Ive

TI·I cu ty of the two lists wiII vary Some-

rSOll. lUS the fi t I' t . hnumber of words unfamiliar to individ rs IS mlg t contain a largerThe second list on the oth h d ,ua] A than does the second list.1 ' er an might conta'n d'arge number of words unfamilia t' . d"d I I a Isproportionatelyare apprOXimately egual in the; r 0 111nlVIua B. If the two individual~

"true scores"), B -will neverthele:sov:ra

word knowledge (i.e., in theirexcel B on the second Th ], e cel A on the first list, while A will

therefore be reversed o'n the treatll.ve standing of these two persons will

. eWolstso' t hselection of items. ' wmg 0 c ance differences in the

Like lest-retest rcliabilit· alt ' f '. ,accompanied by a stateme~' f t~m:te- ~nn rell~blhty should always be

ministrations as well as ado , t~ engft of the Interval between test ad-If h·' escnp Ion 0 relevant . t 't e two forms are administered' . 111 ervenmg experiences.

correlation shows reliabilit 'fn Immediate succession, the resulting. y across orms only not .

error vanance in this cas fl' ' across occasIOns. Thee represents uctuat'o' f

one set of items to another b t R . I ns In per ormance fromIn the d I ' u not uctuations over time

eve Opment of alternate forms h Id" .cised to ensure that the are tm] , care s ou . .of (,'ourse be exer-

of a test should be ind~endc t{ parallel. Fundamentally, parallel forms

same specifications. The tests :h~ ~nstruct~d tests desi~ed to meet theU contam the same number of items

. ,


:,d the 'items should be expressed in the same form and should cover the

metype of content. The range and level of difficulty of the items should

o be equal. Instructions, time limits, illustrative examples, format, and

I other aspects of the test must likewise be checked for comparability.

It should be added that the availability of parallel test forms is desir-

Ie for other reasons besides the determination of test reliability. Alter-

te forms are useful in' follow-up studies or in investigations of the

ectsof some intervening experimental factor on test performance. The

useof several alternate forms also provides a means of reducing the pos-

sibilityof coaching or cheating.

Although much more widely applicable than test-retest reliability, al-

"temate-form reliability also has certain limitations. In the first place, ifthebehavior functions under consideration are subject to a large practice

elfeet, the!'use of alternate forms will reduce but not eliminate such an

'effect. To be sure, if all examinees were to show the same improvement

with repetition, the correlation between their scores would remain un-

,"affected,since adding a constant amount to each score does not alter the

<:orrelationcoefficient. It is much more likely, however, that individuals

will differ in amount of improvement, owing to extent of previous prac-

ticewith similar material, motivation in taking the test, and other factors.

Under these conditions, the practice effect represents another source of

variance that will tend to reduce the correlation between the two test

forms, If the practice effect is small, reduction will be negligible.

Another related question concerns the degree to which the nature of

the test will change with repetition. In certain types of ingenuity prob-

lems, for example, any item involving the same principle can be readily

solved by most subjects once they have worked out the solution to the

first. In such a case, changing the specific content of the items in the

second form would not suffice to eliminate this carry-over from the first

form. Finallv, it should be added that alternate forms are unavailable for

many tests, because of the practical difficulties of constructing compara-

ble forms. For all these reasons, other techniques for estimating test re-

liability are often required.

Reliability lIS

To find split-half reliabilit tl Ii. .order to obtain th y, Ie 1st problem IS how to split the test illdivided in man ~ most nearly comparable halves. Any test can be

second half w~urd dl~e~ent wars. In most tests, the Rrst' half and the

difficulty level of 'tno

e comparable, owing to differences in nature andI ems, as well as to the cu I t' If f

Ul), })ractice fatig b d mu a Ive e ects 0 warming, ue, ore am and am' tI f

sively from the beginning to th~ end ~f th at Ie; ;ctors varying progres-

quate for most purposes is to fi d th e es.. procedure that is ade-of the test. If the items we .n. e scores on the odd and even items

of difficulty such a dl' . ~e on?llndallyan.anged in an approximate order, VIsIon Yle s verv ne I· . I

One precaution to b b d . .' ar)' eqUlva ent half-scores.e a serve 111 making such dd I'

to groups of items d l' . h' an a -even sp It pertainsea mg WIt a smale problem h

ferring to a particular mechanical di~ . ' sue. as questions re-reading test. In this case a whole r glam. or to a gIven passage in a

tact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned in-

in different halves of the t~st th .e I:e~ls In such a group to be placedspuriousl inflated' . '. e Slml anty of the half-scores would be

might aIf~ct items 'i~l~c;t~n~a~~,:~.leerror in understanding of the problem

Once the two half-scores have b b' dbe correlated by the usual m th een a tame for each person, they may

correlation actuallv gives th e °l.d'b~lt.shoufld be noted, however, that this'f hoe re la I It" a onlv a half test F ' II t e entire test consists of 100 ite - h ' . - . . or examp e,tween two sets of scores each a .ms,. t e correlatIon IS computed be-

test-retest and alternate-fotm r:I;;~~~~,ls bas~d on only 50 items, In bothbased on the full nu b f ' . -' on t e other hand, each score is

. m er 0 Items In the testOther thmgs being equ I th I .

It is reasonable t . a I' e ~nger a test, the more reliable it will be?o expect t Iat, WIth a lar If'

arrive at a more adequate and . ger samp e a behaVIOr, we can. ' consIstent measure The ff t th I h

emng or shortening a test will hav . , .' e ec at engt -means of the Spearman-Bra f e allI Its ~oefficlent can be estimated by

wn ormu a, gIVen below:

nr'lI'II =: ~--,,----_

, l+(n-l)r'u

in which '1t is the estimated ffi'n is the number of times th ~o~. c~ent, ~11 the obtained coefficient, and

number of test items is incr:a eSd~ eng~ ened or shortened. Thus, if the

from 60 to 30, n is %. Th sse rom 2.'Jto 100, n is 4; if it is decreased

determining reliability bv ~heP:ari~~ntrown formula is Widely used in

porting reliability in this 'fo p a f m.ethod, m~ny test manuals re-

, formula always involves do~~in"'~~: tpphed to spht-haIf reliability, the

clitions, it can be simplified as f~Iows:ength of-the test. Under these con-

SPLIT-HALF RELIABILITY, From a sin'gle,:administration of one form of a

test it is possible to arrive at a measure 'of, reliability by various split-half

procedures. In such a way, two scores are obtained for e~c1i person by

dividing the test into comparable halves. It is apparent that split-half

reliability provides a measure of consistency with regard to content sam-

pling. Temporal stability of the scores does Ilot enter into such reliability,

because only one test session is involved: This type of reliahility co-

efficient is sometimes called a coefficient of internal consistency, since

only a single administration of a single form is required.2 Lenulhening a test h . I .

" . ' owever, wll Increase 0 I "t, " .tent samplmg not its sl b'I't .,' n y. I S conSIstency m tenns of con-

, a II} over hme (see Cureton, 1965). '


2r'1I

Tn = 1+ r'lI

. s it-half reliability was developed byAnalternate method for findmg p. f th differences between

. 0 Ily the vanance a e IIon (1939). It reqUires I If t ( , ) and the variance of tota

' the two ha -tes Sad f Ich person s scores on b 't t d in the following ormu a,res (a'r); these two values aTe su stJ u e. ,.

hich yieids the reliability of the whole test duectl) .

u'e!

111 = 1- -,-u:;

,r , hi of this formula to the definition of. It is interesting to note the relations p 's scores on the two half-' . , A d'ff ce between a person . 'd d'errorvanance. ny I eren 'f these differences, dlvl e' h r The vanance 0 ,. 'tests represents c ance eTTO. , 'es the roportion of error variance 111

by the variance of total scores, gl\ 'b P t d from 1 00 it gives the- h' 'ariance IS SU trac e , ,he scores. When t IS error \ h' h . I to the reliability coefficient.proportion of "true" variance, w IC IS equa

, . A fourth method for finding reliability,KUDER·RICHARDSON RELIABILIT1:.. f . I form is based on the

. 1 d" t 'ahon 0 a slllg e ,also utiliZing a slIlg e a mmlslII , , the test This interitem con-

f onses to a Items m .consistencv 0 resp f ariance' (1) content sam-'. a d by two sources a error v , h

,;:sistenclj is ~n uence . d s lit-half reliability); and (2) etero-\1 piing (as III altemat~-form an. p m led. The more-homogeneous the

0' geneitv of the behavlOr domalll sa.Pt

' For example if one test in-' • h' h tl . lteritem conSISenc\. , bdomain, the Ig er Ie 11 h'1 lo'ther cOllllJrises addition, su _

I . I' l' 'tcms w leal b hI" eludes only mu tip Ica IOn I ..'.. the former test will pro a yI· I' t' and dIVISIOnItems,

' traction, mu tip Ica lOn, h th latter In the latter, more. 't onsistenc\' t an e, 0 h .

' show more mten em c "f better in subtraction t an III' t t e subJ'ect ma\' per orm 1' heterogeneous es, on. "ons' another subject may score re a-~, any of the other arithmetIc operatl 0 , ly in addition, subtrac-

h d' " 'tems but more poor btively well on t e IVI510nI , A ore extreme example would etion and multiplication; and so on. m b I items in contrast to one' b t . ti I IT of 40 voca u ary, .

represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng,b 1 10 spaha re a I 0, '

containing 10 voca u ar~, ~ the latter test, there might be little orand 10 perceptual speed Item~'dI. 'd r performance on the differentno relationship between an III IVI ua s

.' types of items. ill be less ambiguous when derived., It is apparent that test scores w h t'. the highly heteroge-

t ts Suppose t a IIIfrom relatively homogeneo~ esS' 'th and Jones both obtain a score of

neous, 40-item test cited ave, rfml s of the two on this test were

20,Can we conclude that the Pheormance tly completed 10 vocabulary

? N t II Smith may aye correc ..equal. ot a a . 's and none of the arithmetic reasomngitems, 10 perceptual speed ~tem 't t Jones may have received a scoreand spatial relations items, neon ras ,

Reliability U7

·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial rela-tions, 10 arithmetic reasoning, and no vocabulary items,

Many other combinations could obViously producc the same total score

of 20. This Score would have a very different meaning when obtained

through such dissimilar combinations of items. In the relatively homoge-

neous vocabulary test, On the other hand, a Score of 20 would probably

mean that the Subject llad succeeded with approximately the first 20

words, if the items were arranged in ascending order of difficulty, He

might have failed two or three easier words and correctly responded to

two or three more difficult itcms beyond the 20th, but such individual

variations are slight in comparison with those found in a more heteroge-neous test .

A highly relevant question in this connection is whether the criterion

that the test is trying to predict is itself relatively homogeneous or heter-

ogeneous. Although homogeneous tests are to be preferred because their

Scores permit fairly unambiguous interpretation, a single homogeneous

test is obViously not an adequate predictor of a highly heterogeneous cri-

terion. lvforeover, in the prediction of a heterogeneous criterion, the

heterogeneity of test items would not necessarily represent error variance.

Traditional intelligence tests provide a good example of heterogeneous

tests designed'to predict heterogeneous criteria. In such a case, however,

it may be desirable to construct several relatively homogeneous tests,

each measuring a different phase of the heterogeneous criterion, Thus,

unambiguous interpretation of test scores could be combined with ade-quate criterion coverage.

The most common procedure for finding interitem consistency is that

developed by Kuder and Richardson (1937). As in the split-half methods,

interitem consistency is found from a single administration of a single

test. Rather than requiring two half-scores, however, such a technique is

based on an examination of performance on each item. Of the various

formulas derived in the original article, the most Widely applicable, com-

monly known as "Kuder-Richal'dson formula 20," is the following:

3

In this formula, rll is the reliability coefficient of the whole test, n is the

number of items in the test, and IJ't the standard deviation of total SCOl'es

on the test. The only new term in this formula, 'S.pq, is found by tabu-

lating the proportion of persons who pass (p) and the proportion who do

not pass (q) each item. The product of p and q is computed for each

item, and these products are then added for all items, to give ~pq. Since

in the ptocess of ~est construction p is often routinely recorded in order

3 A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).

_ (~)U't - ~U';TlI - n - 1 u't

A clear description of the computational layout for finding coefficient

alpha can be found in Ebel (1965, pp. 326-330).

Reliability 119

one case, error variance covers temporal fluctuations; in another, it refers

to differences between sets of parallel itcms; and in still another, it in-

cludes any interitem inconsistency. On the other hand, the factors ex-

cluded from measures of error variance are broadly of two types: (a)

those factors whose variance should remain in the scores, since they are

part of the true differences under consideration; and (h) those irrelevant

factors that can be experimentally controlled. For example, it is not

customary to report the error of measurement resulting when a test is

administered under distracting conditions or with a longer or shorter

time limit than that specified in the manual. Timing errors and serious

distractions can be empirically eliminated from the testing situation.

Hence, it is not necessary to report special reliability coefficients corre-

sponding to "distraction variance" or "timing variance."

Similarly, most tcsts provide such highly standardized procedures for

administration and scoring that error variance attributable to these fac-

tors is negligible. This is particularly true of group tests deSigned for

mass testing and computer scoring. 'With such insb'uments, we need only

to make certain that the prescribed procedures are carefully followed

and adequately checked. 'Vith~clinical instruments employed in intensive

individual examinations, on the other hand, the!'e is evidence of con-

siderable "examiner variance:' Through special experimental designs, it

is possible to separate this variance from that attributable to temporal

fluctuations in the subject's condition or to the use of alternate test forms.

~ne source of error variance that can be checked quite simply is scorer

vanance. Certain types of tests-notably tests of creativity and projective

tests of personality-leave a good deal to the judgment of the scorer.

\Vith such tests, there is as much need for a measure of scorer reliability

as there is for the more usual reliability coefficients. Scorer reliability can

be found by having a sample of test papers independently scored by two

examiners. The two scores thus obtained hv each examinee are then cor-

related in the usual way, and the resulti~g correlation coefficient is a

measu,re of scorer reliability. This type of reliability. is commonly com-

puted when subjectively scored instruments are e.mployed in research.

"»est manuals should also report it when appropriate. '

u8 Pri'lcipks of Psychological Testing

i6'find the difficulty level of each item, this method of determining rc-

i~bilityinvolves little additional cO,mputation. l' bT,fIt canbe shown mathematically that the Kuder-Ri~hardson r~ la Ilty

, cient is actually the mean of aU split-half coeffiCients .resultll1~ from

ent splittings of a test (Cronbach, 1951).4 The ordmary spht-half

dent, on the other hand, is based on a planned split design~d to

equivalent sets of items. Hence, unless the test items are hIghly

mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~e

lit-halfreliability. An extreme example will serve to hl.ghlight t?e dlfference.Suppose we construct a 50-item test out of 25 diHerent kmd~ a

emssuch that items 1 and 2 are vocabulary items, items 3 and 4 anth-

eticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.and

venscores on this test could theoretically agree qmte clos:ly, thus. YIeld-

'ng a high split-half reliability coefficient. The homogeneity of. thiS test,

Id be very low Since there would be little consistency of

owever,wou • " lderformance among the entire set of 50 items. In thIS example, we wou.

'~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIt-

halfreliability. It can be seen that the diHerence between Kuder-~Ichard-

,sonand split-half reliability coefficients may serve as a rough ll1dex of

theheterogeneity of a test. .The Kuder-Richardson formula is applicable to tests whose Items are

scoredas right or wrong, or according to some other all-or-none syste~.

Sometests however may have multiple-scored items. On a personahty

inventory,for exampie, the respondent may receive a di,~erent n,~~erical

score on an item, depending on whether he checks . usually, some-. " " I" "ne\1el'" For such tests a generahzed formula hastimes, rare y, or· ' . kbeen derived known as coefficient alpha (Cronbach, 1951; NOVIC &

Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sum

of the variances of item scores. The procedure is to find the vana~ce of

all individuals' scores for each item and then to ~dd these v~na~ces

i, across all items. The complete formula for coeffiCIent alpha IS glVen

below:

SCORER RELIABILITY. It should now be apparent that the difIer:nt types

of reliability vary in the factors they subsume under error vananee. In

4 This is strictly true only when the split-half coefficientsare found by the Rulonformula,not when they are found by correlation of halves and Spearman-Brown

formula(Novick& LewiS, 1967).

OVERVIEW. The diHerent types of reliability coemsiel),ts discussed in

this section are summarized in Tables 8 and 9. In Tablit18'the operations

followed in obtaining each type of reliability are classffled,-,with regard

to number of test forms and number of testing sessions required. Table 9

shows the sources of variance treated as error vitri~nce b},;,~achprocedure.

Any reliability coefficient may be interpreted directly"in terms of the

percentage of score variance attributable to different sources. Thus, a re-

liability coefficient of .85 signifies that 85 perceI1t 9f the variance in test

Split-HalfKuder-Richardson

Scorer

Two \ Test-Retest

1'l..'C••.:J";':'.:.•.;-...•:.~io!<!'.l:::i;r~C'~<;~;;.tr..~""F.:~ ....:.Y:-:_~:_~~,,::.;.c~.-:,~;:.:.;(;,tJ';;.!:.4':~~ __••'~.~;-.:.;~ .•..c..::.t,at;.;..."Ulr'&.~~')l.t;·~•..fW"6'.!':"i·:;",-

Alternate- Form(Delayed)

Reliability 121

efficient (\/;;-;-). When the index of reliability is squared, the result is the

reliability coefficient (r1l), which can therefore be interpreted directly

as the percentage of true variance.

Experimental designs that yield more than one type of reliability co-

efficient for the same group permit the analysis of total score variance

into different components. Let us consider the following hypothetical

example. Forms A and B of a creativity test have been administered with

a two-month interval to 100 sixth-grade children. The resulting alternate-

form reliability is .70. From the responses of either form, a split-half re-

liability coefficient can also be computed.6 This coefficient, stepped up by

the Spearman-Brown formula, is .80. Finally, a second scorer has rescored

a l'andom sample of 50 papers, from which a scorer reliability of .92 is

obtained. The three reliability coefficients can now be analyzed to yield

the error variances shown in Table 10 and Figure n. It will be noted that

by subtracting the en'or variance attributable to content sampling alone

(split-half reliability) from the error variance attributable to both con-

tent and time sampling (alternate-form reliability), we find that .10 of the

variance can be attributed to time sampling alone. Adding the error vari-

~nces attributable to content sampling (.20), time sampling (_10), and

mterscorer difference (.08) gives a total error variance of .38 and hence a

true variance of .62. These proportions, expressed in the more familiar

percentage terms, are shown graphically in Figure II.

lZ0 Principles of Psyc11010gical Testing

TABLE 8Techniquesfor Measuring Reliability, in Relation to Test Form

andTesting Session

TestingSessionSRequired

Test Forms Required

A1temate-Form

(Immediate)

scores depends on true vati~nce in the trait measured and 15 percent

epends on error variance (as:'opcrationally defined by the specific pr~-

edure followed). The statistically sophisticated reader may recall that It

's the square of a correlation coefficient that represents proportion of

ommanvariance. Actually, the proportion of true variance in test scores

'sithe square of the correlation between scores on a single form of the

est and true scores free from chance errors. This correlation, known as

th6index of re1iabdity,~ is equal to the square root of the reliability co-

TABLE 10

Anal)'sis of Sources of Error Variance in a H}'P0thetical Test:fABLE 9,ourcesof Error Variance in Relation to Reliability Coefficients

5Derivationsof the indexof reliability,based on two dilTerentsets of assumptions,

\givenby Gulliksen (l950b, Chs. 2 and 3).

. 6 For a better estimate of the coefficientqf internal consistency.split-half correla-tions could be computed for each fonn amI the two coeffiCientsaveraged by the ap-propriate statistical procedures. '-\;,. ;

Type ofReliabilityCoefficient

From delayed alternate-form reliability: 1 - .70 = .30 (time samplin'k

plus contentsampling)

From split-half, Spearman-Brown reliability: 1 - .SO= .20· (contentsampling)

DiHerence .10· (time sampling)

TWDl scorer reliability: 1- .92 = .OS· (interscorerdifference )

Total Measured Error Varianetl· = .20 + .10 + .08 = .38True Variance = 1- .38 = .62

,est-Retestlemale-Form(Immediate)emale-Form(Delayed)

lit-Half

er-Richardsonand Coefficient

Ipharer

Time samplingContent samplingTime sampling and Content sampling

Content samplingContent sampling andContent heterogeneity

Interscorer differences

'II,',II i,I, ;

i'

that individual differences in test scores depend on speed of perform-

ance, reliability coefficients found by these methods will be spuriously

high. An extreme example will help to clarify this point. Let us suppose

that a 50-item test depends entirely on speed, so that individual differ-

ences in score are based wholly on number of items attempted, rather

than on errors. Then, if individual A obtains a score of 44, he will obvi-ously have 22 correct odd items and 22 correct even items. Similarly,

individual B, with a score of 34, will have odd and even scores of 17 and

17, respectively. Consequently, except for accidental careless errors on a

few items, the correlation between odd and even scores would be perfect,

or +1.00. Such a correlation, however, is entirely spurious and provides

no information about the reliability of the test.

An examination of the procedures followed in finding both split-half

and Kuder-Richardson reliability \:vill show that both are based on the

consistency in number of errors made by the examinee. If, now, indi-

vidual differences in test scores depend, l~ot on errors, but on speed, the

measure of reliability must obviously be based on consistency in speed

of u:ork. 'Vhen test performance depends on a combination of speed and

power, the single-trial reliability coefficient will fall below 1.00, but it

will still be spuriously high. As long as individual differences in test

scores are appreciably affected by speed, single-trial reliability coefficients

cannot be properly interpreted.

'What alternative procedures are available to determine the reliability

of Significantly spl1eded tests? If the test-retest techniqu~ is applicable, it

would be appropriate. Similarly, equivalent-form reliability may be

properly employed with speed tests. Split-half techniques may also be

used, provided that the split is made in terms of time rather than in

terms of items. In other words, the half-scores must be based on sep-

arately timed parts of the test. One way of effecting such a split is to

administer two eqUivalent halves of the test with separate time limits.

For example, the odd and even items may be separately printed on differ-

ent pages, and each set of items given with one-half the time limit of the

entire test. Such a procedure is tantamount to administering two equiva-

lent forms of the test in immediate succession. Each form, however, is

h¥f as long as the test proper, while the subjects' scores are normally

based on the whole test. For this reason, either the Spearman-Brown or

some other appropriate formula should be used to find the reliability of

the whole test.

If it is not feasible to administer the two half-tests separarely, an al-

ternative procedure is to divide the total t,ime into quarters, and to find

a score for each of the four quarters. This caneasil~':J;>~ 'done by having

the examinees mark the item on which they ar~ w6rkiti~ whenever the

examiner gives a prearranged signal. The number of items correctly

completed within the first and fourth quarters can then be combined to

Error Variance: 38'J.

A_-10--x.--8-'X,-"'"

Stable over lime; consistent over !orms;

free !rom interscorer difference

11. Percentage Distribution of Score Variance in a Hypothetical Test.

'LIABILITY OF SPEEDED TESTS"

oth in test construction and in the interpretation of test scores, an

portant distinction is that between t~e ~ea.s~rement. of speed and of

wer. A pure speed test is one in whIch md1~dual differences depend

tirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Items

uniformly low difficulty, all of which are well wI~hm ~he. a?lhty level

the persons for whom the test is designed. The hme 1Im1t.1~made so

ort that no one can finish all the items. Under these conditIons, each

erson's score rcflects only the speed with which he worked. A pur~

DICeI' test, on the other hand, has a time limit long el:ough ~o permIt

veryone to attempt an items. The difficulty of the Items IS steeply

, raded, and the test includes some items too difficult for anyone to solve,

sothat no one can get a perfect score. "It will be noted that both speed and power tests are deSIgned to p~e-"

vent the achievement of perfect scores. The reason for such.a precauh~,

is that perfect scores are indeterminate, since it is impos~lble to .knm.Y

howmuch higher the individual's score would have been If m?re.l~ems,

d'ffi It items had been included, To enable each mdlVldualor more I cu, ,,' .d dto show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e-

. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~

lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~

referenced tests discussed in ChaPtrc4. The purpose of such testm~ IS

not to establish the limits of what th'e3hdividual can do, but to determme

whether a preestablished performance level has or has not been rea.ehed.

In actual practice, the distinction between speed and power :ests IS ~nc

of degree most tests depending on both powe~ and speed 111 varymg

proportiO~S. Information about these proportions is needed for each test

. rder not onlv to understand what the test measures but also to

~o~se the prop~r procedures for evaluating its reliability. Single-trial

reliability coefficients, such as t~ose found by odd-even or Ku.der-

Richardson techniques, are inapplicable to speeded tests. To the extent

Principles of PsycllOlogical Testing

'~w,' represent one half-score, while those in the second and thir~ q~artcrs

," can be combined to yield the other half-score. Such a combmahon of

. quarters tends to balance out the cumulative effects of practice, fatigue,

and other factors. This method is especially satisfactory when the items

are not steeply graded in difficulty level.

When is a test appreciably speeded? Under what conditions must the

. special precautions discussed in this section be observed? Obviously, the

mere employment of a time limit does not signify a speed test. If all

subjects finish within the giycn time limit, speed of work plays no part

in determining the scores. Percentage of persons who fail to complete

the test might be taken as a crude index of speed versus power. Even

when no one finishes the test, however, the role of speed may be negli-

gible. For example, if everyone (<()mpletes exactly 40 items of a 50-item

.test, individual differences with regard to speed are entirely absent, al-

though no one had time to attempt all the items.

The essential question, of course, is: "To what extent are individual

differences in test scores attributable to speed?" In more technical terms,

we want to know what proportion of the total variance of test scores is

speed variance. This proportion can be estimated roughly by finding the

... variance of number of items completed by different persons and dividing

'\ it by the variance of total test scores (u·'/r:J't). In the example cited

above, in which ev~ry individual finishes 40 items, the numerator of this

fraction would be zero, since there are no individuaL differences in num-

ber of items completed (u'(' = 0). The entire index would thus equal zeroin a pure power test. On the other hand, if the total test variance (U2f)

is attributable to individual differences in speed, the two variances will

.. be equal and the ratio will be 1.00. Several more refined procedures have

;". been developed for determining this proportion, but their detailed con-

sideration falls beyond the scope of this book., . '.

An example of the effect of speed on single-trial reliability coefficients

is provided by data collected in an investigi~on of the first edition of

the SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi &

Drake, 1954). In this study, the reliab!lijY',uf each test was first deter-

mined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in the

first row of Table 11, are closely sinjil Jhose reported in the test

manual. Reliability coefficients were the ..," ,nfited by correlating scores

on separately timed halves. These coef1i~~:are shown in the second

row of Table 11. Calculation of speed indexes showed that the Verbal

Meaning test is primarily a power teSt;,l~i1e the Reasoning test is some-

what more dependent on speed. The Spa.~~,and Number tests proved to

be highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com-

TABLE 11

Reliability Coefficients of Four of the SRA Tesls of Primary MenIalAbilities for Ages 11 to 17 (1st Edition)

(Data from Anastasi & Drake, 1954)

Reliability Coefficient VerbalFound by: Meaning Reasoning Space Number

Single-trial odd-even method .94 ,96 .90 .92Separately timed halves .90 .87 .75 .83

p~ted, the reliability of the Space test is .75, in contrast to a spuriously

hIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoning

te,st drops f~on~..96 to .87, and that of the Kumber test drops from .92 to

.8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all

the other hand, shows a negligible difference whe'n computed by the twomethods.

DEPENDENCE OF RELIABILITY COEFFICIENTSON THE SAMPLE TESTED

7 See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman

(1955), Helmstadter& Ortmeyer (1953).

HET~ROG~XEITY.An important factor influencing the size of a reliability

coeffiCient IS the nature of the group on which reliability is measured. In

~he. ~rst pla~e, any correlation coefficient is affected by the range of

1I1?~\')?ual dl~erenc:~ in the group. If every member of a group were

ah~~ 111spcllmg ablhty, then the correlation of spelling with any other

a~lll~y would be zero in that group. It would obviously be impossible;'

WI~~1Ilsuch a group, to predict an individual's standing in any otherablhty from a knowledge of his spelling SCOFe.

Anot~er, less extreme, example is provided by the correlation between

tw~ aptItude tests, such as a verbal comprehenSion and an arithmetic rea-

sonmg test. If these tests were administered to a highly homogeneous

sampll:', such as a group of 300 college sophomores, the correlation be-

I tween the two would probably be close to zero().There is little relation-

S~i~, wi~hin such a .s~lected s~mple of college students, between any in-

dn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the other

hand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 per-

sons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to college

graduates, a hIgh correlatlon would undoubted:}£,::be obtained betweep

the two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tile

college graduates on both tests, and similar no{ . hips would hold for

other subgroups within this highly heterogeneo'us ',pIe.'>


mination of the hypothetical scatter diagram given in Figure 12

urther illustrate the dependence of correlatioll coefficients on the

Hity, or extent of individual differences, within the group. This

r diagram shows a high positive correlation in the entire, heteroge-

s group, since the entries are closely clustered about the diagonal

ding from lower left- to upper right-hand corners. If, now, we con-

only the subgroup falling within the small rectangle in the upper

-hand portion of the diagram, it is apparent that the correlation be-

the two variables is close to zero. Individuals falling within this

, icted range in both variables represent a highly homogeneous group,

did the college sophomores mentipned above.

'ke all correlation coefficients, reliability coefficients depend on the

'iability of ,the sample within which they are found. Thus, if the re-

ility coefficient reported in a test manual was determined in a group

'ing from fourth-grade children to high school students, it cannot be

med that the reliability would be equally high within, let us say, an

hth-grade sample. \Vhen a test is to be used to discriminate individual

Reliability 127

differences within a more homogeneous sample than the standardization

group, the reliabi~ity ~oefficient should be redetermined on such a sample.

Formulas for estimating the reliability coefficient to be expected when

the standard deviation of the group is increased or decreased are avail-

able in elementary statistics textbooks. It is preferable, however, to re-

compute the reliability coefficient empirically on a group comparable to

that on which the test is to be used. For tests designed to cover a wide

range ~f age or abil.ity, the test manual should report separate reliability

coeffiCIents for relatively homogeneous subgroups within the standardiza-tion sample.

i i I ; ! I i I ! ! i , , I

I , I i I I, , I

, , i , IIi I,I

,I , I ! i i I I iI/ I

,,, I , I ! i

, ~ , I,"

"1'/1 11\-'~--, I I , ; ! I I i ,

i I ill I ; 1\11'/1,/1

I i I i, , I I /lill,l '1/'11 Ifi'll IIi i-h': i I i 1 i i

"I jll'/I', :'/'1111, i ,I,

I,

1 ! , ! I 1/1 1'1/;/1/ /II I!I: Ii I !

i,i I !

,I ! , , I ~ I III /11/1.//, , /I:/! I!

i i ! I I I I ,II, , 1/1 /1/1:/1 I :/1' I , i , ,~,

I I I .11 11[111/1 III /I /1;11, , ; i iI !! ,

I I ! ;1 1/11/1' I: I '11,/1 1 i , ' , I,

I! I

i I , i'i il'" //I//!// 11/ II II' ! I J

,I!

I i I,i I I !1I,II,lI/llIll/ll/ I ;", ; I ,

i: .I i i I , 1 1'1' : I ~" 11111I11 'I;;;l;i.;: 'i , i i

I i I ;111/11 //,/1 1/1,11/, I t~1,i,I ii

I I I 111 111 111/:1/!1I I i •.; 1 I 1 i , I,, , I ,/1 I I I i,l I! ~ I I' I

i / I fll I 11/11/'I

". i!I : I I L,

IJII I , II i t~i , i '~?f I I

I 11·/1 I ! I ,:~it· : ! : I

1/1/ /I' I [I I i,

I11/

II:W 11 1/1/ i I I I ; II I , II I I I I 1 I

"/I , II ", ..; ,

I I,III I I .. .'fo;., I I I

jfI I i ! Ii .....- ...

/I I I I I !

I 1",'·,1. I I

ABILITY LEVEL. Kot only does the reliability coefficient vary with the

extent of individual differences in the sample, but it may also vary be-

tween groups differing in average ability level. These differences, more-

over, cannot usually be predicted or estimated by any statistical formula,

b~t c~n ~e' discovere~ .only by empirical tryout of the test on groups

d.dfermg 111 age or abilIty levcl. Such differences in the reliability of a

smgle test may arise from the faCt that a slightlv different combination of

abilities is measured at different difficulty lev~ls of the test. Or it mayresult from the statistical properties of the scale itself, as in the Stanford-

Binet (Pinneau, 1961, Ch. 5). Thus, for different ages and for different

IQ levels, the reliability coefficient of the Stanford-Binet varies from .83

to .98. In other tests, reliability may be relatively low for the younger

and less able ¥roups, since their scores are unduly influenced by guessing.

Under such CIrcumstances, the particular test should not be employed atthese levels.

It is apparen.t t~at every reliability coefficient should be accompanied

by a fuD descnptIon of the type of group on which it was detelmined.

Special attention should be given to the variability and the ability level

of the sa~~le. The reported reliability coefficient is applicable only to

~amplef, s~nll]~r to that on which it was computed. A desirable and grow-

lIlg practice In test construction is to fractionate the standardization

sample into m~re homogeneous subgroups, with regard to age, sex, grade

leve~, occupation, and the like, and to report separate reliability co-

effic~ents for each s~bgroup. Under these conditions, the reliability co-

cHicIen¥ are more lIkely to be applicable to the samples ~~th which thetest is to be used ill actual practice. ..

Score on Variable 1

.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.INTERPRETATION OF INDIVIDUAL SCORES. The reliability of a test may be

expressed in terms of the standard error of measllre~ent ((fmen.,), also

Principles of PsycllOlogical Testing

, called tIle standard error of a score. This measure is particularly wen

suited to the interpretation of individual scores, For many testing pur-

poses, it is therefore more useful than the reliability coefficient., TI~e

, standard error of measurement can be easily computed from the rehabll-

: ity coefficient of the test, by the following formula:

.inwhich al is the standard deviation of the test scores and '11 the reliabil-

itycoefficient, hath computed on the same group. For example, if devia-

tion IQ's on a particular intelligence test have a standard devia~iol1 of ~5

.and a reliability coefficient of .89, the a"" ••. of an IQ on thIS test IS;

;.15\/1- .89= 15Y.ll = 15(.33) = 5. -v

. To understand what the UI/H'.' tells us about a score, let us suppose that

.~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy,

t;tJim,Because of the types of chance errors discussed in this chapter, these

:\ scores will vary, falling into a normal distribution around Jim's true

':score.The mean of this distribution of 100 scores can be taken as the true

,scoreand the standard deviation of the distribution can be taken as the

, "11Im, • Like an\, standard deviation, this standard error can be interpreted

in t~rms of the normal curve frequencies 'discussed in Chapter 4 (see

Figure 3). It will be recalled that between the mean and ±lu there are

~pproximatf'ly 68 percent of the cases in, a normal curve. Th~s" we can

.nclude .h-.;-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on

, is test :_..'" 'fluctuate between ± lUIII,n.'. or 5 points on either side of his

Ie IQ. If his true IQ is no, we 'V<:mldexpect him to score between 105

ld U5 about two-thirds (68 percent)' of the time.

If we want to be more certain of oiI~rprediction, we can choose higher

'\lddsthan :2:1. Reference to Figurei,1t~~Chapter 4 shows that ±3u covers00.7 percent of the cases. It can be::~_sg~,~t.ainedfrom normal curve fre-

uenc)' tables that a distance of 2.58?:7.?~.·~i!4erside of the mean includes

'actly 99 percent of the cases. II,tti{ee;:the chances are 99:1 that Jim's

will fall within 2.58ulllras, or (2.58)(5) = 13 points, on either side of. true IQ. We can thus state at ttte 99 percent' confidence level (with

Iy one chance of error out of l00J,:,that Jim's IQ on any single admin-

ation of the test will lie between"97 an9 123 (110 -13 and no + 13).''Jimwere given 100 equivalent te~ts. ilis IQ would fall outside this band

'Valuesonly once..'In actual practice, of course, we do not have the true scores, but only

e scores obtained in a single test administration. Under these circum-

~nces,we could try to follow ~t.above reasoning in the reverse direc-

. If an individual's obtal,p~l.score is unlikely to deviate by more

2.58O''''r ••. from his true"~ore, we could argue that his true score

lie within 2.580'n1f.B, olflis obtained score. Although we cannot as-

Reliability U9

sign a probability to this statement for any given obtained score, we call

say that the statement would be correct for 99 percent of all the cases.

On the basis of this reasoning, Gulliksen (1950b, pp. li-20) proposed

that the standard error of measurement be used as illustrated abo've to

estimate the reasonable limits of the true score for persons ""it-h any given

obtained score. It is in terms of such "reasonable limits" that the en-or of

measurement is customarily interpreted in psychological testing and it

will be so interpreted in this book.

The standard error of measurement and the reliabilitv coefficient are

obviously alternative ways of exprt'ssing test reliability. Unlike the relia-

bility coefficient, the error of measuren)('nt is independent of the varia-

bility of the group on which it is computed. Expressed in terms of indi-

vidual scores, it remains unchanged when found in a homogeneous or a

heterogeneous group. On the other hand, being reported in score units,

the error of measurement will not be directly comparable from test to

test. The usual problems of comparability of units would thus arise when

errors of measurement are reported in terms of arithmetic problems,

words in a vocabulary test, and the like. Hence, if ,,"e want to comparethe reliability of differetlt tests, the reliability coefficient is the better

measure. To interpret individual scores, the standard error of measure-

ment is more appropriatc.

INTERPRETATI01IO OF SCORE DIFFERENCES. It is particularly important to

consider test reliability and errors of measurement \\'hen evaluating the

differellces between two scores. Thinking in terms of the range within

which each score may fluctuate serves as a check against overempha-

sizing small diHerences between scores. Such caution is desirable both j

when comparing test scores of different persons and when comparing

the scores of the same individual in diHerent abilities. Similarly, changes

in scores following instructiun or other experimental \'ariables need to be

interpreted in the light of errors of measurement.

A frequent question abollt test scores concerns the individuars relative

standing in different areas. Is Jane more able along verbal than along

numerical lines? Does Tom have more aptitude for mechanical than for

verbal activities? If Jane scored higher on the verbal than on the nu-

merical sub tests .on an aptitude battery and Tom scored higher on the

mechanical than on the verbal, how sure can we be that they would still

do so1on a retest with another form of the battery? In other words, could

thc score differences have resulted merely from the chance: se)ection of

specific items in the particular verbal, numerical, and mechahical tests

employed?

Because of the growing interest in the interpretation of score p'rofi.les,

test publishers have been developing report forms that permit the evalua-

Reliability 131

the difference between the Verbal Reasoning and Numerical Ability

scores probably reflects a genuine difference in ability level; that bctween

Mechanical Reasoning and Space Relations probably does not; the dif-

ference between Abstract Reasoning and Mechanical Reasoning is inthe doubtful range.

It is well to bear in mind that the standard error of the difference be-

tween two scores is larger than the error of measurement of either of the

two scores. This follows from the fact that this difference is affected by

the chance er1"Orspresent in both scores. The standard error of the diffe;-

ence between two scores can be found from the standard errors of meas-

urement of the two scores by the follOWing formula:

:RAWSCORE I~~l'~~'llw;;, ft .~~~ I~~l~::'-;-1 ;~ I;; IPERCENTILE 60 9S 80" 95 30 80 90 'l9 85 i

,, <;: ",. -

~~\,

~- ..

. - -

- - ..

..

".0 : .. .. ..

0 .. .. ..

,

1

'"~60~~~ 50u

~ 40

30,.

in which Udi//. is the standard error of the difference between the two

scores, and Umca8.) and Urneas .• are the standard errors of measurement of

the separate scores. By substituting SDyll - TII for Umeus

,) and

SDyll - r2lI for Umeas .• , we may rewrite the formula directly in terms ofreliability coefficients, as follows~

In this substitution, the same SD was used for tests 1 and 2, since their

scores would have to be expressed in terms of the same scale before theycould be compared.

\Ve may illustrate the above procedllfe with the Verbal and Perform-

ance IQ's on the Wechsler Adult Intelligence Scale (WAIS). The split-

half reliabilities of these scores are .96 and .93, respectively. WAIS devia-

tion IQ's have a mean of 100 and an SD of 15. Hence the standard error

of the difference between these two scores can be found as follows:

Flc. 13. Score Profile on the Differential Aptitude Tests, Illustrating Use ofPercentile Bands.

(Fig. 2, Fifth Edition Manual, p. 73. Reproduced b)' permission. Copyright ® 1973,

1974 by The Psychological Corporation, New York, N.Y. All rights reseT\'ed.)

tion of scores in terms of their errors of measurement. An example is, the

Individual Report Form for use with the Differential Aptit,~~e Tests, re-produced in Figure 13. On this form, percentile scores ?~ each subtest

of the battery are plotted as one-inch bars, '\\1th the ~1:l~jPed percentil~ '

at the center. Each percentile bar corresponds to a dist~nce of approxI-

mately 1Y2 to 2 standard error~ o~ :ithe~' ~,ide ~f 't~i!o~t~ine? ~core.8

Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm the

bar is correct about 90 percent oftl,t,:.time. In iI'l~,~rp.tetingthe profiles,

test users are advised not to attach Importance to olfferences between

scores whose percentile bars overlap,- especially if they overlap by more

than half their length. In the profil~%tl!ustrated~~f~gure 13, for example,·1;-~:. -, ,

8 Because the reliability coefficient (a¥d hence th~ er •• , ••. ) varies somewhat with

subtest, grade, and sex. the actual ranges covered by the one-inch lines are not

identical, but they are sufficiently close to permit uniform interpretations for practical

purposes.

Udif/. = 15y12 - .96 - .93 = 4.95

To determine how large a score difference could be obtained by chance

at the .05 level, we multiply the standard error of the difference (4.95)

by 1.96. The result is 9.70, or approximately 10 points. Thus the differ-

ence between an individual's WAIS Verbal and Performance IQ should

be at least 10 points to be significant at the .05 level.

1RELIABIUTY OF CRITERION-REFERENCED TESTS

It will be recalled from Chapter 4 that criterion-referenced, tests usu-

ally (but not necessarily) evaluate performance in terms o(~mastery

rather than degree of achievement. A major statistical implication of

13Z Pl'inciplt:s of Psychological Tcstillg

mastery testing is a reduction in yariability of scores among persons.

Theoretically, if everyone continues training until the skill is mastered,

variability is reduced to zero. Not only is low variability a result of the

way such tests are used; it is also built into the tests through the con-

struction and choice of items, as will be shown in Chapter 8.

In an earlier section of this chapter, we saw that any correlation, in-

cluding reliability coefficients, is affected by the variability of the group

in which it is computed. As the vatiability of the sample decreases, so

does the correlation coefficient. Obviously, then, it would be inappropri-

ate to assess the reliahilitv of most criterion-referenced tests by the usual

procedures.o Under thes; conditions, even a highly stable and internally

consistent tcst could yield a reliability coefficient near zero.

In the construction of criterion-referenced tests, two important ques-

tions are: (1) How many items must be used for reliable assessment of

each of the specific instructional objectives covered by the test? (2) "What

proportion of items must be correct for the reliable establishment of

mastery? In much current testing, these two questions have been an-

swered by judgmental decisions. Efforts are under way, however, to de-

velop appropriate statistical techniques that will provide objective, em-

pirical answers (see, e.g., Ferguson & i\ovick, 1973; Glaser & Nitka, 1971;

Hambleton & l\ovick, 1973; Livingston, 19i2; Millman, 1974). A few

examples will serve to illustrate the nature and scope of these efforts.

The t,,'o question~ about number of items and cutoff score can be in-

corporated into a single hypothesis, amenable to ~testillg within the

framework of decision theory and sequential analysis (Glaser & :\'itko,

197]; Lindgren & :'.1cElrath, 1969; Wald, 1947). Specifically, we wish to

test the hypothesis that the examinee has achieved the required le"el of

mastery in tllP content domain or instructional objective sampled by tne

test items. Segucntial analysis consists in taking observations one at a

timE' and deciding after cach observation whC'f.tper to: (1) accept the

hypothesis; (2) rejcct the hypothes!s; or .(3~f~~ake add~tional o~serYa-

tlOns. Thus the number of observations (m.;fhls case :t:lytnber of items)

needed to reach a reliable conclusion is, itself deten~nined during the

process of testing. Rather than being p.::fls~nted,,:ith a fixed, prede-

termined number of items the examine~~c;;dntimieS;ltaking tbe test until

a mastery or nonmastery d~cision is r~·.·" ·'d. At that point, testing is dis-

cuntinue'd and the student is either dire '.,:fo~henext instructional level

or returned to the nonmastered level '0 ; further study. \Vith the com-

puter facilities described in Chaptn_~, such sequential decision pro-

9 For fuller discussionof special statistic;\"~roceduresrequired for the constructionand evaluationof criterion-referencedtests,see Glaser and Nitko (1971), Hambletonand Novick (1973), Millman (1974), Popham and Husek (1969). A set of tablesfor determining the minimum number of~lems required for establishing mastery atspeCifiedlevels is provided by Millman (1972,1973).

Reliability 133

ce~ures ar~ feasible and can reduce total testing time while yieldingrehable ~stlma.tes of mastery (Glaser & Kitko, 1971).

Some Investigators have been explorinO' the use of Ban'sian estimationtechniques, whi.eh lend themselves well t~ the kind of decisions requiredby,ma~tery testmg. Because of the large number of specific instructionalobjectives to bc t~sted, criterion· referenced tests typically provide only a

small number of Itcms for cach objective. To supplement this limited in-

formation, procedures have been developed for incorporatinO' collateral

data from the student's previous performance history, as well ~s from the

test results of other students (Ferguson & !'\oviek, 197:3; Hambleton &Novick, 1973).

When flexible, individually tailored procedmes are impracticable,

I~ore traditional techniques can be utilized to assess the reliability of a

gl\'en .test. For example, mastery decisions reached at a prerequisite in-

structional level can be che{:ked against performance at the next instruc-

tional level. Is there a sizeable proportion of students who reached or

exceeded the cutoff score on tIle masten' test at .the lower level and

~ailed t~ achi~\'e mastery at the next levei within a reasonable period of

mstructlOnal tU1W?Does an analysis of their difficulties suggest that they

had not truly mastered the prerequisite skiIIs:l If so, these findings would

strongly suggest that the mastery test was unreliable. Either the addi-

tion of more items or the establishment of a higher cutoff score would

seem to be indicated. Another procedure for determining the reliability

of a master)' test is to administer two parallel forms to the same indi-

viduals and note the percentage of persons for ",hom the same decision

(~mstery or nonmastery) is reached on both forms (Hambleton & No-\'Ick, ] 973 ).

In the development of several criterion-referenced tests, Educational

Testing Service has followed an empirical procedure to set standards of

mastery. This procedure involves administering the test in classes one

grade below and one grade above the grade where the particular con-

ce?t or skill i~ taught. The dichotomization can be fmther rcGned by

usmg teacher Judgments to exclude any cases in the lower grade knoVl'll

to have mastered the concept or skill and any cases in the higher grade

who have demonstrably failed to master it. A cutting score, in terms of

number or percentage of correct items, is then selected that best dis-criminates between the two groups. .

Allstatistical procedures for use with criterion-referenced tests are in

an exploratory stage. Much remains to be done, in both theoretical de-

veloIJ!nent and ~mpir.ical ~ryouts, before the most effective IJlethodologyfor different testmg situatlons can be formulated. 4

HAPTER 6

Validity: Basic Concepts 135

sample of the behavior domain to be measured. Such a validation -pro-

cedure is commonly used in evaluating achievement tests .. This type of

test is designed to measure how well the individual has mastered a

specific skill or course of study. It might thus appear that mere inspec-

tion of the content of the test should suffice to establish its "a1idih' for

such a purpose. A test of multiplication, spelling, or bookkeeping '~'ould

seem to be valid by definition if it consists of multiplication, spelling, or

bookkeeping items, respectively.

The solution, however, is not so simple as it appears to be.' Onc diffi-

culty is that of adequately sampling the item universe. The behavior do-

main to be tested must be systematically analyzed to make certain that

aJJ major aspects are covered by the test iteme;. and in the correct pro-

~r example, a test can easily become overloaded with those

aspects of the field that lend thcmselves more readily to the pl'eparation

of objective items. The domain under consideration should be fully de-

scribed in advance, rather than being defined after the test has been pre-

pared. A \VeIl-constructed achievement test should cover the objectives of

instruction, not just its subject matter. Content must therefore be broadly

defined to include major objectives, such as the application of principles

and the interpretation of data,~ as well as factual knowledge. ~vloreover,

content validity depends on the relevance of the individual's test re-

sponses to the behavior area under consideration, rather than on the

apparent rcle\'ance of item content. Mere inspection of the test may fail

to reveal the processes actually used by examinces in taking the test.

It is. also important to guard against any tendency to overgeneralize

regarding the domain sampled by the test. For instance, a multiple-choice

spelling test may measure the ability to recognize correctly and incor-

rectly spelled worde;. But it cannot be assumed that such a test also

measures ability to spell correctly from dictation, frequency of misspell-

ings in written compositions, and other aspects of spelling ability (Ahl-

strom, 1964; Knoell & Harris, 1952). Still another difficulty arises from

the possible inclusion of irrelevant factors in the test scores. For example,

a test designed to measure proficiency in such areas as mathematics or

mechanics may be unduly influenced bv the ability to understand verbal

directions or by speed o{performing si~ple, routi~e tasks.

.alidity:

.;Basic C011cepts

T·HE VALIDlTY of a test concerns u;lwf the test measures and how

, wen it does so. In this connection, we should guard against ae-

, cepting the test name as an index of .what. the ~est measures. Test

names provide short, convenient labels for IdentificatIon purposes. Most

test names are far too broad and vague to furnish meaningful clues to the

behavior area covered, although increasing e£forts are being made to use

more specific and operationally definable test names. ~he ~rait measured

by a given test can be defined only through an e~amIna~l~n of. the ob-

jective sources of information and empirical operatIOns ut~li~ed In estab-

lishing its validity (Anastasi, 1950). Moreover, the vahdlty of ,a .tes;

cannot be reported in general terms. No test can be said to ha.ve 'hl~h

or "low" validitv in the abstract. Its validity must be determmed WIth

reference to the' particular use for, which the test is being considered.

Fundamentallv all procedures for determining test validity are con-

cerned with the 'r~lationships between performance on the test and other

independently observable facts about the behavio~ ehar~cte~stics under

consideration. The specific methods ·employed for mvestIgatmg these re-

lationships are numerous and have been descri~ed by various names. In

the Standards for Educational and PsycJlOloglcal Tests,' (1974), these

procedures are classified under three prineip~~"categories: c~l1t~nt,

criterion-related, and construct validity. Each o~ tnese types of valIdatIon,

procedures will be considered in one of the .fgll?c'~ir:!g.section~, and the

relations amona them will be examined in,~ .concludmg section. Tech-

niques for analyzing and intcrpreting vali1~tt "data with reference to

practical decisions will be discussed in Chapter 7. SPECIFIC PROCEDURES. Content validity is built into a test from the out-

set through the choice of appropriate' items. For educational tests, the

prepfaration of items is preceded by a thorough and systematic examina-

ti'Qn of relevant course syllabi and textbooks, as well as by consultation

NATURE. Content validity involves essentially the systematic exami~a-

tion of the test content to determine whether it covers a representative

I Further discussions of content validity from several angles ca,n be found in Ebel

(1956), Huddleston (1956), and Lennon (1956). .


with subject-matter experts. On the basis of the information thus gath-

'-ered,test specifications are drawn up for the item writers. These specifi-

cations should show the content areas or topics to be covered, the instruc-

'onal objectives or processes to be tested, and the relative importance of

'ndividual topics and processes. On this basis, the number of items of

ach kind to be prepared on each topic can be established. A convenient

ay to set up such specifications is in terms of a two-way table, with

ocesses across the top and topics in the left-hand column (see Table

,eh. 14). Not all cells in such a table, of course, need to have items,

,nee certain processes may be unsuitable or irrelevant for certain topics.

t might be added that such a specification table will also prove helpful

. the preparation of teacher-made examinations for classroom use in any

ubject.

~Jn listing objectives to be co\'ered in an educational achievement test,

e test constructor can be guided by the extensive survey of educational

jectives given in the Taxonomy of ~ducational Objectives (Bloom

a!., 1956; Krathwohl et al., 1964), Prepared by a group of specialists

educational measurement, this handbook also provides examples of

any types of items designed to test each objective. Two volumes are

ilable, covering cognitive and affective domains, respectively. The

jor categories given in the cognitive domain include knowledge (in

senseof remembered facts, terms, methods, principles, etc.), compre-

sion,application, analysis, synthesis, and evaluation. The classification

affective objectives, concerned with the modification of attitudes, in-

rests, values, and appreciation, includes five major categories: recciv-

'g, responding, yaluing, organization, and characterization.

IThediscussion of content validity in the manual of an achievement test

uld include information on th~ content areas and the skills or ob-

ivescovered bv the test, with some indication of the number of items

ach category. 'In addition, the procedures followed in selecting cate-

, s and classifying items should be described. If subject-matter experts

ipated in the test-construction process, their number and pro-

lal qualifications should be stated. If they served as judges in classi- ,

items, the directions they were given should be reported, as well as

extent of agreement among judges. Because curricula and course

eilt change over time, it is paI:tJcularly desirable to give the dates

n subject-matter experts were' consulted. Information should like-

be provided about number and nature of 'course syllabi and text-

s surveyed, including publication dates.

umber of empirical procedures may also be followed in order to

ement the content validation of an achievement test. Both total

s and performance on individual items can be checked for grade

ess.In general, those items are retained that show the largest gains

percentages of children passing them from the lower to the upper

JeqwnN wall ~N'" .••LllCO•.... ..,'" ~::~"'.,.'" "' •.... "" "'o~--~~~~ ~NN "''''.,. Lll"''''0:>0>0"INN NNN NNM

U!Pn&S IU!~OS

~ a3u:a!3S'u;:>is -,~~ samuewnH

" " "

3A!leJJeN i

II

" " "

5Cl'lpn~s le!XlS

" " "'0

is eou81OS x" "a" " " x

'p f!.• .,u;'C:6- S3!l!UllWnH ;

~

'M.!leJJC!N

I" " "sa!P01S II?POS

""0'u; a:>Ua!~S xIii " " " " ".r.

E0.

E 5a!)!lJcwnH

" " x0I,)

,a,,!~eJJeN

" ":

llj5!l:f% CONO LllMcn "'N~ "I•.•..'" "'''' •...coco", .,. ... .,. """-CX) r--.lllCO

"'''''''6 oiIpt'!JE)•.•..MLll CO.,.<0 •...•....•.... Nmq-10l!:t~LO "'''''''' MNO

'".,.CO'" "'''I''' <t ••• .,.

~.~- ;

£~ '4D!1l% ~~.:rl"'.,.- •.... ""'"~ 0 .•••.•..M COOOl "'ON ~~filll!'l~;l;~Z gaP'!J~! ~ "'''' ..•. "'N.,. •.•.,.to ",•••co N"'.,. 0••..'"~~g~- LllNN

.....11l5!1l~ f....-CD

"I"'''' "'.,.- coo", "'OeoL ape.!)

•.•..<0.,."''''''' "'''I''' co•••.'" "'''''''

-~•... "'~Lll N"'<t COOlN "'0>..-N.,.M NM.,."l"'''' "--N "INN

JaqwnN wall -N'" "''''(0 "'COOl O~N ~~~ ~~~ "'O~ "1M.,. "'''' •.... "''''0~NN "'NN N"'''' NNM


es.Figure 14 shows a portion of a table from the manual of the

entialTests of Educational Progress-Series II (STEP). For every

. in each test in this achievement battery, the information provided

des its classification with regard to learning skill and type of ma-

l,as well as the percentage of children in the normative sample who

the right answer to the item in each of the grades for which that

of the test is designed. The 30 items included in Figure 14 repre-

t onepart of the Reading test for Level 3, which covers grades 7 to 9.

ther supplementary procedures that may be employed, when ap-

priate, include analyses of t~l)es of errors commonly made on a test

observation of the work methods employed by examinees. The latter

ld be done by testing students individually with instructions to "think

ud" while ,solving each problem. The contribution of speed can be

ckedby noting how many persons fail to finish the test or by one of

emore refined methods discussed in Chapter 5. To detect the possible

irrelevantinfluence of ability to read instructions on test performance,

,~res on the test can be ~rrelated \",ith scores on a reading compre-

nsiontest. On the other hand, if the test is designed to measure read-

g comprehension, giving the questions v.oithout the reading passage on

hich they are based will show how many could be answered simply

fromthe examinees' prior information or other irrelevant cues.

Validity: Basic Concepts 1.39

into the initial stages of constructing any test, eventual validation of apti-

tude or personality tests requires empirical verification by the procedures

to be described in the following sections. These tests bear less intrinsic

resemblance to the behavior domain they are trying to sample than do

achievement tests. Consequently, the content of aptitude and personality

tests can do little more than reveal the hypotheses that led the test con-

structor to choose a certain type of content for measuring a specified

trait. Such hypotheses need to be empirically confirmed to estabiish thevalidity of the test.

Unlike achievement tests, aptitude and personality tests are not based

on a specified course of instruction or uniform set of prior experiences

from which test content can be drawn. Hence, in the latter tests, indi-

viduals are likely to vary more in the work methods or psycholOgical

processes employed in responding to the same test items. The identical

test might thus measure different functions in different persons. Under

these conditions, it would be virtually impossible to determine the psy-

chological functions measured by the tcst from an inspection of its

content. For example, college graduates might solve a problem in verbal

or mathematical terms, while a,mechanic would arrive at the same solu-

tion in terms of spatial visualization. Or a test measuring arithmetic

reasoning among high scho.ol freshmen might measure only individual

differences in speed of computation when given to college" students. A

specific illustration of the dangers of relying on content analysis of apti-

tude tests is provided by a study conducted with a digit-symbol substitu-

tion ~est"(Burik, 1950). This test, generally regarded as a typical "code-

learmng test, was found to measure chiefly motor speed in a group ofhigh school students.

APPLICATIONS. Especially when bolstered by such empirical checks as

thoseilIusb'ated above, content vali,dity provides an adequate technique

forevaluating achievement tests. It permits us to answer two questions

ihatare basic to the validitv of an achievement test; (1) Does the test

'cover a representative sa~ple of the speCified skills and knowledge?

(2) Is test performance reasonably free from the influence of irrelevant

; \Janables?

~. Content validity is particularly appropriate for the criterion-refer~n~d

.. testsdescribed in Chapter 4. Because performance on these tests lS 111-

f .terpreted in tern1S of content meaning, it is obvious that content validity

~ isa prime requiremenf for their effective use. Content validation is also

· applicable to certain occupational tests designed for employee selection

andclassification, to .be discussed in Chapter 15. This type of validation

issuitable when the test is an actual job sample or otherwise calls for the

sameskills and knowledge required on the job. In such cases, a thorough

· jobanalysis should be carried out in order to demonstrate the close re-

· semblance between the job activities and the test.

For aptitude and personality tests, on the other hand, content validity

is usually inappropriate and may, in fact, be misleading. Although con-

siderations of relevance and effectiveness of content must obviously enter

FACE "ALIDITY. Content validitv should not be confused with face va-

lidity. The latter is not validity 'in the technical sense; it refers, not to

what the test actually measures, but to what it appears superficially to

measure. Face validity pertains to whether the test "looks valid" to the

examinees who take it, the administrative personnel who decide on its

use, and other technically untrained observers. Fundamentally, the ques-

tion of face validity concerns rapport and public relations. Although

common usage of the term validity in tlhs connection may make for

confusion, face validity itself is a desirable feature of tests. For example,

when tests originally designed for children and developed within a class-

room setting were grst extended for adult use, they frequently met with

~esistance and criticism because of their lack of face validity. Certainly

if test content appears irrelevant, inappropriate, silly, or childish, the

result will be poor cooperation, regardless of the actual validity of the


sonnel to occupational training programs represent examples of the sort

of decisions requiring a knowledge of the predictive validity of tests.

Other examples include the use of tests to screen out applicants likely

to develop emotional disorders in stressful environments and the use of

tests to identify psychiatric patients most likely to benefit from a par-ticular therapy.

In .a number of instances, concurrent validity is found merely as a

su~stJt~te for predictive validity. It is frequently impracticable to extend

vah~atlon ~rocedures over the time required for predictive validity or to

o~tam a s~Itable preselection sample for testing purposes. As a compro-

m~se .solutIOn, therefore, tests are administered to a group on whom

cntenon data are already available. Thus, the test scores of college

stud~nts may b~ compared with their cumulative grade-point average at

~he tIme of testmg, or those of employees compared with their currentJob success.

For certain uses of psychological tests, on the other hand, concurrent

validity ~sthe ~~st ~pprop!iate type and can be justified in its own right.

The logICal dI~tinchon between predictive and concurrent validity is

?ased, not on hme, but on the objectives of testing. Concurrent validity

ISrel.ev~nt to tests employed for diagnosis of existing status, rather than

predIction of future outcomes. The difference can be illustrated bv ask-

ing: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely tobecome neurotic::>"(predictive validity) . .

. Because ~he criterion for concurrent validity is always available at the

hme of testmg, we might ask what function is served bv the test in such

situa~ions. B~sicalIy, such tests provide a simpler, quicker, or less ex-

~ensive subs.htute for the criterion data. For example, if the criterion con-

SIStsof continuous observation of a patient during a two-week hospital- ,

ization period, a test that could sort out normals from neurotic and '

?oubtful cases would appreciably reduce the number of persons requir-mg such extensive observation.

140 Principles of Psychological Testing

~st.Especially in adult testing, it is not sufficie~t for a t~st to. be ob-

ctivelyvalid. It also needs face validity to function effectively In prac-

o al situations..Face validity can often be improved by merely reformulating test

msin terms that appear relevant and plausible in the particular setting

whichthe" will be used. For example, if a test of simple arithmetic

soningis 'constructed for use with machinists, the items should be

ded in tcrms of machine operations rather than in terms of "how

y oranges can be purchased for 36 cents" or other traditional school-

k problems. Similarly, an arithmetic test for naval personnel can be

ressedin naval terminology, without necessarily altering the functions

asured.To be sure, face validity should never be regarded as a substi-

e for objectively determined validity. It cannot be assumed that im-

\1ng the face validity of a test '\vill improve its objective validity.

r can it be assumed that when a test is modified so as to increase its

e validity,its objective validity remains unaltered. The validity of the

inits final form will always need to be directly checked.

riterion-relatedvalidity indicates the effectiveness of a test in predict-

an individual's beha\'ior in specified situations. For t~is purpose, per-

anceon the test is checked against a criterion, i.e., a direct and in-

dent measure of that which the test is deSigned to predict. Thus,

mechanical aptitude test, the criterion might bc subsequent job

ormanceas a machinist; for a scholastic aptitude test, it might be

gegrades; and for a neuroticism test, it might be associates' ratings

..her available information on the subjects' behavior in various life

lions.

'CURREI'.:TAND PREDICTIVE VALIDITY. The criterion measure against

test scores are validated may be obtained at approximately the

. time as the test scores or after a stated interval. The APA test

·urds (1974), differentiate between concurrent and predictive valid-

the basis of these time relations between criterion and test. The

rediction"can be used in the broader sense, to refer to prediction

he test to any criterion situation, or in the more limited sense of

'onover a time interval. It is in the latter sense that it is used in

ression"'predictive validity:' The information provided by pre-

validityis most relevant to tests used in the selection and das-

n of personnel. Hiring job applicants, selecting students for

onto college or professional schools, and assigning military per-

• ~RITERION CO~TAMINATION. An essential precaution in finding the va-

hdlty of a test IS to make certain that the test scores do not themselves

influence any individ~ars c~terion. status. For example, if a college ill<-

st.metor or a foreman III an mdustnal plant knows that a particularillai~

VIdual scored very p~rly on an aptitude test, such lcIl,owl~qgemight in-

fluence the gr~de gIVen to the student or the rating assigned to the

worker. Or a hIgh-scoring person might be given the benefit of the doubt

~hen academic grades or on-the-job ratings are being prepared. Such

mHuences would obviously raise the correlation between test scores and

crite~on in ~ manner that is entirely spurious' or <ilrtificia1:;. .'

TIus pOSSIblesource of error in test validation is known as criterion

rillciplesof Psychological Testing

tion, since the criterion ratings become "contaminated" by the

owledgeof the test scores. To prevent the operation of such an

s absolutely essential that no person who participates in the as-

of criterion ratings have any knowledge of the examinees' test

or this reason, test scores employed in "testing the test" must

rictlyconfidential. It is sometimes difficult to convince teachers,

s,military officers, and other line personnel that such a precau-

ential. In their urgency to utilize all available information for

decisions,such persons may fail to realize that the test scores

e put aside until the criterion data mature and validity can be

d,


selected group than elementary school graduates, the relation between

amount of education and scholastic a titnde is far from erEect. Espe-

cIa y at t e Ig er e ucationallevels, economic, social, motivational, and

other nonintellectual factors may influence the continuation of the indi-

vidual's education. Moreover, with such concurrent validation it is diffi-

cult to disentangle cause-and-effect relations. To what extent ~re the ob-

tain~d differences in intelligence test scores simply the result of the

yarymg amount of education? And to what extent could the test have

predicted individual differences in subsequent educational progress?

These questions can be answered only when the test is administered be-

fore the criterion data have matured, as in predictive validation.

I.n t~e development of special aptitude tests, a frequent type of cri-

teno~ is bas~d on performance in specialized training. For example, me-

chamcal aptitude tests may be validated against final achievement in

sho~ courses. Various business school courses, such as stenographY,

t~l~g, or bookkeeping, provide criteria for aptitude tests in these area's.

SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-

datmg musIc. or art. aptitude tests. Several professional aptitude tests

have been validated In terms of achievement in schools of law medicine

dentistry, engineering, and oth;r areas. In the case of custom-:nade tests'

deSigned for use within a specific testing program, training reco;ds are ~

f:equent ~ource of ~riterion data. An outstanding illustration is the valida-

hO~ ~f Au Force pIlot selection tests against performance in basic flight

tr~m~g. Performance in training programs is also commonly used as a

~ntenon ~or test validation in other military occupational specialties and

m some mdustrial validation studies.

~mong the specific indices of training performance employed for cri-

tenon purposes may be mentioned achievement tests administered on

.completion of training, formally assigned grades, instructors' ratings, and

succ~ssful co~pletjon of. training versus elimination from the program.

l\ful~lple .aptItude battenes have often been checked against grades in

spec,IRehIg? school or college courses, in order to determine their validity

as dIfferential predictors. For example, scores on a verbal comprehension

test may be compared with grades in English courses spatial visualiza-

tion scores with geometry grades, and so forth. '

In connection with the use of training records in general as criterion

measures, a useful distinction is that between intermediate and ultimate

criteria: In the development of an Air Force pi!Pt-selection test or a medi-

cal aptItude test, for example, the ultimate criteria would be combat

perfo~mance a~d eventual achievement as a practicing physician, re-

spectIvely. ObVIOuslyit would require a long time for such criterion data

to mature. It is doubtful, moreover, whether il~ly ultimate criterion is

ever obtained in actual practice. Finally, even were such an ultimate

criterion available, it would probably be subject to many unconttolled

MON CRiTERIA. Any test may be validated against as many criteria

e are specific uses for it. Any method for assessing behavior in

tion could provide a criterion measure for some particular pur-

he criteria employed in £ndif\g the validities reported in test

Is,however, fall into a few common categories. Among the criteria

equendyemployed in validating intelligence test~ is some index of

ic ac ' t. It is for this reason that such tests have often

ore precisely described as measures of scholastic aptitude,. The

cindicesused as criterion measures include school grades, achieve-

est scores, promotion and graduation records, special honors and

as,and teachers' or instructors' ratings for "intelligence." Insofar as

ratings given within an acade~ic setting are likely to be heavily

~dby the individual's scholastic performance, they may be properly

edwith the criterion of academic achievement.e various indices of academic achievement have provided criterion

at all educational levels, from the primary grades to college and

uateschool.Although employed principally in the validation of gen-

intelligence tests, they have also served as criteria for certain

'pIe-aptitude and personahty tests. In the validation of any of these

s.oftests for use in the selection of college students, for example, a

:~on criterion is freshman grade-point average. This measure is the

.agegrade in all courses taken during the freshman year, each grade

gweighted by the number of course points for which it w.a~~ceived.

variant of the criterion of academic achievement frequenl:ly em-

edwith out-of-school adults is the amount of education the individual

pleted. It is expected that in general the more intelligent individuals

inutl their education longer, while the less inte.lli ent drop out of

01earlier. The assumption underlying this crite . that the educa-

al ladder serves as a progressively selective nee, eliminating

oseincapable of continuing beyond each step. Although it is undoubt-

ly true that college graduates, for example, represent a more highly


. tors that would render it relatively useless. For example, it would be

cult to evaluate the relative degree of success of physicians practicing

erent specialties and in different parts of the country. F'or these rea-

s, such intermediate criteria as performance records at some stage of

iningare frequently employed.as criterion measures.

or many purposes, the most satisfactory type of criterion measure is

t based on follow-up records of actual ;ob performance. This criterionbeen used to some extent in the validation of general intelligence as

aspersonality tests, and to a larger extent in the validation of special

tude tests. It is a common criterion in the validation of custom-made

. for specific jobs. The "jobs" in question may vary widely in both

I and kind, including work in business, industry, the professions, and

armed services. Most measures of job performance, although prob-

not representing ultimate criteria, at least provide good inter-

iate criteria for many testing purposes. In this respect they are to be

erred to training records. On the other hand, the measurement of

perform;mce does not permit as much uniformity of conditions as is

, ible during training. Moreover, since it usually involves a l?nger

low-up,the criterion of job puformance is likely to entail a loss m the

mber of available subjects. Because of the variation in the nature of

inally .similar jobs in different organizations, test manuals reporting

'ditydata against job criteria should describe not only tbe specific

terion measures employed but also the job duti~s performed by the

rkers.Validation by the method of contrasted groups generally involves a

compositecriterion that reflects the cumulative and uncontrolled selective

j~fluencesof everyday life. This critcrion is ultimately based on survi"al

,withina particular group versus elimination therefrom. For example,. ip

e validation of an intelligence test, the scores obtained by institution~l-

, ed mentally retarded children may be compared with those obtained

y schoolchildren of the same age. In this case, the multiplicity of factors

etermining commitment to an institution for the mentally retarded con-

itutes the criterion. Similarly, the validity of a musical aptitude or a

echanical aptitude test may be checked by comparing the scores ob-

tainedby students enrolled in a music school or an engineering school,

respectively,with the scores of unselected high school or college student~.

, To be sure, contrasted groups can be selected on the basis of any cn-

terion, such as school grades, ratings, or job performa!1ce, by simply

choosingthe extremes of the distribution of criterion me~sures. The con-

trasted groups included in the present category, ho}'-'wer, are disti?ct

groupsthat have gradually become differentiated through the operation

ofthe multiple demands of daily living. The criterion under cons~dera-

lionis thus more complex and less clearly definable than those preVIously

discussed.


. The method o~ contrasted groups is used quite commonly in the valida-

hon of persollahty tests. Thus, in validating a test of social traits, the

test perform~nce of salesmen or executives, on the one hand, may be

compar~d WIth that of clerks or engineers, on the other. The assmnption'

underlymg such a procedure is that, with reference to man v social traits

individua.ls who hav~ entered and remained in such occupatiq9~:~s selling

or executive work Will as a group excel persons in such fiela~['ils clerical

work or engine.ering. Similarly, college students who hav~>:engaged in

man~ .extracl~rncular activities may be compared with those who have

partlcIp~ted 111 nOlle during a comparable period of college attendance.

Oc~up~tlOnaI groups have frequently becn used in the development and

vahdahon ?f interest tests, such as the Strong Vocational Interest Blank,

as well as ~n the preparation of attitude scales. Other groups sometimes

employed. m the validation of attitude scales include political, religious,

~eograp~lCal, or ot~er spccial groups generally knO\vn to represent dis-tmetly dIfferent pomts of "iew on certain issues.

In the developmc.nt of certain personality t~sts, psychiatric diagnosis is

used both as a basIS for the selection of items and as evidence of test

v~lidity, Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-

VIded that It IS based on prolonged observation and detailed case history

rather than on a cursory psychiatric interview or examination. In th;

latter. case, there is no reason to expect the psychiatric diagnosis to be

supenor to the test score itself as an indication of the individual's emo-

tion~l ~ondition. Such a psychiatric diagnosis could not be regarded as

~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidIty would have to be determined.

Mention has already been made, in connection with other criterion

cate,?o~ies, of certain types of ratings by school teachers, instructoml in

speclahzed. cou~s.es, an~ jo~ supervisors. To these can be added ratings

by offic~rs 10 mIhtary sltuahons, ratings of students by school counselors,

and ratmgs by co-workers, classmates, fellow club-members and other

grou?~ of associ~tes. The ratings discussed earlier represent~d merely a

SUhsldI~ry tec?mque for obtaining information regarding such criteria as

academiC achIevement, performance in specialized training, or job suc-

ce~s. :Ve are now considering the use of ratings as the very core of the

cntenon mea~ur~. Under these circuwstances, the ratings themselves

define the CrItenon. Moreover, such.:ratings are not restricted to the

evaluation of speci~c achievement, but involve a personal judgment by

an observer regardmg any of the variety;of traits that psychological tests

attempt to measure. Thus, the subjects in the vali~\ltion sample might be

~ate? on such c?aracteristics as dominance, mech~ll.icaI ingenuity, orig-mali~, leadership, or honesty.':":"

Ratings have bee~ employed in the valid~tion of,lltmost every type of

test. They are partICularly useful in providing criteria for personality

ril1ciplesof Psychological Testing

ation,since the criterion ratings become "contaminated" by the

owledgeof the test scores. To prevent the operation of such an

.'s absolutely essential that no person who participates in the as-

t of criterion ratings have any knowledge of the examinees' test

or this reason, test scores employed in "testing the test" must

rictlyconfidential. It is sometimes difficult to convince teachers,

s,military officers, and other line personnel that such a precau-

ential. In their urgency to utilize all available information for

decisions,such persons may fail to realize that the test scores

e put aside until the criterion data mature and validity can be

d.


selected group than elementary school graduates, the relation between

amount of education and scholastic a titude is far from erfect. Espe-

cIa y at t e Ig er e ucationallevels, economic, social, motivational, and

other nonintellectual factors may influence the continuation of the indi-

vidual's education. Moreover, with such concurrent validation it is diffi-

cult to disentangle cause-and-effect relations. To what extent ~re the ob-

tain~d differences in intelligence test scores simply the result of the

yarymg amount of education? And to what extent could the test have

predicted individual differences in subsequent educational progress?

These questions can be answered only when the test is administered be-

fore the criterion data have matured, as in predictive validation.

I.n t~e development of special aptitude tests, a frequent type of cri-

teno~ is bas~d on performance in specialized training. For example, me-

chamcal aphtude tests may be validated against final achievement in

sho~ courses. Various business school courses, such as stenography,

t~l~g, or bookkeeping, provide criteria for aptitude tests in these areas.

SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-

datmg musIC,or art. aptitude tests. Several professional aptitude tests

have been validated m terms otachievement in schools of law, medicine,

dentistry, engineering, and other areas. In the case of custom-made tests

designed for use within a specific testing program, training reco;ds are ~

f:equent ~ource of ~riterion data. An outstanding illustration is the valida-

hO~ ?f Alr Force pllot selection tests against performance in basic flight

tr~m~g. Performance in training programs is also commonly used as a

~ntenon for test validation in other military occupational specialties and

m some industrial validation studies.

~mong the specific indices of training performance employed for cri-

tenon purposes may be mentioned achievement tests administered on

completion of train,ing, formally assigned grades, instructors' ratings, and

succ~ssful co~plehon of. training versus elimination from the program .

l\ful~lple .aphtude battenes have often been checked against grades in

spec.,fi~hlg~ school or college courses, in order to determine their validity

as dIfferential predictors. For example, scores on a verbal comprehension

test may be compared with grades in English courses spatial visualiza-

tion scores with geometry grades, and so forth. '

In connection with the use of training records in general as criterion

measures, a useful distinction is that between intermediate and ultimate

criteria: In the development of an Air Force pilpt-selection test or a medi-

cal aptitude test, for example, the ultimate criteria would be combat

perfo~mance a~d eventual achievement as a practidng physician, re-

spectively. ObVIOuslyit would require a long time for such criterion data

to mature: It is. doubtful, moreover, whether a".truly ultimate criterion is

ever obtamed m actual practice. Finally, even were such an ultimate

criterion available, it would probably be subje,ct to many uncontrolled

MON CRiTERIA. Any test may be validated against as many criteria

e are specific uses for it. Any method for assessing behavior in

ationcould provide a criterion measure for some particular pur-

he criteria employed in finding the validities reported in test

Is, however, fall into a few common categories. Among the criteria

equentlyemployed in validating intelligence test~ is some index of

ic ac . t. It is for this reason that such tests have often

ore precisely described as measures of scholastic aptitude .. The

cindicesused as criterion measures include school grades, achieve-

est scores, promotion and graduation records, special honors and

as,and teachers' or instructors' ratings for "intelligence." Insofar as

ratings given within an acade~ic setting are likely to be heavily

,edby the individual's scholastic performance, they may be properly

. edwith the criterion of academic achievement.e various indices of academic achievement have provided criterion

at all educational levels, from the primary grades to college and

.uateschool.Although employed principally in the validation of gen-

.intelligence tests, they have also served as criteria for certain

tiplc-aptitude and personality tests. In the validation of any of these

of tests for use in the selection of college students, for example, a

on criterion is freshman grade-point average. This measure is the

, ge grade in all courses taken during the freshman year, each grade

gweighted by the number of course points for which it waJJ~ceived.

variant of the criterion of academic achievement frequently em-

yedwith out-of-school adults is the amount of education the individual

pleted. It is expected that in general the more intelligent individuals

tinue their education longer, while the less int~lli ent drop out of

001 earlier. The assumption underlying this erite . that the educa-

al ladder serves as a progressively selective . nee, eliminating

se incapable of continuing beyond each step. Although it is undoubt-

ly true that college graduates, for example, represent a more highly

44 Principws of Psychological Testing

tors that would render it relatively useless. For example, it would be

cult to evaluate the relative degree of success of physicians practicing

rent specialties and in different parts of the country. For these rea-

sueh intermediate criteria as performance records at some stage of

iningare frequently employed as criterion measures.

or many purposes, the most satisfactory type of criterion measure is

t based on follow-up records of actual ;ob performance. This criterion

.been used to some extent in the validation of general intelligence as

aspersonality tests, and to a larger extent in the validation of special

tude tests. It is a common criterion in the validation of custom-made

for specine jobs. The "jobs" in question may vary widely in both

and kind, including work in business, industry, the professions, and

armed services. Most measures of job performance, although prob-

not representing ultimate criteria, at least provide good inter-

iate criteria for many testing purposes. In this respect they are to be

erred to training records. On the other hand, the measurement of

perform;mce does not permit as much uniformity of conditions as is

ible during training. Moreover, since it usually involves a longer

low-up,the criterion of job ptrformanee is likely to entail a loss in the

mber of available subjects. Because of the variation in the nature of

minallv.similar jobs in different organizations, test manuals reporting

~ditydata against job criteria should describe not only the specific

'terion measures employed but also the job duti~s performed by the

rkers.Validation by the method of contrasted groups generally involve~ a

ill osite criterion that reflects the cumulative and uncontrolled selectIve

fluencesof everyday life. This critcrion is ultimately based on sur\'iY~1

'thin a particular group versus elimination therefr?m. For.ex~mp.le,.~n,

e validation of an intelligence test, the scores obtamed by mSbtutlOnal-

mentally retarded children may be compared with those obtained

schoolchildren of the same age. In this case, the multiplicity of factors

etermining commitment to an institution for the mentally ret~rded con-

stitutes the criterion. Similarly, the validity of a musical aptitude or a

echanical aptitude test may he checked by comparing the scores ob-

ainedby students enrolled in a music school or an engineering school,

espectively,with the scores of un selected high school or college student~.

To be sure, contrasted groups can be selected on the basis of any cn-

terion, such as school grades, ratings, or job performa!!ce, by simply

choosingthe extremes of the distribution of criterion metsures. TIle con-

trasted groups included in the present category, h~~wer, are disti~ct

groupsthat have gradually become differentiated through the ope~ation

ofthe multiple demands of daily living. The criterion under cons~dera-

tionis thus more complex and less clearly definable than those preViously

discussed.

Validity: Ba51c Concepts 145

. The method o~ contrasted groups is used quite commonly in the valida-

tion of personahty tests. Thus, in validating a test of social traits, the

test perform~nce of salesmen or executives, on the one hand, maybe

compar~d WIth that of clerks or engineers, on the other. The assumption'

underlymg such a procedure is that, with reference to many socialtraits

individua.ls who hav~ entered and remained in such occupatiQp~r~s selling

or executive work Will as a group excel persons in such fie1~&~iisclerical

work or engineering. Similarly, college students who hav~.'~ngaged in

man! .extracl~rricular activities may be compared V\'ith those who have

partlcIp~ted 111 nOlle during a comparable period of college attendance.

Oc~up~tlOl1al.groups have frequently been used in the development and

vahdabon ?f mterest tests, such as the Strong Vocational Interest Blank,

as well as ~n the preparation of attitude scales. Other groups sometimes

employed. III the validation of attitude scales include political, religious,

~eograp~lCal, or other special groups generally known to represent dis-

tmetly different points of \oiew on certain issues.

In the developme.nt of certain personality t~sts, psychiatric diagnosis is

used both as a basIS for the selection of items and as evidence of test

v~lidity. Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-

VIded that It is based on prolonged observation and detailed case history

rather than on a cursory psychiatric interview or examination. In th~

latter. case, there is no reason to expect the psychiatric diagnosis to be

supenor to the test score itself as an indication of the individual's emo-

tion~l ~ondition. Such a psychiatric diagnosis could not he regarded as

~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidity would have to be determined.

Mention has already been made, in connection with other criterion

catel?o~ies, of certain types of ratings by school teachers, instructor,s in

speCialized. cou~s.es. an~ jO~ supervisors. To these can be added ratings

by officers m mIlitary Situations, ratings of students bv school counselors

and ratings by co-workers, classmates, fellow club-~embers, and othe;

grou?~ of associ~tes. The ratings discussed earlier represented merely a

suhsldl~ry tec?mque for obtaining information regarding such criteria as

academIC achievement, performance in specialized training, or job suc-

ce:s. :Ve are now considering the use of ratings as the viery core of the

cntenon mea:ur~. Under these circutJIstances, the ratings themselves

define the crltenon. Moreover, suchuatings are not restricted to the

evaluation of speci~c achievement, ~ut involve a personal judgment by

an observer regardmg any of the vanety,of traits that psychological tests

attempt to measure. Th~s, .the subjects in the vali~\ltion sample might be

~ate~ on such charactensbcs as dominance, mechll.:nical ingenuity, orig-mality, leadership, or honesty. .,;,.

Ratings have bee~ employed in t?e validl!tion of;,almost every type of

test. They are partIcularly useful In providing criteria for personality

146 Principles of PSljchological Testing'J

;,tests,since objective criteria are much more difficult to find in this area.

}lfhisis especially true of distinctly social traits, in which ratings based on

;personal contact may constitute the most logically defensible criterion.

iiAlthoughratings may be subject to many judgmental errors, when ob-

. )ained under carefully controlled conditions they represent a valuable

's9urceof criterion data. Techniques for improving the accuracy of

i:iatingsand for reducing common types of errors will be considered in

,{,Chapter20.

,11 Finally, correlations between a new test and previously available tests

i~arefrequently cited as evidence of validity. When the new test is an ab-

,breviated or Simplified form of a currently available test, the latter can

,;Properly be regarded as a criterion measure. Thus, a paper-and-pencil

',test might be validated against a more elaborate and time-consuming per-

<i, formancetest whose validity had previouslv been established. Or a group

ftestmight be Ivalidated:against an individu~l test. The Stanford-Binet, for

:lhample, has repeatedly served as a criterion in validating group tests.

"In such a case, the new test may be regarded at best as a crude appro xi-

~mation of the earlier one. It should be noted that unless the new test

, represenl~a simpler or shorter substitute for the earlier test, the use of the

,';latter as a cdterion is indefensible.t,,;

1 SPECIFICITYOF CRITERIA.Criterion-related validity is most appropriate

'for local validation studies, in which the effectiveness of a test for a

,; specificprogram is to be assessed. This is the approach followed, for

, example,when a given company wishes to evaluate a test for selecting

, applicants for one of its jobs or when a given college wishes to determine

i howwell an academic aptitude test can predict the course performance

~"ofits students. Criterion-related validity can be best characterized as the

~practical validity of a test in a specified situation. This type of validation

',represents applied research, as distinguished from basic research, and as

: such it provides results that are less generalizable than the results of

I other procedures.

That criterion-related validity may be quite specific has been demon-

, strated repeatedly. Figure 15 gives examples of the wide variation in the

correlations of a single type of test with criteria of job proBciency. The

, .~'firstgraph shows the distribution of 72 correlations found between in-

~:telligence test scores and measures of the job proficiency of general

c. clerks; the second graph summarizes in similar fashion 191 correlations

. :' between finger dexterity tests and the job proficiency of benchworkers.

,:;Although in both instances the correlations tend to chIster in a particular

',~range of validity, the variation among individual studies is considerable.

~,.Thevalidity coefficient may be high and positive in one study and negli-

'; gibleor even substantially negative in another.£.1"

Validity: Basic COllcepts 147

Similar .vari~tion with r~gard to the prediction of course grades is il-

lustrated m Flgure 16. ThIS Bgure shows the distribution of correlations

obtained between grades in mathematics and scores on each of the sub-

tests Of,the Differential Aptitude Tests. Thus, for the Numerical Ability

test (NA), the largest number of validity coefficients among boys fell

between .50 and .59; but the correlations obtained in different mathe-

m~tics ~ourses and in different schools ranged from .22 to .75. Equally

Wide dlff~rences we~e found with the other subtests and, it might be

added, WIth grades 10 other subjects not included in Figure 16.

2072 coefficients for general

c1erh on intelligence tests,proficiency criteria

~ 10c:.,'u~••0U

0~~ -1.00

+1.00'0>... 200

191coefficients for bench"Ol workers on finger dexterity0

C tests, proficiency criteria.,~•• 100..

o-1.00 -0.50 .00 +0.50 +1.00

FIG. 15. Examples of Variation in Validity Coefficients of Given Tests for Par-ticular Jobs.

(Adapted from Ghiselli, 1966, p. 29.)

Some. of .the variation in validity coefficients against job criteria re-

ported l.n FIgure 15 r~ults from differences among the specific tests em-

plo ed 10 different studies to measure' . , ..:~2, rity. In

the resu s 0 0 19ures and 16, moreover/some variation is at-

tributable to diHerences in homogeneity and lev~l~£ the groups tested .

The range of validity coefficients found, however, is far wider than

could be explained in these terms. J)ifferences in the' crjtena themselves

~un~oubtedb' a m.!!iorr~ason for th.~~~ariatiQnQ~~~rvgafilong vali<lliy

c~. Thus, thc duties of offic~g~rks or berichworkers may differ


. de artments in the same company.'dely among compames or amo~. tP differ in content, teachingmilarlv, courses in the same su Jec may t' student achieve-ethod'instructor characteristics, bases for evalua

lmg . to be the

' c· ntly w lat appearsellt, and l~umerous other ways. o~sd~e:ent ' combmation of traits ine critefJon ma resent ver

i rrent situations. t'. the same situation. For example,riteria may also vary over Ime In. .. criteria often differse validitv coefficient of a test against Job .tra~m(th' lli 1966) Thereomits v~lidity against job performance cntena Ise, ce of ~ iven

'evidence that the traits required for successful terfo~~:nor job e;peri-

b or even a. si~g~detaslk(;~r~' ~vi~n th:9;~~o~~iS~!::c& Fruchter, 1960;ce of the mdivi ua eiS m .' . ' . 1960) There is also

~:~~::le~d~:1~~I'Sh~~~6;h~~~~I~ri:ri~~:~ge ove~. timt.e fo1rgOotha~r. f . b shIfts In orgamza IOna ,asons such as changmg nature (} )0 s, al d't'ons ('.lac-

' . k d ther tempor con 1 1 IV.dividual advancement In ra~ ' an

llkn° f course that educational

. 1967 P . 1966) It IS we own, 0 ,.nne)', ; nen,. t' In other words the' . d t t change over Ime. ,meula an course con en. ... IIi ence and aptitude teststeria most commonly used m vaiidatmg mte g d'

'namely, job performance and edut:ational achievement-are ynamlc


rather than static. It follows that criterion-related validity is itself subjectto temporal changes.

<SYl':mETIC VALIDITY. Criteria not only differ across situations and over

time, but they are also likely to be complex (see, e.g., Richards, ~llor,

Price, & Jaeobsen, 1965). Success on a job, in school, or in other actirytiesof daily life depends not on one trait but on many traits. Hence, 'prac-

tical criteria are likely to be multifaceted. Several ,different indicators

or measures of job proficiency or academic achievement could thus be

used in validating a test. Since these measures may tap different traits

or combinations of traits, it is not surprising to find that they yield differ-

ent validity coefficients fpr any given test. '\'hen different criterion

measures are obtained for the same individuals. their interoorre!atioDs are

often quite low. For instance, accident records or absenteeism may show\"

virtually no relation to productivity or error ,data for the same job (Sea-

shore, Indik, & Georgopoulos, 1960). These differences, of course, are

reflected in the validity coefficients of any given test against different

criterion measures. Thus, a test may fail to correlate significantly with

supervisors' ratings of job proflciency and yet show appreciable validity

in predicting who will resign and who will be promoted at a later date(Albright, Smith, & Glennon, 1959),

Because of criterion complexity, validating a test against a composite

criterion of job proficiency, academic achievement, or other similar ac-

complishments '~a be of uestionable value and is certainl of limited

generality. If different subcriteria are relatively independent, a more ef-

fectIve procedure is to validate each test against that aspect of the cri-

teiiO'i1Jf IS best designed to measure. An analysis of these more speCific

reIahonships lends meaning t6 the test Scores in terms of the multiple

dimensions of criterion behavior (Dunnette, 1963; Ebel, 1961; S. R. Wal-

lace, 1965). For example, one test might prove to be a valid predictor of

a clerk's perceptual speed and accuracy in handling detail work, another

of his ability to spell correctly, and still another of his ability to resistdistraction.

If, now, we return to the practicClI question of evaluating a test or

combination of tests for effectiveness in predicting a complex criterion

such as success on a given job, we are faced with the necessity of con-

ducting a separate validation stud in each loc tion and re eatin

it at frequent mten~ S. This is admittedly a desi procedure and one

that is often recommended in test manuals. In ma~r situations, however,it is not feasible to follow this procedure be~jise of well-nigh insur-

mountable practical obstacles. Even if adequatel~ p'ained' personnel are

available to carry out the necessary research, mosf Critf:'rion-related va-

lidity studies conducted in industry afe likely to prove unsatisfactory for

• f· "\' l'd'ty Coefficientsof the Differential Aptitude16. GraphIC Summary 0 a I I . Th bad ac-. (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ theanyingnumb~r.sin each column indicate the number 0 coe clen S In

givenat the left. "

R roduced by permiSSIon. CopyrIght © 1975,Fifth Edition Manual, p. 82: eP

NY k N Y All rights reserved.)

by The Psychological Corporatlon, ew or, .•

tso Principles of Psychological Testing

at leastthree reasons, First, it is difficult to obtain dependable and suf-

Scientlycomprehensive criterion data. Second, the number of employees

engagedin the same or closely similar jobs '~ithin a co~pany i,s often

60 small for significant statistical results. Thlfd, correlations will very

~robablybe lowered by restriction of range through preselection, si~ce

polythose persons actuany hired can be followed up on .the Job.

: For all the reasons discussed above, personnel psychologJ.sts have

- shownincreasing interest in a technique 1.."110\\'11as synthetic validity.

\Firstintroduced by Lawshe (1952), the concept of synthetic validity has

!beendefined by Balma (1959, p. 395) as "the inferring of validity in a

specificsituation from a systematic analysis of job elements, a determina-

_Honof test validity for these elements, and a combination of elemental

fvalidities into a ~'hole." Several procedures have been developed for

",.gathering the1needed empirical data and for ~mbining these d~ta. to

, obtainan estimate of synthetic validity for a particular complex cntenon

(see,e.g., Guion, 1965; Lawshe & Balma, 1966, Ch. 14; McCormick, 1959;

l'rimoff, 1959, 1975). Essentially, the process involves three steps: (1)

_. detailed job analysis to identify the job elements and their relative

_weights; (2) analysis and empirical study of each test to determine ~he

.i extent to which it measures proficiency in performing each of these Job

elements; and (3) finding the validity of each test for the given job

synthetically from the weights of these elements in the job and in the

test.In a long-term research program conducted with U.S. Civil Service job

applicants, Primoff (1975) has developed the J-coefficient (for "job-

coefficient") as an index of synthetic validity. Among the special features

of this procedure are the listing of job elements expressed in terms of

worker behavior and the rating of the relative importance of these ele-

ments in each job by supervisors and jo1},}p$]Jmbents. Correlations be-

tween test scores and sell-ratings on jOp;Jj~m~~s are found in total ap-

plicantsamples (not subject to thep~1f'~-~~lW? of employed workers).

Various chec1..ing procedures are fon9~~ed to ensure stability of correl~-

tions and weights derived from self-~~~gs. as wen as adequacy of C[l-

terion coverage. For these purpose.s;~a~a_ are ?btained from d~Herent

samples of applicant populations. 1\~~£nal estimate of correlation be-

tween test and job performance is,!9Pnd from the correlation of each

job element with the pifticular job';~~ the weight of the same element

in the given test.' There i" evidence that the J-coefficient has proved

~f·', The statistical procedures aTe essentiaIly an adaptation of multiple regression

equations, to be discussed in Chapter- 7. For each job element, its correlation withthe job is multiplied by its weight in the test, and these produtcs are added across all

appropriate job elements.


helpful in improvin~ th~ employment opportunities of minority appli-

cants and persons WIth lIttle formal education, because of its concentra-

tion on job-relevant skills (Primoff, 1975).

, A different application of synthetic validity, especially suitable for use

m a sn~all company with few employ~es in each type of job, is described

by Gmon (1965). The study was carried out in a company having 48

employee~, each of whom was doing a job that was appreCiably different

from the Jobs of the other employees. Detailed job analyses nevertheless

revealed seve.n job elements commo!}Jto many jobs. Each employee was

rated on the Job elements appropriate to his job; and these ratings were

then checked against the employees' scores on each test in a trial battery.

On the basis of these analyses, a separate battery could be "svnthesized"

for each job by co~bining the two best tests for each of the j~b elements

demanded by that Job. When the batteries thus assembled were applied

t~ a subsequently hired group of 13 employees, the results showed con-

SIderable promi~e. Because of the small number of cases, these results

are only suggestive. The study was conducted primarily to demonstrate a

model for the utilization of synthetic validity.

The two examples of synthetic validity were cited only to illustrate

the scope of possible applications of these techniques. For a description

of the actual procedures followed, the reader is referred to the ariginal

sources .. In ~ummary, the Concept of synthetic validity can be imple-

~ented III diHerent ways to fit the practical exigencies of different situa-

tIOns.. It oH~rs .a promising approach to the problem of complex and

changmg. cntena; and it permits the assembling of test batteries to fit

~he reqUIrements of specific jobs and the detennination of test validity

1D many contexts where adequate criterion-related validation studies are

impracticable.

!he construct validity o~ a test is the extent to which the test may be

saId to me~ure. a theoretical construct or trait. Examples of such con-

structs ~re mtelhge~~, mechanical comprehension, verbal fluency, speed

of ,;alking, neurotiCIsm, and anxiety. Focusing on a broader, more en-

dunng, .and more abstract kind of behavioral description t'han the previ.

ously dlscusse~ types ,of validity, construct validation requires the grad-

ual a~um~latIon of mfonnation from a variety of sources. Any data

thrOWIng hght on the nature of the trait under consideration and the

~~~tions .aHecting i~ developm.e~t and manifestations' are grist for this

,al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~, for construct

vahdatlon Will be considered below. ':&-"

Validity; Basic Concepts 153

acco~~ing to. a hierarchical pattern of learned skills, they, too, can utilize

empmcal eVidence of hierarchical invariance in their validation.";: DEVELOPMENTAL CHANGES.A major criterion employed in the validation

',',ofa number of intelligence tests is age d.ifferentiation. Su.ch tests a.~the

,Stanford.Binet and most preschool tests arc checked agamst chronolog-

':ical age to determine whether the scores show a pr~gressive i~crease

, .with advancing age. Since abilities are expected to mcre~se \~lth age

, ,duringchildhood, it is argued that the test scores should likewise show

, such an increase, if the test is valid. The very concept of an age scale

,:'0£ intelligence, as initiated by Binet, is based on the assumption that "in-

~telligence"increases with age, at least until maturit,Y- . .The criterion of age differentiation, of course, IS mapp1icable to any

,functions that do not exhibit clear-cut and consistent age changes. In the

area of personality measurement, for example, it ~as found li~ited u~e.

Moreover, it should be noted that, even when apphcable, age differentia-

tion is a necessary but not a sufficient condition for validity. Thus, if the

. test scores fail t~ improve with age, such a finding probably indicates

" that the test is not a valid measure of the abilities it was designed to

."sample. On the other hand, to prove that a test measures something

that illcr,eases with age does not define the area covered by the test very

precisely. A measure of height or weight would al~o show regul~r ag~

inc1'ements,although it would obviously not be deSignated as an mtelli-

'\ gencetest. . .'A final point should be emphasized reg~rding the. mterpretahon .of ~e

age criterion. A psychological test validated a?amst such a cnteno~

measures behavior characteristics that increase w1th age under the condl'

tions existing in the type of environment in which the test was stand-

ardized. Because different cultures may stimulate and foster the develop-

ment of dissimilar behavior characteristics, it cannot be assumed that

the criterion of age differentiation is a universal one .. Lik~ all ~th~r

, criteria, it is circumscribed by the particular cultural settmg m whlCh It

is derived.Developmental analyses are also basic to the construct validation of

the JPiagetian ordinal scales cited in Chapter 4. A fundamental assump,

tion of such scales is 1thesequential patterning of development, such that

the attainment of earlier stages in concept development is prerequisite to

the acquisition of later conceptual skills. T'here is thus. an ~ntrinsic h~er-

archy in the content of these scales. The construct vahdahon of ~rdi~al

scales should therefore include empirical data on the sequential 10-

variance of the successive steps. This involves checking the performance

of children at different levels in the development of any tested concept,

such as conservation or object permanen,ce. Do children who demonstrate

mastery of the concept at a given level :also exhibit mastery at the ~ower

levels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned

CO~~Anoss WlTIl OTHER TESTS. Correlations between a new test

and slIDllar earlier tests are sometimes cited as evidence that the new test

me~sures apprOximately the same general area of behavior as other tests

des~gnated by"the ~ame name, such as "intelligence tests" or "'mechanical

aphtude tests: Unlike the correlations found in criterion-related validity,

these correlahons sh~uld be ~oderately high, but not too high. If the new

test correlates too lughly With an already available test, withuut such

added advantages as brevity or ease of administration, then the new test

represents needless duplication.

Correlations with other tests are employed in still another way to

d~m~nstrate that the new test is relatively free from the influence of cer-

ta~n m~le:ant factors. For ex~~ple, a special aptitude test or a person-

alItr teat "hould hav.e a neglIgtble correlation with tests of general in-

te1hgence ~r scholastic aptitude. Similarly, reading comprehension should

not appreCiably affect performance on such tests. Thus, correlations with

t~sts of general intelligence, reading, or verbal comprehension are some-

h~es reporte~ as indirect or negative evidence of validity. In these cases,

hlgh correlations, would make the test suspect. Low correlations, how-

ever, would n~t 10 t~emselves insure validity. It will be noted that this

use o~ correlations With other tests is similar to one of the supplementary

techmques described under content validity. '

FA~OR ANALYSrs.Of particular relevance to construct validitv is fador

an~lySlS: a s~atistical procedure for the identification of psy~hological

~ralts. E,s~entia.lly, factor analysis is a refined technique for analyzing the

I~terrelationships of behavior data. For example, if 20 tests have been

glven ~o300 persons, the first step is to compute the correlations of each

t~st Wlth e:ery other., An inspection of the resulting table of 190 eoi-rela-

ti,O~ may Itself reveal. certain clusters among the tests, suggesting the 10-

catI?n of common traIts. Thus" it_ tests as vocabulary, analogies op-

pOSites, and sent~nce ~mpletioJl •• high correlations with each ~ther

and low correlations With all ot~ ~ts, we could tentatively infer the

pre~en.(:e of a verbal :omprehe~ioj "tor. Because',~uch an inspectional

ana ~m of .a ~rrelaho~ table is ~t and uncetjtirln, however, more

precIse statistical teclm1ques have blWft developed to locat th

f t. cd """'- e e common

ac ors reqmr to account for the'ttbtai, ·ned co i ti' Th h. f f "rre,a ons. ese tee -m~ues a .actor a~alysis will be e~amiil~d further in Chapter 13, together

WIth multnple aptItude tests developed~~y means of I~r analysis.

Principles of PSljchological Testing

n the process of factor analysis, the number of variables or .cate~ories

ermsof which each individual's performance can be descnbed lS re-

ed from the number of original tests to a relatively small number of

rs, or common traits. In the example cited above, five or six factors

t suffice to account for the intercorrelations among the 20 tests. Each

'dual might thus be described in terms of his scores in the five or six

ors, rather than in tcrms of the original 20 scores. A major purpose of

(>ranalysis is to simplify the description of behavior by reducing the

er of categories from an initial multi licit 1 of test vari bles to a few

ac;Aft~rthe factors have been idcntified, they can be utilized in describing

e factorial composition of a test. Each test can thus be cl1afacterized in

rmsof the l1)a)orfactors determining its scores, together with the weight

r loading of each factor and the correlation of the test with each facto~.

uch a correlation is known as the factorial validity of the test. Thus, lf

he verbal comprehension factor has a weight of .66 in a vocabulary test,

he factorial validity of this vocabulary test as a measure of the trait of

erbal comprehension is .66. It should be noted that factorial validity is

entially the correlation of the test with whatever is common to a group

of tests or other indices of behavior. The set of variables analyzed can,

ofcourse, include both test and nontest data. Ratings and other criterion

'measurescan thus be utilized, along with other tests, to explore the fac-

torial validity of a particular test and to define the common traits it

measures.

Validity: BasicCaacepts 155

~orrelation of .subtest scores with total score. Many intelligence tests, for

lD:tance,. con~lst of separately administered subtests (such as vocabulary,

anthmehc, picture completion, etc.) whose scores are combined in finding

the total test score. In the construction of such tests, the scores on each

subtest are often correlated with total score and any subtest whose cor-

relation with total score is too low is eliminated. The correlations of the

rem~ining sUbte~ts with total score are then reported as evidence of

the Internal consistency of the entire instrument.

. It is app.arent that internal consistency correlations, whether based on

Items or subtests,. are essentially measures of homogeneity. Because it

helps to charactenze the behavior domain or trait sampled by the test, the

degree of homogeneity of a test has some relevance to its construct valid-

ity .. Ne.vert~eless, ~he contribution of internal consistency data to test

vahdatlOn IS very limited. In the absence of data external to the test it-

self, little can be learned about what a test measmes.

INTERNALCONSISTENCY. In the published descriptions of certain tests,

especially in the area of personality, the statement is made that the test

has been validated by the method of internal consistency. The essential

characteristic of this method is that the criterion is none other than the

-total score on the test itself. Sometimes an adaptation of the contrasted

. grOUpmethod is used, extr'"emegroups being selected on the basis of the

total test score. The performance of the upper criterion group on each test i

item is then compared with that of the lower criterion group. Items that

fail to show a significantly greater proportion of "passes" in the upper

than in the lower criterion group are considered invalid, and are either

~liminated or revised. Correlation~l pr.qcedures may also be employed for

this purpose. For example, the biserial'correlation between ."pass-f~il" .on

each item and total test score can be computed. Only those Items )'leldmg

significant item-testcorr~fliJi.Pns would be retained. A test whose items

were selected by this meth,qd can be said to show internal consistency,

since each item differentiates in the same direction as the entire test.

Another application of the criterion of internal consistency involves the

EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further source

of data forconstmct validation is provided by ex-periments on the effect

of selecte(;I~ariables on test scores. In checking the validitv of a ~riterion-

referellce'O test for use in an individualized instruction~l program, for

example, one approach is through a comparison QE pretest and posttest

scor~s..The rationale of such a test calls for low scores on the pretest,

admlms~ered b~fore ~he relevant instruction, and high scores on the post-

test. ThiS relationshIp can also be checked for individual items in the

te~t (Po.pharo, 1971). Ideally, the largest proportion of examinees should

fall an Item ?n the pretest and pass it on the posttest. Items that are

commonly falled on both tests are too difficult, and those passed on both

tests ~oo easy, for t~e purposes of such a test. If a sizeable proportion of

exa~mees pass an ltem on thc pretest and fail it on the posttest, there is

obvlOusly something wrong with the item, or the instruction, or both.

A. test designed to measure anxiety-proneness can be administered to

sub!ects who are subsequently put through a situation designed to arouse

amQe.~, such as .t~~ng an examination under distracting and stressful

conditions. The lDltlal anxiety test scores can then be correlated with

phySiolog!cal. and other indices of an~iety expression du~pg and after

the exammatIon. A different hypothesis regarding an anxietY· test could

?e evalua~ed by admini~tering the test before and after an anxiety-arous-

mg expen:~ce an~ seemg whether test scores rise Significantly on the

retest. PosItive flndmgs from such an experiment would indicate that the

test scores. reBect current anxiety level. In a similar w,lI.y;' exper4;h.lentscan be. designed to test any other hypothesis regarding th.~;;tfait ~~SUredby a gIVen test.' .'

TABLE 12

A Hypothetical Multitrait-M:ultimethod Matrix

(From Campbell & Fiske, 1959, p. 82.)

NVERGENT AND DISCRIMINANT VALIDATiON. In a thoughtful analysis

nstruet validation, D. T. Campbell (1960) points out that in order

emollstrate construct validity we must show not only that a test cor-

es highly with other variables with whi~h .it should ~heoret.ically

elate, but also that it does not correlate sIgmficantly wIth van abIes

which it should differ. In an earlier article, Campbell and Fiske

) described the former process as convergent validation and the

er as discriminant validation. Correlation of a mechanical aptitude

with subsequent grades in a shop course would be an example of

vergentvalidation. For the same test, discriminant validity would be

rated by a low and insignificant correlation with scores. on a .reading

prehension test, since reading ability is an irrelevant varIable m a test

gnedto measure mechanical aptitude. ., .t will be recalled that the requirement of low correlatlOn WIth trrele-

t variables was discussed in connection with supplementary and pre-

tionary procedures followed in content validation. Discrin;inant va~-

ionis also especially relevant to the validation of personality tests, In

ich irrelevant variables may affect scores in a variety of ways.

ampbell and Fiske (1959) proposed a systematic experimental deSign

the dual approach of convergent and discriminant validation, which

ey called the multitrait-multimet1lOd J7latrix. Essentially, this procedure

quiresthe assessment of two or more traits by tw.o Qr~ore metho~s. A

pathetical example provided by Campbell and FIske WIll serve to IUUS-

ate the procedure. Table 12 shows all possible correlations among the

oresobtained when three traits are each measured by three methods.

, e three traits could represent three personality characteristics, such as

A) dominance, (B) sociability, and (C) achievement motivation. The

hreemethods could be (1) a self-report inventory, (2) a projective tech-

'iquc,and (3) associates' ratings. Thus, Al would indicate dom~na~ce

oreson the self-report inventory, A2 dominance scores on the projective

est,C3 associates' ratings on achievement motivation, and so forth.

The hypothetical correlations giv~n in Ta~le 12 include reli.ability co-

fficients (in parentheses, along principal dIagonal) and validity coef-

cients (in boldface, along three shorter diagonals). In these validity

coefficients,the scores obtained fc",~~~psame trait by different methodsarecorrelated; each measure "is.thu~ 'being checked against other, inde-

pendent measures of the same'::'trait, ~s'.in the familiar validati~n proce-

dure.The table also includes correlations between different traIts meas-

uredby the same riJ":.thod'(in solid triilngles) ~nd corrclati.ons between

different traitsllleasured by different methods (Ill broken trIangles). For

satisfactory construct validity, the validity coefficients should obviously

be higher than the correlations between different traits measured by

different methods; they should also be higher than the correlations be-

,...56::-:- ...~22---:11: 67-'---42-------:I ',', I 1'" ',. .33 II..... .•. I : ......•......, ....•....•.. :

:.23'".58"',)2: :.43 '".~6',,:.34:I "... ....,~ t •••••• I

l~~1 :..~~:~~~45L~~ .:~~~::::~58~.58 (.85)•...• ·::;"'-~"':'_~~~'~:;~'::>'.-~;~I;;,';"_:~~~~ ..•_.:;.::~:=.t;,..~~~':Q.~IM&~)

Note: Le~tersA. B, C refer to traits, subSCripts1,2,3 to methods. Validity coefficients

(rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers; reliability

c~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principal

diagonal. Sohd tnangles enclose heterotrait-monomethod correlations; broken tri-angles enclose heterotrait-hcteromethod correlations.

'l

tween different traits measured by the same method. For example, the

COITf:lationbetween dominance scores from a self-report inventory and

dOITt~ijancescores from a projective test should be higher than the cor-

relatIon between dominance and sociability scores from a self-report in.

ventor~. If ~he l~tter correlation, representing common method variance,

:-rere hIgh, It mIght inllicate, for example, that a person's scores on this

Inventory are unduly affected by some irrelevant common factor such as

ability to understand the questions or desire to make oneself appear in afavorable light on all traits.

Fiske (1973) has added still another set of correlations that should be

checke~, esp~cially in the construct validation of personality tests. These

~rrelab~ns Involve the same trait measured by the"same method, but

With a dlffer~nt test. For examplc, two il)vestigators may each pliepare

a self-report Inventory designed to assesseIl,durance. Yet the end~rance

scores obtained with the two inventories may show quite diffe~~nt. pat-

terns of correlations with measures of other personality traits. Under these

Traits

A,

Method 1 B,

C,

A,

Method 2 B.

C.

IllustrativeQuestion

Type ofValidity

Validity: Basic Conc('pts 159

at a higher educational level, as when selectinO' hiO'h school students forb t:<

college admission, it needs to be evaluated against the criterion of sub-

sequent college performance rather than in terms of its content validity.

The examples given in Table 13. focus on the differences among the

various types of validation procedures. Further consideration of these

procedures, however, shows that content, criterion-related, and construct

validity do not correspond to distinct or lOgically coordinflte categories.

On the contrary, construct validity is a comprehensive concept, which

includes the other types. All the specific techniques for establishing con-

tent and criterion-related validity, discussed in earlier sections of this

chapter, could have heen listed again under construct validity. Comparing

the test performance of contrasted groups, such as neurotics and normals,

is one way of checking the construct validity of a test designed to meas-

ure emotional adjustment, anxiety, or other postulated traits. Comparing

the test scores of institutionalized mental retardates with those of normal

schoolchildren is one way to investigate the construct validity of an

intelligence test. The correlations of a mechanical aptitude test with per-

formance in shop courses and in a wide variety of jobs contribute to our

understanding of the construct measured by the test. Validity against

various practical criteria is commonly reported in test manuals to aid the

potential user in understandin~ what a test measures. Although he may

not be directly concerned with the prediction of any of the specific cri-

teria employed, by examining such criteria the test user is able to build

up a concept of the behavior domain sampled by the test.

Content validity likewise enters into both the construction and the

subsequent evaluation of all tests. In assembling items for any new test,

the test constructor is guided by hypotheses regarding the relations be-

tween the type of content he chooses and the behavior he wishes to

measure. All the techniques of criterion-related validation, as well as the

other techniques discussed under construct validation, represent ways of

testing such hypotheses. As for the test user, he too relies in part on

content validity in evaluating any test. For example, he may check the

vocabulary in an emotional adjustment inventory to determine whether

some of the words are too difficult for the persons he plans to test; he

may conclude that. the scores on a particular test depend too much on

speed for his purposes; or he may notice that an intelligence test de-

veloped twenty years ago contains many obsolescent items unsuitable for

use today. All these observations about content are relevant to the con-

struct validity of a test. In fact, there is no information provided by any

validation procedure that is not relevant to construct validity.

The term construct validity was officially introduced into the psy-

chome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psy-

c11010glcal Tests and Diagnostic Techniques, which constituted the first

edition of the current APA test Standards (1974). Although the validation


.ditions,it cannot be concluded tllat both inventories measure the same

·sonalityconstruct of endurance. ., .t might be noted that within the framework of the mnlhtrmt-mulh-

hodmatrix, reliability represents agreement between two measures of

same trait obtained through maximally similar methods, such as

alle! forms of the same test; validity represents agreement between

measures of the same trait obtained by maximally different methods,

chas test scores and supervisor's ratings. Since similarity and difference

methods arem~tters of degree, theoretically reliability and validity can

regarded as falling along a single continuum: O~~inarily, ho\~'e~er, the

hniques actually employed to measure rehabllIty and validIty cor-

ond to easily identifiable regions of this continuum.

We have considered several ways of asking, "How valid is this test?"

Topoint up the distinctive features of the different types of validity, let

usapply each in turn to a test consisting of 50 assorted arithmetic prob-

lems.Four ways in which this test might be employed, together with the

typeof validation procedure appropriate to each, a:e illustra:ed ~nTable

13. This example highlights the fact that the chOIce of valIdahon pro-

; cedure depends on the use to be made of the test scores. The same test,

when employed for different purposes, should be validated in different

ways.If an achievement test is useet to predict subsequent performance

TABLE 13Validationof a Single Arithmetic Test for Different Purposes

. Achievement test in ele-mentary school aritlune-

ticAptitude test to predict

performance in highschool mathematics

Technique for diagnosing

learning disabilities

Measure of logical reason-

ing

How much has Dicklearned in the past?

How well will Jim learn in

the future?

Criterion-related:

predictive

Does Bill's performanceshow specific disabili-

ties?How can we describeHenry's psychological

functioning?

Criterion-related:concurrent

160 Pritlci,Jles of PSlJchological Testing

procedures subsumed under construct validity were not new at the time,

the discussions of construct validation that followed served to make the

. implications of these procedures more explicit and to provide a systematic

,; rationale for their use. Construct validation has focused attention on the

role of psychological theory in test construction and on the need to

formulate hypotheses that can be proved or disproved in the validation

process. It is particularly appropriate in the evaluation of tests for use

in research.

In practical contexts, construct validation is suitable for investigating

; the validity of the criterion measures used in traditional criterion-related

" test validation (see, e.g., James, 1973). Through an analysis of the cor-

relations of different criterion measures with each other and with other

, relevant variables, and through factorial analyses of such data, one can

learn more about the meaning of a particular criterion. In some instances,

the r~sults of such a study may lead to modification or replacement of the

criterion chosen to validatc a test. Under any circumstances, the results

will enrich the interpretation of the test validation study.

Another practical application of construct validation is in the evalu-

ation of tests in situations that do not permit acceptable criterion-related

validation studies. as in the local validation of some personnel tests for

industrial use. The difficulties encountered in these situations were dis-

cussed earlier in thi.s chapter, in connection with synthetic validity. Con-

str~ct validation offers another alternative approach that could be fol-

lowed in evaluating the appropriateness of published-tests for a particular

job. Like synthetic validation, this approach requires a systematic job

analysis, followed by a description of worker qualifications expressed in

;.:''terms of relevant behavioral constructs. If, now, the test has bcen sub-

jected to sufficient research prior to publication, the data cited in the

manual should permit a specification of the principal constructs measured

by the test. This information could be used directly in assessing the

relevance of the test to the required job functions, if the correspondence

of constructs is clear enough; or it could serve as a basis for computing

a J-coefficient or some other quantitative index of synthetic validity. .

Construct validation has also stimulated the search for novel ways of I

gathering validity data. Although the principal techniques employed in

investigating construct validity have long been familiar, the field of

operation has been' expanded to admit a \\rider variety of procedures.

This very multiplicity of data-gathering techniques, however, presents

certain hazards. It is possible for a test constructor to try a large number

of different validation procedures, a few of which will yield positive re-

sults by chance. If these confirmatory results were then to be reported

without mention of all the validity probes that yielded negative results, a

very misleading impression about the validity of a test could be created.

I Another possible danger in the application of construct validation is that


it may open the way for s b" .validity. Since . ~ J~chve, unvenfled assertions about test

cept, it has bE':~~~~~~ v;~~~~;s ~uc~ asbroad and loosely dcflned can-constructors Seem to ~r . . rs 00. ome textbook writers and test

psychological trait na~lescelVe It as content validity expressed in terms of

subjective accounts of ~h~:~~e, t~e~ present as construct validity purely

A further source of ossibl ey e ~ve (o~ hope) the test measures.construct validation "is ; I e

dCo~fuslOn anses from a statement that

a measure of some at .~vo ve w e~ever ~ test is to be interpreted as

oned'." (Cronbach & ~:e~tel~r quahty whIch is not 'operationally de-

published analysis of the co' 5;, f' 282). Appearing in the first detailed

often incorrectl acce ted :c~p ? ~nstruct "alidity, this statement was

the absence of ~ata ~hat t~ Justifrng a claim for construct validity in

such an interpretati;n is i1lus:a~~t ors of .the sta~e~ent did. not intendarticle, that "unless the n t k d b

kytheIr own inSIstence, III the same

e war ma es contact with b .construct validation cannot b I' d" 0 servations . . .h . . . e c alme (p. 291) In th .t ey cnhclze tests for wh' h" fi . e same connectIon,been oHered as if it wcre l~al'; t.ne~pun network of rationalizations has

construct, trait or behavio dla I~n (p, 291). Actually, the theoretical

b d ' r omam measured bv rti Ie a equateI), defined only' th I' h f - a pa cu ar test canvalidating that test Such I~ Iie. ~g t 0 data gathered in the process of

abIes with which th~ test c~ ~ ~lhO~ would take into account the vari-

found to affect its Scores an~et~ ed SIgnificantly, as well as the conditions

scores. These procedures are e ; ~~~ps that diff~r significantly in suchbutions made bv the co t n fIre :- m aCcord w1th the positive contrl-

. ncep 0 construct valid'ty I 'the empirical investigation of the r I' h' 1. t IS only throughexternal data that we can d' ehahons IpS of test SCores to other

ISCOverw at a test measures.

HArTER 7

alidity:

Measuremel~t

and lrlterpretation

MEASUREMEXT OF RELATIONSHIP. A validity coefficient is a correlation

between test score and criterion measure. Because it provides a single

numerical index of test validity, it is commonly used in test manuals to

report the validity of a test against each criterion for which data are

available. The data used in computing any validity coefficient can also

be expressed in the form of an expectancy table or expectancy chart,

illustrated in Chapter 4. In fact, such tables and charts provide a con-

venient way to show what a validity coefficient means for the person

tested. It will be recalled that expectancy charts give the probability that

an individual who obtains a certain score on the test will attain a speci-

fied level of criterion performance. For example, with Table 6 (Ch. 4,

p. 101), if we know a student's score on the DAT Verbal Reasoning test,

"",e can look up the chances that he will earn a particular grade in a

hIgh school course. The same data yield a validity coefficient of .66

When both test and criterion variables are continuous, as in this example,

the familiar Pearson Product-Moment Correlation Coefficient is appli-

cable. Other types of correlation coefficients can be computed when the

data are expressed in different forms, as when a two-fold pass-fail cri-

terion is employed (e.g., Fig. 7, Ch. 4). The specific procedures for

computing these different kinds of correlations can be found in any

standard statistics text.

CHAPTER 6 was concerned with different concepts of validity and

. their appropriateness for various testing. f~nctions; t~is. chapter

deals with quantitative expressions of vahdlty and theIr mterpre-

tation. The test user is concerned with validity at either or both of two

stages. First, when considering the suitability of a test for his purposes,

he examin~ailable validit)'data reported in the test manual or ot~er

p~ed so.Jltces..Through such in~ormation, he arrives at a tentative

concept of what psychological fu~ctlOns the test actually measures, and

he judges the relevance of such function~ to his p.rop~sed use of t~e test.

In effect, when a test user relies on published validation data, he IS dea.l-

ing with construct validity, regardless of the specific pro?ed~res used m

gathering the data. As we have seen in Chapter 6, the cntena employed

in published studies cannot be assumed to be iden?cal. with th~se the

test user wants· to predict. Jobs bearing the same title m two dIfferent

companies are rarely identical. Two courses in freshman English taught

in different colleges may be quite dissim~1;l.r· i

Because of the specificity of each criterion, te~t users are .us~ally ad-

vised to check the validity of anv chosen, 'test agamst local cnterla when-

ever possible. Although publishe'd dat~ay str~ngl~ sugg~st that a given

test should have high validity in a particular sltuatio~, dlTee: corrobo~a-

tion is always desirable. The dete:t'inination of validJ!Y agamst specific

local criteria represents the second stage in the test ~r's evaluation of

valKTfty.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~cially

relevant to the analysis of validity data obtamed by ~e test u.ser hlms~1f.

Most of them are also useful, however, in understanding and mterpretmg

the validity data reported in test manuals.

J6z

COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. As in the case of reli-

ability, it is essential to specify the nature of the group on which a

validity coefficient is found. The same test may measure different func-

tions when given to individuals who differ in age, sex, educational level,

occupation, or any other relevant characteristic. Persons with different

experiential backgrounds, for example, may utilize different work meth-

ods to solve the same test problem. Consequently, a test could have high

validity in predicting a particular criterion in one population, and little

or no validity in another. Or it might be a valid measure of different

functions in the two populations. Thus, unless the validation s~ple is

repri'!seiififiVe of the population on which the test is to be used, validity

should be redetermined on a more appropriate sample.

The question of sample heterogeneity is relevant to the measurement

of validity, as it is to the measurement of reliability,'.,since both charac-

teristics ale commonly reported in terms of correlation eoefficiElnts. It

will be recalled that, other things being equal, the wider the range of

scores, the higher will be the correlation. This fact should be kept in


mind when interpreting the validity coefficients given in test manuals.

,;. Il.special difHcttlt}, encountered in many validation samples arises from

preselection. For example, a new test that is being validated for job selee-

.tionmay be admini$tered to a group of newly hired employees on whom

;criterJIonmeaSures of job performance will eventua11y be available. It is

~likely;however, that such employees represent a superior selection of all

"!hosewho applied for the job. Hence, the range of such a group in both

.'tests¢ores and criterion measures will be curtailed at the lower end of the

:·bdistribution.the effe~t of such preselection will therefore be to lower the

'validity coefficient. In the subsequent use of the test, when it is admin-

dster¢d to all applicants for selection purposes, the validity can be ex-

pected to be somewhllt higher.

" Validity coefficients may also change over time because of changing

.'selection standards. An example is provided by a comparison of validity

,coefficients compll.ted over a 3D-year interval with Yale students (Burn-

"ham, 1965). Correlations were found between a predictive index based

, on College Entrance Examination Board tests and high school records,

f onthe one hand, and average freshman grades, on the other. This correla-

tion dropped from .11 to .52 over the 30 years. An examination of the

r' bivariate distributions dearly reveals the reason for this drop. Because of

~higher admissibn standards, the later class was more homogeneous than

.:the earlier class in both predictor and criterion performance. Conse-

quently, the correlation was lower in the later group, although the ac-

t curacy with whkh individuals' grades were predicted showed little

ch~nge. In other words, the observed drop in correlation did not indicate

. that the predictors were less va-lid than they had been 30 years earlier.

Had the difference$ in group homogeneity been ignored, it might have

" been 'Wrongly concluded that this was the case.

'0' For the propet interpretation of a validity coefficient, attention should

alm be given to the form of the relationship between test and criterion.

, The computation of a Pearson correlation coefficient a;;sumes that the re-

lationship is linear and uniform throughout the range. There is evidence I

I that in certain situations, however, these' conditions may not be met

, (Fisher, 1959; Kahneman & Ghiselli, 1962). Thus, a particular job may

" require a minimum level of reading comprehension, to enable employees

to read instructiorl manuals, labels, and the like. Once this minimum is

e:,tceeded, however, further increments in reading ability may be un-

related to degree of job success. This would be an example of a nonlinear

relation between test and job performance. An examination of the bivari-

ate distributjon or scat.\:er diagram obtained by plotting reading compre-

hension scores a!Ylinst criterion measures would show a rise in job per-

I fprmance up to the minimal required reading ability and a leveling off

beyond that point. Hence, the entries would cluster around a curve rather

Validity: Mcasuremcnt and Interprctation ~65

In other situations th 1" f bindividual entries m;y d: .lfIte~ ~st 6t may be a straight line, but the

at the lower end of the s:~ e Sart er around this line at the upper than

aptitude test is a necClisa; ~ut u~:se that 'performa~c::e on a scholastic

achievement in a course Th t' h a tufficlent condItion for successful

poorly in the cOU"se' bl!lt'a' a ISth,t h~ how-scoring students will perform

• ,. mong e Ig -scor' t d .fonn well in the course . d th mg s u ents, some WIll per-

motivation. In this situat~:n ~h ers ",:::1:erf~rm poorly because of lowperformance among the ·h.'h ere. WI e WIder variability of criterion

dents, This condition in ~g ~sco~g t~an. am?ng the low-scoring stu-scedasticih.' Th p. bwanate dIstrIbution is known as hctero-

'J' e earson correlatio hvariability throughout tb ~ assum:s ?moscedasticity or eqll.al

present example, the bivae..r:n~ o'b th~ bIVanate distribution. In the

at the upper end and n na e shtn utIon would be fan-shaped-wide

b' , arrow at t e lower end A . ' .Ivanate distribution itsdf ill 11 . . II exammation of thenature of the relations·hip b 'tV usua y give a, good indication of the

e ween test and 't' Eand expectancy charts also I cn erIOn. xpectancy tablesthe test at different levels. correct y reveal the relative effectiveness of

MAGNITUDE OF A V.Aj.LIDITY·COEFFr •

coefficient be? No gener I CIE~T. How hIgh should a validity. . a answer to thIS gr' .mterpretation of a validit ffi . ues lOll IS pOSSIble, since theof concomitant circumsta; coe clent ~ust take into account a number

be high enough to be sta~~~·'n;~o~tamed correlation, of course, should

such as the 01 or 05 level' d~8.Ica !Jds~gnificant at some acceptable level. . . s ISCusse in Cha t 5 I h '

drawing any conclusions about th • lid' per . not er words, before

sonably certain that the obt' d el~d~ Ity of a test, we should be rea~. ame va I Ity coeffi' t ld

throug~ chance fluctuatip.tls of sam Ii . Clen cou not ,have arisenHavmg establjshed a signiflcant p ng fro.m a true correlation of zero.

criterion, however, we need to e correlat1~n between test Scores and

light of the uses to be m d f v~luate the SIZeo~ the correlation in ~he

vidual's exact criterion s~ e 0 ~le test. If we WIsh to predict an indi-

will receive in college the ~:fi~~tas:: grade-point average a student

of the standard erro; of estimare coe .clen.t may be interpreted in terms

measurement discussed in : whl7h IS analogous to the ,error of

that the errOr of measure.:~;~c~~n WIth reliability. It wl"llbe recalled

pected in an individual's n Icates the margin,. of error to be ex-

~irni1ar1y, the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t.

m the individual's predicted 't o~s t e margm of ~r,rot to be expe~tec:lvalidity of the test. cn erIon score, as a ~lwf the imper{~(;t

The error of estimate is found b th f 11 . ";"'"Y e 0 owmg fOfn,ula:

6 Prillciples of Psychological Testing

:whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandard

eviatiol1of the criterion scores. It will be noted that if the vahdlty were

erfect(r >'V ;::: 1.00), the error of estimate would be zero. On the other

and,with a test having zero validity, the error of estimate is as large as

e standard deviation of the criterion distribution (ucBr.;::: ulIVI -0 =v), Under these conditions, the prediction is no better. than. a ~ues~; and

he range of prediction error is as wide as the enbre distnbutIOn of

criterionscores. Between these two extremes are to be found the errors

ofestimatecorresponding to tests of varying validity.

Reference to the formula for cr •• t. will show that the term VI - r'''11

, servesto indicate the size of the error relative to the error that wou~

, result from a mere guess, i.e., with zero validity. In other words, lf

v'l- r'xv ig equal to 1.00, the error of estimate is ~s .lar~e as it would beifwe were to guess the subject's score. The predlc~ve lmprove~~nt at-

tributable to the use of the test would thus be rol. If the validlty co-

efficientio; .80, then VI - "XI/ is equal to .60, and -the error is 60 percent

aslarge as it would be by chance. To put it diffe:ently, the use of s~ch a

test enables us to predict the individual's critenon performance wlth a

marginof error that is 40 percent smaller than it would be if we were to

guess. . . . .It would thus appear that even with a validlty of .80,whl~h 1S unusu~lIy

high, the error of predicted scores is.conside~abl~..u th,e pnmary ~~ctl~n

of psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~in

the criterion distribution, the outlook would be qUite dlscouraglOg. \\ hen

examined in the light of the error of estimate, mos~ t~sts do not appear

very efficient. In most testing situations, ho~ev~r,. lt IS not necessary to

predict the specific criterion performance of mdlvl~ual .c~ses, but rather

to determine which individuals will exceed a certam mlmmum standard

of performance, or cutoff point, in the cri:erion. What are the ch.an:es

that Mary Greene will graduate from medIcal school, tI:at Tom Hlggms

",'in pass a course in calculus, or that Beverly ~ruce WIll succeed as an

astronaut? Which applicants are likely to be satlsfactory clc::rks,salesmen,

or machine operators? Such information is ~seful not only fo~ ~roup i

selection but also for individual career planmng. For example, lt 15 ad-

vantageous for a student to know that he has a gOO? chanc~ of pas~ingall courses in law school, even if we are unable to estimate WIth certamty

whether his grade average will be 74 or ~I: . .,A test may appreciably improve predIctive effiCIencyIf It sho~s a~1J

significant correlation with the criterion, however 10w..Un.der.certa~n Clt-

cumstanees even validities as low as .20 or .30 may Justify lncluslon of

the test in ~ selection program. For many testing purposes, evaluation .of

tests in terms of the error of estimate is unrealistically stringent. Consld-

eration must be given to other ways of evaluating the contribution of a

Validity: AI easuremellt and Interpretation 167

test, which take into account the types of decisions to be made from the

scores. Some of these procedurcs will be illustrated in the following sec-

tion.

BASIC APPROACH. Let us suppose that 100 applicants have been given

fln aptitude test and followed up until each could be evaluated for suc-

cess on a certain job. Figure 17 shows the bh'ariate distribution of test

scores and measures of job success for the 100 subjects. The correlation

between these two variables is slightly below .70. The minimum accept-

able job performance, or criterion cutoff point, is indicated in the diagram

by a heavy horizontal line. The 40 cases falling below this line would

represent job failures; the 60 above the line, job successes. If all 100 appli-

~ants are hired, thereforc, 60 percent will succeed on the job. Similarly,

if a smaller number were hired at random, without reference to test

scores, the proportion of successes would probably be close to 60 percent.

Suppose, however, that the test scores are used to select the 45 most

promising applicants out of the 100 (selection ratio;::: .45). In such a

case, the 45 individuals falling to the right of the heavy vertical line

would be chosen. Within this group of 45, it can be seen that there- arc 7

job failures, or false acceptances, falling below the heavy horizontal line,

and 38 job successes. Hence, the percentage of job successes is now.84

rather than 60 (i.e., 38/45 = .84). This increase is attributable to the use

of the test as a screening instrument. It will be noted that errors in pre-

dic:te:d criterion score that do not affect the decision can be ignored.

Opl)' those prediction errors that cross the cutoff line and hence place

the individual in the wrong category will reduce the selective effective-

ness of the test .

. For a complete evaluation of the effectiveness of the test as a screening

mstrument, another category of cases in Figure 17 must also be examined.

This is the category of false re;ections, comprising the 22 persons who

score below the cutoff point on the test but above the criterion cutoff.

From these data we would estimate that 22 percent' of the total applicant

sample are potential job successes who will be lost if the test is used as a

screening device with the present cutoff point. These false rejects in a

personnel selection situation correspond to the false positives in clinical

evaluations. The latter term has been adopted frO,J:lkmedical practice, in

whi~ .a t~st for a pathological condition is reported ~positive if the

condltion 1S present and negative if the patient is Dormal. A false positive

thus refers to ~ case in ~hich the test erroneously 4l~~atf,(~-1:hepresence

?f ~ ?athologJ~1 condition, as when brain damage~,-~ mdicated in anmdlVldual who lS actually normal. This terminology is likely to be COD-

r.' fusingunless we remember that in clinical practice a positiv~ result po a, test denotes pathology and unfavorable diagnosis, whereas In pers~n~el

. selectiona positive result conventionally refers to a favorabJ~ prediCtIon

: regarding job performance, academic achievement, and the lI~e.

. In settin on a test, attention should be ven to the

'i. percentage of false rejects (or false positives as we as to the .erc:nt-i

) a cesses an ai ures wit in t~_se eete grou.!} In certam SItu-

;; ations,the cutoff point should be set sufficiently higt, to e~clu?e all but

',' a few possible failures. This would be the case when t~~',;obIS of such!: a nature that a poorly qualified worker could cause senous loss or dam-

i age. An example would be a commercial airline. pilot. Under o.ther

:' circumstances, it may be more important to admit. as many qualIfied

~ personsas possible, at the risk of including more fallures .. In the latter

',> case the number of false rejects can be reduced by the choice of a lower

,~,cutoffscore. Other factors that normally determine the position of ~he

."i, cutoffscore include the available personnel snpP4:, the number of job. -

"

Validity: Measurement and Interpretation 169

openin s, and the ur ('nc or seed with which t ,filled.

In many personnel decisions, the selection ratio is determined by the

practical demands of the situation. Because of supply and demand in

filling job openings, for example, it may be necessary to hire the top 40

percent of applicants in one case and the top 75 percent in another.

When the selection ratio is not externall,T imposed, the cutting smre 011

a test can be set at that point giving the maximum differentiation be.

tw~ Clilelioll grouEs. TIus can be done roughly by comparing the

distrl ution of test scores in the two criterion groups. More precise math-

ematical procedures for setting optimal cutting scores have also been

worked out (Darlington & Stauffer, 1966; Guttman & Raju, 1965; Rorer,

Hoffman, La Forge, & Hsieh, 1966). These procedures make it possible totake into account other relevant parameters, such as the relative serious-ness of false rejections and false acceptances.

In the terminology of decision theofy, the example given in Figure 17

illustrates a simple strategy, or plan for deciding which applicants to ac-

cept and which to reject. I~ mor.e.~eral terms, a strategy is a technique

for utilizing information in order to reach a decision about individuals. In

tTllscase, the strategy was to accept the 45 persons with the highest te;

scores. The increase in percentage of successful employees from 60 to 84

could be used as a basis for estimating the net benefit resulting from theuse of the test.

Statistical decision theory was developed by Wald (1950) with special

reference to the decisions required in the inspection and quality control

of industrial products. Many of its implications for the construction and

interpretation of psychological tests have been systematically worked out

by Cronbach and GIeser (1965). Essentially, decision theory is an at-

tempt to put the decision-making process into mathematical form, so thdt

available information may be used to arrive at the most effective decision~nder .s~ecified circumstances. The mathematical procedures employed

lD. d~c1Slon.th~ory a~e often quite complex, and few are in a form per-

mItting theIr Immediate application to practical testing problems. Some

of the basic concepts of decision theory, however, are proving helpful in

the reformulation and clarification of certain questions about tests. A few

of these ideas were introduced into testing before the formal develop-

ment of statistical decision theory and were later recognized as Dttinginto that framework.

, I'~

I I

I

Job

Successes

Criterion

Cutoff

Job

failures

LowLow

Test Score

~'FIC. 17. Increase in the Proportion of "Successes" Resulting from the Use of, a Selection Test.

PREDICTION OF OUTCOMES. A precursor of decision theory ini.psychologi.

ca.1testing is ~o b~ found in the Taylor-Russell table~( 193,~),which per-

mIt a detennmation of -the net gain in selection acc~racy atbibutable to

the use of the test. ~ information required inc1\ipls'the validity co-

·60 .60

.60 .60

.61 .60

.61 .61

.62 .61


selected after the use of the test. Thus, the difference between .60 and

anyone table entry shows the increase in proportion of successful selec-

tions attributable to the test.

Obviously if the selection ratio w.ere 100 percent, that is, if all appli-

cants had to be accepted, no test, howen'r valid, could improve the

selection process. Reference to Table 14 sho\\'s that, when as many as 95

percent of applicants must be admitted, even a test with perfect validity

( r = 1.00) would raise the proportion of successful persons by only 3 per-

cent (.60 to .63). On the other hand, when only 5 percent of applicants

need to be chosen, a test with a validity coefficient of only .30 can raise

the percentage of successful applicants selected from 60 to 82. The rise

from 60 to 82 represents the incremental vaUdity of the test (Sechrest,

1963), or the increase in predictive validity attributable to the test. It

indicates the contribution the test makes to the selection of individuals

who will meet the minimum standards in criterion performance. In ap-

plying the Taylor-Russell tables, of course, test validity should be com-

puted on the same sort of group used to estimate percentage of prior

successes. In other words, the contribution of the test is not evaluated

against chance success unless 'applicants were preViously selected by

chance-a most unlikely circumstance. If applicants had been sele<;:teq

on the basis of previous job history, letters of recommendation, and inter-

views, the contribution of the test should be evaluated ODe. the- ,basis atwhat the test adds to these previous selection procedures. ..

The incremental validity resul~~~ from the use of a test depends not

only on the selection ratio but l\~'()ll the base rate. In the previously

illustrated job selection situation, the base rale refers to the proportion of

successful employees prior to the introduction of the test for selection

purposes. Table 14 shows the anticipated outcomes when the base rate

is .60. For other base rates, we need to consult the other appropriate

tables in the cited reference (Taylor & Russell, 1939). Let us consider

an example in which test validity is .40 and the selection ratio is 70 per-

cent. Under these conditions, what would be the contribution or incre-

mental validity of the test if we begin with a base rate of 50 percent?

And what would be the contribution if we begin with more extreme base

rates of 10 and 90 percent? Reference to the appropriate Taylor-Russell

tables for these base rates shows that the percentage of successful em-

ployees would rise from 50 to 75 in the Hrst case; from 10 to 21 in the

second; and from 9 to 99 in the third. Thus, the improvement in percent-

age of successful employees attributable tQ .the use of the test is 25 whenthe base rate was 50, but only 11 and 9 when the b,ase rates were more

extreme. .

The implications of extreme base rates are of specia~,,interest in clinical

psychology, where the base rate refe~ to' the frequency of the patho-

lOgical condition to be diagnosed in the, p.qpulation tested (Buchwald,

o Principles of Psychological Testing

cient of the test, the proportion of applicants who m~~t be acclep~e~

lection ratio), and the proportion of successfu~ app lc~n~ :: ~~r:ethout the use of the test (base rate). A change many 0 t I"

ctorscan alter the predictive efficiency of the test.For urposes of illustration, one of the Taylor-Russell tables has been

e rod~eed in Table 14. This table is designed for us~ when the base

.aie or ercenta e of successful applicants selected pnor to the use of

he test 1s 60. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~r

base ra~es Across the top of the table are given different va ues ~ .e

selection ;atio, and along the side are the tes~ validities. The entnes 111

the' body of the table indicate the proportion of successful· persons

TABLE 14 i ( f G'Proportionof "Successes" Expected through the Use 0 Test 0 lven

Validityand Given Selection Ratio, for Base Rate .60.

(FromTaylor and Russell, 1959, p. 576) . =-"~':~~,",7"'J2'-':UliH~~'.:>,JI;~~,.:~!M.r ••_.:::·..:;.':5.~~~

Selection Ratio

.30 .40 .50 .60 .70 .80 .90 .95

.75

.80

.85

.90

.951.00

.991.00

1.00

1.001.001.00

.99

.991.00

1.001.00

1.00

.96

.98

.991.001.001.00

.93

.95

.97

.991.00

1.00

.90

.92

.95

.97

.99

1.00

.60 .60 .60

.61 .61 .61

.63 .62 .61

.64 .63 .62

.65 .64 .63

.66 .65 .63

.68 .66 .64

.69 .67 .65

.70 .68 .66

.72 .69 .66

.73 .70 .67

.75 .71 .68

.76 .73 .69

.78 .74 .70

.80 .75 .71

.71.72

.73

.74

.75

.75

.86

.88

.91

.94

.971.00

.81 .77

.83 .78

.86 .80

.88 .82.92 .841.00 .86

.62 .61

.62. .6J

.63 .62

.63.62

.64 .62

.64 .62

.64 .62

.65 .63

.65 .63

.66 .63

.66 .63

.66 .63

.66 .63

.67 .63

.67 .63

.67 .63

Princillies of PSljcllological Testing

,t1965; Cureton, 1957a; Meehl & Rosen, 1955; J. S. Wiggins, 1973). For

:)example, if 5 percent of the intake population of a clinic has organic

:brain damage, then 5 percent is the base rate of brain damage in this

,~population. Although the introduction of any valid test win improve

:~.predictive or diagnostic accuracy, the improvement is greatest when the

. base rates are closest to 50 percent. '''ith the extreme base rates found

;i'wfth rare pathological conditions, however, the improvement may be

.:, negligible. Under these conditions, the use of a test may prove to be

unjustified when the cost of its administration and scoring is taken into

'; account. In a clinical situation, this cost would include the time of pro-

fessional personnel that IDlght otherwise be spent on the treatment of

• additional cases (Buchwald. 1965). The number of false positives, or

normal individuals incorrectly classified as pathological, would of course

increase this overall cost in a clinical situation."'Then the seriousness of a rare condition makes its diagnosis urgent,

.. tests of moderate validity may be employed in an early stage of sequential

decisions. For example, all cases might first be screened with an easily

administered test of moderate validity. If the cutoff score is set high

enough (high scores being favorable), there will be few false negatives

but many false positives, or normals diagnosed as pathological. The latter

can then be detected through a more intensive individual examination

given to all cases diagnosed as positive by the test. This solution would

be appropriate, for instance, when available facilities !Jlake the intensive

individual examination of all cases impracticable.

RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. In many practical situ-

ations, what is wanted is an estimate of the effect of the selection test,

not on percentage of persons exceeding the minimum performance, but

on overall output of the selected persons. How does the actual level of

job proficiency or criterion achievement of the workers hired on the

basis of the test compare with that of the total applicant sample that

would have been hired without the test? Following the work of Taylor

and Russell, several investigators addressed themselves to this question

(Brogden, 1946; Brown & Ghiselli, 1953; Jarrett, 1948; Richardson, 1944).

Brogden (1946) first demonstrated that the expected increase in output

is directly proportional to the validity of the test. Thus, the improvement

resulting from the use of a test of validity .50 is 50 percent as great as the

improvement expected from a test of perfect validity.The relation between test validity and expected rise in criterion

achievement can be readily seen in Table 15.1 Expressing criterion scores

1 A table including more values for both selection ratios and validity coefficients

was prepared by Naylor and Shine (1965).

00,..(

It:l

~

0

~

It:lce:

0

~

It:l1:-:

0I,,;

It:l

~

..,0c

II) C'1·8esIII \I')0C,) ~.e-:2 0

~~

o~

~ c.2Qj:8~tn IX:


standard scores with a mean of zero and an SD of 1.00, this table gives

e expected mean criterion score of workers selected with a test of given

idity and with a given selection ratio. In this context, the base output

an, corresponding to the performance of applicants selected without

se-ofthe test, is given in the column for zero validity. Using a test with

erovalidity is equivalent to using no test at all. To illustrate the use of

he table, let us assume that the highest scoring 20 percent of the appli-

.cantsare hired, (selection ratio == .20) by means of a test whose validitycoefficient is.50. Reference to Table 15 shows that the mean criterion

.performance of this group is .70 SD above the expected base mean of an

Illitested sample. \Vith the same 20 percent selection ratio and a perfect

test (validity coefficient = 1.00), the mean criterion score of the acceptedapplicants }vould be 1.40, just twice what it would be with the test of

validity .50. Similar direct linear relations will be found if other mean

criterion performances are compared within any roW of Table 15. For

instance, with a selection ratio of 60 percent, a validity of .25 yields a

mean criterion score of .16, while a validity of .50 yields a mean of .32.

Again, doubling the validity doubles the output rise.The evaluation of test validity in terms of either mean predicted out-

put or proportion of persons exceeding a minimum criterion cutoff is

obviously much more favorable than an evaluation based on the previ-

ously discussed error of estimate. The reason for the difference is that

prediction errors that do not affect decisions are irrelevant to the selec-

tion situation. For example, if Smith and Jones are both superior workers

and are both hired on the basis of the test, it does not matter if the test

shows Smith to be better than Jones while in job performance Jones

excels Smith.

TIlE ROLE OF VALUES IN DECISION TIIEORY. It is characteristic of decision

theory that tests are evaluated in terms of their effectiveness in a specific

situation. Such evaluation takes into account not only the validity of the

test in predicting a particular criterion but also a number of other

parameters, including base ra:e and s~~ Another important'

parameter is the relative utility of expected outcomes, the judged favor-

ableness or unfl\.vorablcness of each outcome. The lack of adequate

systems for assigning values to outcomes in terms of a uniform utility

scale is one of the chief obstacles to the application of decision theory.

In industrial decisions, a dollar-and-cents value can frequently be as-

Signed to different outcomes. Even in such cases, however, certain out-

comes pertaining to good will, public relations, and employee morale are

difficult to assess in monetary terms. Educational decisions must take into

account institutional goals, social values, and other relatively intangible

factors. Individual decisions, as in counseling, must consider the indi-


vidual's preferences and I hout, however, that decisi:: Ut~:~ste~: It a~ been repeatedly pointedvalues into the d .. ry Id not mtroduce the problem of

tems h 1

eClSlon process, but merely made it explicit Value-:"·ave a ways enter d . t d .. . ~

clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore

In choosmg a decision strate th 1 .utilities across all outcome R lY' e goa IS to maximize expectedof a Simple de . . s. . e er.enee to the schematic representationcedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro-

17 in :.vhic~ la~ralm sho~s the decision strategy illustrated in Figurethe' d " a smg e test IS administered to a group of applicants and

eClSIon to accept or . t Iicutoff score on the t t ~Jec an app cant is made on the basis of avalid and fals es. here are four possible outcomes, including

ability of he acceptances and valid and false rejections. The prob-eac outcome can be f d fr h

each of the four sectio . oun om t e number of persons inin that example th ns ofbFIgu:e. 17. Since there were 100 applicants

, ese num ers dIVIded b 100' hthe four outcomes listed in Fi . 18 rh gIVe t e probabilities ofutilities of the diff gure. e other data needed are the

erent outcomes expre dexpected overall utili of th ' sse on a common scale. Theing the probability t h e,strategy could then be found by multiply-

these products forOt~:c faoutco~e by the utility of the outcom~, addingrespondin to h u~ ou comes, and subtracting a value cor-

test of 10:val:d~;~:t;~r:e~:~:r t~h~s last ~erm ~i?h~ights th~ fact that a

easily administered by reIat' r e r~tamed If It IS short, mexpensive,

group administration An 1· IdV:~dun1tramed personnel, and suitable for. n IVI ua test req . . t' d

or expensive equipment would d h' h Uln~g. a rame examinerllee a Ig er vahdlty to justify its use.

Decision Outcome Probability

Valid Acceptance .38

False Acceptance .07

V~lid Rejection .33

False';Rejection .22

Administer testand C1pply

cutoff score

FIG. 18. A$imple Decision Strategy.

2 For ,w fl'ctitious example illustraf all .'Wiggins"0973), pp. 257-274. mg steps II! these computations, see y, a

It should also be noted that many personnel decisions are in effect

sequential, although they may not be so perceived. Incompetent em-

ployees hired because of prediction errors can usually be discharged

after a probationary period; failing students can be dropped from col-

lege at several stages. In such situations, it is only adverse selection

decisions that are terminal. To be sure, incorrect selection decisionS- that

are later rectified may be costly in terms of several value systems. }Jut

they are often less costly than terminal wrong decisions. " ",

A second condition that may alter the effectiveness of a psychological

test is, the availability of alternative treatments and the possibility of

adaptmg treatments to individual characteristics, An example would be

the utilization of different training procedures for workers at different

aptitude levels, or thc introduction of CQ.l!lpensatory educational pro-

grams for students with certain educational disabilities. Under these

conditions, the decision strategy followed in individual eases should take

into account available data on the interaction of initial test score and dif-

ferential treatment. When adaptive treatments ar~ utilized, the success

rate js likely to be substantially improved. Be£ause, the assignment of

in<!ividuals to alternative treatments is essentiallyadilsSif'ication rather

tharu-sel~oblem, -mOre wiI[ be Sald about tlleTequired method-

ology in a later section on classification decisions.

The examples cited illustrate a few of the ways in which the concepts

an~ rationale of decision theory can assist in the evaluation of psycho-

logIcal tests for specific testing purposes, Essentially, decision theory has

served to focus attention on the complexity of factors that determine the

contribution a given test can make in a particular situation. The validity

coefficient alone cannot indicate whether or not a test should be used

~ince it is only one of the factors to be considered in evaluating th~

lmpact of the test on the efficacy of the total decision process) .

';, SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS. In some situations,

~;the effectiveness of a test may be increased through the use ~f more

complex decision strategies which take still more param,etoe~s.lllto .ac-

. count. Two examples will serve to illustrate these poss~blhtles, ,~ust,

, ,t t may be used to make sequential rather than termmal deCISIOns,'. es sod' F' 17 d 18 aU;,With the simple decision strategy Illustrate III 19ures an ,:"decisions to accept or reject ar: treated as terminal. Figure 19, on the

" other hand, shows a two-stage sequential decision, T~st A could be a

,';shortand easilv administered screening test. On the baSIS of.per~orma~ce

, on this test, in'dividuals would be sorted into three categ?nes.: mcludl~~

thoseclearly accepted or rejected, as well s. 3n in~ermedlat~ uncertam

group to be examined further with more intenSIve tec~mque~, repre-

.. sented by Test B. On the basis of t~e second-sta?e testmg, tIllS group

,; wouldbe sorted into accepted and rejected categorIes.

Such sequential testing can also be cmployed within a si~gle test~ng, f t to t'm (DeWItt & \Velss

,> session,to ma'Ximize the effectlve usc 0 es mg Ie.'..~.1974; Linn, Rock, & Cleary, 1969; Weiss- -& 13etz, 1973): Altho~gh. appli-

cable to paper-and-pencil printed grou~ ~ts, seq~entIal testmg IS par-

ticularly well suited for computer testing, ~ssenhally the sequen~e ~f

items ~r item groups 'within the test is determine? b~ the examl,nee s

ownperfom1anceo For example, everyone might begm w1th a set of Ite~s

of intermediate difficulty. Those who score poorly are routed t? easIer

items' those who score well, to more difficult items. Such branchmg may I

oeeu; repeatedly at several stages, The princip~l eff.e~t is that each

examinee attempts only those items suited to h~s abJ1l~y level, rather

than trying all items, Sequential testing ~~del.s WIll be dlscusse~ further

in Chapter 11, in connection with the utlhzahon of computers 10 group

testing. hI' I d' dAnother strategy, suitable for the diagnosis of psye 0 ogICa 1~or ers,

is to use only two categories, but to test further. a~ cases clas~ified as

.. positives (i.e., possibly pathol~gi~al) ~y the.prel~mmary sc~eem~g test..', This is the strategy cited earlIer ll1 this. s.e~tion, ~n connection Wlth the

use of tests to diag,nose pathological condItIons With very low base rates.

,.~\

DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS. The validity of a

test for a given criterion may vary among subgroups differing in personal

characteristics. The classic psychometric model assumes that prediction

errors are characteristic of the test rather than of the person and that

these errors are randomly distributed among persons. With the flexibility

of ap~roach ushe,re~ in by decision theory, there has been increasing ex-

ploration of prediction models involving interacti~ hetween persons and

3 For a ,fuller discussion of the implicationsof decision theory for test use, see

J. S. Wlggms (1973), Ch. 6, and at a more technical level; Cronbach and GIeser(1965), .

178 Principles of Psychological Testing~.' ts. Such interaction implies that the same test may be a better pre-

tor for cert~i~Ciasses or subsets of persons than it is for others. For

xamplc,a given test may be a better predic~or of criterio~ performance

or men than for women, or a better predlctor for applicants from a

ower than for applicants from a higher socioeconomic level. In. these

xamples,sex and socioeconomic level are known as moderator vanables,

sincethey moderate the validity of the test (Saunders, 1956).

I When computed in a total group, the vali~ity coe!R<,ient of a test may

'be too low to be of much practical value In prcdlction. But when reo

< computed in subsets of individuals differing in some i~e~tifia?le charac-

, teristic, validity may be high in one subset and negl1g~~le In anot~er.

; The test could thus be used effectively in making declSJons regardmg

! persolls in the first group but not in the second. Per~aps anothe~ test or

" some other assessment device could be found that IS an effective pre-

dictor in the second group. .A moderator variable is some characteristic of persons that makes It

i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a given

ins rument. t may e a emograp lC vana e, such as sex, age, e u-

.al level, or socioeconomic background; or it may be a score on

another test. Interests and motUlation often function as moderator

variables. Thus, if an applicant has little interest in a job, he will prob-

ably perform poorly regardless of his scores on relevant aptitude tests.

Among such persons, the correlation between aptitude test scores and

job performance would be low. For individuals who are interested and

highly motivated, on the other hand, the correlation between aptitude

test score and job success may be quite high.

EMPmlCALEXAMPLESOF MODERATORVARIABLES.Evidence for the op-

eration of moderator variables comes from a variety of sources. In a sur-

vey of several hundred correlation coefficients between ap~tude test

scores and academic grades, H. G. Seashore (1962) found htgher cor-

relations for women than for men in the large majority of instances. Tht;same trend was founa in high sChool and college, although the trend was

more pronounced at the coll~ge level. ~he ?~ta do not in.dicate. the

reason for this sex difference in the predictabIhty of academlc achieve-

ment, but it may be interesting to speculate about it in the light of other

known sex differences. If women students in general tend to be more

conforming and more inclined to accept the values and standards of the

school situation, theiJ;class achievement will probably devend largely on

their abilities. If, on the other hand, men students tend to concentrate

their efforts on those activities (in or out of school) that arouse their

individual interests, these interest differences wO..!,Jldintroduce additional

val'ianee-in their......courseachiev~t and would make it more difficult to

Validity: Mr:a~'U"C11lentand Interpretation 179

predict achievement from test scores. Whatever the reason for the

difference, sex does a ear to function as a moderator variable in the

predictability of academic gra es from aphtu e test scores.

A number of investigations have been specially designed to assess the

role of moderator variables in the prediction of academic achievement.

Several studies (Frederiksen & Cilbert, 1960; Frederiksen & Melville,

1954; Stricker, 1966) tested the hypothesis that the more compulsive

students, identified through two tests of compulsivity, Y{,ouldput a great

deal of effort into their course work, regardless of their interest in the

courses, but that the effort of the less compulsive students would depend

on their interest. Since effort will be reflected in grades, the correlation

between the appropriate interest test scores and grades should be higher

among noncompulsive than among compulsive students. This hypothesis

was confirmed in several groups of male engineering students, but not

among liberal arts students of either sex. Moreover, lack of agreement

among different indicators of compulsivity casts doubt on the generality

of the construct that was being measured.

In another study (Grooms & Endler, 1960), the college grades of the

more anxious students correlated higher (r = .63) with aptitude and

achievement test scores than did the grades of the less anxious litudents

(r = .19). A different approach is illustrated by Berdie (1961), who in-

vestigated the relation between intraindividual variability on a test and

the predictive '-'ilidity of the same test. It was hypothesized that a given

test will be a- better predictor for those individuals who perform more

consistently in different parts of the test-and whose total scores are thus

more reliable. Although the hypothesis was partially confirmed, the re-

lation proved to be more complex than anticipated (Berdie, 1969).

In a different context, there is evidence that self-report personality in-

ventories may have higher validity for some types of neurotics than for

others (Fulkerson, 1959). The characteristic behavior of the two types

tends to make one type careful and accurate in reporting symptoms, the

o~her ~areless and evasive. The individual who is characteristically pre-

ClSeand careful about details, who tends to worry about his problems,

and who uses intellectualization as a primary defense is likely to provide

a more accurate picture of his emotional difficulties on a self-report in-

ventory than is the impulsive, careless individual who tends to avoid

expressing unpleasant thoughts and emotions and who llses denial as aprimary defense.

Ghi~elli (1956, 1960a, 1960b, 1963, 1968; Chise~!C Sander~, 1967) has

extenslvely explored the role of moderator variaBles iIl. UidiIstrial situ-

ations. In a study of taxi drivers (Ghiselli, 1956), the @rrelati~n between

an aptitude test and a job-performance criterion in the t6tl;J applicant

sa'ijl~ was only .220. The group was then sorted into tpirds qp the basis~ ~~ ..~ on an occupational interest test. When the validity of the

,_~ Validity: Measurement and Interpretation 181

/?I?'e:-T~!'~redict a s~n.gle.criterion, they are known as a test batten(. 1Jpe chiefOf:tlp.fJ701(.fl.problem arlsmg 10 the use of such batteries concerns the way 'in which

scores ,on the di~ert:n~ tests are to be combined in arrivi,!!g at a decision

regardmg each IndiVIdual. The. statistical procedures followed for this

purpose ~re of tw.g major typ:s, namely, multiple regression equationand multiple cutoff scores. --------'--:...::-.~-=..::!.:.::.:.:.:=___, ..

~Vhe~~ts ~re adIriinistered in the intensive study of individual cases,?1t~/et-,VV1C

:: 111 ~li~lCaldiagnosis, counseling, or the evaluation of high-level execu//' I ' "."",,-

ves, It Isa£.QmIDOlLpr.actice.fOLtb~aminer to utilize test scores with~1out further st~tistical...analpis.-W preparing a case report and in making!

recom~endatI~ns, the examiner relies on judgment, past experience, and \

theoret~cal ratIOnale to interpret score patterns and integrate findings \

from dl~erent tests. Such clinical use of test scores will be discussed \further 1ll Chapter 16. \

\• MULTIPLE ~RESS~ON. EQUATION. The multiple regression equation '\ I '--"1YIelds a predicted cntenon Score for each individual on the basis of his ,1'V~, ..-;f ir

U':'

score . b t . '1'1.. £ II ._-' v(.{li,(,Gi/ltt/Vl..', '. a te ~L1e 10 owing regression equation _-1.::- ....-------...illu~trates the applIcation of this technique to predicting a student's JachIevement in high school mathematics courses from his scores on Ii,)'verbal (V), numerical (N), and reasoning (R) tests: L :1"11

. Math~matics Achievement =: .21V + .21N + .82 R + 1.35, 'IIn t~IS ~quabon, the student's stanine Score on each of the three tests is 1 I:multiplied by the corresponding weight given in the equation. The sum :1

of t~c..sepr~~uct~, plus a constant (1.35), gives the student's predicted ';:s~!!~ne pOSItIon lD mathematics courses. 1 ;:.'Suppose that Bill Jones receives the following stanine scores: I ".

Verbal 6 ; :If

~::~~~ : 1 'lii,'1The estimated ma'h ti' h'

Lema cs ac levement of this student is found as 'i i,follows: ' II

1 fill

I illl'1 i: \

(i

11111:1

'I,i!J 'J, ,"


aptitude test was recomputed within the third whose occupational in-

terest level was most appropriate for the job, it rose to .664.

A technique employed by Chiselli in much of his research consists in

finding for each individual the absolute difference (D) between his

:. actualand his predicted criterion scores. The smaller the value of D, the

, morepredictable is the individual's criterion score. A predictability scale

~; is then developed by comparing the item responses of two contrasted

:- subgroups selected on the basis of their D scores. The predictability

-: scaleis subsequently applied to a new sample, to identify highly pre-

o dictableand poorly predictable subgroups, and the validity of the original

. testis compared in these two subgroups. This approach has shown con-

siderablepromise as a means of identifying persons for whom a test will

" be a good or a poor predictor. An extension of the same procedure has

.'been developed to determine in advance which of two tests will be a

'\ betterpredictor for each individual (Chiselli, 1960a).

Other investigators (Dunnette, 1972; Hobert & Dmmette, 1967) have

"'. argued that Chiselli's D index, based on the absolute amount of pre-

.~dictionerror without regard to direction of error, may obscure important

individualdifferences. Alternative procedures, involving separate analyses

ofoverpredicted and underpredicted cases, have accordingly been pro-

'posed. .;\ Atthis time the identification and use of moderator variables are still

,'i,n' an explor;tory ·phase. Considerable caution is_required to avoid

methodologicalpitfalls (see, e,g., Abrahams & Alf, 1972a, 1972b; Dun-

nette,1972;Ghiselh, 1972; Velicer, 1972a, 1972b). The results are usually

~9uitespecific to the situations in which they were obtained. And it is

iinportant to check the extent to which the use of moderators actually

'proves the prediction that could be achieved through other more

'rectmeans (Pinder, 1973).

':;xForthe prediction of practical criteria, not one but several tests are

eperallyrequired. Most: criteria are complex, the criterion measure de-

. ingon a number of different traits. A single test designed to measure

a criteriQnwould thus have to be highly heterogeneous. It has al-

y been pointed out, however, that~ re~!i.~~ homogeneous _~~st,

suringlargely' a singlet~ is more satisfactory b~~.e_iL)'ieIasJess--.--.

. ---US-Scores ('Ch-:;)). Hence, it is usually preferable to use a

ination of several relatively homogeneous t~sts, each covering a

ent aspect of the criterion, rather than a single test consisting of a

podge of many diffe:rent kinds of items.

en a number of speciaUy selected tests are employed together to

Math. Achiev. == (.21)(6) + (.21) (4) +( .32)( 8) + 1.35 = 6.01

Bill's ~redictcd stanine is approximately 6. It ~l be recalled (Ch. 4) that

a stanme of 5 represents average pedormance. Bill would thus be ex-

pected to ~o somewhat better than average in mathe~tics courses. His

very supenor performance in the reasoning test (R =8') and his above-

~verage score on the verbal test (V = 6) compensate for his poor score10 spee~ and a~uracy of computation (N = 4). _

SpecIRc techmques for the computation of regression equations can be

Anastasi Anne Psychological Testing I

Documents

Transcript of Anastasi Anne Psychological Testing I