Psychological Testing - Basic Concepts . . . Anne Anastasi.pdf
Anastasi Anne Psychological Testing I
-
Upload
alex-boncu -
Category
Documents
-
view
160 -
download
18
Transcript of Anastasi Anne Psychological Testing I
ANNE~NASTASIProfessor of Psychology, Fordham Universiry
Psyclwlvgical Testing
MACMILLAN PUBLISHING CO., INC.
New York
Collier Maonillan Publishers
London
I N A revised edition, one expects both similarities and differences. This
edition shares with the earlier versions the objectives and basic approach
of the book. The primary goal of this text is still to contribute toward the
proper evaluation of psychological tests and the correct interpretation
and use of test results. This goal calls for several kinds of information:
( 1) an understanding of the major principles of test construction, (2)
psychological knowledge about the behavior being assessed, (3) sensi-
tivity to the social and ethical implications of test use, and (4) broad
familiarity with the types of available instruments and the sources of
information about tests. A minor innovation in the fourth edition is the
addition of a suggested outline for test evaluation (Appendix C).
In successive editions, it has been necessary to exercise more and more
restraint to keep the number of specific tests discussed in the book from
growing with the field-it has never been my intention to provide a
miniature Mental Measurements Yearbook! l:\evertheless, I am aware
that principles of test co~struction and interpretation can be better un-
derstood when applied to~particular tests. Moreover, acquaintance with
the major types of available tests, together with an understanding of
their special contributions and limitations, is an es!>entialcomponent of
knowledge about contemporary testing. For these reasons, specific tests
are again examined and evaluated in Parts 3, 4, and 5. These tests have
been chosen either because they are outstanding examples with which
the student of testing should be familiar or because they illustrate some
special point of test construction or interpretation. In the text itself, the
principal focus is on types of tests rather than on specific instruments. At
the same time, Appendix E contains a classified list of over 250 tests,
including not only those cited in the text but also others added to provide
a more representative sample.
As for the differences-they loomed especially large during the prepa-
ration of this edition. Much that has happened in human society since
the mid-1960's has had an impact on psychological testing. Some of these
developments were briefly described in the last two chapters of the third
edition. Today they have become part of the mairn;tream.;()fpsychological'
testing and have been accordingly incorpo~i-ted in the apprqpqate sec-
tions throughout the book. Recent changes in psychological Jesting that
are reflected in the present edition can be delpribed on three levels:
(1) general orientation toward testing, (2) Stlbm,IJ,tiveand inethod()l~i-
cal developments, and (3) "ordinary progress" w1)Q as the publiciitibn
of new tests and revision of earlier tests.
All rights reserved. No part of this book may be reproduced or
transmitted in any form or by any means, electronic or me-
chanical, including photocopying, recording, or any informa-
tion storage and retrieval system, without permission in writing
from the Publisher.
Earlier editions copyright 1954 and © 1961 by Macmillan
Publishing Co., Inc., and copyright © 1968 by Anne Anastasi.
MACMILLAN PUBLISHING Co., INC.
866 Third Avenue, New York, New York 10022
COLLIER MACMILLAN CANADA, LTD.
Librarlj of Congress Cataloging in Publication Data
Anastasi, Anne, (date)
Psychological testing.
Bibliography: p.
Includes indexes.
1. Mental tests. 2. Personality tests. I. Title.
[DNLM: 1. Psychological tests. WM145 A534P]
BF431.A573 1976 153·9 75-2206
ISBN O-<>2-30298<r3
Preface
; An example of changes on the first level is the increasing awareness of
~e ethical, social, and legal implications of t~sting. In the present edi-
lon, this topic has been expanded and treated 111a separate chapter early
b the book (Ch. 3) and in Appendixes A and B. A cluster of related
l..evelopments represe~ts a bro~dening of.test u~es..Beside~ the tradi~ion~l'pplications of tests 111 selectwn and diagnosIs, 111creasmg attention IS
eing given to administering tests for self-kuowledge and self-develop-
~entl and to training individuals in the use of their own test res?lts. in
,lJecisionmaking (Chs. 3 and 4). In the same category are the contmumg
~eplacementof global scores with multitrait profiles and the application
bf classificationstrategies, whereby "everyone can be above average" in
bne or more socially valued "ariables (Ch. 7). From another angle,
rffortsare being made to modify traditional interpretations of test scores,
~n bothcognitive and noncognitive areas, in the light of accumulating
psychological knowledge. In this edition, Chapter 12 brings together
'psychological issues in the interpretation of intelligence test scores,
:touchingon such problems as stability and change in intellectual level
.overtime; the nature of intelligence; and the testing of intelligence in
:earlychildhood, in old age, and in different cultures. Another example
is providedby the increasing emphasis on situational specificity and
I person-by-situationinteractions in personality testing, stimulated in large
partbythe social-learning theorists (Ch. 17).
T~e second level, -covering substantive and methodological changes,
is illustratedby the impact of computers on the development, administra-
"tioll,scoring,and interpretation of tests (see especially Chs. 4, 11, 13, 17,
18, W). The use of computers in administering or managing instructional
pro/ramshas also stimulated the development of criterion-referenced
t~~~although other conditions have contributed to the upsurge of
'i!restin such tests in education. Criterion-referenced tests are discussed'1c •
,. 'pally in Chapters 4,5, and 14. Other types of lllstruments that have
to prominence and have received fuller treatment in the present
n include: tests for identifying specific learning disabilities (Ch.
inventories and other devices for use in behavior modification pro-'
(Ch. 20), instruments for assessing early ch~ldhOod education
14), Piagetian "ordinal" scales (Chs. 10 and 14), basic education
literacy tests for adults (Cbs. 13 and 14), and techniques for the
ment of environments (Ch. 20). Problems to be considered in the
, ment of minority groups, including the question of test bias, are
ined from different angles in Chapters 3, 7, 8, and 12.
the third level, it may be noted that over 100 of the tests listed in
edition have been either initially pUblished or revised since the
ication of the preceding edition (1968). Major examples include the
arthy Scales of Children's Abilities, the WISC-R, the 1972 Stanford-
norms (with all the resulting readjustments in interpretations),
Preface vii
Forms Sand T of the DAT (including a computerized Career Planning
Program), the Strong-Campbell Interest Inventory (merged form of the
SVIB), and the latest revisions of the Stanford Achievement Test and theMetropolitan Readiness Tests.
It is a pleasure to acknowledge the assis~nce received from many
sources in the preparation of this edition. The completion of the project
was facilitated by a one-semester Faculty Fellowship awarded by Ford-
ham University and by a grant from the Fordham University Research
Council covering principally the services of a research assistant. These
services were performed by Stanley Friedland with an unusual combina-
tion of expertise, responSibility, and graciousness. I am indebted to the
many authors and test publishers who provided reprints, unpublished
manuscripts, specimen sets of tests, and answers to my innumerable in-
quiries by mail and telephone. For assistance extending far beyond the
interests and responsibilities of any single publisher, I am especially
grateful to Anna Dragositz of Educational Testing Service and Blythe
Mitchell of Harcourt Brace Jovanovich, Ioc. I want to acknowledge the
Significant contribution of John T. Cowles of the University of Pittsburgh,
who assumed complete responSibility for the preparation of the Instruc-tor's Manual to accompany this text.
For informative discussions and critical comments on particular topics,
I want to convey my sincere thanks to Willianl H. Angoff of Educational
Testing Service and to several members of the Fordham University Psy-
chology Department, including David R. Chabot, Marvin Reznikoff,
Reube~ M. Schonebaum, and 'Warren, W. Tryon. Grateful acknowledg-
ment IS also made of the thoughtful recommendations submitted by
course instructors in response to the questionnaire distributed to current
users of the third edition. Special thanks in this connection am due to
Mary Carol Cahill for her extensive, constructive, and Wide-ranging
suggestions. I wish to express my appreciation to Victoria Overton of
the Fordham University library staff for her efficient and courteous as-
sistance in bibliographic matters. Finany, I am happy to record the
contributions of my husband, John Porter Foley, Jr., who again partici-
pated in the solution of countless problems at all stages in the prepara-tion of the book.
A.A.
CONTENTS
PART 1CONTEXT OF PSYCHOLOGICAL TESTING
1. FUNCTIONS AND ORIGINS OFPSYCHOLOGICAL TESTING 3
Current uses of psychological tests QEarly interest in classification and training of the mentally
retarded 5The first experimental psychologists 7
Contributions of Francis Galton 8
Cattell and the early "mental tests" 9
Binet and the nse of intelligence tests 10
Group testing 12
Aptitude testing 13 ~
Standardized achievement tests 16
Measurement of personality 18
Sources of information about tests 20
2. NATURE AND USE OFPSYCHOLOGICAL TESTS
What is a psychological test? 23Reasons for controlling the use of psychological tests
Test administration 32
Rapport 34
Test anxiet\' 37Examiner ~nd situational variables 39
Coaching, practice, and test sophistication 41
3. SOCIAL AND ETHICAL IMPLICATIONS
OF TESTING "
User qualifications 45
Testing instruments and procedures 47
Protection of privacy . 49
Confidentiality 52
Communicating test results 56
Testing and the civil rights of minorities 57
ix
4. NORMS AND THE INTERPRETATION OF
TEST SCORES
Statistical concepts 68
Developmental norms 73
Within-group norms 77
Relativity of norms 88Computer utilization in tile interpretation of test scores 94
Criterion-referenced testing 96
5, RELIAB ILITY
The correlation coefficient 104
Types of reliability 110
Reliability of speeded tests 122Dependence of reliability coefficients on the sample tested 125
Standard error of measurement 127
Reliability of criterion-referenced tests 131
Content validity 134
Criterion-related validity 140
Construct validity 151
Overview 158
7. VALIDITY: MEASUREMENT AND
INTERPRET ATION
Validity coefficient and error of estimate 163
Test validity and decision theory 167
Moderator variabll;;s 177Combining information from different tests 180
Use of tests for cl.assification decisions 186
Statistical analyses of test bias 191
8. ITEM ANALYSl-S
Item difficulty 199
Item validity 206
Internal consistency 215
Item analysis of speeded tests 217
Cross validation 219
Item-group interaction 222
PART 3
TESTS OF GENERAL INTELLECTUAL
LEVEL
9. INDIVIDUAL TESTS
Stanford-Binet Intelligence Scale 230
Wechsler Adult Intelligence Scale 245
Wechsler Intelligence Scale for Children 2.'55Wechsler Preschool and Primary Scale of Intelligence 260
10. TESTS FOR SPECIAL POPULATIONS
Infant and preschool testing 266
Testing the physically handicapped 281
Cross-cultural testing 287
Croup tests versus individual tests 299
Multilevel batteries 305
Tests for the college level and beyond 318
12. PSYCHOLOGICAL ISSUES ININTELLIGENCE TESTING
Longitudinal studies of intelligence 327.
Intelligence in early childhood 332
Problems in the testing of adult intelligence 337
Problems in cross-cultural testing 343
Nature of intelligence 349
PART 4
TESTS OF SEPARATE AInLJTIES
13. MEASURING MULTIPLE APTITUDES
Factor analysis 362
Theories of trait organization
MUltiple aptitude batteries
Measurement of creativity
369
378
388
14. EDUCATIONAL TESTING
Achievement tests: their nature and uses 398General achievement batteries 403
Standardized tests in separate subjects 410
Teacher-made classroom tests 412
20. OTHER ASSESSMENT TECHNIQUES
"Objective" performance tests 588
Situational tests 593
SeH-concepts and personal constructs 598
Assessment techniques in behavior modification programs
Observer reports 606
Biographical inventories 614
The assessment of environments 616
Diagnostic and criterion-rdt:renced tests 417
Specialized prognostic tests 423
Assessment in early childhood education 425
~ OCCUPATIONAL TESTING
\V Validation of industrial tests 435Short screening tests .for industrial personnel 439
Special aptitude tests 442
Testing in the profeSSions 458
Diagnostic use of intelligence tests 465
Special tests for detecting cognitive dysfunction
Identifying specific learning disabilities 478
Clinical judgment 482
Report writing 487
B. Guidelines on Employee Selection Procedures (EEOC)
Guidelines for Reporting Criterion-Related and
Content Validity (OFCC)
PART 5PERSON ALITY TESTS
17. SELF-REPORT INVENTORIES
Content validation 494
Empirical criterion keying - 496
Factor analysis in test development
Personality theory in test development
Test-taking attitudes and response sets
Situational specificity 521
Evaluation of personality inventories
506510
515
18. MEASURES OF INTERESTS, ATTITUDES,AND VALUES ;527
Interest inventories 528
Opinion and attitude measurement 543
Attitude scales 546Assessment of values and related variables 552
19. PROJECTIVE TECHNIQUES
Nature of projective techniques 558
Inkblot techniques 559Thematic Apperception Test and related instruments
Other projective techniques 569
Evaluation of projective techniques 576
PART 1
C01ltext of
. Psychological Testing
CHAPTER 1
Functions and 01~igiTlSof
Psycllological TeStiTlg
A'NYONE reading this book today could undoubtedly illush'ate what
. is meant by a psychological test, It would be easy enough to recall
. a test the reader himself has taken in school, in college, in the
armed services, in the counseling center, or in the personnel office. Or
perhaps the reader has served as a subject in an experiment in which
standardized tests were employed. This would certainly not have been the
case fifty years ago. Psychological testing is a relatively young branch of
one of the youngest of the sciences.
Basically, the function of psychological tests is to measure ,9.:iffe~~~.n~L_
1Jetween individuals or between the reactions of the same individual on
different occasions. One of the first problems that stimulated the develop-
ment of psychological tests was the identification of the mentally re-
tarded. To this day, the detection of int~i1ectual deficiencies remains an
Important application of certain types of psychological tests. Related
clinical uses of tests include the examination of the emotionally disturbed,
the delinquent, and other types of behavioral deviartts. A strong impetus
to the early development of tests was likewise provided by problems
arising in education, At present, schools are among the largest test users.
The classifica.tiOIlOfchildren with reference to their ability to profit
from different types of school instruction, the identi£ication of the in-
tellectually retarded on the one hand and the gifted on the other, the
diagnosis of academic failures, the educational and vocational counseling
of high school and college students, and the s~~ction of applicants for
professional and other special schools are among the many educational~uses of tests.
The selection and classification of industrial personnel represent an-
other major application of psychological testing. From the assembly-line
4 COllfcl't of Psychological Testing
operator or filing clerk to top management, there is scarcely a type of job
for which some kind of psychological test has not proved helpful in such
matters as hiring, job assignment, transfer, promotion, or termination.
To be sure, the effective employment of tests in many of these situations,
es eciiill-"Tri('Onnection with high-level jobs, usuall • re uires that the
t!.:ts he used as an adjunct to s -i u interviewing, so that test scores
may be properly int~rpreteaTnt1leli ht of other back ound' rmatiQn
a out the m IVI un. evertheless, testing constitutes an important part
~ total personnel program. A closely related application of psycho-
logical testing is to be found in the selection and classification of military
personnel. From simple beginnings in "Vorld 'War I, the scope and
variety of psychological tests employed in military sihlations underwent
a phenomenal increase during World War II. Subsequently, research
on test development has been continuing on a large scale in all branches
of the armed services,
The use of tests in counseling has gradually broadened from a nar-
rowly defined guidance regarding educational and vocational plans to
an involvement with all aspects of the person's life. Emotional well-
being and effective interpersonal relations have become increasingly
prominent objectives of counseling. There is growing emphasis, too, on
the use of tests to enhance self-understanding and personal development.
Within this framework, test scores are part of the information given to
the individual as aids to his own decision-making processes.
It is clearly evident that psychological tests are currently being em-
ployed in the solution of a wide range of practical problems. One should
not, however, lose sight of the fact that such tests are als? serving impor-
tant functions in basic research Nearly all problems in differential psy-
chology, for example, require testing procedures as a means of gathering
data. As illustrations, reference may be made to studies on the nature and
extent of individual differences, the identification of psychological traits,
the measurement of group:' differences, ~nd the investigationfijo]ogical
and cUltural factors associated WIth 6ehavioral differences. For all such
areas of research-and for many others-the precise mt>.asurement of
individual differences made possible by well-constructed tests is an
essential prerequisite. Similarly, psycholOgical tests provide standardized
tools for investigating such varied problems as life-span developmental
changes within the individual, the relative effectiveness of different edu-
cational procedures, the outcomes of psychotherapy, the impact of
community programs, and the influence of noise on performance.
From the many different uses of psychological tests, it follows that some
knowledge of such tests is needed for an adequate understanding of most
fields of contemporary psychology. It is primarily with this end in view
that the present book has been prepared. The book is not designed to
make the individual either n skilled examiner and test administrator or
an"experf on test construction. It is directed, not to the test specialist, but
to the general student of psychology. Some acquaintance with the lead·'
ing current tests is necessary in order to understand references to the use
of such tests in the psychological literature. And a proper evaluation and
interpretation of test results must ultimately rest on a knowledge of how
the tests were constructe<l, what they can be expected to accomplish, and
what are their peculiar limitations. Today a familiarity with tests is re-
quired, not only b~' those who give or construct tests, but by the generalpsychologist as well.
A brief overview of the historical antecedents and origins of psychologi-
cal testing will provide perspective and should aid in the understanding
of present-day tests.' The direction in which contemporary psychological
testing has been progressing can be clarified when considered in the light
of the precursors of such tests. The special limitations as well as the
advantages that characterize current tests likewise become more intel-
ligible when viewed against the background in which they originated.
The roots of testing are lost in antiquity. DuBois (1966) gives a pro-
vocative and entertaining account of the system of civil service examina-
tions prevailit:\g in the 'Chinese empire for some three thousand years.
Among the ancient Greeks, testing was an established adjunct to the
educational process. Tests were used to assess the mastery of physical as
well as intellectual skills. 'the Socratic method of teaching, with its
interweaving of testin and t~hin has mch i mmon with toda 's
rrograme earning. From their beginnings in the middle ages, European
umversities relied on formal examinations in awarding degrees and
honors. To identify the major developments that shaped contemporary
testing, however, we need go no farther than the nineteenth century. It
is to these developments that we now turn,
EARLY INTEREST IN CLASSIFICATION AND
TRAINING OF THE MENTALLY RETARDED
The nineteenth century witnessed a strong awakening of interest in the
humane treatment of the mentally retarded and the insane. Prior to that
time, neglect, ridicule, and even torture had been the common lot of these
unfortunates. With the growing concern for the proper care of mental
I A more detlliled account of the early origins of psycllOlogical tests can be found
in Goodenough (1949) and J. Pefers~n (1926~. See also Boring (1950) and Murphyand Kovach (1972) for more general backgrq~md, DuBois (1970) for a brief but
comprehensive history of psychologi~l tClsting, and ,Anastasi (1965) for historicalantecedents of the study of individual differences.
6 Context of Psychological Testing
deviates came a realization that some uniform criteria for identifying and
classifying these cases were required. The establishment of many special
institutions for the care of the mentally retarded in both Europe and
America made the need for setting up admission standards and an ob-
jective system of classification especially urgent. First it was necessary to
differentiate between the insane and the mentallv retarded. The former
manifested emotional disorders that might or might not be accompanied
by intellectual deteriomtion from an initially normal level; the latter were
characterized essentially by i~tellectual defect that had been present
from birth or early infancy. What is probably the first explicit statement
of this distinction is to be found in a two-volume work published in 1838
by the French physician Esquirol (1838), in which over one hundred
pages are de\'oted to mental retardation. Esquirol also pointed out that
there an! many degrees of mental retardation, varying along a continuum
from normality to low-grade idiOCy. In the effort to develop some system
for claSSifying the different degrees and varieties of retardation"Esguiroi
tried several procedures but concluded that the individual's use of lan-
guage provides the m05t de endable criterion of his intellectual level. It
is meres mg to note t at current criteria 0 menta retardation are also
largely lingUistic ant! that present-day intelligence tests are heavily
loaded ~vith Yerbal content. The important part verbal ability plays in
our concept of intelligence will be repeatedly demonstrated in subsequent
chapters.
Of special significance are the contributions of another French physi-
cian, S,egll~. who pioneered in the training of the mentally retarded.
Having rejected the prevalent notion of the ineurability of mental re-
tardation SeO'uin (1866) eXIJerimented for many vears with what he, v ~ "
termed the physiological method of training; and in 1837 he,:es,tal:6hed
the nrst school devoted to the education of mentally reta .." ~hildren.
In 1848 he emigrated to America, where his ideas gaine _ ide recog-
nition. Man~- of the sense-training and muscle-trainirJg techniques cur-
rently in use in institutions for the mentally retarded \vere originated by
Seguin. By these methods, severely retarded children are given intensive
exercise in sensory discrimination and in the development of motor con-
trol. Some of the procedures developed by Seguin for this purpose were
'eventually incorporated into performance or nonverbal tests of intelli-
gence. An example is the Seguin Form Board, in which the individual
is required to insert variously shaped blocks into the corresponding
recesses as quickly as possible.
More than half a century after the work of Esquirol and Seguin, the
French psychologist Alfred Binet urged that children who failed to
respond to normal schooling be examined before dismissal and, if con-
sidered educable, be assigned to special classes (T. H. Wolf, 1973). With
Functions and Origins of Psychological Testing 7
his fellow members of the Society for the Psychological Study of the
Child, Binet stimulated the Ministry of Public Instruction to take steps to
improve the condition of retarded children. A specific outcome was the
<'stablishment of a ministerial commission for the study of retarded chil-
dren, to which Binet was appointed. This appointment was a momentous
event in the history of psychological testing, of which more will be saidJal'er.
The ~arly experimental psycholOgists of the nineteenth century were
not, in general, concerned \vith the measurement of individual'differ-
ences. The principal aim of psychologists of that period was the fomm-
lation of generalized descriptions of human behavior. It was the
uniformities rather than the differences in behavior that were the focus
of attention. Individual differences were either ignored or were accepted
as a necessary evil that limited the applicability of the generalizations.
Thus, the fact that one individual reacted diHerently from another when
observed under identical co~ditions was regarded' as a form of -etror.
The presence of such error, or individual variability, rendered the
generalizations approximate rather than exact. This was the attitude
toward individual differences that prevailed in such laborotodes as that
founded by '''undt at Leipzig in 1879, where many of the early experi-mental psychologists received their training.
In their choice of topics, as in many other phases of their work, the
founoers of experimental psychology reBected the influence of their back-
grounds in physiology and physics. The problems studied in their labora-
tories were concerned largely with sensitivit~ to ~al, auditory, and~
other sensory stimuli and \vith simple reaction time. This emphasis on
sen~ory phenome~a was in tU!'l1reflected in the nature of the £rst psycho-
logICal tests, as will be apparent in subsequent sections.
. St:ilI another way in which nineteenth-century experimental psychology
Influenced the course of the testing movement may be noted. ,The earlv
ps~'chological experiments brought out the need for rigorous control
of the conditions under which observations were made. For example, the
\\'?rding of directions given to the subject in a reaction-time experiment
mIght appreci~bly incre.ase or decrease the speeg 'i\t the subject's re-
sponse. Or agam, the bnghtness or color oEthe sUtr~,,~:ding field could
mar~edly alter the appearance of a visu~J s~mulU~:".1\h~portance of
makmg observations on all subjects un4i~.,s~ndardiz~& conditions was...!fu1svividly demonstrated: Such standardization of proce,dure eventually
became one of the special earmarks of psychological tests.
Functions and Ol'igills of Psychological Testing 9
mathematically untrained investigator who might wish to treat test re-
sults quantitatively. He thereby extended enormously the application of
statistical procedures to the analysis of test data. This phase of Galton's
work has been carried forward by many of his students, the most eminent
of whom was Karl Pearson.It "'as the English biologist Sir Francis Galton who ,,:as. primarily r~-
sponsible for launching the testing movem~l~t: A umfY~lg. factor ~n
Calton's numerous and vaI'ied research activities was hiS }nterest llL
'humaJ;rheredit ". In the course of his imestigations on heredity, Calton
t~a 'ize t e need for measuring the characteristics of related and un-
related persons. Only in this way could he discover, for example, the
exact degree of resemblance bet:w'een p~ren~s and offspring, 1;'rothers and .
sisters; cousins, or twins. With this end 11l View, Calton was mstrument~l '
in inducing a number of educational institutions to keep systematic
anthropometric recOl:ds on their students. ~e al<;oset up an anthropo~ct-
ric laboratory at the International EXposI~on of ,18~4wh~re, by .pa) mg
threepence, visitors could be measured 111 ce~yslcal traIts and
could take tests of keenness of vision and hearing, muscular strength,
reaction time, and other simple sensorimotor functions. Whe~l the expo-
sition closed, the laboratory was transferred to South Kensmgton Mu-
seum, London, where it operated for six years. By such methods, the nrstlarge, systematic body of data on individual differences in simple psycho-
logical processes was gradually aceu~ulated. . . .Galton himself devised most of the sun pIe tests admIDlstered at hIS an-
thropometric laboratory, many of which are still familiar either in ~heir
original or in modified forms. Examples include the Cal~o~ bar for ,,:mual
,discrimination of len h, the Galton whistle for determmlllg the hlghest
au i e pitch, and graduated series of weights for measurin? k~ne.sth~tic
discrimimltion. It was Calton's belief that tests of sensory discrlrmnatlOn
could serve as a means of gauging a person's intellect. In this respec,~' he
was partly influenced hy the theories of L?cke. Thus Galton wrote: .The
only information that reaches us concernmg outward events appeals to
pass through the avenue of our senses; and the n~ore per~ptive the sen~es
are of difference, the larger is the field upon which our Judgment and 10-
telligence can act" (Calton, 1883, ~'. 27). C~lt~n !lad.:~lso noted that
idiots tend to be defective in the ability to discrlmmaJe·:heat, cold, and
pain-an observation that furtller strengthene5iYnis ~nviction that sens~ry
discriminative capacity "would on the whole' be highest among the m-
tellectualh- ablest" (Galton, 1883, p. 29). .Galton also pioneered in the application of rating-sca~c ~nd ques~lOn-
naire methods as well as in the use of the free associatIon techmque
subsequently ~mployed for a wide ~arietyof purposes. A .fu.rther contri-
bution of Galton is to be found in hiS development of statistical methods
for the analysis of data on individual differences. Galton selected and
adapted a n~mber of techniques previously derived ~y m~thematicians.
These techniques he put in such form as to permit theIr use by the
An especially prominent position in the development of psychological
testing is occupied by the American psychologist James McKeen Cattell.
The newly established science of experimental psychology and the still
newer testing movement merged in Cattelfs work. For his doctorate at
Leipzig, he completed a dissertation on individual differences in reaction
!ime, despite Wundt's resistance to this t'ype of investigation. While lec-
tming at Cambridge in 1888, Cattell's own interest in the measurement
of individual differences was reinforced bv contact with Calton. On his
return to America, Cattell was active both 'in the- establishment of labora-
tories for experimental psychology and in the spread of the testing
movement.l -;;\- ';e~ U-U..~
In an article written by Cattell in ,,890, the term "mental test'. was . _
used for the £rst time in the psychological literature. This article de-
scribed a series of tests that were beinO' administered anlluallv to collegeo .
students in the effort to determine their irteilectuall~yel. The tests, which
had to be administered individually, included measures of muscular
strength, speed of movement, sensiti~ty to pain, keenness of vision and
of hearing, weight discrimination, reaction time, memory, and the like. I
In his choice of tests, Cattell shared Galton's view that Jl measure of/M-.,';';;.(,V1.""V'.-(~
i,ntellectual functions could he Qbt<}ined through tests of sensorv cis,- f<.U4-~e.I..t., ;~~
c~pination and reaction time. Cattell's pI'eference for such tests was also !1~~tl<-.~bolst.e~ed by the fact that simple functions could be measured with .p!i<ck{t<:1.<-lA.~J
preCiSIOnand accuracy, whereas the development of objective measures1-<=~.M "..it-r I
for the more complex functions seemed at that time a well-nigh hopeless r:YL-'task. ' .
Catten's tests were typical of those to be found in a number of test
series developed during the Jast decade of the nineteenth century. Such
test series were administered to schoolchilqren, college students', and mis-
ccllaneous adults. At the Columbian Exposition Jield in Chicago in 189~,
Jastraw set up an exhibit at which visitors wete"'iIllitted to take tests of
sensory, motor, and simple perceptual processes and: to compare tlieir
skill with the norms (J. Peterson, 1926; Philippe, 1894·~.A few attempts
to evaluate such early tests yielded very discOuraging results: The indi-
vidual's Rerform~Dce showed little correspondence from one test to an-
other (Sharp, 1~1899; Wissler, 1901), and it exhibited little or no
10 Context of PSlJc11010gical Testing
relation to independent estimates of intellectual levC:'1based on teachers'
ratings (Bolton, 1891-1892; J. A. Gilbert, 1894) or academic grades
(Wissler, 1901).
A number of test series assembled by European psychologists of the
period tended to cover somewhat more complex functions. Kraepelin
(1895), who was interested primarily in the clinical examination of psy-
chiatric patients, prepared a long series of tests to measure what he re-
garded as basic factors in the characterization of an individual. The
tests, employing chiefly simple arithmetic operations, were designed to
measure practice effects, memory, and susceptibility to fatigue and to dis-
traction. A few years earlier, Oehrn (1889), a pupil of Kraepelin, had
emploY€idtests of perception, memory, association, and motor functions
in an investigation on the interrelations of psychological functions. An-
other German psychologist, Ebbinghaus (1897), administered tests of
arithmetic computation, memory span, and sentence completion to school-
children. The most complex of the three tests, sentence completion, was
the only one that showed a clear correspondence with the children's
scholastic achievement.
Like Kraepelin, the Italian psychologist Ferrari and his students were
interested primarily in the use of tests with pathological cases (Guicciardi
& Ferrari, 1896). The test series they devised ranged from physiological
measures and motor tests to apprehension span and the interpretation of
pictures. In an article published in France in 1895, Binet and Henri criti-
cized most of the available test series as being too largely sensory and as
concentrating unduly on simple, specialized abilities. They argued further
that, in the measurement of the more complex functions, great precision
is not necessary, since individual differences are larger in these functions.
An extensive and varied list of tests was proposed, covering such func-tions as memory, imagination, attention, comprehension, suggestibility,
aesthetic appreciation, and many others. In these tests we can recognize
the trends that were eventually to lead to the development of the famous
Binet intelligence scales.
Functions and Origi;ls of Psychological Testing 11
ously cited commission to study procedures for the education of retarded
children. It was in connection 'with the objectives of this commission that
Binet, in collaboration with Simon, prepared the first Binet-Simon Scale(Binet & Simon, 1905).
This scale, known as the 1905 seale, consisted of 30 problems or tests
arranged in ascending order of difficulty. The difficulty level was deter-
mined empirically by administering the tests to 50 normal children aged
3 to 11 years, and to some mentally retarded children and adults. The
tests were designed to cover a wide variety of functions, with speCial
emphasis onJ.udgmt;nt, comprehension, and reasoning. Which Binet re-
garded as essential components of intelligence. Although sensory and
perceptual tests were included, a much greater proportion of verbal
content was found in this scale than in most test series of the time. The
1905 scale was presented as a preliminary and tentative instrument, and
no precise objective method for arriving at a total score was formulated.
In the second, or 1908, scale, the number of tests was increased, some
unsatisfactory tests from the earlier scale were eliminated, and all tests
were grouped into age levels on the basis of the performance of about
300 normal children between.. the ages of 3 and 13 Years. Thus, in the
3-year level were placed all tests passed by 80 to 00 percent of normal3-year-olds; in the 4-year-Ievel, all tests similarly passed by normal 4-year-
olds; and so on to age 13. The child's score on the entire test could then
be expressed as a mental level corresponding to the age of normal chil-
dren whose performance he equaled. In the various translations and
adaptations of the Binet scales, the term "mental age" was commonly
substituted for "mentalleveI." Since mental age is such a simple concept
to~rasE> the introduction of this term undoubtedly did much to popu-
larize intelligence testing.> Binet himself, however, avoided the term
"mental age" because of its unverified developmental implications and
preferred the more neutral term "mental level" (T. H. \\Tolf, 1973).
A third revision of the Binet-Simon Scale appeared in 1911, the year of
Binet's untimely death. In this scale, no fundamental changes were intro-
duced. Minor revisions and relocations of specific tests were instituted.
More tests were added at several year levels, and the scale was extendedto the adult level
Even prior to the 1908 revision, the Binet-Simon tests attracted wide
> Goodenough (1949, pp. 50-51) notes that in 1881, 2l y~aTs befor~ the appear-ance of the 1908 Binet-Simon Scale, S. E. Chaille publi!iheq in the New Orleans
Medical a~d Surgical Journal a series of tests for infan~ 11l7anged according to the
a!1:eat whIch the tests are commonly passed. Partly because' of the limited circulation
of the journal 'nd partly, perhaps, because the scientific ~Om!J1l1nity was not readyfor it, the significance of this age-scale concept passed unnoticed at the time. Binet's
own scale was in~ed by the work oE some oE ~is contemporaries, notably Blinand Damaye, who prepared a set of oral questions from which they derived a singleglobal score Eor eaclrdiild (T. H. Wolf, 1973). .
Binet and his co-workers devoted many years to active and ingenious
research on ways of measuring intelligence. Many approaches were tried,
including even the measurement of cranial, facial, and hand form, and
the analysis of handwriting. The results, however, led to a growing con-
viction that the direct, even though crude, measurement of com lex
1 fence a unc ons 0 ere t e greatest promise. T en a specific situ-
ation arose that brought Binet's efforts to imme(]iate practical fruition.
In 1904, the Minister of Public Instruction appointed ~inet to the previ-
12 Context of Psyc11010gical Testing
attention among psychologists throughout the world. Translation~ and
adaptations appeared in many lang;uages. In Americ;l, a number of diHer-
ent revisions were prepa.red, the most famous of which is the one de-
veloped under the direction of L. ~tTerman a.t Stanford University, and
known as the Stanfmd-Binet (Terman, 1916). It was in this test that the
intelligence quotient (IQ), or mtio between mental age and chronologi-
cal age, was first used. The latest revision of this test is widely employed
today and will be more full\' considered in Chapter 9. Of special interest,
too. is the first Kuhlmann-Binet revision, which extended the scale down-
ward to the age level of 3 months (Kuhlmann, 1912). This scale repre-
sents one of the earliest efforts to develop preschool and infant tests of
intelligence.
Functions and Origins of Psyc1101ugical Testing 13
fo~ g~n~ral routine te~ting; t~e latter was a nonlanguage scale employed
WIth Illiterates and wIth foreign-born recruits who were unable to take a
tcst in English. Both test~ w~re suitable for administratio~ to large groups.
Shortly af~e~ the temunatlOn of "Vorld War I, the Army tests were re-
leased for cmhan use. Not only did the Army Alpha and Army Beta
themselves pass through many revisions, the latest of which are even now
in use, b.ut they also sVVed as ~dels for most group intelligence tests.
The te~ting .movement underwent a tremendous spurt of growth. Soon
group mtelhgence tests were being devised for all ages and types of
~ersons, from preschool children to graduate students. Large-sc~le test-
109 progra~ns: previously impossible, were now being launched with
~est~ul optimIsm. Because group. tests were designed as mass testing
lUsh uments, they not only permItted the simultaneous examination of
large groups but also simplified the instructions and adminish'ation pro-
cedu~es so as to demand a minimum of training on the part of the
exammer. Schoolteachers began to give intelligence tests to thcir classes.
Coll~ge studen~s were routinely examined prio~ to admission. Extensive
studies of specIal adult groups, such as prisoners, were undertaken. And
soon the general public became IQ-conscious. "---
T~e application of such group intelligence tests far outran their techni-
cal Improvement. That the tests were still crude instruments was often
f?rgotten in the rush of gathering scores and drawing practical condu-
slO~Sfrom the ~esults. 'Vhen. ~he tests failed to meet unwarranted expec-
tations" skepticism and hostiht)' toward all testing often resulted. JJ1US.the testi boom of the twenties, based on the indiscriminate use of tests i?ISma~ have ~one as much to retai' as to ad\'ance the progress of psvcho- ---logical test mg. - ~
The Binet tests, as well as all their revisions, are indil;iclual scales in
the sense that the\" can be administered to onlY one person at a time.
Man\' of the tests in these scales require .oral re~ponses from the subject
or n~cessitate the manipulation of materials. Some call for individual
timing of responses. For these and other reasons, such tests are not
adapted to group administration. Another characteristic of the Binet type
of test is that" it requires a highly trained examiner. Such tests are es-
sentiallv clinical instruments, suited to the intensive study of individualJ .' •
cases.Group testing, like the first Binet scale, was developed to meet a press-
ing practical need. When the United States entered l)!orld 'Var I in
1917, a committee was appointed by the American Psychological Associ-
ation to consider ways in which psychology might assist in the conduct of
the war. This committee, under the direction of !lobert 1.•.1. Yerkes, recog-
nized the need for the rapid classification of the million and a ha1f re-cruits with respect to general intellectual level. Such informati~.~~~va:s
relevant to many admmistrative decisions, including rejection or dis-
charge from military service, assignment to different types of sel'vicei, or
admission to officer-training camps. It was in this setting that the first
group intelligence test was developed. In this task, the Ar-m~' psycholo-
gists drew on all available test materials, and especially on an unpub-
lished group intelligence test prepared by ~rthur S. Otis, which hc
turned over to the Army. A major contribution of Otis's test, which he
designed while a student in one of Terman's graduate courses, was the
introduction of multiple-choice and other "objective" item types.
The tests finally developed by the Army psychologists came to be
known as the ~rm""yAlpha and the Army Beta The former was designed
~lthough intelligence tests were originally designed to sample a wide
vanety of ~unctions in order to estimate the individual's general intelIec-
tua~ level, It soon became apparent that such tests were quite limited in
theIr .cove~age. Not all important functions were represented. IJ:!. fact,
most mtelhgence tests were primarily measures of verbal ability and. to a
lesser extent, of the ability to handle numerical and other abstract and
symb~~ic re~ations. Gr~dually psychologists eame to recogni~e that the
~erm . Il1telhgence test was a misnomer, since only certain aspects of
mtelligence were measured by such tests.
To be sure, th~ tests cov~red abilities ,t~t are ot p.rime importance in
our culture. B~ It was. realized that more'precise designations, in terms
of the type of mformation these tests are able to yield, w<;lUlq be prefer-
14 Context of Psyclwlo{!.ical Testing
able, For example, a number of tests that would probably have been
caned intelligence tests during the twenties later came to be known as
scholastic aptitude tests. This shift ill terminology was made in l'ec:og-
nition of the fact that mallY so-called intelligence tests measure that
combination of abilities demanded by academic work.
E\'l'n prior to Vvorld War I, ps\'ch~logists had begun to recognize the
need for tests of spE'cial aptitudes to suppkment the global intelligence
tests. These s ecial a till/de tests ' , , _ '
vocationa counseling and in the selection and classification of industrial
and military ersonn~1. Among the most widely used are tests of.!!lechani-
ea , c erica, musical, and artistic aptitlldes.-TI~ca~lation of intelligence tests that follm,'ed their wide-
sl>\'eadand indiscriminate use durinlJ the twenties also revealed another, 0
lIote"iOlthy fact: an individual's erformance on '
test often -showed mar -c variation. This ,yas especially apparent on
gl'OUptests, 111whlch the items ar~mmonly segregated into subtests of
relath'e1\- homogeneous content. For example, a person might score rela-
tively high on a verbal subtest and low on a numerical subtest, or vice
versa, To some extent, such internal variability is also discernible on a
test like the Stanford-Binet, in which, for example, all items involving
words might prove difficult for a particular individual, whereas itcms
employing pictures or geometric diagrams may place him at an ad-
vantage,Test users. and especially clinicians, frequently utilized such interc~l11-
parisons in order to obtain 1110reinsight into the individual's psychological
make-up. Thus, not only tllC'IQ or other global score but also scores on
subtests wonld lJt' examined in the e\'aluation of the indhidual case, Such
a practice is not to be general1~' recommended, ho,~,('ver. ~)eeaus~ in-
tellig('J]ce tests were not designed for the purpose of ,dIHerel,~h,~11aphtude
anal;'sis. Often the subtests heing compared contain t0o,14C\\' items to
yield a stable or reliable estimate of a specific ability:;jis'a result, the
obtained diffl:'rence betwcen subtest scores might be reversed if the
individual were retestE'd on a different day or with another foml of the
same test. If such intraindividual comparisons are to be made, tests are
needed that are specially designed to reveal differences in performance
in various functions.While the practical apl)lication of tests demonstrated the l1~.ed for
differential aptitude tests, a parallel development in the stu,d)' of trait or-
ganization was gradually providing the means for constructing SUC? tests.
Statistical studi('s on the nature of intelligence had been explonng the
iflterrelatiol1s among scores obtained by many persons on a ,,,ide variety
of different tests, Such investigations were begun by the English ,psy-
chologist Charles Spearman (1904, 1927) during the £lrst decade of the
Functions and OrigillS of PSljchological Testing 15
present century. Subsequent methodological developments, based on the
work of such American psychologists as T. L. ReIley (1928) and L. L.
!hurs~one (1935, 194i), as well as on that of other American and English
ll1veshgators, have come to be known as "factor analvsis."
The contributions that the methods of factor ana'lysis have made to
test c'Onstruction will be more fully examined and ill~strated in Chapter
1:3. For the present, it will suffice to note that the data gathered by such
procedures have indicated the presence of a Dumber of rebtiyely ;nde-
J)endent factors. or traits. Some of these traits were represen'ted, in
vary~ng proportions, in the traditional intelligence tests. Verbal compre-
henSIOn and numerical reasoning are examples of this tvpe of trait.
Others, such as spatial, perceptual, and mechanical aptitude~, were found
more often in special aptitude tests than in intelligence tests.
One of the chief practical outcomes of factor analysis was the develop-
ment of multiple aptitude batteries. These batteri('s arc desiuned to pro-
vide a measure of the individual's standing in each of a number of traits.
In place of a total score or IQ, a separate score is obtained for such traits
as "erhal comprehension, numerical aptitude, spatial visualization, arith-
m~tic re~soning, and perce~tual speed, Such batteries thus provide a
SUItable mstrument for makin<1 the kind of intraindividual anaJ\'Sis I'
1 e~'e ~nOSls, t at c inicians a een tr\'ing for matiy years to
.obtam, wlth crude and often errODl:'OUSresults from intelligence tests.
These batteries also incorporate into a comprehensivl:' and svstl:'matic
testing program much of the inform,ation formerly obtained fro~l special
aptihlde tl:'sts, since the multiple aptitude batteries cover some of the
traits not ordinarily me u e JlI IJ1e 1 ence tests.
, u tip e ap u e atteries represent a relatively late development in
the testing field. Nearl~' all have appeared since 1945. In this connection,the work of thc military psychologists during World War II s.J~d also
be noted. ~fuch of the test research conducted in the armed services was
based on factor analysis and was directed toward the construction of
mu.ltiple aptitude batteries. In the Air Force, for example, special bat-
tent's were constructed for pilots, bombardiers, radio operators, range
finders, and scores of other military specialists. A report of the batterics
prepared in the Air Force alone occupies at least nine of the nineteen
volumes devoted to the aviation psychology program during 'Vorld War
II (Anny Air Forces, 1947-1948). Research along these line~ is still in
progress under the sponsorship of various branches of the armed services.
A.~~mber of multiple aptitude batteries !rl,\yelikewise ~en 4,eveloped for
clVllian. use and are being widely appliel:l\,n educati0l1~l and vocational
counselmg and in personnel' selectioll and' cJassincadqIl. Examples ofsuch butteries will be discussed in Chapter 13, ,"-' "
To avoid confusion, a point of terminology shoul\!l be clarified. The
16 COIl!ex! of Psyclwlogict,{ Tcsrillg
term "aptitude test" has been tracHtiollalJ" cmployed to refer to tests
measuring relativel\" homo ('ncous and dparlv defined sc rn1C'nts of
• I I \., t le term "intelliO'ence test" customarih' refers to more hderogenc-Co) e-. .~ests yielding a single global score sm:h as an IQ. S~)ecial aptitu~c
tests typically measure a single aptitude. ~lultiple al~tltl1de battenes
measure a number of aptitudes but pro\"ide a profile of scores, one for
eaeh aptitude.
FI/I1C!iol1.\' mltl Origi/l.~ of Psyc1IO/<l{!.ical Tcsli,l{!. 17
and other hroad educational objectives. The deeade of the 19:305 also
witnessed the introduction of test-seoring maehines, for which the newohjec:tive tests could be readily adapted.
The establishment of statewide, regional. and nalional testing programs
,,,as another noteworthy parallel denlopment. Probably the best known
.?f these programs is that of the College Entrance Examination Board
~t;EEB). Established at thc turn of the ce_ll'~' to reduce duplication in
the exa"tnining of entering college freshmen, this program has undergone
profound changes ill its testing procedures and in the number and nature
?f partie-ipa.ting col1eges-c·hangcs that reflect inten'ening developments
111both testIng and cducation. In 1947, the testing functions of the CEEB
were llIerged with those of the Carnegie Corporation and the American
Council on Education to form Educational Testing Service (ETS). In
subscq.t1cnt ~'ears, ETS has assumed responsibility for a growing number
of testlllg programs on behalf of universities, professional schools, gov-
ernment agencies, and other institutions. \[ention should also he made of
the American Collegc Testing Program established in 1959 to scrccn
applicants to colleges not included i~ thc CEEB program, and of several
national testing programs for the selection of highl\' talented studentsfor scholarship awards. .
. Achievem.ent tests are used not only for educational purposes but also
III the se]Pchon of applicants for industrial and government jobs. \fention
has already been made of the systematic use of ci\'i\ sen'jce examinations
in the Chinese empire, dating from 111.5 .B.c. In modern times, selection
of go\'~rnI~lent emplo:-e~s by examination was introduced in European
countnes 111the late eIghteenth and eark nineteenth centuries. The
l!llited States Chi! Service Commission in~talled competitive examina-
tions as a regular procedure in 1883 (Kanuck, 19.56). Test construction
techniques developed during and prior to World "'a~ I were introduded
into tll<:'examination program of the United States Ch-il Service with the
appointment of L. J. O'Rourke as director of the newlv established re-search dh'ision in 1922. '
. As more and more psychologists trained in psychometrics participated
m the construction of standardized achievement tests, the technical as-
pects of achievement tests increasingly came to resemble those of in-
telligence and aptitude tests. Procedur~s for cons,trllcting and evaluating
all ~hese tcsts have much in common. The incre~s!ng effOlts to prepare
achIevement tests that would measure the attainment of broad educa-
tional goals, as contrasted to the recall of factualiminutiae also made
the content of achievement tests resemble more -cioselv th~t of intelli-
ge~lce tests. Today the difference between these two 'types of tests is
dueHy one of degree of specificity of content and extent to which the
test presupposes a designated course of prior instruCtion.
While psychologists were busy developing intelligence and aptitude
tests, traditional school examinations were undergoing a number of tech-
nical improvements (Caldwell & Courtis, 192:3; Ebel & Damrin, 1960 ~.
An important step in this direction was taken by the Boston pubhc
schools in 1845, when written examinations wefe substituted for the oral
interroO'ation of students by visiting examiners. Commenting on this in-
nDvati~l, Horacc ~fann cit~d arguments remarkably similar to those used
much later to justify the replacement of essay questions hy objective
multiple-choice items. The written examiuations, \lann noted, put all
students in a uniform situation, permitted a wider cO\'erage of content,
reduced the chance element in question choice, and eliminated tIll' pos-
sibilitv of h\'oritism on the examiner's part.
Aft~r the turn of the centurv, the first stand-ardized tests for measuring
the outeomes of school instnl~tion began to appear. Spearheaded h~' the
work of E. L. Thorndike. these tests utilized measurement principks de-
veloped in the psychological laboratory. Examples include scales for
rating the quality of handwriting and written compos.itiol1s, as. well ~s
tests in spelling, arithmetic computation, and arithmetic reasol1lng. Stl11
later came the achie\"ement batteries, initiated by the publication of the
first edition of the Stanford Achievement Test in 192:3. Its authors were
three earl" It'aders in test development: Truman L. Kelley, GHes ~f.
Ruch, ami Lewis M. Terman. Foreshadowing many characteri·stic'S of
modern t'fsting, this battery provided com~arable measu~'es of perfo~-
ance in different school subjects, evaluated 111 terms of a smgle norma live
group.At the same time, evidence was accumulating regarding the lack of
agreement among teachers in grading essay tests. By .1930 it was.widely
recognized that essay tests were not only more hme-cOnsumll1g for
examiners and examinees, but also yielded less reliable results than the
"new type" of objective items. As the latter came into increasing use in
standardized achievement tests, there was a growing emphaSiS on the
design of items to test the understanding and application of knowledge
J' IIIIC/ /(111,\ {///(/ (higill., of J'sydl(l'(/~i('111 1'<'S!iIlt!. 19
of bc-!Ja>ior01' Wl'I'('<:olll:erncd with mOl'(' dbtindly social r('~pons('s, such
as dOl1lmalll'C-sublllission in interpersonal ('ontacts. A later development
\\'as th<: constmction of tests for quantifying the expression of interests
and athtude's, These tests, too, W('H' based l'ssentialh' on <llll'stionnairet('chniqul's, .
.All(~th('rapproach to the measurement of personalit~' is through the ap-
pllc,\hon of performatlce or situational tests. In such tests, the subject has
a task to perform whose purpose is often disgUised, :\Iost of these tests
s~llIulate e\'eryday-life situations quite c1ose1~'.Th(' first extensive applica-
tIOn o~ such tl'chniqnes is to be found in the h'sts de\'eloped in the late
twenhcs and earl~' thirties by Hartshorne, ~fa\', and their associates
(1928, 1929, 19:30), This series, standardized on s'choolchildren, was con-
cerned \:'ith such beha"ior as cheating, lying. stealing, cooperatin'ness,
and pcmstenct', Objective, quantitative scores could he obtained on each
of a largc numb('r of sp('cific tests, A more recent illustration, for the
a~1I1.tlev;l, is l~ro\'ided by the series of situational tests developcd during
" OJld "ar II 111 the Assessment Program of the Office of Strate<TicServ-ices (OSS, 19-48). These tests wem' C:Oll('erned with rclath·ely ~omplex
and subtle sodal and emotional beha\'ior and refluired rather ehlborate
f~cilities and tr~lin:d personnel for their admillistration, The interpreta-
tIOn of th,e subject s responses, moreover, \\'as rdati\'C I~' suhjectivc.
Pro,ectll;e techniqlles represent a third approach to the study of per-
sO,nall.tyand olle that has shown phenomenal gro\vth. cspecially among
dlll1CIans. In such tests. the subject is gi\'en a relatin'Jy unstructured
task that permits "'ide latitudl' in its solution, The assumption underlvincr
such metllocls is that the indi\'idual will project his characteristic m~d~:
of response into stich a task. Lik(' the performancc and situational tests.
proje~ti\'l' techniqucs are mor(' or less disguised in lhl:'ir purpose, thereby
reducmg the chances that the subject can dt'li1wrateh- create a desired
impressi?l1, The prc\'iously cited free association test'represe.nts one of
thc earlIest types of projccth'e techniques. Sellten('e-completion tests
hav.e al.so been tlSed in this manner. Otller tasks commonly employed\n
proJech\'e techniques include drawing, arranging toys to create a scene,
('xtempor~nt'ous dramatic play. and interpreting pictures or inkblots.
All.a\'aJlable types of personality t('sts present serious difficulties. both
practi~al and theoretical. Each approach has its own special advaqtages
and. dlsad\:antages. On the whole, personality testing has lagged far
behmd aptitude t('sting in its positive accomplishments. But such lack of
progress is not to be attributed to insufficient eHOI't. Hesearch on the
~~~urement ~f. pers?nality ~as attained i~pr~s~ive Pl~p,p'ortions since
. ' ~nd .man) mgemous devIC.'csand techmcal J1nprovemeil~s arc under
~VeStigabon. It is rathe,r the spt'cial difficulti~ encountel:fd in the
easurement of personality that account for the slow advances in thisu~ . ,
Another area of psy<:holo~ical testing is concerned with the aH('ctive or
nonint('lIectnal aspects of b('ha\'io!'. Tests d('signed for this purpose are
commonly known as personality tests. although some psychologists prefer
to lISt' the term personalit~, in a hroader sense, to refer to the cntirc indi-
vidual. Intellectual as well as nonintellectual traits ,,"ould thus be included
under this heading, In the terminology of psychologit·al testing, howcver,
the designation "personality test" most often refers to measures of such
characteristics as emotional adjustment, interpersonal relations, moth·a-
tion, interests, and attitudes.An earl~' precursor of personaJit~' testing may be r('cognizcd in Kra,:-
pelin's use of the free association test with abnormal patients. In thIS
test the subject is gh'en specially selectcd stimulus words and is required
to r('spond to each with the first word that comes to mind, Kraepelin
( 1892) also employed this technique to study the psychological effects
of fatigue, hunger, and drugs and concluded that all these agents in-
crease the relati\'{~ frequenc~' of superficial associations, Sommer (1894),
also writing: during the last decade of the nineteenth century, suggested
that the free association test might be used to differentiate between the
various forms of mental disorder. The fre(' association technique has
subscqllenth' becn utilized for a vari('ty of testing purpos('s and is still
curr('nth- en\plcn'ed, \Iention should also be made of the 'York of Galton,
Pear~on: and C;lttell in the dpyelopment of standardized questionnaire
and ratin~-,~'ale tl'chniqn('s. Although origin~l1y devised for other pur-
poses. these proc-edmes \wre e\'entual1~' employed by othNs in construct-
ing some of the most common types of current personality tests.
The protntype of tht, personalit\' qnpstionnaire, or self-report inventory,
is the Per~(lnal Data Sheet developed by \Voodworth durin~ \"orId \Var
I (DuBois. 1970; Symonds. 19:31,eh. 5; Goldlwrg, 19(1). This test was
designed as a rough screening device for identifying seriously ~urotic
men \\'110 would be' unfit for military service. The inventor\' conslst~d of
a number of questions dealing with common neurotic sy~pt01'!lS, ,~'hich
the individual answered about himself. A total score was o\5t~ined by
counting the number of symptoms reported, The Personal Data ~heet
was )lot completed carly enough to permit its operational use .J)efore the
war cnded. Immediatel" after the war, however, civilian forms were
prepared, including a special form for use with children. The \Vood-
worth Personal Data Sheet, moreover, served as a model for most subse-
quent emotional adjustment inventories. In some of these questionnaires,
an attempt was made to subdivide emotional adjustment into more spe-
cific forms. such as home adjustment, school adjustment, and vocational
adjustment. Other tests concentrated more intensively on a narrower area
imtruJl1cnts {,;m hr found in A SourcelJook for .Hell/(/I 11ealtll Measures
(Comn'~·. Backer, & Glaser, 197:1). Containing approximately 1,100 ab-
stracts. this sourcehook includes tests, questionnaires, rating scales, and
other <ledc('s for assessing both aptitude and personality variables in
adults and children. Another similar reference is entitled Measures for
Psychological Assessment (Chun, Cobb, & Frenrh, ]975). For each of
:1,000 measures, this volume' gives the original sOl\J'et' as well as an anno-
tat<,d bibliography of the studies in which the measure was subscquently
used. The entries were located through a search of 26 measurement-
related journals for the Years 1960 to 1970.
Information 011 asses~ment devices suitable for children from birth to
12 years is summarized in Tests and Measurements in Child Development:
A Handbook (Johnson & Bommarito, 1971). Covering only tests not listed
in the \nrr, this handbook describes instruments located through an
intensi\'(~ journal search spanning a ten-year period. Selection criteria
included availability of the test to professionals, adequate instructions
for administration and scoring, sufficient length, and convenience- of use
(i.p., not requiring expensive or elaborate equipment). A still more spe-
cialized collection CO\'crs measures of social and emotional development
applicable to children between the ages of ,3 and 6 years (Walker, 1973).
Finanv, it should be noted that the most direct source of information
regardiI;!!: specific curr~ltksts is pro\'ided h~' the catalo~t1cs of tcst pub-
lIshers and b~' tht· mannal that accompani0s ('ach test. A comprehensive
list of test publishers, \\'ith addresses, can be found in the lates't Mell/al
M el/S/lTcmcnfs rearl)()ok~ For reach' reference, the namt's and nddrt'sses
of some of the largt'r .-\merican p'uhlishers and distributors of psycho-
logical tests are gi\'en in AppendiX D. Cltalog\1('s of current tests can be
obtained from each of these publishers on requcst. :\lanuals and speci-
men sets of tests can be purchased hy qualified users.
The test manual should provide the ('ssential infurmation required for
administering, scoring. and evaluating a particular test. In it should be
found full and detailed instructions, scoring key, norms, and data on re-
Iiahilit~, and validity. :\fo!'E'over, the manual should report the number
and nature of subjects on whom lIonns, reliahilit~·. and validity were
est~b~ished, the methods employed in computing indices of reliability and
valIdity, and the specific criteria against which validity was checked. In
~he e\'ent that the necessary information is too lengthy to fit conveniently
mto the manual, references to the printed sour<.:esin which such infor-
mation can be readily located should be given. The manual should, in
other. words, enable the test user to evaluate the ·test before choosing it
for IllS specific purpose. It might be added that ma~y test manuals still
fa!1 short of this goal. But some of the larger ancl more professionally
onented test publishers are giving increasillg attention to the preparation
Psychological testing is in a state of rapid chan~e. There are shifting
oriel;tations, a constant stream of new tests, revisc>dforms of old tests, and
additional data that mav refine or alter the interpretation of scores on
existing tests. The accelerating rate of <:hange, together with ~he vast
number uf available tests, makes it impracticable to sun'ey speCific tests
in any single text. \lore intensive coverage of testing instruments and
problems in special areas can be found in books dealing with the us~ of
tests in such fields as counseling. clinical practice, personnel selection,
and education. References to such publications are given in the appropri-
ate chapters of this book. In order to keep abreast of current develop-
ments, however, anyone working with tests needs to be familiar with
IlUoredirect sources of contemporary information about tests.
One of the most important sources is the series of Mental !Ifeasurements
)'eaTbooks (MMY) edited hy Buros (19i2). Th('sc yearbooks cover nearly
all commercially available psychological, educational, and vocational tests
published in English. The coverage is especially .complete .for paper-~nd-
pencil tests. Eaeh yearbook includes tests publIshed dunng a speCified
period, thus supplementing rather than supplanting the earlier yearbooks.
The Ser,enth Mental Measurements rear7JOok, for example, is concernedprincipally with tests appearing bet\\'een 1964 and 1~70. Tests. of con-
tinuing interest, however, may be reviewed r~peat('dly m StH.·cesSlyey~ar-
hooks, as nt'w data accumulate from pertment research. The earhest
publications in this series were merely bi~)liographies of tests: B~ginning
in ]9,38,however, the ),earbook assumed Its ('UlTt'I\t form, wlll(:h llldudes
critical reviews of most of the tests by one or more test experts, as well
as a complete list of published references pertailling to each lest. .Routine
information regarding poblisher, -price, forms, and age of subjects for
whom the test is suitable is also regularly giv('n.A comprehensive bibliography covering all types of published tests
available in English-speaking countries is provided by Te:~ts in Print(Buras, 1974). Two related sources are Reading Tests and Reviett;~
(Bums, 1968) and Personality Tests and Reviews (Buras, 11970). Both
include a numbeF'~9f tests not found in any volume of the MMY, as well
as master indexes'that facilitate the location of tests in the :\1\1Y. Reviews
of specific tests are also published in several Ilsychological and educa-
tional journals, such as the Journal of Educational Measurement and the
JOllrnal of Counseling Psyc1101ogy.Since I9iO several sourcebooks have appeared which provide informa-
tion about u~published or little known instruments, largely supplement-
ing the material listed in the MMY. A comprehensive survey of such
22 Context of Psyc11010gical Testing
ofmanuals that meet adequate scientific standards. An enlightened PU?-lie of test users provides the firmest assurance that such standal'ds wIll
be maintained and improved in the future.. .A succinct but comprehensive guide for the evaluatwn of psy~hologlcal
tests is to be found in Standards for Educational arul Psyc11010glCal Tests
(1974), published by the American Psychological As~ocia~ion. These
standards represent a summary of recommended practices 111 test con-
struction based on the current state of knowledge in the field. They are
concerned with the information about validity, reliability, norms, and
other test characteristics that ought to be reported in the manual. In their
latest revision, the Standards also provide a guide for the proper use of
tests and for the correct interpretation and application of test results.
Relevant portions of the StQnda~ds "ill.be cited in the following chapters,
in connection with the appropnate tOpICS.
CHAPTER 2
J\rat1ure arld Use of
Psyclz.ological Tests
T.HE HISTORICAL introduction in Chapter 1 has already suggested
some of the many uses of psychological tests, as well as the wide
diversity of available tests. Although the general public may still
associate psychological tests most dosely with "IQ tests" and with tests
designed to detect emotional disorders, these tests represent only a small
proportion of the available types of instruments. The major categories of
psychological tests will be discussed and illustrated in Parts 3, 4, and 5,
'\'hich cover tests of general intellectual level, traditionally called intelli-
gence tests; tests of separate abilities, including multiple aptitude bat-
teries, tests of special aptitudes, and achievement tests; and personality
tests, concerned with measures of emotional and motivational traits, in-
terpersonal behavior, interests, attitudes, and other noncognitive char-
acteristics.
In the face of such diversity in nature and purpose, ,~hat are tIle
common differentiating characteristics of ps~'Chological tests? Ho," do
psychological tests differ from other methods of gathering information
about individuals? The answer is to be found in certain fundamental
features of both the construction and use of tests. It is with these featm!es
that the present chapter is concerned.
BEHAVIOR SAMPLE..-A, psychological test is essentially an objective
.~d standardized measure orit's'ample of behavior. Psychological tests
are like tests in any other science, insofar as 0R~flh~tions are made on a
small hut carefully chosen ,sample .~ .an ip~jyjil~)rs behaviQr.. In this
respect, the psychologist proceeds in much·.the 'Jame way as the chemist
who tests a patient's blood or a community.}swater supply by analyzing
,-et'more samples of it. If the psychologistwish¢'~ to test the extent
,iff a c1lild's vocabulary, a clerk's ability to perform arithmetic computa-
tions, or a pilot's eye-hand coordination, he ('xamim's their performance
with a representatin' set of wonls, :11'ithmclie prol>lems, or motor tests.
"'hetlwr or not the test adeqnately co\'(.'rs the behavior under con-
sideration obviously depends on the number and nature of it nls in thesamp e. or examp e, an ant 1I1letJctest consisting of only five problems,
~le including only multiplication items, would be a poor measure of
the indiyidual's computational skill. A yoealmlary test composed entirely
of baseball terms would hardly proYide a dependable estimate of a
child's total range of vocalmlar~'.
The diagnostic or 'redictiJ;c t;a7uc of a lsycholC!gical test depend~_ol!
the debH,',~O which it sen'es as an indicator of a relatively broad and
!!guinea;t area·Ofb~;:. Measurement of the hehaYior sample directl~'
cO\'ered by the test is J:arely, if ever, the goal of psychological testing.
The child's knowledge of a particular list of 50 words is not, in itself, of
,great interest. Nor is the job applicant's performance on a specific set
of 20 arithmetic problems of much importune-e_ If, however, it can be
demonstrated that there is a dose correspondence between the child's
knO\dedge of the word list and his total l1laster~-of vocabulary, or be-
tween the applicant's score on the arithmetic problems and his computa-
tional performance on the joh. then the tests are ser\'ing their purpose,
It should be noted ir.. this connectiolJ that the test items need not
resemble closely the beha.vior the test is.to }[('dicr."It is only necessary
tna " .- on ence be demoHstrated bet"'ecn the tm); The
degrec of similarity between the test sample and the predicted behavior
ma\' vary widely. At one extreme. the test mav coincide completelY with
a part o'f the b;'h~or to he preclictt'cl. An e.\:Imple might be a foreign
vocabulary test in whi!=·htilt:' students are examilled on 20 of the 50 nt'\\-
words th~y have studied; another example is provided by the ro,ld test
taken prior to obtaining a driver's liccme. A lesser degree of similarity is
illustrated by many vocational aptitude tests administered prior to joh
training, in which there is only a mod<'rate rese ance between the
tasks peIformed on the joh and those incorporat ,in the test. At the
other extreme one finds projecth'e personality test!>'" eh as the Rorschach
inkblot test, in which an attempt is made to predict from the subject's
as~ociations to inkblots how he will rcad to other people, to ~motionally
toned stimuli, and to other complex, everyday-life situations, Despite
their superficial differences, all these tests consist of samples of the indi-
~s behavioL., And each mUst prove Its worth by" an empirically
demonstrated correspondence between the subject's pcrformance on the
test and in other situations.
Whether the term "diagnosis" or the term "prediction" is employed in
this connection also represents a minor distinction. Prediction eommonly
connotes a temporal estimate, the individual's future performance on a
job, for example, heing foreeast from his present test performance. In a
hroader sense, ho\\"('\'er, e\-en the diagnosis of present condition, suell as
mental retardation ur emutional disorder, implies a prediction of what
the incIi\'idual will cIO in situations other than the present test. It is
logically Simpler to consider all tests as behavior samples from which
predictions regarding other JX.havior can be made. Different typps of
tests can then be characterized as variants of this basic pattern.
Anotlwr point that should be considered at the outset pertains to the
cone-ept of Clll}(/cify. It is entirely possible, for example, to dc\'isc a test
fur predicting how well an individual can learn Fre11Ch before he has
even begun the study of French. Such a test would invoh-e a sample of
the types of behavior required to learn the new language, but would in
itself presuppose no knowledge of French. It could then be said that
this test measures the indh'idual's "capacity" or "potentialitt for learn-
ing French, Such tenus should, hO"'ever, be used with caution in refer-
ence to ps~'dlOlogical tests. Onl\' in the senSe that a present behavior
sample can be used as an indicator of other, future behayior can we
s~ak.()f a test measuring "capacity." Ko psychological test can do more
than measurelJel1"UDor. 'Vh~ethci:S\1ch behavior can serve as an effective
inc!('x of other IX'hador can be determined only by empirical try-out.
STA:-;DARDIZATIO:-;, It ,,-:"iIlhe recalled that in the initial definition a ps~--
chological test \\'as described as a standardized measure. Standardization
implies !miformifll of ~)rQcedllre in 'hdnl11Hsfenng and SCoring the 'test If
the scores obtained by different iudiyiduals are to be comparable, testin~
conditions must obYiously be the same for all. Such a requirement is only
a speCial application of the need for controlled conditions in all scientific
ohse-ryations. In a test situation, the single independent \'ariable is
usuall~' the indh-idual being tested.
In order to secure uniformity of testing conditions, the test constructor
provides detailed directions for administering each newly developed h:'st.
The formulation of such directions is a major part of the standardization
of a new test_ Such standardization extends to the exact materials em'plo~d, time limits, oral instructions to subjects, prc>Jiminary demonstra-
: ~ns, ways of handling queries from subjects. and evel,\, other ~
the testing situation. :Many other, more subtle factors may influence the
subject's performance on certain tests. Thus, in giving instructions or,
presenting problems orally, consideration must be given to the rate of
speaking, tone of voice, inflection, pauses, and faCj~1 e}pression. In a
test involving the detection of absurdities, tot eX;lnit>le, the correct an-~wer may be given away by smiling or paY~jlg wh~n the crucial word
J~.read .. Stand~rdized testing p.rocedure, ~r:,~i[th~\. ex.aminer's point of\1:w, Will be dJscussed further m a later sect~g~ of-<tl;lJSchapter dealing
\\'Jth problems of test administration. ."
26 COlltext Of Psychological Testing
Another important step in the standardization of a test is the establish-
ment of norms, Psychological tests have no predetermined standards of
pli5singor fa'inng; an individual's score is evaluated by comparing it with
the scores obtained by others. As its name implies, a norm is the normal
or average performance. Thus, if normal B-year-old children complete
12 out of 50 problems correctly on a particular arithmetic reasoning test,
then the 8-year-old norm on this test corresponds to a score of 12, The
latter is known as the raw score on the test, It may be expressed as
number of correct items, time required to complete a task, number of
errors, or some other objective measure appropriate to the content of the
test. Such a raw score is meaninglcss until evaluated in terms of a suitable
set of norms, .
In the process of standardizing a test, it is administered to a large,representative sample of the type of subjects for whom it is designed.
This group, known as the standardization sample, serves to establish the
norms. Such norms indicate not only the average performance but also
the relative frequency of varying degrees of deviation above and below
the awrage. It is thus possible to evaluate different degrees of superiority
and inferiority. The specific ways in which such norm" may be expressed
will be considered in Chapter 4. All permit the designation of the indi-
"idual's position with reference to the normative or standardization
sample.
It might also be noted that norms are established for personality tests
. in esse!1tially the same way as for aptitude tests. The norm on a person-
ality test is not necessarily the most desirable or "ideal" performance,
any more than a perfect or errorless score is the norm on an aptitude
test. On both types of tests, the norm corresponds to the performance of
typical or average individuals. On dominance-submission tests, for ex-
ample, the nonn falls at an intermediate point representing the degree
of dominance or submission manifested by the average individual.
Similarly. in an emotional adjustment inventory, the norm does not
ordinarih· correspond to a complete absen<.'C of unfavoral;>le or mal-
adaptive' }'esponses, since a few such responses occur in the majority of
"normal" individuals in the standardization sample. It is thus apparent
that psychological tests, of whatever type, are bascq'· on lmpirically
established norms.
Nature alld Use of Psychological Tests 27
the discussion of standardization. Thus, the administration, scoring, and
interpretation of scores are objective insofar as they are independent of
the subjective judgment of the individual examiner. Anv one individual
should theoretically obtain the identical score on a test r~gardless of who
happens to be his examiner. This is not entirely so, of comse, since per-
fect standal'dization and objectivity have not been attained in practice.
But at least such objectivity is the goal of test consb'uction and has been
achieved to a reasonably high degree in most tests.
There are other major ways in which psychological tests can be prop-
erly described as objective. The determination of the difficulty level of an
item or of a whole test is based on objective, empirical procedures. 'Vhen
Binet and Simon prepared their original, 1905 scale for the measurement
of intelligence, they arranged the 30 items of the scale in order of in-
creasing difficulty. Such difficulty, it will be recalled, was determined by
trying out the items on 50 normal and a few mentally retarded children.
The items correctly solved by the largest number of' children were, ipso
facto, taken to be the easiest; those passed by relativdy few children were
regarded as more difficult items. By this procedure, an empirical order
of difficulty was established. This early ,:xarnple typifies the objective
measurement of difficulty level, which is now common practice in psycho.logical test construction.
:l'ot only the arrangement but also the selection of items for inclusion
in a test can be determined by the proportion of subjects in the trial
samples who pass each item. Thus, if there is a bunching of items at the
easy or difficult end of the scale, some items can be discarded. Similarly,
if items are sparse in celiain portions of the difficulty range, new items
can be added to fill the gaps. More technical aspects of item analYsiswill be considered in Chapter 8. .
. RELIABILITY. How good is this test? Does it really work? Thes£l ques-
t~ons could-and occasionally do-result in long hours of futile discus-
sIOn. Subjective opinions, hunches, and personal biases may lead, on the
one hand, to extravagant claims regarding what a particular test can
acco~plish and, on the other hand, to stubborn rejection. The only way
q~estlOns sU~h ~s these can be conclusively answered is by,empirical
trial. The olJ]ectlve evaluation of psychological tests involves primarilv
t?e d~tennination of the reliability and the validity of the test in specifiedSltuatlons.
As used in psychometrics, the term reliability always means consis-
tenc~', Test reliability is the consistency of scores obtain_ed;~ the same
persons when retested with the identical test or with an eqRhYalent form
of the test. If a child receives an IQ of 110 on Monday and an IQ of 80
OBJECTIVE MEASUREMENT OF DIFFICULTY. Reference to the definition
of a psychological test with which this discussion opened will show that
such a test was characterized as an objective as well as a standardized
measure. In ,••.hat specific way~.are such tests objective? Some aspects of
the objectivity of psychologieh'l tests have already been touched on in
when retested on Friday, it is obvious that little or 110 confidence can be
put in either score. Similarly, if in olle set of 50 words an individual
identifies 40 correctl~·, whereas in another, supposedly equivalent set he
gets a score of only 20 right, then neither score can be taken as a de-
pendable index of his verbal comprehension. To be sure, in both illustra-
tions it is possible that only one of the two sC'ores is in error, but tlus
could be demonstrated only by further retests. From the given data, we
can conclude only that both scores cannot be right. \Vhether one or
neither is an adequate estimate of the individual's ability in vocabulary
cannot be established without additional information.
Before a psychological test is released for general use, a thorough,
objective check of its reliability should be carried out. The different types
of test reliability, as well as methods of measuring each, will be con-
sidered in Chapter 5. Reliability can be checked with reference to
Itemporal fluctuations, the particular selection of items or behavior sample
constituting the test, the role of different examiners or scorers, and other
aspects of the testing situation. It is essential to specify the type of re-
liability and the method employed to determine it, because the same test
may vary in these different aspects. The number and nature of indi-
viduals on whom reliability was checked should likewise be reported.
With such information, the test user can predict whether the test will be
about equally reliable for the group with 'which he expects to use it, or
whether it is likelv to be more reliable or less reliable.
VALIDITY, Undoubtedly the most important question to be asked about
any psychological test"concerns its validity, i.e., the degree to which the
test actually measures what it purports to measure. Validity provides a
direct check on how well the test fulfills its function. The determination
of validity usually requires independent, external criteria of-whatever the
test is nesigned to measure. For example, if a medical aptitude test ist9
be used in selecting promising applicants for medical school,. ultimatle
success in medical scholYlwould be a criterion. In the process of ·y~lidat-
ing such a test, it would be administered to a large group of students at
the time of their admission to medical school. Some measure of per-
formance in medical school would eventually be obtained for each stu-
dent on the basis of grades, ratings by instructors, success or failure in
completing training, and the like. Such a composite measure constitutes
the criterion with which each student's initial test score is to be correlated.
A high correlation, or validity coefficie,,!t, would signify th~t those indi-
viduals who scored high on the- test. had been relatively successful in
medical school, whereas those scoring low on the test had done poorly in
medical school. A low correlation would indicate little correspondence
l,,,t"'ppn tp~t ~('orp.rind criterirJn measure and hence poor validity for the
test. The validity coefficifnt enables us to determine how closel\' the
criterion perfor~ance could have been predicted from the test scor~s.
In a similar manner, tests designed for other purposes can be validated
against appropriate criteria. A vocational aptitude test, for example, can
be validated against on-the-job success of a trial group of new employees.
A pilot aptitude battery can 1;>evalidated against achie\'ement in flig:lt
training. Tests designed for broader f\nd more varied uses are validated
against a number of criteria and their validity can be established only by
the gradual accumulation of data from many different kinds of investiga-tions.
The reader may have noticed an apparent paradox in the concept of
test validity. If it is necessary to follow up the subjects or in other ways
to obtain independent measures of what the test is trying to predict, why
not dispense v.ith the test? The answer to this riddle is to be found in the
distinction between the validation l,TfOUp on the one hand anci the groups
on which the test will eventually be employed for operational purposes
on the other. Before the test is ready for use, its validity must be estab-
lished on a representative sample of suhjects. The scores of these persons
are not themselves employed for operational purposes but serve only in
the process of testing the test. If the test proves valid b~' this method, it
can then be used on other samples in the absence of criterion measures.
It might still be argued that we would need only to wait for the crite-
rion measure to mature, to become available, on any group in order to
obtain the information that the test is trying to predict. But such a pro-
cedure would be so wasteful of time and energy as to be prohibitive in
most instances. Thus, we could detennine which applicants will succeed
on a job or which students will satisfactorily complete college by admit-
ting all who apply and waiting for subsequent developments! It is the
very wastefulness of this procedure-and its deleterious emotional im-
pact on individuals-that tests are designed to minimize. By means of
tests, the person's present level of prerequisite skills, knowledge, and
other relevant characteristics can be assessed with a deferminable margin
of error. The more valid and reliable thef~, the smaller will be this,margin of error. .
The special problems encountered in determining the validity of dif-
ferent types of tests, as well as the specific criteria and statistical pro-
cedures employed, willlJ~ fhscussed in Chapters 6 and 7. One further
point, however, should be coq$fdered at this time. Validitv tells us more
than the degree to which the te~t is f~lfilling its funcpari.ft actually tells
us what the test is measuring. By studying the validation data, we can
objectively determine what the test is measuring. It would thus be more
accurate to define validity as the extent to which we Jrnow what the test
measures. The interpretation of test scores would undoubtedly be clearer
and less ambiguous if tests were regularly named in terms of the criterion
Context of Psychological Tes/ing
'~:~hl:oughwhich they had been validated. A tendency in this direction
pe'recognized in such test labels as "~cholastic aptitude test" and
sonnel classification test" in place of the vague title "intelligence
'SONS FOR CONTROLLING THE USE OF
,CHOLOCICAL TESTS
'y I:have a Stanford-Binet blank? ~fy nephew has to take it next week for;
i~sionto,School X and I'd like to give him ~ol1lepractice so he can pass."
o improve the reading program in our school, we need a culture-free IQ
,t .that measures each child's inllate potential."
st night I answered the questions in an intelligence test published in a
~gazineand I got an IQ of SO-I think psychological tests are silly."
.. 'y roommate is studying psych. She gave me a personality test and I came1neurotic. I've been too upset to go to class ever since."
, 'ast ~'enryou gave a new personality test to our employees for research pur-
.;poses.We would now like to have the scores for their personnel folders."
The above ·remarks are not imaginary. Each is based on a re~fincident,
nd the list could easily be extended by any psychologist. SuQ't remarks
'lustrate potential misllses or misinterpretations of psychological tests in
uch wavs, as to rrnder the tests worthless or to hurt the indi:,V;idual.Like
ny sd~ntillc instrument or precision tool, psychological t~~s"roJ!~.LP.!:_
9perly used to be effective. In the hands of either the unscrupulous or
"we -meamng ut uninformed user, such tests can cause serious
~~~ ~.There are two principal reasons for controlling the use of psychological
ests: (a) to revent general familiarity with test content, which would
.' invalidate the test an ( to ensure tat e test is used ~ a qualified :>
, '~\' if an individual were to merr'lbrize the correct' re-
O' sponses on a test o'f' color blindness, such a test w~ld no longer be a
'measure of color vision for him. Under these condItions, the test would
be completely invalidated. Test content clearly has to be restricted in
, order to forestall deliberate efforts to fake scores.
In other cnses, however, the effect of familiarity may be less obvious,
or the test may be invalidated in good faith by misinformed persons. A
\ ,schoolteacher, for example, may give her class special praettee in prob-
.1ems closely resembling those on an intelligence test, "so that the pupils
will be well prepared to take the test." Such an attitude is simply a carry-
"over from the usual procedure of preparing for a school examination.
When applied to an intelligence test, however, it is likely that such
specific training 01' coaching will raise the scores on the test without ap-
preciably affecting the broader area of beha"ior the test tries to sample.
Under such conditions. the validity of the test as a predictive instl'l1ment
is reduced.
The need for a qualified examiner is evident in each of the three major
aspects of the testing situation-selection of the test, administration and
scoring, and i~terpretation of scores. Tests cannot be chos'en like lawn
mowers, from a mail-order catalogue. They cannot be evaluated by name,
author, or other easy marks of identification. To be sure, it requires no
psychological training to consider such factors as cost, bulkiness and ease
of transporting test materials, testing time required, and ease and rapidity
of scoring. Information on these practica] points can '\lsually be obtained
from a test catalogue and should be taken into account in planning a test-
ing program. For the test to serve its function, however, an e"nlnation of
its technical merits' in terms of such characteristics as validity reliability
difficulty level, and norms is essential. Only in such a way' ~an the tes~
user determine the appropriateness of an)' test for his particular purpose
and its suitability for the type of persons with whom he plans to use it.
The introductory discussion of test standardization earlier in this chap-
ter has ah'eady suggested the importance of a trained examiner. An ade-
quate realization of the need to follow instructions precisely, as well as a
thorough familiarity with the standard instructions, i~ required if the test
scores obtained by different examiners are to be comparable or if anyone
individual's score is to he evaluated in terms of the published norms.
Careful conh-ol of testing conditions is also essential. Similarly, incorrect
or inaccurate scoring may render the test score worthless. In the absence
of proper checking procedures, scoring errors are far more likeh- to occur
than is generally realized. . ,\
The proper interpretation of test scores requires a thorough under-
standing of the test, the individual, and the testing <'Onditiolls. What is
being measured can be objectively determined only by reference to the
specific procedures in terms of which the particular test was validated.
Other information, pertaining to reliability, nature of the group on which
norms were established, and the like, is likewise relevant. Some back-
ground data reg,arding the individual being tested are essential in inter-
preting any test score. The same score may be obtained by different per-
sons for very different reasons. The conclusions to be drawn from such
scores would therefo.re be quite dissimilar. Finally, some consideration
must also be given to special factors that may have influenced a particular
score, such as unusual testing conditions, temporary emotional or physical
state of thl> subject, and extent of the subject's previous experience with
tests.
The basic rationale of testing im·olves generalization from the behavior
sample observed in the testing situation to beha"ior manifested in other,
nontest situations, A test SCOl'e should help us to predict how the client
will feel and act outside the clinic, how the student will achieve in col-
lege courses, and how the applicant will perform on the job. Any influ-
ences that are specific to the test situation constitute error variance and
reduce test validity. It is therefore important to identify any test-related
influences that may limit or impair the generalizability of test results.
A whole volume could easil\' be devoted to a discussion of desirable
procedures of test administration, But such a survey falls outside the
scope of the present book. Moreover, it is more pra~ticable to acquire
~.such techniques within specific settings, because no one person would
normally be concerned with all forms of testing, from the examination
of infants to the clinical testing of psychotic patients or the administra-
tion of a mass testing program for military personnel. The present discus-
sion will therefore deal principally with the common rationale of test
administration rather than with specific questions of implementation. For
detailed suggestions regarding testing procedure, see Palmer (1970),
Sattler (1974), and Terman and Merrill (1960) for individual testing,
and Clemans (1971) for group testing.
ADVASCE PREPARATIOS OF E."I:AMINERS. The most important requirement
for good testing proc;.edure is advanc-e preparation. In testing there can
he no emergencies. Special efforts must therefore be made to foresee and
forestall emergencies. Only in this way can unifom1ity of procedure be
..a{ls.\wed.
'Advance preparation for the testing session takes many forms. Memo-
rizingthe exact verbal instructions is essential in most individual testing.
Even ill a group test in which the instructions are reauto the subrects,
some· previous familiarity with the statements to be read prevents mis-
reading and hesitation and permits a more natural. informal ;manner dur-
ing test admillish'ation. The preparation of test materials is an9ther im-
portant preliminary step. In individual testing and especially in the ad-
ministration of performance tests, such preparation invqlves the actual
layout of the necessary materials to facilitate subsequent use with a
minimum of search or fumbling. Materials should generally be placed on
a table near the testing ta.~le so that they are within easy reach of the
examiner but do not distriCt Vte subject. When apparatus is employed,
frequent periodic checking and calibration may be necessary. In group
testing, all test blanks, answer sheets, special pencils,· or other materials
Nature alld (he of PsycllOlogiclIl Tc'sls 33
needed should be carefully counted, checked, and arranged in advance
of the testing day.
Thorough familiarity with the specific testing procedure is another im-
portant prerequisite in both individual and group testing. For individual
testing, supervised training in the administration of the particular test is
usually essential. Depending upon the nature of the test and the type of
subjects to be examined, such training may requi.re from a few demonstra-
tion and practice sessions to over a year of instruction. For group testing,
and espeCially in large-scale projects, such preparation may include
advance briefing of examiners and proctors, so that each is hilly in-
fonned about the functions he is to perform, In general, the examiner
reads the instructions, takes care of timing, and is in charge of the group
in anyone testing room. The proctors hand out and collect test materials,
make certain that subjects are following instructions, answer individual
questions of subjects within the limitations specified in the manual, and
prevent cheating.
· J
TESTING COXDlTlOXS. Standardized procedure applies not only to verbal
instructions, timing, materials, and other aspects of the tests themselves
but also to the testing environment. Some attention should be iven to
the selection of a . ~ flijJ.. This room should be
hould wvide , venti-
~ .~cial~should a so e ta -en to prevcnt mtcrrup ons unng the test. Posting a
sign on the door to indicate that testing is in progress is effective, pro-
vided all personnel have learned that such a sign means no admittance
under any circumstances. In the testing of large groups, locking the doors
or posting an assistant outside each door may be neeessarv to-prevent the
entrance of late-comers. --
. It is important to realize the extent to which testing conditions may
lI1fluence scores. Even apparentl~' ·minor aspects of the testing situation
may appreciably alter performance. Such a factor as the use of deSKSor
of chairs with desk arms, for example, proved to be significant in a group
testing project with high school students, the groups using desks tending
to obtain higher scores (Kelley, 1~43:Traxler & Hilkert, 1942). There is
also evidence to show that the Slli9ir~loyed may affecttest scores (Bell, Hoff, & Hoyt,-19t3~1~li'~1lfr-~~ab1ishment of in-
dependent test-scoring and data-processing agencies that;, provide their
0\1.'11machine-scorable answer sheets, examiners sometimes administer
group tests with answer sheets other than those lIsed in the standardiza-
tion sample. In the absence of empirical verification, the equivalence of
these answer sheet# cannot be assumed. The Differential Aptitude Tests,
for example, may be administered with any of five different answer
Context of Psychological Testing
eets.On the Clerical Speed and Accuracy Test of this battery, separate
s are provided for three of the five answer sheets, because they were
nd to yield substantially different scores than those obtained with the
reI' sheets used by the standardization sample.
testing children below the fifth grade, the use of (Illy separate answer
t may significantly lower test scores (Meh'opolitan Achievement Test
ial Report, 19i5). At these grade levels, having the child mark the
\'ers in the test booklet itself is generally preferable.
any other, more subtle testing conditions have been shown to affect
ormance on ability as well as personality tests. Whether the ex-
inel' is a stranger or someone familiar to the subjects may make a
'nificant difference in test scores (Sacks, 1952; Tsudzuki, Hata, & Kuze,
57). In another study, the general manner and behavior of the exam-
, as illustrated by smiling, nodding, and making such comments as
ood" or "fine," were shown to have a decided effect on test results
"ickes, 1956). In a projective test requiring the subject to write stories
'fit given pictures, the presence of the examiner in the room tended to
hibit the inclusion of strongly emotional content in the stories (Bern-
ein, 1956). III the administration of a typing test, job applicants typed
'a significantly faster rate when tested alone than when tested in groups
liHwo or more (Kirchner, 1966).
Examples.could readily be multiplied. The implications are threefold.
.first, follow standardized procedures to the minutest detail. It is the re-
onsibility of the test author and publisher to descdbe such procedures
ully and clearly in the test manual. Second, record any unusual testing
onditions, however minor. Third, take testing conditions into account
;hcn interpreting test results. In the intensive assessment of a person
rough individual testing, an experienced examiner may occasionally de-
rt from the standardized test procedure in OJ:der to eJi~it additional in-
rmation for special reasons. \Vhen he docs so, he ~ no longer in-
rpret the subject's responses in terms of the test norms, Under these
rcumstances, the test stimuli are used only for qualitative exploration;
. ld the responses should be treated in the same way as any other infor-
"malbehavioral observations or interview data.
In psychometrics, the term "rapport" refers to the examiner's effOl'ts
o arouse the subject's interest in the test, elicit his cooperation, and
nsure that he follows the standard test instructions. In ability tests, the
nstructions call for careful concentration on the given tasks and for put-
'ng forth one's best efforts to perform well; in personality inventories,
ey call for frank and honest responses to questions about one's usual
Natml.' anel USe' Of Psychological Tests 35
behavior; in certain projective tests, they call for full reporting of associa-
tions evoked by the stimuli, without any censoring or editing of content.
Still other kinds of tests may require other approaches. But in all in-
stances, the examiner endeavors to motivate the subject to follow the
mstructlOns as fullv and conscientiously as he can.
The training of examiners covers techniques for the establishmcnt of
rapport as well as those more directly related to test administration. In
establishing rapport, as in other testing procedures, uniformity of condi-
tions is essential for comparability of results. If a child is given a coveted
prize whenever he solves a test problem correctly, his performance can-
not be directly compared with the norms or with that of other children
who are motivated only with the standard verbal encoura"ement 01', 0
praise. Any deviation from standard motivating conditions for a particular
test should be noted and t,aken into account in interpreting performance.
Although rapport can be more fully established in individual testing,
steps can also be taken in group testing to motivate the subjects and re-
lieve their anxiety. Specific techniques for establishing rapport vary with
the nature of the test and with the age and other characterbtics of the
subjects. In testing preschool children, special factors to be considered
include shyness with strangers, distractibility, and negativism. A friendly,
cheerful, and relaxed manner on the part of the examiner helps to reas-
sure the child. The shy, timid child needs more preliminary time to be-
come familiar with his surroundings. For this reason it is better for the
examiner not to be too demonstrative at the outset. but rather to wait
until the child is ready to make the first contact. Test periods should be
br~ef, and the ~asks should be varied and intrinsically interesting to the
chll.d.. The testIng should be presented to the child as a game and his
cunoslty aroused before each new task is introduced. A certain flexibilitv
of procedure is necessary at this age level because of possible refusal~,
loss of interest, and other manifestations of negativism.
Children in the first two or three grades of elementary school present
many of the same testing problems as the preschool child. The game ap-
proach is still the most effective way of arousing their interest in the test.
The older schoolchild can usually be motivated through an appeal to his
competitive spirit and his desire to do well on tests. 'Vhen testing chil-
dren from educationally disadvantaged backgrounds or from different
cultures, however, the examiner cannot assume they will be motiyated to
excel on academic taSKSto the same extent as children in the starfdardiza-
ti~n sa~~le ..This pro~le~ and others pertaining to the testing of persons
\\ lth diSSImilar expenential backgrounds will be c'Onsidered further inChapters 3, 7, and 12.
. Special. motivational problems may be encountered in testing emo-
tionally disturbed persons, prisoners, or juvenile delinquents. Especially
when examined in an institutional setting, suca persons are likely ·to ..
manifest a number of unfavorable attitudes, such as suspicion, insecurity,
fl'ar, or cynical indifh'renee. Abnormal conditions in their past experiences
are also likely to influence their test perforrnanee adversely. As a result
of early failures and frustrations in school, for example, they may have
developed feelings of hostility and inferiority toward academic tasks,
\rhich the tests resemble. The experienced examiner makes special efforts
to establish rappolt under these conditions. In any event, he must be
sensitive t~ these special difficulties and take them into account in inter-
preting and explaining test performance.
In testing any school-age child or adult, one should bear in mind that
e\'e1')'test presents an implied threat to the individual's prestige. Some
reassurance should therefore be given at the outset. It is helpful to ex-
plain, for example, that no one is expected to finish or to get all the itcms
correct. The individual might otherwise experience a mounting sense of
failure as 11e advances to the more difficult items or finds that he is un-
able to finish anv subtest within the time allowed.
It is also desil:able to eliminate the element of surprise from the test
situation as far as possible, because the unexpected and unknown are
likely to produce al1xiet~'. :Many group tests provide a prdiminaryex-
planatory statement that is read to the group by the examiner. An even
better procedure is to announce the tests a few days in advance and to
give each subject a printed booklet that explains the purpose and nature
of the tests, offers general suggestions on how to take tests, and contains
a few sample items. Such explanatory booklets are regularly available to
participants in large-scale testing programs such as those conducted bythe College Entrance Examination Board (1974a, 1974b). The United
States Employment Service has likewise de\'eloped a booklet on how to
take tests, as well as a more extensive pretesting orientation~.technique
for use with culturally disadvantaged applicants unfamili~f. ,v'ith tests.
\1ore general orientation booklets aie also .available, si'tc11 as l\feetingthe Test (Anderson, Katz, & Shimberg, 1965), A tape recOl'ding and two
booklets are combined in Test Orientatioll Procedure (TOP), designedspecifically for job applicants with little prior testing experience CBen-
nett & Doppelt, 1967), The first booklet, used together with the tape,
provides general information on how to take tests; the second contains
practice tests. In the absence of a tape recorder, the examiner may read
the instructions from a printed script.
Adult testing presents--some additional problems. Unlike the school-
child, the adult is not so likely to work hard at a task merely because it is
assigned to him. It therefore becomes more important to "sell" the pur-
pose of the tests to the adult, although high school and college students
also respond to such an appeal Cooperation of the examinee can usually
;be secured by convincing him that it is in his own interests to obtain a\,
valid score, Le., a score correctly indicating wh~lt he can do rather than
overestimating or underestimating his abilities. ~Iost persons will under-
stand that an incorrect decision, which might result from invalid test
scores, would mean subsequent failure, loss of time, and frustration for
them. This approach can serve not only to motivate the individual to
try his best on ability tests but also to reduce faking and encourage frank
reporting on personality inventories, because the examinee realizes that
he himself would otherwise be the loser. It is certainly not in the best
interests of the individual to be admitted to a course of study for which
he is not qualified or assigned to a job he cannot perform or that he
would find uncongenial.
:\lany of the practices designed to enhance rapport sen'e also to reduce
test anxiety. Procedures tending to dispel surprise and strangeness from
the testing situation and to reassure and encourage the subject shottld
certainly help to lower anxiety. J'he examiner's own manner and a well-
organized, smccthly running testing operation will contribute toward the
same goal. Individual differences in test anxiety have been studied with
hoth schoolchildren and college students (Ga~dry& Spielberger, 1974;-
Spielberger, 19i2). Much of this research was initiated bv Sarason and
his associates at Yale (Sarason, Davidson, Lighthall, "'aite, & Ruebush,
1960). The first step was to construct a questionnaire to assess the indi-
vidual's test-taking attitudes. The children's form, for example, contains
items such as the following:
Do you worry a lot before taking a test?
\\'hen the teacher sa~'s she is going to find out how much you h,we learned,does your healt begin to beat faster?
While 'you are taking a test, do you usually think you are not doing wen.
Of primary interest is the finding that both school achievement and intel-
ligence test scores yielded significant negative correlations with test anx-
iety. Similar correlations have been found among college st1tdcn!s (1. G.
Samson, 1961). Longitudinal studies likewise revealed an inverse relation
between changes in anxiety level and changes in inteJligence or achieve-
ment test perfonnance (Hill & Sarason, 1966; Sarason, Hill, & Zim-bardo, 1964). .
~uch findings, of course, do not indicate the direction of caUsal relation-
slllps. It is possible that children develop test anxiety because they per-
Context of Psydl(Jlogical Testiug
formpoorly on tests and haw thus experienced failure and frustration in
previous test situations. In support of this interpretation is the finding
that \\ithin subgroups of high scorers on intelligence tests, the negative
"rrelation between anxiet~' level and test performance disappears
Denny, 1966; Feldhusen & Klausmeier, 1962). On the other hand, there
5 evidence suggesting that at least some of the relationship results from
he deleteLious effects of anxiety on test performance. In one study
(:Waite,Sarason, Lighthall, & Davidson, 1958), high-anxious and low-
, 'iotlschildren equated in intelligence test scores were given repeated
ials in a learning task Although initially equal in the learning test, the
w-allxiousgroup improved significantly more than the high-anxious.
Severalinvestigators have compared test performance under conditions
esigned to evoke "anxious" and "relaxed" states. Mandler and Sarason f;;(.1952), for example, found that ego-involving instructions, such as telling
subjects that everyone is expected to finish in the time allotted, had a
beneficialeffect on the performance of low-anxious subjects, but a dele-
teriouseffect on that ofbigh-anxious subjects. Other studies have likewise
foundan interaction between testing conditions and such individual char-
~cteristicsas anxiety level and achievement motivation (Lawrence, 1962;
Palll& Eriksen, 1964). It thus appears likely that the r~latjQn between
anxiety,and test performance is nonlinear, a slight amount Qf anxiety
,\lein bencficia~ while a lar e amount is detrimental. Individuals who are
',cllstomariy ow-anxious benefit from test con i,tions t lat arouse some
et:>, ",hi e t lose who are customarilv hi<rh-anxiol1s )erform better
Ii ' firmore re axe can itions.
it is undoubtedl\' true that a ~hronicalh- high amidv len'l will c:I;erJ a
deb'imental effect 'on school learning and' int~lIectual dewlopllleltf,_",~~ch
"aneffect, howe\'er, should be distinguished horn the tesr:tiinit1!,r- ~'ects
with which this discussion is concerned. To what extent do~s test auxier.·
,make the individual's test performance unrepresentative of his cust~mar~'
;'performance level in nontest situations? Because of the competitive pre~-
sureexperienced by college-bound high school seniors in ,,\merica today,
it has been argued that performance on c'OlIege ~dmissif>il tests may be
unduly affected by test anxiety. In a thorough ana::4ontrol1ed investi.
gationof this question, French (1962) compar~d Jhf'p,erformancc of high
school students on a test given as part of the fe-gular administration of
the SAT with performance on a parallel form of the test administered at
,a different time under "relaxed" conditions, The instructions on the latter
, occasion specified that the test was given for 'research purposes only and
scores would not be sent to any college. The results showed that per-
formance was no poorer during the standard administration than during
the relaxed administration. Moreover, the concurrent validitv of the test
scoresagainst high school course grades did not differ signifi~antly under
the two conditions.
Comprehensive surveys of the effects of examiner and situational
variables on test seores'lmve been prepared by S. B. Sarason (1954),
Masling (l~60), ~foliarty (1961, 1966), Sattler and Theye (1967),
Palmer (19,0), and Sattler (1970, 1974). Although some effects have
been demonstrated with objective group tests, most of the data have been
obtained with either projective techniques or individual intelligence tests.
These extraneous factors are more likely to operate with unstructured and
ambiguous stimuli, as well as "ith difficult and nO"el tasks, than with
clearly defined and well-learned functions. In general, children are more
susceptible to examiner and situational influences than are adults; in the
examination of preschool children, the role of the examiner is especially
cruCiaL.Emotionally disturbed and insecure persons of an\' age are also
mClre likely to be affected by such conditions than are well-adjustedpersons,
There is considerable evidence that test results may vary systematically
as a function of the examiner (E. Cohen, 1965; ~'Iasling, 1960). These dif-
ferences may he related to personal characteristics of the examiner, such
as his, age, sex, race, professional or socioeconomic status, training and
expenence, personality charaderistics, and appearance. Se\'eral studies of
thes~ examiner variables, however, have yielded misleading or illcon-
cluSl\'e results because the experimental designs failed to control or iso-
late the influence of differcnt examiner or subject characteristics. Hence
thp l:'ffeds of two or more variables ma\, be confounded.
The examiner's behavior before and during test auministration has also
heen s~lown to affect test results, For example, controlled investigations
ha\'e YIelded significant differences in intelligence test performance as a
res~lt of a "warm" versus a "cold" interpersonal relation between ex-
amllJer and examinees, or a rigid and aloof versus a natural manner on
the part of the examiner (Exner, 1966; Masling, 1959). Moreover, there
may be Significant interactions between examiner and examinee' charac-t " , he~lstJCs,III t e sen~e that the same examiner characteristic or testing man-
nel may have a dIfferent effect on different examinees as a function of
the examinee's Own personality characteristics. Similar interactions may
occur '~ith task variables, such as the nature of th,e test, the purpose of
the testing, and the instructions given to the subjects. Dyer (1973) adds
even more variables to this list, calling attention to the possible inHirenceof th t t· , d . ," .. c es gIVers an the test takers' diverse perceptions of the funetigllsand goals of testing.' 'St'll '• '. I. an,other way in which an examin8r may inadvertently affect the
~x~~m~e s responses is through ~is own 'cexpectations, This is simply a
P clal mstance of the self-fulfilhng prophecy (Rosenthal, 1966; Rosen-
40 Context of Psycholog.ical Testing
thaI & Rosnow, 1969). -An experiment conducted with the Rorschach will
illustrate this effect (Masling, 1965). The examiners were 14 graduate
student volunteers, 7 of whom were told, among other things, that ex-
perienced examinel's elicit more human than animal responses from the
subjects, while the other 7 were told that experienced examiners elicit
more animal than human responses. Under these conditions, the two
groups of examiners obtained significantly diHerent ratios of animal to
human responses from theh subjects. These differences occurred despite
the fact that neither examiners nor subjects reported awareness of any
influence attempt. ~foreover, tape recordings of all testing sessions re-
vealed no evidence of verbal influence on the part of any examiner. The
examiners' expectations apparently operated through subtle postural and
facial cues to which the subjects responded.
Apa~ from the examiner, other aspects of the testing situation may
Significantly affect test performance. Military recmits, for example, are
often examined shortly after induction, during a period of intense read-
justment to an unfamilim' and stressful situation. In one investigation
designed to test the effect of acclimatization to such a situation on test
performance, 2,724 recruits were given the Navy Classification Battery
during their ninth day at the ~a\'al Training Center (Gordon & Alf,
1960). When their scores were c'Ompared with those obtained by 2,180
recruits tested at the conventional time, during their third day, the 9-day
group scored Significantly higher on all subtests of the battery.
The examinees' activities immediately preceding the test may also af-
fect their performance, especially when such activities produce emotional
disturbance, fatigue, or other- handicapping conditions. In an investiga-
tionwith third- and fourth-grade schoolchildren, there was some evidence
to suggest that IQ on the Draw-a-Man Test was influenced Qrthe chil-
dren's preceding classroom activity (McCarthy, 1944). On one occasion,
the class had been engaged in writing a composition on "The" Best
Thing That Ever Happened to Me"; on the second occasion, they had
again been writing, but this time on "The Wo~sLThing That Ever'Hap-
pened to Me." The IQ's on the second test, fOllowing what may have
been an emotionally depressing experience, averaged 4 or 5 points lo\ver
than on the first test. These findings were corroborated in a later investi-
gation specifically designed to determine the effect of immediately pre-
eeding experience on the Draw-a-Man Test (Reichenberg-Hackett, 1953).
In this study, children who had had a gratifying experience involving the
successful solution of an interesting puzzle, followed by a reward of toys
and candy, snowed more improvement in their test scores than those who
had undergone neutral or less gratifying experiences. Similar results were
obtained by W. E. Davis (1969a, 1969b) with college students. Per-
fonnance on an arithmetic reasoning test was significantly poorer when
preceded by a failure experience on a verbal comprehension test than it
Natufa aile! Use of Psychological Tests 41
was in a control group given no preceding test and in one that had taken
a standard verbal comprehension test under ordinary conditions.
Several studies have been concerned with the effects of feedback re-
garding test scores on the individual's subsequent test performance. In a
particularly well-designed investigation with seventh-grade students,
Bridgeman (1974) found that "success" feedback was followed by sig-
nificantly higher performance on a similar test than was "failure" feed-
hack in subjects who had actually performed equally well to begin with.
This type of motivational feedback may operate largely through the goals
the subjects set for themselves in subsequent performance and may thus
represent another example of the self-fulfilling prophecy. Such general
motivational feedback, however, s1)ould not be confused with corrective
feedback, 'whereby the individual is informed about the specific items he
missed and given remedial instruction; under these conditions, feedback
is much more likely to improve the performance of initially low-scoring
persons.
The examples cited in this section illustrate the wide diversity of test-
related factors that may affect test scores. In the majority of well-admin-
istered testing programs, the influence of these factors is negligible for
practical purposes. Nevertheless~ the skilled examiner is constantly on
guard to detect the possible operation of such factors and to mipimize
their influence. When circumstances do not permit the control of these
conditions, the conclusions drawn from test performance should be
qualified.
In evaluating the eHect of coaching or practice on test scores, a funda-
mental question is whether the improvement is limited to the specific
items included in the test or whether it extends to the broader area of
~ehavior that the test i~gned to p;edict. The answer to this ques~
represel1ts the difference between coacmng and education. Obviously
any educational experience the indiVidual undergoes, either formal or in-
formal, in or out of school, should be reflected in his performance on tests
sampling the relevant aspects of behavior. Such broad influene.es will in
no way invalidate the test, since the test score presents an aar:a,tate piC-
ture of the individual's standing in the abilities under conside~n. The
difference is, of course, one of degree. Influences cannot..:..be~dassified as
either. narrow or broad, but obviously vary widely in scop~~f;om those
~ffecting only a single a~lllinis~tj~n of a.,single test, throu~hJib.~se. affect-
~ng'p~rformance on all Items ;()fi,ca /:crtUln,type, to those mtfUencmg the
mdl vidual's performance in the large .Irtai9rity of his activities. From the
standpOint of effective testing, however, a workable distinction can be
COlltext of P~yc1lOlogic(/l Testing
e. Thus, it can be stated that a test score is inmlidated only when a
':'cular experience raises it withont appreciably affecting the criterion
~Lviorthat: the test is deSigned to predict.
:";{CHIKC.'the effects of coaching on test scores have been widely in-
gated. Many of these studies were conducted by British psycholo-
,with special reference to the effects of practice and coaching on the
brinerly used in assigning ll-year-old children to different types of
'Ilrv;,schools (Yates et aI., 195:3-1954). As might be expected, the
ot ~~ovement depends on the ability and earlier educational;
'ences of'the examinees, the nature of the tests, and the amount and
'of coaching provided. Individuals with deficient educational back-
unds are more likely to benefit from special coaching than are those
'ihave had superior educational opportunities and are already pre-
, to do well on the tests. It is obvious, too, that the closer the re-
,blance between test content and coaching material, the greater will
the improvement in test scores. On the other hand, the more closely
truction is restricted to specific test content, the less likely is improve-
:nt to extend to criterion performance.
"n America, the College Entrance Examination Board has been con-
hed about the spread of ill-advised commercial coaching courses for
lege applicants. To clarify the issues, the College Board conducted
veral well-controlled experiments to determine the effects of coaching
'its Scholastic Aptitude Test and surveyed the results of similar studies
other, independent investigators (Angoff, 19711>;Conege Entrance
'amination Board, 1968). These studies covered a variety of coaching
ethods and included students in both public and private high schools;
e investigation was conducted with black students in 15 urban and
'"ralhigh schools in Tennessee. The conclusion from all"these studies is
':at intensive drill on items similar to those on the SAT is unlikelY to
'oduce appreciably greater gains than occur wrJ/i students are rete~ted
'th the SAT after a year of regular high schot;il instruction.
On the basis of such research, the Trustees of the College Board issued
.formal statement about coaching, in which the fonowing points were
ade, among others (College Entrance Examination Board, 1968,p.8-9):
e results of the coaching studies which ha,'e thus far been completed in-
te that average increases of less than 10 points on a 600 point scale can,expected. It is not reasonable to believe that admissions decisions can be
ected by such small changes in scores. This is especially true since the testsmerely supplementary to the school record and other evidence taken into
. unt b'): admissions officers. . . , As the College Board uses the term, ap-itude is not something flxed and impervious to influence by the way the child
\in'S and is taught. Rather, this particular Scholastic Aptitude Test is a meas-
ure of abilities that seem to grow slowly and stubb(lrnl~'. profoundly influcllcedby conditions at home and at school over thc years, but not responding tohasty attempts to relive a young lifetime.
It should also be noted that in its test construction procedures, the Col.
lege Board im'estigates the susceptibility of new item types to coaching
(:\ngoH, 1971b; Pike & Evans, 1972). Item types on which perfo.rma1lce
can be appreciably raised by short-term drill or instruction of a narrowly
limited nature are not included in the operational forms of the tests..
PRACTICE.The effects of sheer repetition, or practice, on test per-
formance are similar to the effects of coaching, but usuaIl~' less pro-
nounced. It should be noted that practice, as well as coaching, may alter
the nature of the test, since the subjects may emplo~' different work meth-
ods in solving the same problems. Moreover, certain types of items may
be much easier when encountered a second time. An example is 'provided
by problems requiring insightful solutions which, once attained, can be
applied directly in solving the same or similar problems in a retest. Scores
on such tests, whether derived from a repetition of the identical test or
from a parallel form, should therefore be carefully scrutinized.
A number of studies have been concerned ~,'ith the effects of the
identical repetition of intelligence tests over periods ranging from a few
days to se,'eral years (see Quereshi, ]968). Both adults and children,
and both normal and mentally retarded persons have been employed. The
studies have covered individual as well as group tests. All agree in show-
ing significant mean gains on retests. Nor is improvement necessarily
limited to the initial repetitions. \Vhether gains persist or level off in suc-
cessive administrations seems to depend on the difficulty of the test and
the abilit~· level of the subjects. The implications of sucll findings are il- \
lustrated by the results obtained in annual retests of .3,500 schoolchildren
with a Yariety of intelligence tests (Dearborn & Rothnev, 1941). When
the same test was readministered in successive years, th~ median IQ of
the group rose from 102 to 113, but it dropped to 104 when another test
w~s substituted. Becaus~ of the retest gains, the meaning of an IQ ob-
tamed on an initial and later trial proved to be quite different. For exam-
ple, .a~ ~Q of 100 fell approximately at the average o£'lhe distribution on
the Im~lal trial, -but in the lowest quarter On a retest~S\ldl iQ's, though
numencally identical and derived from the same te~ 1l;!ightthus signify
normal ability in the one instance and inferior ability#},(,the other.
G~ins in score are also found on retesting with pili:dIel -forms <1j the
same tes~, although such gains tend in general to be .srh.a4Ier.Significant
m~a,n gams have been reported when altema"f~ forins ofa 'test were ad-
rnullstered in immediate succession or after intervals ranging from orie
Context of Psychological Tesring
b three years (Angoff, 1971b; Droege, 1966; Peel, 1951, 1952).
. r results have been obtained with normal and intellectually gifted
)children, high school and college students, and employee samples.
a"onthe distribution of gains to be expected on a retest with a parallel
should be provided in test manuals and allowance for such gains
. ~dbe made when interpreting test scores.
)17 SOPHJSTICATIO~. The general problem o(test sophistication should
'"be considered in this connection. The individual who has had ex-
'vl! prior experience in taking psychological tests enjoys a certain ad-
Jage in test performance over one who is taking his first test (Heim &
, IIace,194~1950; Millman, Bishop, & Ebel, 1965; Rodger, 1936). Part
Ithis advantage stems from having overcome an initial feeling of
angeness, as well as from haVing developed more self-confidence and
"etter test"taking attitudes. Part is the result of a certain amount of over-
lap in the type of content and functions covered by many tests. SpeCific
,"familiaritywith common item types and practice in the use of objective
"answer sheets may also improve performance slightly. It is particularly
important to take test sophistication into account when comparing the
scores obtained by children from different types of schools, where the
extent of test-taking experience may have varied Widely. Short orienta-
tion and practice sessions, as described em'lier in this chapter, can be
quite effective in equalizing test sophistication (Wahlstrom & Boersman,1968).
CHAPTER 3
Social a1ld Etltical
11JljJZicatioTls of Testi1lg
IxORDER to prevent the misuse of psychological tests, it has become
necessary to erect a number of safeguards around both the tests
themselves and the test scores. The distribution and use of psycho-
logical tests constitutes a major area in Ethical Standards of Psychologists,
the code of professional ethics officially adopted by the American Psycho-
logical Association and reproduced in Appendix A. Principles 13, 14, and
15 are specifically directed to testing, being concerned with Test Security,
Test Interpretation, and Test Publication. Other principles that, 'although
broader in scope, are highly relevant to testing include 6 (ConfideIi-
tiality), 7 (Client Welfare), and 9 (Impersonal Services). Some of the
matters discussed in the Ethical Standards are closely related to points
covered in the Standards for Educational and Psychological Tests (1974),
cited in Chapter 1. For a fuller ,and richer understanding of the principles
set forth in the Ethical Standards, the reader should consult two com-
panion publications, the Casebook on Ethical Standards of PsycllOlogists
(1967) and Ethical Principles in tIle Conduct of Researc11 with Human
Participants (1973). Both report specific incidents to illustrate each prin-
Ciple. Special attention is given to marginal situations in which there may
be a conflict of values, as between the advancement of science for human
betterment and the protection of the rights and welfare of individuals.
The requirement that tests be used only by appropriately qualified
examiners is one step toward protecting !he indiy!~ual againE: the im-
~oper use of tests. Qf course, the necessary qualiB,c~tions vary with the
type of test. Thus, a relatively long pe.ri!'d of int~nsive training and
s~pervised experience is required for the proper use of individual intel-
ligence tests and most personality tests, whereas a mini~um of specialized
psychological training is needed in the case of educational achievement
45
46 COllfext of Psycl1010gicaf Testing
or vocational proficiency tests. It should also be noted that students who
take tests in class for instructional purposes are not usually equipped to
administer the tests to others or to interpret the scores properly.
The well-trained examiner chooses tests that are a )ro riate for 0
the particular purpose for whie 1 e is teshn an t ex-
amme. e IS a so cognizant of the available research literature on the
clioseiitest and able to evaluate its technical merits with reC1ard to suchocharacter,istics as norms, reliability, and validity. In administering the
test, he is sensitive to the many conditions that
~ such as those 1 ustrate 10 apter 2. He draws conclusions or
makes recommendations only after considering the test score (or scores)
in the light of other pertinent information about the individual. Above all,
lie shpuld be sufficiently knowledgeable about the science of human be-
havior to guard against unwarranted inferences in his interpretations of
test scores. When tests are administered' by psychological technicians or
assistants, or by persons in other professions, it is essential that an ade-
quately qualified psychologist be available, at least as a consultant, to
provide the needed perspective for a proper interpretation of test per-
formance.
Misconceptions about the nature and purpose of tests and misinter-
pretations of test results underlie Illany of the popular criticisms of psy-
chological tests. In part, these difficulties arise from inadequate com-
munication between· psychometricians and their various publics-
educators, parents, legislators, job' applicants, and so forth. Probably th~
most common examples center on unfounded inferences kdfrtIQs. Not alT
IU1sconcephons· about tests, howcyer, can bc attrib_R!;~ to inadequate
communication between psychologists and laymeD.)~'c.:hological testing
itself has tended to become dissociated from~;.the· mainstream of be-
havioral science (Anastasi, 1967). The growing.Fdrnplexity of the science
of psychology has inevitably becn accompani~,dby increasingspecializa-
tion among psychologists. In this process, psychometricians have concen-
trated more and more on the technical refinements of test construction
and have tended to lose conta:tt wit'rr developments in other relevant
specialties, such as learning, child development, individual diffe;ences,
and behavior genetics. Thus, the technical aspects of test construction
have tended to outstrip the psychological sophistication with which test
results are interpreted. Test scores can be properly interpreted only in
the light of all available knowledge regarding the behavior that the tests
are designed to measure.
Who is a qualified psychologist? Obviously, with the diversification of
the field and the consequent specialization of training, no psychologist is
equally qualified in all areas. In recognition of this fact, the Ethical
Standards specify: "The psychologist recognizes the boundaries of his
competence and the limitations of his techniques and does not offer
Social alief Etllicalll1lplications of Testing 47
selyices or use techniques that fail to meet profeSSional standards estab-
lished in particular fields" (Appendix A, Principle 2c). A useful distinc-
tion is that between a psychologist working in an institutional setting,
such as a school system, university, clinic, or government agency, and one
engaged in independent practice. B~cause the inde endent ractitioner
is less subject to judC1ment and eva ua on l' wle eable collen es
t lan lS 1e lIlS Itntional s choloC1ist he needs to meet hi her standards
? -pro esslOna qualifications. The same would be true of a psychologist
responSIble for the supervision of other i·nstitntional psychologists or one
who serves as an expert consultant to institutional personnel.
A Significant step, both in upgrading professional standards and in
helping the public to identify qualified psychologists, was the enactment
of state licensing and certification laws for psychologists. Nearly all states
now have such laws. Although the terms '1icensing" and "certification"
are often used interchangeably, in psychology certification typically refers
to legal protection of the title "psychologist," whereas licensing controls
the practice of psychology. Licensing laws thus need to include a defini-
tion of the practice of psychology. In either type of law, the requirements
are generally a PhO in psychology, a specified amount of snpervised
experience, and satisfactory performance on a qualifying examination.
Violations of the APA ethics code constitute grounds for revoking a
celtiRcate or license. Although most states began with the simpler certifi-
cation laws, there has been continuing movement toward licensing.
At a more advanced level, speCialty certification within psychology is
provided by the American Board of Professional Psychology (ABPP).
ReeJuiring a high level of training and experience within deSignated
specialties, ABPP grants diplomas in such areas as clinical, counseling,
industrial and organizational, and school psychology. The Biographical
Director~' of the APA contains a list of current diplomates in each spe-
cialty, which can also be obtained directly from ABPP. The principal
f~nction of ABPP is to provide information regarding qualified psycholo-
gIsts. As a privately constituted board within the profession, ABPP does
~)()thave the enforcement authority available to the agencies administer-
mg toe state licensing and certification laws.
.The. p~rchase of tests is generally restricted to persoJl~ ,who meet cer-
tam z:nlmmalqualifications. The catalogues of major testp~1>lishers specify
reqUlr~ments that must be met by purchasers, Usually ~pdividuals with a
mast~r s degree in psychology or its equivalent qu~l.i~~' -SO'rtle publishers
claSSIfytheir tests into levels with reference to user qt;al~fi~~ions, ranging
from educational achievement and vocational proficiency tests, through
'Context of Psychological Testing
, , 'entories to such clinical instru-ltelligence tests and mterest In\ t 'ersonalit tests, Distinc-
s individual intelligence tests al1ldmOhsPers alld a~thorized insti-
' db' d' 'idua [lUre ase alsoma e etween In ,1\ t . Graduate students who mayh of appropnate tes s, , hPure asers " f research must have t e. , f I ignment or or ,
. articular test or ~ c ass a~s h "ehology instructor, who as-" order countersigned by t elf ps~ ,
'b'l' f' th oller use of the test. ,sponsl 1 Ity 01 e'pr, f h a dual objective: secuntyto restrict the distn~uboll o· ~ests ;~: Ethical Standards state:' . 1 d prevenhon of mIsuse, 1 , tatena san ,'th professional mteres s' , I' 't d to persons \\1
to such deVices IS ImI e , , 1 13)' "Test scores like test' d h' "( Pnnclp e, ,~ll safeguar t elr use who arc ualifled to interpret and
als, are rele::sed ~nl~ to perso~:sshould beqnoted that although test
m properly (Prmciple 14)" I t these obJ'cctives, the con-k ' , efforts to Imp emen 'b'l'
utors ma 'e SllleCIe , '1 limited, The major responsl 1 ItyYare able to exert IS neeessan y h ' d' 'dual uscr or institution
f 'd in t e 111 IViproper use 0 tests resl es h t MA degree in psychology
~ed,It is evident, ~or exampleA~;p a~i 'lorna-do not necessarily
en a PhD, state hc~nse, a~ld P articular test or that his
' hat the indi\'idualls quah~ed ~o u;e tia;: of the results obtained
is relevant to the proper mtel pre a 0
at test. . s the Il1arketing of psvcho-. l' 'bilihr concern. ,er professIOna lcsponsl '} h Id - t be released pre-
I d blishers Tests s Oll notests by aut lOrs an pu , I' be made regardin
crthe
' ' 1 N" h Id anv c aUllS bV for <renera use, • 01 S ou '. b' t" c"l'dence 'I\'hen a
o f fficient 0 Jec lye, .f a test in the absence 0 su nI\, this condition should
d If search purposes 0 .''sttibute ear y or reo , , f the test n;'stricted accordingly,y specified and the d~,S:'lb~tIOl~~e data to permit an evaluation
manual should pro\ 1 e a, eq, re ardin administraUon,
est itself as well as full il1fo~~n~ttonfactal e~OSitiOl1 of what
nd norn1S,The manual S IOU ,e a d vice:.desi ed t~;~t1t the'b t tlle test rather than a sellmg c ;'" gn h da ou , , )onsibility of the test ,aut or an'favorable lIght, It IS the rfeSl h to prevent obsolescence, i
' dorms 0 ten enougr to reVise tests an n d t d 'II of course var)'
, t be ames out a e WI, ,idity with wlueh a tes c "
vith the nature of the tehst, ld t be published in a newsp.aper,' t of tests s ou no If
~~ma °orUI:l'Sbook either for descri tive wrposes or forI SC
b
-e, or , " If 1 t' on would not on y e'00, Under these COndltI~:\;eW~~~~j~~ \vorthless, but it might
, such drastic errors as I' d' 'dual Moreover any pub-I . II ' , , s to t le In 1VI, ,~ychoogl~a y mJ~nou will tend to invalidate the future use of
,n to speCIfic test It~~S 'ght also be added that presentation of
)Vithother 'persOJ~s, m~ to create an erroneous and distortedprials in thIS fashIOn ten ,s 01 ~""h nllhlicitv may foster
Social alld Ethical Implicatiolls of Tes/ing 49
either naIve credulity or indiscriminate resistance on the part of the pub-lic toward aU psychological testing,
Another unprofessional practice is testing by mail, An individual's per-
formance on eithel' aptitude or personalit~· tests cannot be properly as-
sessed by mailing test forms to him nnd lla\'ing him return them by mail
for scoring and interpretation, Not only does this procedure provide no
control of testing conditions but usually it nlso involves tIle interpretation
of test scores in the absence of other pertinent information about the in-
dividual. Under these conditions, test results may be Worse than useless,
A question arising particularly in connection with personality tests is
that of invasion of privacy, Insofar as some tests of emotional, motiva-
tional, or attitudinal traits are necessarily disguised, the subject may re-
veal characteristics in the COurse of such a test without realiZing that he
is so dOing, Although there are few available tests whose appr~1ts
subtle enough to fall into this category, the possibility of developing s'i1~1.r
indirect testing procedures i~~ a grave responsibility on the pi.
choIogist who uses them. F~~se61 ijf'te§..ting cliee:tii\'ene~,~. De,..necessary to keep the examinee"in'1gnQ.f~~ the speCific~.):hhis l'esponses on any Oue test are to be int~fpreted, Xe\'er~~ •.a.1Jt'r_
son should not be subjected to any testing program under false pretenses,
Of primary importance in this connection is the obligation to have a
dear understanding with the examinee regarding the use that will be
made of llis test results, The- Jellowing statement contained in Ethical
Standards of Psychologists (Principle 7d) is especially germane to thisproblem:
The psychologist who asks that an individual reveal personal information inthe COurseof interviewing, testing, or evaluation, or who allows such infonna-
tion to be divulged to him, does s9 only after making certain that the r:e-sponsible person is fully aware oflhe purposes of the intervjew, testing, orevaluation and of the ways in which the information may be used,
Although concerns about the invasion of privacy have .been expressed
most commonly about perspnalit)' tests, they logi<:ally apply to any type
of test. Certainly any itlteJligence, aptitude, or achievement test may re-
veal limitations in skills and knowledge that an individual would rather
1Totdisclose. Moreover, any observation of an individual's behavi@r-'tt'~
in an interview, casual conversation, or, other personal '~llcoul1ter-m:lM'
yield information about him that he wouldpr~fer to c.qnCe.E.l1 and that I¢may reveal unWittingly. The fact that psycI11;)Jogicaltests have often been.
Il/('xl (If Psychological Testing
lit in discussions of the invasion of privacy probably reflects
.misconceptions about tests. If all tests were recognized as
.of behavior samples, with 110 mysterious powers to penetrate
havior,popular fears and suspicion would be lessened.
'Id also bc noted that all behavior research, whether employing
het-observational procedures, presents the possibility of invasion
'. Yet,as scientists, psychologists are committed to the goal of
g,.knowledge about human behavior. Principle 1a in Ethical
s ofPsychologists (Appendix A) clearly spells out the psycholo-
Viction"that socieh' v.·ill be best served when he investigates
judgment indicate~ investigation is needed." Several other prin-
theother hand, are concerned with the protection of privacy
'the{velfare of research subjects (see, e.g., 7d, 8a, 16). Conflicts
may thus arise, which must be resolved in individual cases.
amplesof such confl.ict resolutions can be found in the previously
icalPrinciples in the Conduct of Research tcit11 Human Par-
s (1973).
problemis obviously not simple; and it has been the subject of
"e delibemtion by psychologists and other professionals. In a re-
titledPrivacy and Be7IGvioral Research (1967), prepared for the
f Science and Technology, the right to privacy is defined as "the
the individual to decide for himself how much he will share with
histhoughts, his feelings, and the facts of his personal life" (p. 2).
fllrthercharacterized as "a right that is essential to insure dignity
reedomof sf>lf.determination"-(p. 2). To safeguard personal pri-
jno universal rules can be formulated; only general guidelines £illl
rovided.In the application of these guidelines to specific cases, th~~~
substitute for the ethical awareness and professional respons~i{9
Ie individual psychologist. Solutions must be worked out in ter~ p£:particularcircumstances. -
:'nerelevant factor is the purpose for which the testing is conducted-
'ther for individual counseling, institutional decisions regarding~~lec-
andclassification, or research. In clinical or counseling sit1,j.tions, the
_ t is usually willing to reveal himself in order to obtain h~]p with his
,oblems.The clinician or examiner does not invade privacy'where he is
eelyadmitted. Even under these conditions, however, the client should
tie warned that in the course of the testing or interviewing he may reveal
:informationabout himself without realizing that he is so doing; or he
Irony disclose feelings of which he himself is unawar
- When tes ng IS con uded for institutional purposes, the lfiaffiinee
Isbouldbe fully informed as to the use that will be made of his test scores.
, It is also desirable, however, to explain to the examinee that correct as-
sessmentwill benefit him, since it is not to his advantage to be placed
in a position where he will fail or which he will find uncongenial. The
results of tests administered in a clinical or counseling situation, of course,
should not be made available for instihltional purposes, unless the ex-
aminee gives his consent.
When tests are given for research purposes, anonymity should be pre-
served as fully as possible and the procedures for ensuring such anonym-
ity should be explained in advance to the subjects. Anonymity does not,
however, solve the problem of protecting privacy in all research contexts.
Some subjects may resent the disclosure of facts they consider personal,
even when complete confidentiality of responses is assmed. In most cases,
however, cooperation of subjects may be elicited if they are convinced
that the information is needed for the research in question and if they _
have sufficient confidence in the integrity and competence of the in-
vestigator. All research OIl human behavior, whether or not it utilizes
tests, may present conflicts of values. Freedom of inquiry, which is es-
sential to the progress of science, must be balanced against the protection
of the individual. The investigator must be alert to the values involved
and must carefully weigh alternative solutions (see Ethical Principles,1973; Privacy and Be1lGvioral Researc11,1967; Ruebhausen & Brim, 1966).
Whatever the purposes of testin tlle rotection f riva
two Key concepts: re evanc consent. The information that
t e m iVl ua is asked to reveal must be relevant to the stated purposes
of the testing. An important implication of this principle is that an prac-
ticable effOlts should be made to ascertain the validity of tests for the
particular diagnostic or predictive purpose for which they are used. An
instrument that is demonstrably valid for a given purpose is one that
provides relevant information. It also behooves the examiner to make
sure that test scores are correctly interpreted. An individual is less likely
to feel that his privacy is being ~aded by a test assessing his readiness
for a particular educational progrlfm than by a test allegedly measuring
his "innate intelligence."
The concept.,£.f informed consellt also requires clarification; and its ap-
plication in individual cases mav call for the exercise of considerable
judgment (Ethical Principles, 1973;,Ruebhausen & Brim, 1966). The ex-
aminee should certainly be infoJ'l!le~.about the purpose of testing, the
kinds of data sought, and the use tha1;:wifi be made of his scores. It is not
implied, however, tliat he be shown the test items in advance or told
how specific responses will be scored. Nor should the test items be shown
to a parent, in the case of a minor. Suc~ infonnation would usually in-
validate the test. Not only would the giving of this information seriously
impair the usefuhless of an ability test, boutit would alsotcm.d Jo distort
responses on many personality tests. For ~xaQJple, if an indi®~,~l is told
in advance that a self-report inventory-will be scored v.ith adorpinance
Social and Ethical Implications of Testing 53
tent, the hazards of misunderstanding test scores, and the need of various
persons to know the results.
There has been a growing awareness of the right of the individual
himself to have access to the findings in his test re ort. He should also
lave e opportum to comment on e contents of the report and if
necessary to clarify or correct factual information. Counselors are now
trying more and more to involve the client as an active participant in his
O\\'n assessment. For these purposes, test results should be presented in
a form that is readily understandable, free from technical jargon or
labels, and oriented toward the immediate objective of the testing.
Proper safeguards must be observed against misuse and misinterpretation
of test findings (see Ethical Standards. Principle 14).
-In the case 'of minors, one must also consider the parents' right of
access to the child's test record. This presents a possible conflict with the
child's own right to privacy, especially in the case of older children. In a
searching analysis of the problem, Ruebhausen and Brim (1966, pp. 431-
4,32) wrote: uShould not a child, even before the age of full legal re-
sponsibility, be accorded the dignity of a private personality? Considera-
tions of healthy personal growth, buttressed with reasons of ethics,
seem to command that this be done." The previously mentioned Guide-
lines (Russell Sage Foundation, 1970, p. 27) recommend that uwhen a
student reaches the age of eighteen and no longer is attending high
school, or is married (whether age eighteen or not)," he should have the
right to deny parental access to his records, However, this recommenda-
tion is followed by the caution that school authorities check local state
laws for possible legal difficulties in implementing such a policy.
Apart from these- possjble exceptjons, the question is not whether to
commUDlcute test results to arents of a minor but how to do so. Parents 1
norma y have a legal right to information- a out eir child; and it is
usually desirable for them to have such information. In some cases, more-
over, a child's academic or emotional difficulties may arise in part from
parent-child relations. Under these conditions, the counselor's contact
WIth die parents IS of prime importance, both to fill in background data
and to elicit parental coope.ration.
Discussions of ~he ~n6dentiality of test records have usuall~ dealt
with accessibility to a thIrd person, other than the in~hjdilal tese~d (orparent of a minor) and the examiner (Ethical Stando,r.ds, Principle 6;
Russell Sage Foundation, 1970). The underlying principle is that such
records should not be released without the knowl~~~. an..d. conseiitOf •the individual.' .,
'Vhen tests are administered in an institutional context, as in a school
system, court, or employment setting, the indi~dual should be .infonne~
at the time of testing regarding the purpose 6f~!he test, how th~ results- _._~--_._ _.-._-- ...•.. ~,:~'-- ..
of Psychological T('sting. fl d by stereotyped (and often
,p'J)se~are likely to bbeIntthu~n:ait or by a false or distorted
as'he may have a ou IS ,
. . 'th regard to pa-ng of children, special qU,es~ons anse "':1 e the Russell
~~:(i~;~~~U~~iS~~:~r:i:;:~;n~;:;d~di::I7:rc tfite COeelle~~i~~:, . ' f P '1 Recor s. 11 re ereo,.and DissenunatlOl1 0 tip' . d' "d al consent,nt, the Gujdelines differentiate b;tween l~t:~~o:al consent,
'hild, his'tiparents, or both, an . ~r~~e resentatives, such'arents: legally elected ~r .appoll1t~. . p the Guidelines
board. 'While avoiding rigId preSC;lpti~n:h type of instru-
, and achie",~mcnt tests as examp es °b em' t, at the" , I nt should e su Cleo,,!lich representation a conse .. . "", t' cite . 1 1 I
, e, personality ~ss~ssm~~i~:lilles is the inclusion of sample
helpfu eature. o. ~ e~~tten consent. There is also a selected.forms,for obtammo ' 1 t of school record keeping,, on the ethical and ~ega alsPdec~ that protect the indi-a d 'penmenta eSlgnsrocedur~~ a~o pe:rucipate and that adequately safeguard hist ,to eCme. . f 1data resent a challengeHevielding scientifically meanmg u 'tP d the establish-, '. . W'th oper rappor anc,hologist's ipgenUlty.. 1 pr h b of refusals to, 1 t however t e num ertitudes of mutua' respec, . 'bl ' tity The technical dif-may be reduced to a neghgl e quan ' h'; b avoided.
,bi;sed sampling and voluntee~ error, may : USe;t; tllat this
rom both national and stateWide .SUlvley~'t:gg,·, nd in the
, h . f g educatlona ,ou comes abe achieved, bot III t':s III rch (Holi:zn~~n, 1971; Womer,
'~itiv~area of pers~~allty ;~:~ath(' number of respondents 'whoere Is-also some eVI ence ,',' "on of privacy or
. . t enresents an mvaSI .a personahty llwen ory r 1" .'. 'S" nt'ly reduced when
h' ff nsive 15 slgm ca''der some of t e.ltems 0 e " : ex lanation of h.Q.YLitemL
:preceded by a Simple.and ~orthrJ ~:d..(Fink & Butcher, 1972).ted and I 0\ ores WI I be mterpre_ , .1:'~,' h
,- lid' 't hould be' adde~~~"t sue anstandpoint of test va Ity, 1 Sf'" the personality'on did not affect the mean profile 0 scores on - , '
IDENTIALITY
, . . which it is related, the problem oft,~e~rotectlOn ~f p~lVacYiftf:ceted. The fundamental question is:
tiahty of test ata ISmu {ts? Several considerations influence the
all hav~ access. to t~t resAmu~ng them are the security of test con-in particular situations.
Social alld Ethical Implications of TCStillg 55
the c:lpacity to record faithfully, to maintain permanently, to retrieve promptly,
and to communicate both Widely and instantly.
'ntext of psychological Testing
'd nd their availabilih' to institutional personnel who h~v~ aISC ,ad f th UncJ:e,r'these conditions, nO further penms~lOne nee or em. . hi 'h' t1 institutiOn,d tl ti results are made avalla e Wit III Ie .-e at Ie me 1 r uested by outsiders,'nt situation exists when test resu ts are eq t It from" 'm lover or a college requests tes resu s
"~.R:::~~t~~:s: i~st~nc~s, ind~v~d~l~~e~o;:~:~t~o:;:~~:~e~;dt~:
equired, T~e same r~qUlre%~,nres:~ch urposes. The previously
and coullsehng contexts, or d' 1971 P 42) contain a sample
uidelines (Russell Sage Folun :tlOn"n de~ri~lg the transmission, ofiformfor the use of schoo sys ems I ,
ta. , f . d'n institutions. Oner pr,oblem pertains to the l'ete~tlO~l? recor s I
bevcr' valuable,
,hand, longi1tudinal rec~r~s a~:o l:~~I~::~~:t~~~ing ani'counseling
y for researc I purposes u . advanta es resuppose proper
son. As is so often the cas~, th;se t1 othe; haKd the availability
. interpretation •.of test resu ts, n m::uses as inl~rrect inferences
rleords opens t~e way f~~ s~ch for otber than the original'solete data atld.~unauthollze 1 acbcessd for example to cite an IQ
I Id b anifest v a sur , , ,.gpurpose, ~wou em: d b a child in the third gradereading achle\'t>ment sco~e, obtalOe II Ye Too much may have hap-
n evaluating him for admISSion to co eg 'k h ·1' and ""'lated, I" 'ears to ma e suc eaI' ,.,..d to llim in t Ie mtervemng ) d etained fo'l"many, f I S' '1 Iv when recor s are rscoresmeaning u. Iml ar .' b ed for purposes that the individ-
rs,t11ereis dan!!:er that tbey ma): edu~nd would not have approved.(or his parents) never suspecte 'd t' d either for le-
I, when recor s are re amea I1revent suc 1 mIsuses, , f h 'd'" d al or for ac-, " I 'the interest 0 t e m 1'111 ulate longltudma use m them should be subject to unusual¥i
table research purPloseCs,a,cdcej:setso(Russell Sal1e foundation, 1970:W', t troIs In t Ie w e In I:> d tngen can . d 1 . 'fi d into three categories-with regar· 0
t2), sch~ol recol' s. are c aSSli~in factor in this classification is the
'I" retenti~n, ~. major det~~~ilih~ of the data; anot\l.er is rdevance to i
ree of objectIVity and ven a 'J 1 I Id be ..s-e for any type of. 1 b' ti f the schoo. t wou ,,"", .
e educationa 0 Jec ves ,0 '1 . l' 't policies regardit.g the destruc-.stitution to fonnulate SHmar exp lCl d'. . . 'b'1't f personal recor s.:t!on, retention, and acceSSI I I Y a 't nd accessibility of test results'" ' bl f . tenance secun y, a, The pro ems 0 mam, . 'fi d bv the develop-
.~and uf all other ~ersonal da~:n~avlen b~~; ;:~~e eta the Guidelines
. inent of computenzed, aata . 5-6) Ruebhausen wrote;(RussellSage Foundation, 1970, pp, , ,
. d a new dimension into the issues of pnvacy.Modernscience has mtl'Oduce h tr t allies of privacy were the in-
t' 1e among t e s ongcs ,Therewas a Ime W 1 n . d the healing compaSSion, h f II'b'n" f hiS memorv an
efficiencyof man, tea 1 1 I • ,0 f' t' 'd' the warmth of human reeol-, d b th the passmg 0 tme an '
lhatat'compame,. 0 ,'." .., ""II" ,_,,"'.\fnrlrrn sciellcehas !!ivenus
The unprecedented advances in storing, processing, and retrieving data
made possible by computers can be of inestimable service both in re-
search and in the more immediate handling of social problems, The po-
tential dangel"s of invasion of privacy and violation of con~dentiality
need to be faced squarely, constructively, and imaginatively, Rather than
fearing the centralization and efficiency of complex computer systems, we
should explore the possibility that these very characteristics may permit
more effecth'e procedures for protecting the security of individual
records.
An example of what can be accomplished with adequate facilities is
provided by the Link system de\'eloped by the American Council of
Education (Astin & Boruch, 1970), In a longitudinal research program
on the effects of different types of college environments, questionnaires
were administered annually to several hundred thousand college fresh-
men, To permit the collection of follow-up data on the same persons
while preventing the identiflcation of individual responses by anyone at
any future time, a three-file system of computer tapes was devised, The
first tape, containing each student's responses marked with an arbitrary
identincation number, is readily accessible for research purposes. The
second tape, containing only the students' names and addresses with the
same identification numbers, was originally housed in a locked vault and
used only to print labels for follow-up mailings. After the preparation of
these tapes, the original questionnaires were destroyed.
This two-file system repl'esents the traditional security system. It still
did not provide complete protection, since some staff members would
have access to both files. ~'Ioreover, such files a-re subject to judicial and
legislative subpoena. For these reasons, a third me was prepared. Known
as the Link file, it contained only the original identification numbers and
a neW set of random numbers which were substituted for the original
identification ~umbers in the name and address file. The Link file was
dcposited at a computer facilit), in a foreign country, with the agreement
that the file would never bC;le)eased to anyone, inclu~jpg the American
Council on Education. Follow-u.p data t~p!s are sent tq the f{)reign fa-
cility, which substitutes one set of code numbers f~the other. With the
.decoding files and the research data files under: the control of different
organizations, no one can identify {he responses of illdividuals ~ the
data files. Such elaborate precautions roi'the protection of conlidentiality
obviously would not be feasible except in a!aJge-scale computerized data
bank. The procedure could be simplified sQmewhat if the lin\ing faCility'·
were located in a domestic agency given,:adequate protection against
subpoena.
i$tshave given much thought to the comm~nication of test
"formthat will be meaningful and useful. It IS clear that the
should not be transmitted routinely, but should be accom-
nterpretive explanations by a professionally trained person.
imicating scores to parents. for example, a recommended
to arrange a group meeting at which a counselor or school
'\explains the purpose and nature of the tests, the sort of
th'tt"t mav reasonably be drawn from the results, and the
of the d~ta. Written'reports about their own children may
ributed to the parents, and arrangements made for personal
';vithany parents wishing to discuss the ~epol'ts further .. ~e-
how they afe transmitted, however, an Important condItIon
resu1tsshould be prcsented in terms of descriptive perform-
rather than isolated numerical scores. This is especiall}' tnu::..
nee test· which are more likely to be misinter reted than are
't tes,ll1icatingresults to teachers, school administrators, emplo'yers,
approprig.te persons, similar safeguard~ shoul~ b~ proVided.
Is of performance and qualitati\·e descnptot~ns 111 Sllnple terms
preferred over specific numerical. scores, cxc,:pt when com-
g with adequately trained professlOnals .. Ev~n well-educated
ye been known to confuse percentiles WIth Q~~centa~e scor~s,
with lQ's, norms with standards, and int~Fts~ ratlOgs With
'ores.But a.more serious misinte )fetation )ertams to the con-
rawn from test SCOl'es,even w en their te.c:nnical meaning is
mderstood. A familiar example is the popuhyassumption that
!cates a fixed characteristic of the individual wl)ich pTede-
is lifetime level of intellectual achievemen~. , ,-litcommunication it is desirable to take .i.W:oaccount the char- .
of the person who is to receive the i~fomlation. This. applies I
o at person's general educatIOn 1~:imowledge about psy-
nd testing. but also to his anticipated eIllotional response to the
on. In the case of a parent or teacher, for. example, personal
I' involvement with the child may interfere with a calm and
'cceptance of factual information. . .ut by no means least is the problem of commumcatlOg test re-
';e individual himself, whether child or adult. The same gene.ral
.'s against misinterpretation apply here as in ~mmuni~tm~
ird party. The person's emotional reaction to the mforrnatlOn lS
ly important. of course, when he is learnin? about hfs 0'1\'11 assets.,... :..~.. 'H1".~ ,,,, ;nr1;vir'll1:1lis !!iven hiS own test results, not
Social and "Etl1icalIIll1"ications of Testing 57
onl~. s~ould the data be interpreted by a properly qualified person, but
faclli~Ies shoul.d also be available for counseling anyone who may become
cmOti01~any dIsturbed by such information. For example, a college stu-
dent mIght become seriously' discouraged when he leams of his poor
performance on a scholastic aptitude test. A gifted schoolchild might de-
velop habits of laziness and shiftlessness, or he might become uncoop-
erahve a~ld unm.anageable, if he discovers that he is much brighter than
any of Ius asso.clates. A severe personality disorder may be precipitated
when a ~aladlust('d individual is given his score on a personality test.
Such de~nmental effects may, of course, occur regardless of the correct-
ness or lllcorrectness of the score itself. Even when a test has been ac-
curately administer:d and scored and properly interpreted, a knowledge
of such a score WIthout the opportunity to discuss it further ~nay be
harmful to the individual.
Counseling psychologists h~e been especially concerned with the de-
v~lo ment of effective wavs of transmittin test inform' to-their-_
c IC11t5 see, e.g., Goldman, 1971. Ch. 14-16). Although the details of
..-tfu~ pr.ocess ~re be}'o~d the, scope of ?~present discussion, two major
gll1del~nes are of particular mterest. FI~ test-reporting is to be \'iewed~
as an mtegral part of the counselin rocess and incor orated into the
o a counse or-c lent relationshi . Se d, insofar as ossible, test results
shou e reported as answers to specific !:lucstions raised bv the CQun-
~. An Important consideration in counseling relates to the' counselee's
~cceptance o~ the information presented to him. The counseling situation
IS such thaf If the individual rejects any information, for whatever rea-
sons, then that information is likely to be totally wasted.
I
1 II III
T~ SETfINC.~he decades since 1950 have witnessed an increasing
publIc concern With the rights of minorities,' a concern that is reflected in
the enactment of civil rights legislation at both federal and state levels.
In conn~t~on with mechanisms for improving educational and vocational
opportumhes ~f such groups, psychological testing has been a major
focus of att:nbon. Th~ psychological literature of the 1960s and early
197?s co~tams many dI~cussions of the topic, whose impact ran.ges from
clanflcabo~ ~o obfuscation. Among the more clarifying contributions are
several po.slbon papers by professional associ,tit>ns (see, e.g., American
Psychological Association, 1969; Cleary, Humphreys, Kendrick, & \Ves-
Ie I tlthou~h ~omen repre)'lnt a statistical majorltyjn the nati~~al population.
ga I.y,~c~upalJonallY'in in otlu~rways.they have s~ed Jllany of the problemsof mmoTlhes. Hence w the term "minority" is use(i "fu tnis section it will beunderstood to includj) men. '
. 'onlcxt of Ps!}clIOlogica1Testing
'5' Deutsch Fishman, Kogan, North, & Whiteman, 1964; Tl::'1Jl~use of t~sts 1972). A brief but cogent paper b~ F~augh
Isohelps to cle~r away some preval~nt S~ll;C~So~;:; ~~I~~~ural'of the concern centers on the lowenng 0 es sc . d . t r-ns that ma)' have affected the devel~p;ne~t lofc~;:~~e:;;ti: eoEotivation, attitudes, and other psyc ~ O~IC: for the problemou members. Some of the propose so u Ions . al
mi~nlrstandings about tIle nature anddfllnfction of ps~chdol'Vll?j~~ls. . I b kgroun s 0 groups or 10
iflerencesin the expenentia ac hI' 1 test~itablymanifested in test performanlce. Ev:rytPsbychaoVl~o~C~tsin-
. 1 I f as Cll ture alIec s e ,res a beh:wlOl' samp e. nso ar If 1 ut aU culturalwill and should be detected by tests. .we ~ e. 0 as a measure
I1tialsHom a test, we may th.erebdYlower Its ;ah~?t case the test
behavior domain it was deslgnc to assess. n'fail to provide the kind of information needed to correct the very
'ionsthat impaired performance. . 1 citron the
ause the testing of minorities repr~sent\a sP:~~~l ~~; :heoretical
.er problem of cross-cultural te.stmg, t e U full) in Cha ter 12.
naleand testh~g procedures ar: ~1~~~:~e~i:'?7s giv:n in Ch~pter 7,
chnicalanalysIs of the concep 0 h t h ter our interest is, . h l'd't In t e presen c ap ,
llnnectlOl1Wit test va I I y. ., ., f inDrity groUpwily in the basic issues and SOCialImplications 0 m
·ng.
Social and Etllicallmplications of Testing 59
iarity with such objects. On the other hand, if the development of arith-
metic ability itself is more strongly fostered in one culture than in an-
other, scores on an arithmetic test should not eliminate or conceal such
a difference.
Another, more subtle way in which specific test content may spuriously
affect performance is through the examinee's emotional and attitudinal
responses. Stories or pictures portraying typical suburban middle-class
family scenes, for example, may alienate a child reared in a low-income
inner-city home. Exclusive representation of the physical features of a
single racial type in test illustrations may have a similar effect on mem-
bers of an ethnic minority. In the same vein, women's organizatiDlls have
objected to the perpetuation of sex stereotypes in test content, as in the
portrayal of male doctors or executives and female nurses or secretaries.
Certain words, too, may have acquired connotations that are offensive to
minority groups. As one test publisher aptly expressed it, "Until fairly
recently, most standardized tests were constructed by white middle-class
people, who sometimes clumsily violate the feelings of the test-taker
without even knDwing it. In a way, one cDuld say that we have been not
so mueh culture biased as we-have been 'culture blind'" (Fitzgibbon,1972, pp. 2--3).
The major test publishers now make special efforts to weed out in-
appropriate test cDntent. Their Dwn test construction staffs have becDme
sensitized to pDtentially offensive, culturally restricted, or stereotyped
material. Members of different ethnic groups participate either as regular
staff members or as consultants. And the reviewing of test content with
reference to possible minority implications is a regular step in the process
of test construction. An example Df the application Df these procedures
in item construction and revision is provided by the 1970 edition of the
Metropolitan Achievement Tests (Fitzgibbon, 1972; HarcDurt Brace Jo-
vanovich,1972).
TEST.RELATED FACTORS. In testing culturally di"h·elt·seffPerst°bno~hi:e~~~~d, . . b cultural factors t a a ecrtant to differentiate etween . t . t d to the test It is
I' d th hDse in uence is res nc e - .·terionbe laVlor an ose w d ~ Ex~mples of suchatter, tSst-related actors that. re l\~e va 1 .; ion to erEorm
to~sinclude previous experience m ~akmg tests, mo~;t. variable; ,~_
~veJlon tests, rappDrt with the exammer, an~ an1y0 tet
r~_-i<c;€fit~rion
th fcular test but me evan O~ __ . --_.-
fectingperformance on .e pa~ I h ld be m'aae toreduce the opera-d. s'deration SpeCial en arts s ou .' . '.
~ I' when testing persons wltn diSSimilarlion of these test-related factors - . - 'd adequate test-
,.ii:ctilfuralbacKg~.n:dS: A d~b1e proc~~urea:\~u~~~;~d\Y the bookletsakingorientation and prehmmary prnc iCe,. 'th parallel form is. d" d' Chapter 2 Retestinl1 WI a _ ~--~d tape recor mgs cite III '. Ph h had little or noIsoreeDmmended with low-seorin examm s w a ave -
~prl;;e~st~t~e:~.~:e~c:;also~n~~e:~:s:e;: :~~~e;e~7c ~:::o::~, ~::
unrelated to cntenon per£orm~n tu' of obl'ects unfamiliar in a particular-ample the use of names or piC res . d h diex l' T ld obviously represent a test-restncte an cap.cultura mlleu wou h' k' d not depend upon fami!-
. Ability to carry out. quantitative t m mg oes
INTERPRETATION AND USE OF TEST SCORES. By far the most important
coflsiderations in the testing of culturally diverse groups-as in all testing
-,;..,pertain to the interpretation of test scores. The most frequent misgiv-
ings regarding the use Df tests with minority group m~w:bers ste~ from
misinterpretations of scores. If a minority examinee Qn~l:li~sa low score
on an aptitude test or a deviant score on a personality):est, it is essential
tQ.investigate why he did so. FDr example, an infel~i'St:ore on an arith-
metic test could result from low test-taking motivation, poor reading
ability, or inadequate knowledge of arithmetic, among other reasons.
Some thought should also be given to the type of nQCWsto be employed
in evaluating individual scores. Depending on the purpose of the testing,
the appropriate norms may be general nDrms~.!2gl;oUP.Jlotms based Qn- . .
Many bright, non-conformingpupils, with backgrounds different from those oftheir teachers, make favorable showings on achievement tests, in contrast totheir low classroom marks. These are very often chffarenwhose cultural handi-
caps are most evident in their overt social and interpersonal behavior. Withoutthe intervention of standardized tests, many such children would he stigma-
tized by the adverse subjective ratings of teachers who tend to reward can·formist behavior of middle-class character.
Social alld Et!lical171lplicatiolls of Testing 61
an IQ would thus serve to perpetuate their handicap. It is largely be-
cause implications of permanent status have become attached tq.Jhe IQthat in 1964 the use of group intelligenGe-testS-..M:asdiscontinued in the
l\ew York City public schools (H. B. Cilbeli, 1966; Loretan, 1966). That
it proved necessary to discard the tests in order to eliminate the miscon-
ceptions, about the fixity of the IQ is a revealing commentary on the
tenacity of the misconceptions. It should also be noted that the use of
individual intelligence tests like the Stanford-Binet, which are admin-
istered and interpreted by trained examiners and school psychologists,
was not eliminated. It was the mass testing and routine use of IQs by
relatively unsophisticated persons that was considered hazardous.
According to a popular misconception, the IQ is an index of innate
intellectual potential and represents a fixed property of the organism. As
will be seen in Chapter 12, this view is neither theoretically defensible
nor supported by empirical data. \Vhen properly intcrrireted, intelligence
test scores should not foster a l'igid categorizing ~f persons. On the con-
hary, intelligence tests-and any other test-may be regarded as a map
on which the individual's present position can be located. When com-
bined with information about his experiential background, test scores
should facilitate effective planning for the optimal development of the
individual.
OBJECTIVITY OF TESTS. "'hen social stereot:'pes and prejudice may dis-tort interpersonal evaluations, tests provide a safeguard against fa-
voritism and arbitrary or capricious decisions. Commenting on the use of
tests in schools, Gardner (1961, pp. 4&-49) wrote: "The tests couldn't see
whether the youngster was in rags or in tweeds, and they couldn't hean
the accents of the slum. The tests revealed intellectual gifts at every level
of the population."
In the same vein, the Guidelines for Testitlg Minority Group Children(Deutsch et at, 1964, p. 139) contain the follOWingobservation:
\Vith regard to personnel selection, the contr!!>ution:,of t~sts was aptly
characterized in the following words by John ,¥:, Macy, Jr.,'Chairman of
the United States Civil Service Commission (7.f~~!f,rgand Public Policy,1965, p. 883) :""'.
be of states enacted legislation and estlt •••AL REGULATIONS. Anum. r ., (FEPC) to implement i,t..:,.d F . E 10 ment Practices CommiSSions. -1'im\'!\.
"e aIr mp y f h legal mechanisms at the federal l~;l~~'nor to the development 0 suc lIotts have been made to pat-
1iI0ngthe states that did so 7t~r, sfme;e\ The most pertinent federal
tern th~ re?ulatio~s after the e u~tE:olo '~ent Opportunity Act (Title
legislatIOnISprovld.ed by the ~q 1964 ~ ?ts subsequent amendments).>
'n of the Civil Rl?hts Act o. a~ ;nfottement is vested in the, e. sponsibility for Implementation
Can ., (EEOC) When charges
, 0 rtunity ommlSSlon .qual Employment ppo. h plal'nt and if it finds the charges
, " h EEOC' shgates t e com ,-arefiled, t e lllve t th 'tuation through conferences and
. '6 d'" first to correc e Sltobe lush e , u1.es d f '1 EEOC may proceed tor If these proce ures al,voluntary com~ lance. d d . t orders and finally bring action inhold hearings, ISsue cease an eSlS ,
. 1 al developmentssince midcentury, including'A brief summary of ~he major e~d rt decisions, can be found in Fincher
legislativeactions, executive orders, an cou
(1973).
Social and Etlticallmplications of Testing 63
the federal courts. In states having an approved FEPC, the Commission
will defer to the local agency and will give its Bndings and conclusions
"substantial weight."
The Office of Federal Contract Compliance (OFCC) has the authority
to monitor the use of tests for employment purposes by government con-
tractors. Colleges and universities are among the institutions concerned
with OFCC regulations, because of their many research and training
grants from such federal sources as the Department of Health, Educa-
tion, and Welfare. Both EEOC and OFCC have drawn up guidelines re-
garding employee testing and other selection procedures, which are vir-
tulillly identical in substance. A copy of the EEOC Guidelines on Em-
ployee Selection Procedures is reproduced in Appendix B, together with
a 1974 amendment of the OFCC guidelines clarifying acceptable pro-
cedures for reporting test validity,3
Some major provisions in the EEOC Guidelines should be noted, The
Equal Employment Opportunity Act prohibits discrimination by em-
ployers, trade unions, or employment agencies on the basis of race, color,
religion, sex, or national origin, It is recognized that properly conducted
testing programs not only are acceptable under this Act but can also
contribute to the "implementation of nondiscriminatory personnel poli-
cies." Moreover, the same regulations specified for tests are also applied
to all other formal and informal selection procedures, such as educational
or work-history requirements, interviews, and application forms (Sec-
tions 2 and 13),
\Vhen the use of a test (or other selection procedure) results in a
significantly higher rejection rate for minority candidates than for non-
minority candidates, its utility must be justified by evidence of validity
for the job in question. In defining acceptable procedures for establish-
ing validity, the Guidelines make explicit reference to the Standards for
Educational and Psychological Tests (1974) prepared by the American
PsycholOgical Association. A major portion of the Guidelines covers mini-
mum requirements for acceptable validation (Sections 5 to 9). The
reader may find it profitable to review these requirements after reading
the more detailed technical discussion of validity in Chapters 6 and 7 of
this book. It will be seen that the requirements are generally in line with
good psychometric practice.
In the final section, dealing with affirmative action, the Guidelines
point out that even when selection procedures have been satisfactorily
ntcxt of Psychological Testing
" ., f pIc that are related to job per-sityto measure charactenS!lCS 0 peo h' h' the basis for entrv, is at the very root of the merit system, ~v~u:s over the veal'S, th~
~areerservices of ~hel~ederalt ~o\t'he:::l~pmen't and application of.. .. h s had a vita mteres m d bl'eTVIcea d bt that the widesprea pu Iegicaltesting methods. I ha\'~ ~o ou d res has in large part been" in the objectivity of ~ur 111
fnnhgp;~ce ~: the' practicality, and the
"by the public's perception 0 t e alrne., .-.'ofthe appraisal methods they must submit to.
". • 101 ee Selection Procedures, prepared by the:GUldeltnes on Emp y. ., (1970) as an aid in the". I t 0 portumty CommiSSIOnmp oymen P b' 'th the following state-'entation of the Civil Rights Act, rgm WI
purpose:I h belief that properly validated and
elin,esin this part ar~ based o~ ~e: can significantly contribute to thefzedemployee selection proce u I I'CI'es as required bv Title
d' ' . t personne po I ' , ,
entationof no~ Iscnmma or; . llv developed tests, when used in(is also recogmzed that pro esslon~ ;~sessment and complemented by'ction with other tools of perso~n~fi tl,'d in the development and" f ' b d' may sign! can 'Ii al dprograms0]0 eSlgn, - d . deed aid in the utilization antenanceof an efficient work force an , In ,
servationof human resources generally,
, b 'sused in testing culturally disadvantagedar)' tests can e ml ' 'h th '
nsumm .'. ,I When properly used, owever, e). ns-as 111 testmg aD.yon~ ese, ting irrelevant and unfair discrim-
, e an important fun~tlOn 111 pre~te~ive index of the extent of cultural. 'ti' The\' also prOVIde a quanti ~ ..lOaon, - . d'al programsnandicapas a necessar~' first step In reme 1< •
3 In 1973, in the interest of simplIficationand improvedcoordination,the prepara-tion of a set of uniform guidelineswas undertaken by the Equal EmploymentOp-portunity Coordinating Council, consisting of representatives of E ,the U.S,Department of Justice, the u.s. Civil Service Commission,the U.S'c,rtlJlent ofLabor, and the U.S. Commissionon Civil Rights. No'uniform versioD,o<... et. 1u!syet been adopted. " '•.
Context of Psychological Testing
'ted, if disproportionate rejection rates result for minorities, steps
e.takento reduce this discrepancy as much as possible. Affirmative
'~impliesthat an organization does more than merely avoiding dis-
'. ry practicCli,.Psychologically, affirmative action programs may
dedas eHorts to compensate for the residual effects of past social
~s.Such effects may include deficiencies in aptitudes, job skills,
~,motivation, and other job-related behavior. They may also be
'~iniH~erson'sreluctance to apply for a job not traditionally open" ndidates, or in his inexperience in job-seeking procedures.
~mative actions in meeting these problems include re-
media most likely to reach minorities;, explicitly en-
minority candidates to apply and following other recruiting
esignedto counteract past stereotypes; and, when practicable,
special training programs fOI the acquisition of prerequisite
knowledge.
PART 2
Primipus of
Psychological listing
CHAPTER 4
NornlS a'nd the
11lterjJretation of Test Scores
INTHE absence of additional interpretive data, a raw score on any
psychological test is meaningless. To say that an individual has
correctly solved 15 problems on an arithmetic reasoning test, or
identified 34 words in a vocabulary test, or successfully assembled a
mechanical object in 57 seconds conveys little or no information about
his standing in any of these functions. Nor do the familiar percentage
scores provide a satisfactory solution to the problem of interpreting test
scores. A score of 65 percent correct on one vocabulary test, for' example,
might be equivalent to 30 percent corred on another, and to 80 percent
correct on a third. The difficulty level of the items making up each test
will, of course, determine the meaning of the score. Like aU raw scores,
percentage scores can he interpreted only in terms of a dearly defined
and uniform frame of reference.
Scores on psychological tests are mOst commonly interpreted by ref-
erence to norms which represent the test performance of the stand-
ardization sample. The norms are thus empirically established by de-
termining what a representative group of persons actually do on the test.
Any individual's raw score is then referred to the distribution of scores
obtained by the standardization sample, to discover where he falls in that
distribution. Does his score coincide with the average performance of the
standardization group? Is he slightly below average? Or does he fall near
the upper end of the distribution?
In order to determine more precisely the individual's exact position
with reference to the standardization sample, the raw score is converted
into some relative measure. These derived scores are designed to serve a
dual purpose. First, they indicate the individual's t~lativ.e standing in
the normative sample and thus permit an evaluation of his'performance
in reference to other persons. Second, they provide comparable measures
that permit a direct comparison of the individual's performance on dif-
ferent tests. For example, if an individual has a raw score of 40 on a
vocabulary test and a:raw score of 22 on an arithmetic reasoning test, we
67
il1lcsof Psychological Tcstillg
'nownothing about his relative performance on the two tests.
invocabulary or in arithmetic, or equally good in both? Since
'.9ndifferent tests are usually expressed in different units, a
,a)'isollof such scores is impossible, The difficulty level of the
est would also affect such a comparison between raw scores.
,~s,on the other hand, can be expressed in the same units
"to the same or to closely similar normative samples for
. The individual's relath'e performance in many different
,thusbe compared.ariousways in which raw scores may be converted to fulfill
p.vesstate'd above. Fundamentally, however, derived scores
)0 one of two major ways: (1) developmental level at-
relative position within a specified group. These types of
~rwith some of their common variants, will be considered
::tionsof this chapter. But first it ,vill be necessary to ex-
'elementary statistical concepts that underlie the develop-
'zation of norms. The following section is included simply
.meaningof certain common statistical measures. Simplified
.examples are given onl~; for this purpose and not to pro-
'~ statistical methods. For computational details and spe-
s to be ~llowed in the practical application of tl1ese tech-
er is refeHed to any recent textbook on psychological or
atistics.
TABLE 1
Frequency Distribution of Scores of 1 000 C II Studon a Code-Learning Test ' 0 ege ents
(Data from Anastasi, 1934, p. 34)
-Class Interval Frequency
52-55
48-51
1
44-471
40-43
20
36-S9
73
32-35156
28-31
328
24-27
244
20-23
136
16-1928
12-158
8-1132
•. ~-:-na-=
1,000
fa
ject of statistical method is to organize and summarize
)~ in order to facilitate their understanding. A list of 1,000
be an overwhelming sight. In that form, it conveys little-
step in bringing order into such a chaos of Iaw data is to
es into a frequency distribution, as illustrated in Table l.
'on is prepared by grouping the scores into convenient
d tallying each score in the appropriate interval. When
.n entered, the tallies are counted to find the frequency,
es, in each class im"erval. The sums of these frequencies
'e total number of cases in the group, Table 1 shows the
,~ollegestudents in a code-learning test in which one set
ds, or nonsense syllables, was to be substituted for an-
, ~cores, giving number of correct syllables substituted
Inute trial, ranged from 8 to 52. They have been grouped
'1sof 4 points, from 52-55 at the top of the distributionIe frequency column reveals that two persons scored
~~~ws:e~n~and 11, three b~tween 12 and 15, eight ,between 16 and 19,
The information provided b fpresented graphicallv in the f y af r~~ue~lcy. distribution can also be
the data of Table 1 'l'n gra h,ormf° ao lstnbubon curve. Figure 1 shows
p lC orm. n the b r h'are the scores grouped int I' ase me, or onzontal axis,frequencies, or number of o. c ass/1~ervals: .~n the vertical axis are the
graph has been plotted I' teases a mbogwlthm each class interval. The
n wo ways th fo be' .In the histogram, the hei ht of the :x.l rms 109 m common use.terval corresponds to the g b umn erected over each class in-
can think of each individ n~mt erd~f persons scoring in that interval. Wecolumn In the fre ua 1s an mg on another's shoulders to form the
is indi~ated by aq;~i:~Y~o Y~~'th~ number of persons in each intervalacross from the appro n~atacef m t e center of the class interval and, , ,p erequency The s c' .Jomed by straight I' . u ceSSlVepomts are then
meso '
Except for minor irregularities th di 'b . .resembles the bell-shaped normdl e stn ution por~ayed in Figure 1
~erfect normal curve is reproduce;~:~i A mathem.atically dete~jned,lmportant mathematical TO erti ' , . gu:e 3, This type of curve hasof statistical a~alyses FoP thP es and prOVIdesthe basis for many kinds
. represent purpo htures will be noted E ti n h se, owever, only a few fea-
. ssen a y t e curve . d' th "number of ca 1 " m lcatesat'J4~ largest
ses custer In the center of the range and thattlie nu;ri15er
Norms and tile Interpretation of Test Scores 71
~he most ~bvious and faniiliar way of reporting variability is in terms of
e range etween the highest and lowest score The ran e h .cxtrem I d d . g, owever IS. . e y cru c an unstable, for it is determined by onl two scores' A
smgle unu~ually high 01' low score would thus markedly Iffect its size' A
:ore precIse method of measuring variability is based on the d'ff .etwee~ eac~ individual's score and the me;n of the ou 1 erence
w~t:~ P01~t it will be helpful to look at the exam~Ie r~Table 2 in
10 c t ~ va~ous measures under consideration have been computed on
str:~~~' alu~ a s~an group was chosen in order to simplify the demon-• ,< tough 111 actual practice we would rarely perform these co
putations on so fe' ' T hI J m-ard statistical sym~o~~~~t s~o~: ~ervetS adlsfotO
fintroduce certain stand-
e no e or uture reference Original
raw scores are conventionally designated by a capital X d . n .used to refer to deviations of each score from the ' an a sma x IS
letter I means "sum of" It 'n b group mean. The Greek
g. th d f . Wi e seen that the first column in Table 2lves e ata or the 'f40, th d" computation 0 mean and median. The mean is, erne lan IS 405 fall' 'd b" mg ml way etween 40 and 41-five cases
Principles of Pbycl1010gical Testing
ps off gradually in both directions as the extremes are approached.
.curve is bilaterally symmetrical, with a single peak in the center.
st distributions of human traj,ts, from height and weight to aptitudes
personality characteristics, approximate the normal curve. In gen-
I,the larger the group, the more closely will the distribution resemble
theoretical normal curve.
340
320
300
280
260
240
~ 220
i3 200'0180
•• 160
i 140:l 120
100
80
60
40
20
- Frequency polygon--- Histogram
TABLE 2 ~
Illustration of Central Tendency and Variabilit)·
•• ""JI fi!.=z:r--
--I
12- 16- 20-15 19 23
Diff. Squared
(:1:2 )24- 28- 32- 36- 40- 44- 48- 52-27 31 35 39 43 47 51 55
scores
50% of {:~ ~~1cases ~~ ~!J +20
Medi,n ~ 40.5 ~~:, ~ {E =H -20
___ 3_2 =~J~X = 400 ~ Ixl = 40
~X 400M=N=1O=40
AD = }; ixj _ 40_N - 10}~ 4
V. ~x' 244·
anance = 0" = - = - - 24 40N 10 - .
SD or u = ~~2 = v'24.40 = 4.9
Flc.1. Distribution Curves: Frequenc\: polygon and Histogram.
(Data from Table 1.)
A group of scores can also be described in terms of some measure.:of
central tendency. Such a measure provides a single, most typical or repJi~-sentative score to characterize the performance of the entire grouf:- 'The
most familiar of these measures is the average, more technically known
as the mean (M). As is well known, this is found by adding all scores
and dividing the sum by the number of cases (N). Another measure of
central tendency is the mode, or most frequent score. In a frequency
distribution, the mode is the midpoint of the class ihterval with the
highest frequency. Thus, in Table 1, the mode falls midway between 32
and 35, being 33.5. It will be noted that this score corresponds to the
highest point on the distribution curve in Figure 1. A third measure of
central tendency is the median, or middlemost score when all scores
have been arranged in order of size. The median is the point that bisects
the distribution, half the cases falling above it and half below.Further description of a set of test scores is given by measures of varia-
, "', ..• 1. r ~ ••• ~"'t "f ;..,rl;"i"'l1~ 1 flifkrences around the central tendency.
64
49
9
1
1
o4
1636
64
:£x' = 244
.,;Principles of Psychological Test ing
'eIcent) are above the median and five below. There is little point in
a mode in such a small group, since the cases do not show c1ear-
tering on anyone score. Technically, however, 41 would repre-
mode, because t",o persons obtained this score, while all other
ccur only once.and column sho\\'s how far each score deviates above or below
of 40. The sum of these deviations will always equal zero, be-
.EOsitiveand negative deviations around the m~an nec~ssarily.
or cancel each other out (+20 - 20 = 0). If we Ignore slgns, ofe Ci,\1laverage the absolute deviations, thus obtaining a measure
th'eaverage deviation (AD). The symbol Ix\ ill the AD formula
that absolute values were summed, without regard to sign. Al-
f ~mnedescriptive value, the AD is not suitable for use in fur-
thema'tical analyses because of the arbitrary discarding of signs.
99.72'1
95.44'1
68.26'1tI
1III
IIIIIIIIII
-30' -leT Mean +leT +20'
FIC. 3. Percentage Distribution of Cases in a NOlmal Curve.
diffe~ent tests in terms of norms, as will be shown in the section on
stan~ard scores. The interpretation of the SD is especi~lly clear-cut when
apphed to a normal or approximately normal distribution curve. In such
a distribution, there is an exact relationship between the SD and the
proportion of cases, as shown in Figure 3. On the baseline of this normal
curvc have been marked distances representing one, two, and three
standard deviations above and below the mean. For instance, in the ex-
ample given in Table 2, the mean would correspond to a score of
40, +1u to 44.9 (40 + 4.9), +20' to 49.8 (40 + 2 X 4,9), and so on. Thepercentage of cases that fall between the mean and +lu in a normalcurve is 34.13. Because the curve is symmetrical, 34.13 percent of the
cases are likewise found between the mean and -1u, so that between
+ 1u and - 1(1 on both sides of the mean there are 68.26 percent of the
cases. Nearly all the cases (99.72 percent) fall within ±3u from the
mean. These relationships are particularly relevant in the interpreta.tion
of standard scores and percentilcs, to be discussed in later sections.
One way in which meaning can be attached to test scores is to indicate
how far along the normal developmental path the individual has pro-
gressed. T~us a~ 8-year-old who performs as well as the average 10-year-
old on an mtelhgence test may be described, as having a mental age of
10; a mentally retarded adult who performs at the saifre level would like-
wise be assigned ~n MA of 10. ~n a different context. 11i~.urth-grade child
may be cba.ractenzed as reacbmg the sixth-grade nonn An a reading test
and the t~l~d-grade n~rm in an. ar~thmetic test. Other d~velopm~tal
systems uti!tze more hIghly quahtative deSCriptions of be.JU~yi9I.in ~r
-- Lorge SD
---Small SD
Scores
Frequenc\'Distributions ...\'ith the Same Mean b~t Different Variahility.
. h more serviceable measure of variability is the standard devw-
:mbolized by either SD or u), in which the negative signs are
'ely eliminated by squaring each deviation. This p~ has
owed in the last column of Table 2. The sum of thiS column
: ("iX2)by the number of cases N is known as the variance, or mean
eviatiol1, andc~ymbo1ized by u'. The variance has proved ~x-
'useful in sorting out the contributions of different factors to m-
differences in test performance. For the present purposes, how-
chief concern is with the SD, which is the square root of the
as shown in Table 2. This measure is commonly employed in
.'g the variability of different groups. In F.igur.e 2,. for e~a~~le,
distributions having the same mean but dlflenng In vanabllity.
ribution with wider individual differences yields a larger SD
"one with narrower individual differences.Sf) also provides the basis for expressing an individual's scores on
'Prillcil,lesof PSljchological Testing
unctionsranging from sensorimotor activities to concept formation.
-I'erexp~essed, scores based on developmental norms tend. to be
oinetricallvcrude and do not lend themselves well to precise sta-
treah~e~t. Nevertheless, they have considerable appeal for de-
\ve purposes, especially in the intensive clinical study of individuals
orcertain research purposeS.
Norms and the Interpretation of Test Scores 75
readily ~isualized if ••w~ think ~~ the in.dividual's height as being ex-
pressed 10 tem1S of heIght age. The dIfference in inches between a
height age of 3 and 4.years would be greater tha~ that betw~en a height
age of 10 and 11. OWll1gto the progressive shrinkage of the MA unit, one
year of acceleration or retardation at, let us sav, age 5 represents a largerdeviation from the norm than does one vear 'of acceleration or retarda-
tion at age 10, .
'l'l;TAL ACE. In Chapter 1 it was noted that the tenn "mental ~ge"
s;ddelv popularized through the various translations and adaptatiOns
the Billet-Simon scales, although Binet himself had employed the
re nelitral term "mental levcl." In age scales such as the Binet and
'revisionsjitemsare grouped into year le,·els. For example, those items
ssedbv the majority of 7-vear-olds in the standardization sample are
~jacedi~ the 7-year level, tilose passed by the m~j~rity of 8-year-olds
~e assignedto the 8-year level, and so fOlth. A child s score on the test
',,~11then correspond to the highest year level that he can succe5sful~y
'omplete.In actual practice, the indh'idual's performance shows a certal~
'~mountof scatter. In other words, the subject fails some tests below h1s
mentalage level and passes some above it. For this reason, it is c~stom-
ar}'to compute the basal age, i.e., the highest age at and below w~lCh all
testsare passed. Partial credits, in months, are then ~d?ed to thiS basal
,'agefor all tests passed at hi~e:;p~r ~evels The chIld s mental age o~the test ISthe sum of the ba~:gp ;lvitbe:dditjonaJ months of credit
earned at higher age level§.:. - . '~tal age norms have also been employed wl~h ~ests that are l:ot dl-
divedinto year levels. In such a case, the subJect s raw scor~ 1S first
determined. Such a score may be the total number of correct Items on
thewhole test; or it may be based on time, on number of~p"(lrs, or on
somecombination of sU~'hmeasures. The mean raw scores.t;,t)Q~ninedby
the children in each year group within the standardiza~tQn' sample con-
stitute the age norms for such a test. The mean raw seore of the 8-~ea~-
old children, for example, would represent the 8-year nonn. If an ll1d~-i
vidual's raw score is equal to the mean 8-year-old raw SCOre,then hiS
mental age on the test is 8 years. All raw scores on such a test can be
transformed in a similar manner by reference to the age nonns.It should be noted that the mental age unit does not remain constant
with age, but tends to shrin~ with advancing years. For example, a child
who is one year retarded at age 4 will be approximately three. years. re-
tarded at age·12. One year of mental growth from ages 3.to 4 ISeqUIVa-
lent to three years of growth from ages 9 to 12. Since mtellectual de-
velopment progresses more rapidly at the earlier ages and gradually
decreases as the individual approaches his mature limit, the mental age
unit shrinks correspondingly with age. This relationship may be more
GRADE EQUIVALENTS.Scores on educational achievement tests are often
interpreted in terms of grade equivalents. This practice is understandable
becaus.e,the t<:stsare employed within an academic setting. To describe
a pupil s ~chlevement as equivalent to seventh-grade performance in
spelhng, eIghth-grade in reading, and fifth-grade in arithmetic has the
same popular appeal as the use of mental age in the traditional intelli-
gence tests.
~rade ~orms are found by computing the mean raw score obtained by
chIldren In each grade. Thus, if the average number of problems solved
c~ITectly on .an arithmetic tes~ by the fourth graders in the standardiza-
hon sample 1S23, th~n a raw score of 23 corresponds to a grade equiva-
lent of 4. IntermedIate grade equivalents, representing fractions of a
gr~de, a~e usually found by interpolation, although they can also be ob-
tamed directly by testing children at different times within the school
year. Because the school year covers ten months, successive months can
be expressed as decimals. For example, 4.0 refers to average perfonnance
at the beginning of the fourth grade (September testing), 4.5 refers toaverage performance at the middle of the grade (Febmary testing), and
so forth.
Despite their popularity, grade norms have several shortcomings. First,
the content of instruction varies somewhat from grade to grade. Hence,
grade norms are appropriate only for common subjects taught through-
o~t the grade le~els covered by the test. They are not generally ap-
phcable at the hIgh school level, where many subjects may be studied
for only one or two. years. Even Vlith subjects taugkt in each grade,
however, the emphas1s placed on different subjects may vary from grade
~o grade, and ~rogress may therefore be more rapid in oJ1esubject than
III ~other dUrIng a particular grade. In other words, grade-units are
obv~ously unequal and these inequalities occur irregqllirly in differentsubjects. ,; .
Grade norms are also subject to misinterpretation uni~s ,the test user
keeps fi~ly in mind the manner in which they were ·deri~ed. For ex-
am~le, .If a fourth-grade child obtains a grade eq.~ivalent of 6~9in arith-
metic, I.t does ~ot mean that he has mastered thfi aritn,w.etic processestaught In the SIxth grade. He undoubtedly obtained'hjs sc6r~ largely by
.Principles of Psyc11010gicaJ Testing .
" . I Id not• >~t . ce 'I·nfouI,th grade arithmetic. It certam y COUlOrpenorman - • . d 'h fcI. med that he has the prerequi~ites for seventh-gra e ant me I ~
adc norms tend to be incorrectly regarded as performan~l
;df. A sixth-grade teacher, for example: may assume tha.t all h~!:e~class should fall a! or close to tl~e sixth-grade ,n?rm In ac rade
ests This misconception is certamly not surpnsmg when g h
iare ~sed Yet individual differences within any onc grade ar~ suc
·.,:therange' of achJevement test scores will inevitably exten over
pal grades,
1 t developmental norms derivesDINAL SCALES. Another approac 1 0 1 b t' f behavior
, hI' E Ipirica 0 serva Ion 0research in chIld psyc 0 og~, . n . 1 d t the description of be-'pment in infants and young chlldl;n e. 0 1 omotion sensory
typical of successive ages in ,SUC? uncti~ns as OCt forma~ion. An
.' inati0t, .lingui~~c dc~~~~~;~~:t:~n~f a~ese~lo:~e£ his associates at
(eAxampe1913sl~:~e~et ~l. 1940; Gesell & Amatruda, 1947; H~lver-mes" , h d I h 0 th apprOXImate
1933) The Gesell Developmental Sc e U es s 0''0 e h f flopm~ntallevel in months that the child has attained in eadc 0 °aul
r
d ptive lan<1uage an person -.areasof behavior, namely, motor, a ~ J h'ld' 'behaVior with1 Tliese levels are "found by companng tIe CIS h• 0 0 0 k a ran ing from 4 weeks to 38 mont s.typlCalof eight ey at>es, g . d tl uential patterning ofsell and his co-workers emphaSize Ie .seq. f'f'-
Th 't d xtenslVe eVidence 0 um or1111, behavior development. ey CIe e. . f behaviorof developmental sequences and an orderly pdrogressllolllb~ect piaced
h hOld' fons towar a sma 0 ]Iges.For example, tee I s reac I , . visualont of him exhibit a characteristic chronologIcal sequen:e I~ d in
ion and in hand and finger movements. Use of th~ entire an
'de attempts at palmar prehension OCC~Il'S~t a~ ear~er ~g~i~h;: t~::he thumb in opposition to the palm; thIS t)~e 0 pre en~, t pincer-owedb use of the thumb and index finger In a more e c~en .. Y f the ob'ect Such sequential patterning was hkewlse ob-
cg~~;wOalking,st!ir ~limbing, and ~ost ~f th; s~~~~:~l~:o~':~:~~;kt of the first few years, The scales eve ope ~ 'c6nstant. do I' the sense that developmental stages follow In a .
e~:~~~hl~tage presupposing m~stery of prerequisite behaVIOr char-
a~teJ'isticof earlier stages.', •• . I I" differs from that in statistics, in which an'.Thisusageof the term ordma sca ~ k l' f individuals wjthout" .' I that permlt~ a ran -oruenn~ 0
o .al scale IS simp y one . . between them' in the statistical sense; o~1. dgeabout amount of dilI~r~nce i les Ordinal sillIes of child developmentarecontra.stedto equal-umt mterva ~:m~~ scale or simplex, in which success-
uallydeSignedon1theI ~o~~;so:u:c:ss at an lower levels (Guttman, 1944). An
:pprformanceat one eve mlp I
Norms arid the ITltcrprc:tafioTl of Test Scores 77
Since the 19605, there has been a sharp upsurge of interest in the de-
velopmental theories of the Swiss child psychologist, Jean Piaget (see
Flavell, 1963; Ginsburg & Opper, 1969; Green, Ford, & Flamer, 1971).
Piaget's research has focused on the development of cognitive processes
from infancy to the midteens. He is concerned with specific concepts
rather than broad abilities. An example of such a concept, or schema, is
object permanence, whereby the child is aware of the identity and con-
tinuing existence of objects when they are seen from different angles
or are out of sight. Another widely studied concept is conservation, or
the recognition that an attribute remains constant over changes in per-
ceptual appearance, as when the same quantity of liquid is poured into
differently shaped containers, or when rods of the same length are placed
in different spatial arrangements.
Piagetian tasks have been used widely in research by developmental
psychologists and some have been organized into standardized scales,
to be discussed in Chapters 10 and 14 (Goldschmid & Bentler, 1968b;
Loretan, 1966; Pinard & Laurendeau, 1964; Uzgiris & Hunt, 1975). In ac-
cordance with Pia get's approach, these instruments are ordinal scales, in
which the attainment of one stage is contingent upon completion of the
earlier stages in the development of the concept. The tasks are designed
to reveal the dominant aspects of each developmental stage; only later
are empitical data gathered regarding the ages at which each stage is
typically reached, In this respect, the procedure differs from that fol-
lowed in constructing age scales, in which items are selected in the first
place on the basis of their differentiating between successive ages.
In summary, ordinal scales are designed to identify the stage reached
by the child in the development of specific behavior functions. Although
sc.'Oresmay he reported in terms of approximate age levels, such scores
are secondary to a qualitative description of the child's characteristic be-
havior. The ordinality of such scales refers to the uniform progression of
development through successive stages. Insofar as these scales typically
provide information about what the child is actually able to do (e.g.,
climbs stairs without assistance; recognizes identity in quantity of liquid
when poured into differently shaped containers), they share important
features with the criterion-referenced tests to be discussed in a later
section of this chapter.
Nearly all standardized tests now provide some foryn of within~group
norms. With such norms, the individual's performa,~~,. is evaluated in;t.~·
extension of Guttman's analysis to Include nonlinear hi~archies i,~ilescribc:dby Bart
and Airasian (1974), with special reference to Piagetillrr··~al.~".~ .
Principles of Psychological Testing
msof the performance of the most nearly comparable standardization
up, as when comparing a child's raW score with that of ~hi~dren of
e same chronological age or in the same school grade. Wlthm-group
reshave a uniform and clearl\' defined quantitative meaning and can
appropriately employed in m~st types of statistical analysis.
PERCEKnLES. Percentile scores are expressed in terms of the percentage
persons in the standardization sample who fall be~ow a given raw
reoFor exampk, if 28 percent of the persons obtam fewer than 15
bblemscorrect on an arithmetic reasoning test, then a raw score of
<j<\rrespdndsto the 28th percentile (P~~). A percentile indicates ~he
.J{iiduafs relative position in the standardization sample. ~ercent~les
. :)\150 be regarded as ranks in a group of 100, except th~t m rankmg
ustomary to start countin<1 at the top, the best person m the group
'ing a rank of one. 'With ~ercentiles, on the other hand, we begin
ing at the bottom, so that the lower the percentile, the poorer the
'dual's standing. .'e 50th percentile (P;;(I) corresponds to the medlan, already dls-
d as a measure of central tendency. Percentiles above 50 represent
e-average performance; those below 50 signify inferior p~rforman:e.
'.25th and 75th percentile are known as the first and thlrd quartile
hits (Ql and Q3), because they cut off the lowest and highest quarters
the distribution. Like the median, they provide convenient landmarks
Qrdescribing a distribution of scores and comparing it with other dis-
ributions. .Percentiles should not be confused with the familiar percehtage scores.
he latter are raw scores, expressed in terms of the percentage of correct
/items;percentiles are derived scores, expressed in terms of perce~ltage of
}<persons.A raw score lower than any obtained in the stand~rdizahon sam-
.:ple would have a percentile rank of zero (Po); one hl~her than any
.. scorein the standardization sample would have a percentile rank of 100,
. (PH"')' These percentiles, however, do not imply a zero raw score and a
perfect raw score.Percentile scores have several advantages. They are easy to compute
and can be readily understood, even by relatively untrained persons.
Moreover, percentiles are universally applicable. They can be used
equally well with adults and children and a~e sUit~ble for any type of
test, whether it measures aptitude or personahty vanables. .The chief drawback of percentile scores arises from the marked 10-
equality of their units, especially. at ~he extremes of the distribut~on. If
the distribution of raw scores approx1mates the normal curve, as lS true
of most test scores, then raw score differences near the median or center
of the distrihution are exag~erated in the percentile transformation,
__________________________ •••••••••••• ·1
Norms and tile Interpretation of Test Scores 79
whereas raw score differences near the ends of the distribution are
greatly shrunk. This distortion of distances between scores can be seen
in Figure 4. In a normal curve, it will be recalled, cases cluster closely at
the center and s~atter more widely as the extremes are approached. Con-
sequently, any glYen percentage of cases near the center covers a shorter
distance on the baseline than the same percentage near the ends of the
distribution. In Figure 4, this discrepancy in the gaps between percentile
ranks (PH) can readily be seen if we compare the dj$tance between a
PR of 40 and a PH of 50 with that between a PR oero and a PR of 20.
Even more stdking is the discrepancy between these distances and that
between a PH of 10 and PR of 1. (In a mathematically derived normal
curve, zero percentile is not reached until infinity and hence cannot be
shown on the graph. )
Q1 Mdn Q3
20130405106070180
i J i I I : iI 1 I I I I I
: \ : I \ I II I I I I II I I 1J I I II II I
1 I
99II
1~ I
III
1III
I\
IIIII
+20- +30-
98 99.9-30- -10- M +10-
~m ~ ~ ~
FIC. 4. Percentile Ranks in a NOlmal Distribution.
The same relationship can be seen from the opposite direction if we
examine the percentile ranks corresponding to equal u-distances from the
mean ~f a. normal curve. These percentile ranks are given under the
graph m Flgure 4. Thus, the percentile difference i;letween the mean and
+ lIT .is 34 (84 - 50). That between + I.,. and +~is only 14 (98 - 84).
. It IS apparent that percentiles show each indiyf<Jual's relative position
In the normative sample but not the amount of <h~ence between scores.
If plotted on arithmetic probability paper, however, percentile scores
can also provide a correct visual pictUre of th~ differences between
sc..or~s. A~ithmetic probability paper is a cr~ss-se<:;rl?npaper i~ which the
vertical h.nes. are. spaced .in t?e same W~y asltM'percentile p~~nts in anormal dlstnbubon (as. m FIgure 4), whereas the horizonta.1i~.nes are
uniformly spaced, or vice versa (as in Figure 5). Such normqJpe;centile
.";pfillciIJles of Psychological TestingNorms and the Interprdation of Test Scores 81
of differences between standard scores derived by such a linear trans-formation corresponds exactly to that between the ;aw scores. All-proper-
ties of the original distribution of raw scores are duplicated in the
distribution of these standard scores. For this reason, any computationsthat can be carried out with the original raw scores can also be carried
out with linear standard scores, withollt any distortion of results.
Linearly derived standard scores are often desilTnatedsimpl\' as "stand-b .
ard scores" or "z scores." To compute a :; score, we find the differencebetween the individual's raw score and the mean of the normative group
and then divide this difference by the SD of the normative group.
Table 3 shows the computation of z scores for two individuals, one ofwhom falls 1SD above the group mean, the other .40 SD below the
mean. Any raw score that is exactly equal to the mean is equivalent to a
z smre of zero. It is apparent that such a procedure will yield derivedscores that have a negative sign for all subjects falling below the mean.
.Moreover, because the total range of most groups extends no farther
than about 3 SD's above and below the mean, such standard scores will
have to be reported to at least one decimal place in order to provide
sufficient differentiation among individuals.
John Mary Ellen Edgar Jane Dick Bill Debby
~h-ANormal"PercentileChart. Percentiles are spaced so as to ~orrespond
~~Idistancesin a normal distribution. Compare the sc~re. distance ~e-" hn and Mary with that between EIIen and Edgar; w!.thm both pal:s,
entile difference is 5 points. Jane and Dick differ by 10 percentile
as do Bill and Debby.
TABLE 3
Computation of Standard Scores
X-M
SD
JOHN'S SCORE
X\=65
65 - 60Zl=
5
= +1.00
BILL'S SCORE
X:=58
58 - 60"'canbe used to plot the scores of different persons. on the same
r thescoresof the same person on different tests. In elther case, theillinterscoredifference will be correctly represented~ Many aptitude
achievementbatteries now utilize this technique in their score pro-
'whichshow the individual's performance in each test. An example
~eIndividualReport Form of the Differential Aptitude Tests, repro-
d in Figure 13 (Ch. 5).
. "AXDARD SCORES. Current tests are making increasing use of standard.
. scoreswhichare the most satisfactory type of derived score ftom most~oints'of view. Standard scores express the individual's distance from
t ", meanin terms of the standard deviation of the distribution.Standardscores mav be obtained by either linear or nonlinear trans-
ationsof the origi~al raw scores.Whe~ found by a l.in.eartransforma-; theyretain the exact numerical r~labons of the ongmal raw scores,. usethey are computed by subtracting a constant from each raw scorethendividing the result by another con~tant The relative magnitude
Both the abovE'conditions, viz., the occurrence of negative values and
of decimals, tend to produce awkward numbers that are confusing and
difficult to use for both computational and reporting purposes. For this
reason, some further linear transformation is u~u,:lly applied, simply to
put the scores into a more convenient form. ,For. ~x~lnple, the scores onthe Scholastic Aptitude Test (SAT) of the College Entrance Examina-
tion Board are standard scores adjusted to a mean ot;~:,and an SD of
100. Thus a standard score of -Ion this test would b: .ressed as 400(500 - 100= 4(0). Similarly, a standard score of +l.S ou1ltcorrespond
to 650 (500 + 1.5 X 100 = 650). To con"er~ an origi~$ll!tandard score tothe new scale, it is Simplynecessary to multiply the standard score by the
Principles of P~Y;'IO'ogical Testing
'ed SD (100) and add it to or subtract it from the desired mean
). Any other convenient values can be arbitrarily chosen for the
,mean and SD. Scores 011 the separate subtests of the Wechsler In-
ence Scales, for instance, are converted to a distribution with a
1 of 10 and an SD of 3. All such measures are examples of linearly
sformed standard scores.'twill be recalled that one of the reasons for transforming raw scores
o any derived scale is to render scores on different tests comparable.
e linearlv derived standard scores discussed in the preceding section
" be cO~lparable only when found from distributions that have ap-
ximatelythe same form. Under such conditions, a score corresponding
~.I SD above the mean, for example, signines that the individual occu-
ies the same position in relation to both groups. His score exceeds ap-
roximately t1J.e.same percentage of persons in both distributions, and
is percentage can be determined if the form of the distribution is
'known.If, howeyer;"one distribution is mal'kedly skewed and the other
"normal,a z score of +1.00 might exceed only 50 percent of the cases in
,negroup but would exceed 84 percent in the other.
In order to achieve comparability of scores from dissimilarly shaped
,distl-ibutions,nonlinear transformations may be employed to fit the scores
to any specified type of distribution curve. The mental age and percentile
scores described in earlier sections represent nonlinear transformations,
but they are subject to other limitations already discussed. Although
under certain circumstances another type of distribution may be more
appropriate, the normal curve is usually employed for this purpose. One
of the chief reasons for this chotee is that most raw score distributions
approximate the normal CUJ;V-e more closely than they do any other type
of curve. Moreover, physical me1tsures such as height and weight, which
use equal-unit scales derived. thl:"'t'fugh physical operations, generaU,y yield
normal ~istributions., Anoth'1f"frnportan: advantage .of the ~or.~al :~rveis that It has many useful mathematical properties, whlchl""faclhtate
further computations.
NQrmalized standard scores are standard scores expressed in terms of a
distribution that has been transformed to fit a normal curve. Such scoreS
can be computed by reference to tables giving the percentage of cases
falling at different SD distances from the mean of a normal curve. Firsf,
the percentage of persons in the standardization sample falling at or
above each raw score is found. This percentage is then located in the
normal curve frequency table, and the con-esponding normalized stand-
2 Partly for this reason and partly as a result of other theoretical considerations. it
has frequently been argued that, by normaliZingraw scores. an e(lual-unit scale could
be developcd for psycholo~ical measurement similar to the equal-twit sL-dlesof physi-
cal measurement. This, however, is a debatable point that involves certain question-
able assumptions.
ard score is obtained. Normalized standard scores are expressed in the
same form as linearly derived standard scores, viz., with a mean of zero
and an SD of 1. Thus, a normalized score of zero indicates that the indi-
vidual falls at the mean of a normal curve, excelling 50 percent of the
group. A score of -I means thafhe surpasses approximately 16 percent
of the group; and a s(:ore of + I, that he surpasses 84 percent. These per-
centages correspond to a distance of 1 SD below and 1 SD above the
mean of a normal curve, respectively, as can be seen by reference to the
bottom line of Figure 4.
Like linearly derived standard scores, normalized standard scores can
be put into any convenient form. If the normalized standard score is
multiplied by 10 and added to or subtracted from 50, it is converted into
a T score, a type of score first proposed by McCall (1922). On this scale,
a score of 50 corresponds to the mean, a score of 60 to 1 SD above the
mean, and so forth. Another well-known transformation is represented
by the stanine scale, developed by the United States Air Force during
World War II. This scale provides a single-digit system of scores with a
mean of 5 and an SD of approximately 2.3 The name stanine (a contrac-
tion of "standard nine") is based on the fact that the scores run from
1 to 9. The restriction of ~cores to single-digit numbers has certain
computational advantages, for each score requires only a Single column
on computer punched cards.
TABLE 4
Normal Curve Percentages for Use in Stanine Conversion
Percentage
Stanine
Raw scores can readily be co~verted to stanines by arranging the origi-
nal scores:in order of size and ~,~fn assigning stanines in accordance with
the normal curve percentages"re,produced in Table 4. For example, if
tlJ.e group consists of exactly I()() persons, the 4 lowest-scoring persons re-
ceive a stanine score of 1, the next 7 a score of 2, the next 12 a score of 3,
and so on. When the group contains more or fewer than l00~cases, the
number corresponding to each deSignated percentage is first computed,
and these numbers of cases are then given the appropriate stanines."'c
-" 3 Kaiser (1958) proposed a modification of the staninl!'scale thaq~volves slight
(;han~es in the percentages and yields an SD of exactly 2, thus being e~Werto handlequantitatively. Other variants are the C scale (Guilford & ltruchter, :1,.913" Ch. 19),consisting of 11 units and also yielding an SD of 2, and tl.!~~lO-Uilitstefl scale, with
5 units above and 5 below the mean (Canfield, 1951}.'\: ".Co
Norms and the Interpretation of Test Scores 85
for comparability of ratio IQ's throughout their age range. Chiefly for
this reason, the ratio IQ has been largely replaced by the so-called devi-
ation IQ, which is actually another variant of the familiar standard score.
The deviation IQ is a standard score with a mean of 100 and an SD
that approximates the SD of the Stanford-Binet IQ distribution. Al-
though the SD of the Stanford-Binet ratio IQ (last used in the 1937
edition) was not exactly constant at all ages, it fluctuated around a
median value slightly greater than 16. Hence, if an SD close to 16 is
chosen in reporting standard scores on a newly developed test, the result-
ing scores can be interpreted in the same way as Stanford-Binet ratio
IQ's. Since Stanford-Binet IQ's have been in use for many years, testers
and clinicians have become accustomed to interpreting and classifying
test performance in terms of such IQ levels. They have learned what to
expect from individuals with IQ's of 40, 70, 90, 130, and so forth. There
are therefore certain practical advantages in the use of a derived scale
that corresponds to the familiar distribution of Stanford-Binet IQ's.
Such a correspondence of score units can be achieved by the selection of
numerical values for the mean and SD that agree closely with those inthe Stanford-Binet distribution.
It should be added that the use of the term "IQ" to designate such
standard scores may seem to be somewhat misleading. Such IQ's are not
derived by the same methods employed in finding traditional ratio IQ's.
They are not ratios of mental ages and chronological ages. The justifi-
cation lies in the general familiarity of the term "IQ," and in the fact
that such scores can be interpreted as IQ's provided that their SD
is approximately equal to that of previously known IQ's. Among the first
tests to express scores in terms of deviation IQ's were the \Vechsler In-
telligence Scales. In these tests, the mean is 100 and the SD 15. Deviation
IQ's are also used in a number of current group tests of intelligence
and in the latest revision of the Stanford-Binet itself.
\Vith the increasing use of deviation IQ's, it is important to remember
that deviation IQ's from different tests are comparable only when they
employ the same or closely similar values for the SD. This value should,always be reported in the manual and carefully noted by the test user. If
a test maker chooses a different value for the SD in making up his devia-
tion IQ scale, the meaning of any given IQ on his test will be quite differ-
ent from its meaning on other tests. These discrepancies are illustrated in
Table 5, which shows the percentage of cases}i1normal distriblltions with
SD's from 12 to 18 who would obtain IQ's at different l~els.These SD
values have actually been employed in the IQ scales ofp*lJli~hed tests.
Table 5 shows, for example, that an IQ of 70 cuts off the lo\v(j:..st3.1 per-
cent when the SD is 16 (as in the Stanford-Binet); but it _",;;y cut off.
as few as 0.7 percent (SD = 12) or as many as 5.1 percen .' = 18) .An IQ of 70 has been used traditionally as a cutoff point fpl' . ying
Prillciplcs of Psycl1010gical Testing
us,out of 200 cases, 8 would be assigned a stanine of 1 (4 percent of
= 8). With 150 cases, 6 would receive a stanine of 1 (4 percent of
== 6). For any group containing from 10 to 100 cases, Bartlett and
,erton (1966) have prepared a table whereby ranks can be directly
rted to stanines. Because of their practical as well as theoretical
rimtages,stanines are being used increasingly, especially with aptitude
achievement tests.
IthoughnOlmalized standard scores are the most satisfactory type of
.refor the majority of purposes, there are nevertheless certain tech-
al objections to normalizing all distributions routinely. Such a trans-
:)ation should be carried out only when the sample is large and rep-
Iltativeand when there is reason to believe that the deviation from
in~litvresults from defects in the test rather than from characteristics
he sample or from other factors affecting the behavior under con-
ration/it should also be noted that whpn-the original distribution of
scoresapproximates normality, the linearly derived standard scores
the normalized standard scores will be very similar. Although the
:ods of deriving these two types of scores are quite different, the
tiltingscores will be nearly identical under such conditions. ObViously,
.!proeessof normaliZing a distribution that is already virtually normal
rproduce little or no change. Whenever feasible, it is generally more
'rable to obtain a normal distribution of raw scores by proper adjust-
,t of the llifficulty' level of test items rather than ~by subsequently
alizing a markedly nonnormal distribution. With an approximately
al distributiou of raw scores, the linearl\' derived standard scores
,servethe same purposes as normalized st;ndard scores.
.~ DEVIAT10JlO IQ. In an effort to convert ~1A scores into a ~6rm
J of the individual's relative status, the ratio IQ (Intelligence
Jient) was introduced in early intelligence tests. Such aIJ.,IQ was
ply the ratio of mental age to chronological age, multiplied by 100 to'pate decimals (IQ = 100 X MAjCA). Obviously, if a child's ~IA
Is his CA, his IQ will be exactly 100. An IQ of 100 thus represents
'\i.\ or average performance. IQ's below 100 indicate retardation,
above 100, acceleration.
" apparent logical simplicity of the traditional ratio IQ, however,
proved deceptive. A major technical difficulty is that, unless the
f the IQ distribution remains approximately constant with age,
will not be comparable at different age levels. An IQ of 115 at age
r example, may indicate the same degree of superiority as an IQ
at age 12, since both may fall at a distance of 1 SD from th~
. of their respective age distributions. In actual practice, it prm'e,&'
. ifficult to constmc:t tests that met the psychometric requiremeritS'
5
tage of Cases at Each IQ Interval in Normal Distributions with Mean
and Different Standard Deviations
esyTest Department, Ha~court Brace Jovanovich, Inc.)
In
5:cov
'0.8E
'"z0.13%
0.13%
-40- -10- Mean +1<1 +2<1 +3<1 +4<1Test score
z score I ! I I I I I I-4 -3 -2 -I +1 +2 +3 +4
Tscore L I I I I ! I I.. I10 20 30 40 50 GO 70 80 90
CEEB score I I I I I200 300 - 400 500 600 700 800
Deviation IQ(SD =15) ! I I I I I
55 10 85 100 115 130 145
Stanine4%
I 7% ,12%,17% ,20%! 11% 112% 17% I 4%
2 3 4 5 6 7 8
. : 1Q1iltervalSD= 12 SD = 14 SD = 16 SO = 18
s',b .
130 \Rh above 0.7 1.6 3.1 5.1120-129 4.3 6.3 7.5 8.5
··:110-119 15.2 16.0 15.8 15.4
100-109 29.S} 59.6 26.1}52.2 ;;::}47.2 21.°l42090- 9~ 29.8 26.1 21.0) .
80- 89 15.2 16.0 15.8 15.4
70- 79 4.3 6.3 7.5 8.5.Below70 0.7 1.6 3.1 5.1
Total 100.0 100.0 100.0 I100,0
= -,'1II9tA~;r.r ...).~~"""""~
mental retardation. The same discrepancies, of course, apply to IQ's of
130 and above, which might be used in selecting children for special
programs for the intellectually gifted. The IQ range between 90 and lIO,
generally described as normal, IJlay include as few as 42 percent or as
many as 59,6 percent of the popula-tion, depending on the ~est chosen. To
be sure, test publishers are making efforts to adopt the umform SD of 16
in new tests and in new editions of earlier tests. There are still enough
variations among cuaently available tests, however, to make the checking
of the SD imperative.
Percentile I I I I I I I I I
5 10 20 30 40 50 60 10 80 90 95 !l9
FIC. 6. Relationships among OiHerent Types of Test Scores in a NormalDistribution.
INTERRELATIONSHIPSOF WITHIN-GROUPSCORES,At this stage in our dis;
cussian of derived scores, the reader may have become aware of a
rapprochement among the various types. of scores. Percentiles ~ave
gradually been taking on at least a graphIC rese~b~a~ce t? norma}ijzed
standard scores. Linear standard scores arc mdlstingmshable from
normalized standard scores if the original distribution of raw scores
closely approximates the normal curve. Finally, standard s(:ores have. be-
come IQ's and vice versa. In connection with the last point, a ree,xamm~-
tion of the meaning of a ratio IQ on such a test as the Stanford-.Bmet WIll
show that these IQ's can themselves be interpreted as standard scores. If
we know that the distribution of Stanford-Binet ratio IQ's had a mean of
11") ronrl ~n qT) of :mnroximatelv 16. we can conclude that an IQ of 1I6
falls at a distance of 1 SD above the mean and represents a standard
score of +1.00. Similarly, an IQ of 132 corresponds to a standard score
of +2.00, an IQ of 76 to a standard score of -1.50, and so forth. More-
over, a Stanford-Binet ratio IQ of lI6 corresponds to.~Percertile rank
of approximately 84, because in a normal curve 84 plirc~1it of-the casesfall helo. +1.00 SD (Figure 4). . ,.
In Figure 6 are summarized the relaHbnships that exist in a normal
distribution among the types of scores so far discussed in .this chapter.
These include z scores, College Entrance Examination Bqp,rcd (CEEB)
scores, Wechsler deviation IQ's (SD = 15), T SCOres,stanines, and per-
centil~s. Ratio IQ's on any test will coincide with th~g_iven deviation iQscale-If they are normally distributed and have an S1). of 15. Any other
Principles of Psychological Testing
ally distributed IQ could be added to the chart, provided we know
'SD. If the SD is 20, for instance, then an IQ of 120 corresponds to
'1 SD, an IQ of 80 to -1 SD, and so on.
In conclusion, the exact form in which scores are reported is dictated
gelyby convenience, familiarity, and ease of developing nonns. Stand-
scores in any form (including the deviation IQ) have generally
placed other types of scores because of c.-ertain advantages they offer
'th regard to test construction and statistical treatment of data .. ~ost
pesof within-group derived scores, however, are fundamentally s1m1lar
_. carefully derived and properly interpreted. When certain statistical
conditionsare met, each of these scores can be readily translated into
...anyof the others.
Norms and the Interpretation of Test Scores 89
tests may differ in content despite their similar labels. So-called intelli-
gence tests rrovide many illustrations of this confusion. Although com-
mon]y descnbed by the same blanket term, one of these tests may include
only v~rba] content, another may tap predominantly spatial aptitudes,
and still another may cover verbal, numerical, and spatia] content in
about equal proportions. Second, the scale units may not be comparable.
As explained earlier in this chapter, if IQ's onone test have an SD of 12
and IQ's on another have an SD of 18, then an individual who received
an IQ of 112 on the first test is most likely to receive an IQ of 118 on the
secon~. !hird, the composition of the s~dardi;;;ation sa'!!Ples used in
establIshmg nonns for different tests may vary. ObViously, the same indi-
~idu~l will appear to have performed better when compared with an
mfenor group than when compared with a superior group.
Lack of comparability of either test content or scale units can usually
be detected by reference to the test itself or to the test manual. Differ-
ences in the respective normative samples, howeyer, are more likely to
be overlooked. Such differences probably account for many otherwise un-explained discrepancies in test results.
ISTERTEST COMPARISONS, An IQ, or allY other score, should always be
accompanied by the name of the test on which it was obtained. Test
~corescannot be properly interpreted in the abstract; they must be re-
e ferred to particular tests. If the school records show that Bill Jones re-
. ceived an IQ of 94 and Tom Brown an IQ of 110, such IQ's cannot be
accepted at face value without further information. The positions of
these two students might have been reversed by exchanging the par-
ticular tests that eq,ch was given in his respective school.
Similarly, an individual's relative standing in di~erent functions may
be grossly misrepresented through lack of comparability of test norms.
Let us s~ppose that a student has been given a verbal comprehension
test and a spatial aptitude test to determine his relative standing in the
two fields. If the verbal abilitv test was standardized on a random sample
of high school students, while the spatial tes~ was standardized on a
selected group of boys attending elective shop courses, the examiner
might erroneously conclude that the individual is much more able along
verbal than along spatial lines, when the reverse may actually be the case.
Still another example involves longitudinal comparisl?,ns of a single
individual's test performance over time. If a schoolchild's cumulative
record shows IQ's of 118, 115, and 101 at the fourth, fifth, and sixth
grades, the first question to ask before interpreting these changes is,
"What tests did he take on these three occasions?" The apparent decline
may reflect no more than the differences among the tests. In that case,
he would have obtained these scores even if the three tests had been
administered within a week of each other.
There are three principal reasons to account for systematic variations
among the scores obtained by the same individual on different tests. First,
THE NORMATIVE SAMPLE.• Any norm, however expressed, is restricted
to the particular normative population from which it was derived, The
test user should never lose sight of the way in which norms are estab-
lished. Psychological test norms are in no sense absolute, univer;!U,or
penn~ne~t. They JIle~ely represent the test performance of the subi.~15
consti~tmg the~i\r..~ardization sample. In choosing such a sample·, af1
eff?rt IS usual.lr~de t(t'Qbtain a representative cross sectiol\Hlf.the popu-latIon for which th~.it~st is designed. .
In st~tistjca] terminology, a distinction is made between sample and
populatIOn. Th: former refers to the group of individuals actually teste (i.
Th~ latter des1gn~tes the larger, but similarly constituted, group froin
which the sample 1Sdrawn. For example, if we wish to establish nonns of
test performance for the population of 10-year-old, urban, public schoo]
boys, ~ve migh~ test a carefully chosen sample of 500 10-year-oJd boys
attendmg PUb~IC schools in several American cities. The sample would
be checked w1th reference to geographical distribution, socioeconomic
level, ethnic (,'omposition, and other relevant characteristics to ensure that
it was truly representative of the defined population.
In the development and application of test norms, considerable atten-
tion should be. given to the standardization sample. It is,,apparent that the
sample on wh1ch the norms are based should be large enough to provide
stable values., Another, similarly chosen sample of th•.•same populationshould not yIeld nonns that diverge appreciably frorp tfl.ose obtained.
"Prillciplesof Psychological Testing
, with a large sampling error would obviollsly be of little yalue in
~erpretationof test scores.
uallyimportant is the requirement that the sample be representative
',populationunder consideration. Subtle selective factors that might
. the sample unrepresentative should be carefully investigated. A
ber of such selective factors are illustrated in institutional samples.
ausesuch samples are usually large and readily available for testing
oses,they offer an alluring field for the accumulation of normative
. The special limitations of these samples, however, should be care-
yanalyzed. Testing subjects in school, for example, will yield an in-
'singlysuperior selection of cases in the sllccessive grades, owing to
e progressive dropping out of the less able pupils. Nor does such
iffiinationi?,ffectdifferent subgroups equally. For example, the rate of
ctiveelimination from school is greater for boys than for girls, and
/~greater in lower than in higher socioeconomic levels.
S~I~ctivefactors likewise operate in other institutional samples, such
.prisoners,patients in mental hospitals, or institutionalized mental re-
dates.Because of many special factors that determine institutionaliza-
'n itseH,such groups are not representative of the entire population of
riminals,psychotics, or mental retardates. For example, mental retard-
teswith physical handicaps are more likely to be institutionalized than
re the physically fit. Similarly, the relative proportion of severely re-
ardedpersons will be much greater in institutiunal samples than in the
total population.
Closely related to the question of representativeness of sample is the
needfor defining the specific population to which the norms apply. Obvi-
ous]y,one way of ensuring that a sample is representative is to restrict
the population to fit the ~ecifications of the available sample. For ex-
. ample, if the population i$ defined to include only 14-year-old school-
chDdrenrather than all 14-year-old children, then a school sample would
be representative. Ideally, of course, the desired population should be
definedin advance in terms of the objectives of the test. Then a suitable
sample should be assembled. Practical obstacles in obtaining subjects,
however, may make this goal unattainable. In such a case, it is far better
to redefine the population more narrowly than to report norms on an ideal
population which is not adequately represented by the standardization
sample. In actual practice, very fe''''' tests are standardized on such broad
populations as is pORularly assumed. No test provides norms for the
human species! And it is doubtful whether any tests give truly adequate
norms for such broadly defined populations as "adult American men,"
"lO-year-old American children," and the like. Consequently, the samples
obtained by different test constructors often tend to be unrepresentative
of their alleged populations and biased in different ways. Hence, the
rr<llJtin~norms are not comparable.
NATION~L ANCHOR NORMS. One solution for the lack of comparability
of n~rms IS to use an anchor test to work out eqUivalency tables for scores
?n dl~erent tests. Such tables are designed to show what score in Test A
IS e~Ulvalent to ~ach score in TestB. This can be done by the equiper-
cent,ze m.ethod, m which scores are considered equivalent when ther
have equal percentiles in a given group. For example, if the 80th pel:'
centile in the same group corresponds to an IQ of lI5 on Test A and to
an IQ of 120 on Test B, then Test.A-IQ 115 is considered to be equivalent
to Test-B-IQ 120. This approach has been followed to a limited extent
by so~e test publishers, who have prepared equivalency tables for a fewof theIr Own tests (see, e.g., Lennon, 1966a).
More ambitious proposals have been made from time to time for cali.
brat~n~ each new test against a single anchor test, which has itself been
admllllstered to a highly representative, national normative sample (Len-
~on, 1966b). No single anchor test, of course. could be used in establish-
mg norms for all tests, regardless of content. "'hat is required is a batterY
of anchor tests, all administered to the same national sample. Each ne,~'
~est could then be checked aKainst the most nearlY similar anchor test111 the battery. .
The data gathered in Project TALENT (Flanagan et a!', 1964) so far
come closest to providing such an anchor batten' for a high school popu-
la~ion. Using a r~ndo~ sample of about 5 per~nt of the high schools in
tIllS country, th~ lllVeStIga.torsadministered a two-day battery of specially
cons~ructed aphtude, achIevement, interest, and temperament tests to ap-
pr~:llnately 400,000 students in grades 9 through 12. Even with the avail-
~bihty of anchor data such as these, however, it must be recognized tItatl~dependen~ly dev.eloped tests ·can ~ever be regarded as completely inter-
changeable. At best, the use of natIOnal anchor norms would appreciably
reduce the lack of comparability among tests, but it would not elimi.nate it.
Th~ Pro!ec~ TALENT battery has been employed to calibrate several
test battenes III use by the Navy and Air Force (Dailey, Shaycoft, & Orr,
1962: ~haycoft, Neyman, & Dailey, 1962). The general procedure is to
admllllster both the Project TALENT battery and the tests to be cali-
bra~ed to the same sample. Through correlational analysis, a ,composite of
Project TALENT tests is identified that is most n~ya,dycomparable to
each test to be norme?. By means of the equipercentile method, tables
are then prepared g1Vlllg the corresponding scores On the Project
T~LENT composite and on the particular test. For several other bat-
tenes, data have been gathered to identify the Project TA.Lf:NT com-
4 F~r an excellent analysis of some of the technical difficulties involved in effortsto achIeve score comparability with different tests, see Angolf (i~~. 1966, 1971a).
"~-
SPECIFIC NORMS. Another approach to the nonequivalence of existing
norms-and probably a more realistic one for most tests-is to standard-
ize tests on more narrowly defined populations, so chosen as to suit the
specificpurposes of each test. In such ca.ses. the limits of the normative
; population should be clearly reported wIth the norms. :hus, the n?rms
" might be said to apply to "employed clerical worke~',s 111 large busll1~sS
'. organizations" or to "first-year enginee~ing students. For many test~ng
<. purposes. highly specific norms are deSirable. Eve~ w~e~ representatIve
. norms are available for a broadly defined populatIon. It IS often helpful
.tohave separately reported subgroup norms. This is true whenever recog-;
.•nizable subgroups yield appreciably different scores on a particular ~est.
The subgroups may be formed with respect to ag~, grade, type.of curnc~-
. lum, sex, geographical region, urban or rural envIronment, soclOeCOnO~T1lc
'level and manv other factors. The use to be made of the test determmes
the ~pe of differentiation that is most relevant. as well as whether
general or specific norms are more appropriate., Mention should also be made of local norms, often developed by the
test users themselves within a particular setting. The groups employed in
r11'ridnrt s11ehnorms are even more narrow I)· defined than the subgroups
• FIXED REFERENCE GROUP. Although most derived scores are computed
m such a way as to provide an immediate normative interpretation of test
perfom~ance, there. ~re some notable exceptions. One type of non-
normative scale utIlIzes a fixed reference group in order to ensure
compar~bility and continuity of scores, without providing normative
evaluation of performance. \Vith such a scale, normative interpretation
requires reference to independently collected norms from a suitable
population. Local' or other specific norms are often used for this purpose.
One of the clearest examples of scaling in terms of a fixed reference
group is provided by the score scale of the College Board Scholastic
Aptitude Test (Angoff, 1962, 1971b). Between 1926 (when this test was
first a~ministered) and 1941, SAT scores were expressed on a normative
scale, 111 t~r.ms o~ the mean and SD of the candidates taking the test at
each adm~mstration. As the number and variety of College Board member
colleges l~lcreased and the composition of the candidate population
changed, It was concluded that scale continuity should be maintained.
Otherwise, an individual's score would depend on the characteristics ot
the group tes~ed .dUring a particular year. An even more urgent reason
for scale continu~ty ~temmed from the observation that students taking
the. SA~ at certam .hmes of the year performed mOre poorly than those
~akll1g It at other bmes, Qwing to the differential operation of selective
f~ctors. After 1941, therefore, all SAT scores were expressed in terms of
the ~ean and SD of the approximately 11,000 candidates who took the
test m 1941. These candidates constitute the fixed reference group em-
ployed in scaling all subsequent forms of the test. Thus, a score of 500 on
any form of the SAT corresponds to the mean of the 1941 sample' a scoreof 600 falls 1 SD above that mean, and so forth. ' ,
To permit translation of raw scores on any {prm of the SAT into these
~x~d-refere~ce-group scores, a short anc~or test (9r set of common items)
IS lI:c1uded 111 each fonn. Each new form is thereby linked to one or two
~arher forms. which in turn are linked with other forms by"g chin of
Items extend!ng back to the 1941 form. These nonnormative SAT scores
can then be mterpreted by comparison with any appropriate distribution
,~ Principles of Psychological Testing
..positecorresponding to each test in the battery (Cool~y, 1965; Cooley &
Miller,1965). These batteries include the General AptItude Test Battery
'oftheUnited States Employment Service, the Differential Aptitude Tests,
.andthe Flanagan Aptitude Classification Tesfs ..Ofparticular interest is The Anchor Test Study conducted by the Edu-
cationalTesting Service under the auspices of the U:S. Office of E~u-
qation(Jaeger, 197.3). This study represents a systematIc effort to proVIde
comparable and tI'uly representative national norms for the seve~ most
'dely used reading achievement tests for. elementa~ schoolchIldren.
hrough an unusually \vell-controlled ~xpenmental desl.gn, o.ver 300,000
fourth-,fifth-, and sixth-grade schoolchIldren were exammed 111 50 states.
The anchor test consisted of the reading comprehension and vocabulary
btests of the Metropolitan Achievement Test, for which new norms
creestablished in one phase of the-project. In the equating phase of the
"d)', each child took the reading comprehension an~ voca?ula~ sub-
ests from two of the seven batteries, each battery bemg paned In turn
withevery other battery. Some groups took parallel forms of t~~ t\.•.•o sub-
:testsfrom the same battery. In still other groups, all the pamngs were
'duplicated in reverse sequence, in order to control for order. of ad-
ministration. From statistical analyses of all these data, score eqUivalency
"tablesfor the seven tests were prepared by the equipercentile method. A
manual for interpreting scores is provided for use by school systems and
. other interested persons (Loret, Seder, Bianchini, & Vale. 1974).
Norms alld the Intcrpretation of Tcst Scores 93
considered a?ove. Thus, an employer may accumulate norms on appli-
cants for a gIVen type of job within his company. A college admissions
office may develop norms on its own student population. Or a single
elementa~y school may evaluate the performance of individual pupils in
terms of Its own sco:e distribution. These local norms are more appropri-
ate than broad nahonal norms for many testing purposes, such as the
prediction of subsequent job performance or college achievement, the
comparison of a child's relative achievement in different subjects, or
the measurement of an individual's progress o\-er time.
"94 Princil)les of Psychological Testing
of scores; such as that of a particular college, a type of college, a r~gi?n,
etc. These specific norms are. more useful in making colle.ge adml~slon
decisions than would be annu~l norms based on ~he entire. candidate
o ulation. Any changes in the candidate populatlOn o.ver time, more-
~v~r,can be detected only with a fixed-score scale. It will be noted that
the principal difference beh":een the fixed-reference-group scales u~der
consideration and the previously discussed. scales ~ased on natlOn~1
anchor norms is that the latter require the chOIce of a. smgle group that IS
broadl representative and appropriate for normative purposes. Apart
from the practical difficulties in obtaining such a group and the need to
update the norms, it is likely that for many testing purposes such broad
norms are not required. .Scales built from a fixed reference group are analogous m one respect
to scales employed in physical measurement. In this connection, Angoff
(1962}pp. 32--33) writes:
There is hardly a person here who knows the precise origina~ definition of ~heI gth of the foot used in the measurement of height or distance, or which~:git was whose foot was originally agreed upon as the standard; on t~eother hand, there is no one here who does not know how to. evalm~te lengt s
and distances in terms of this unit. Our ignora~ce of the precise on.gmal me~n-. g or derivation of the foot does not lessen Its usefulness to us In a~y "ay.
~~susefulness derives from the fact that it remains the same ~ver time andallowsus to familiarize ourselves with it. Needless to say, .preclsely th~ same
considerations applv to other units of measurement-the mch, the mile, th:de ree of Fahrel1h~it, and so on. In the field ofpsych?l.ogical measureme.nt It. g. 'lar]y reasonable to say that the original defimtlOn of the scale IS orIS Slml . . h . t ce of ashould be' of no consequence. ~Vhat is of consequence IS t e ~am enan .. t nt scale--which in the case of a multiple-form testmg program, ISconsa·, d 1 . . f s pIeachieved bv rigorous form-to-form equati~g-an . t 1e provlSl~n 0 up.-
t,. or'nlative data to aid in interpretation and III the formation of specific
men alY n , . d't' .. ntdecisions, data which would be revised from time to time as con I lOllSwalla .
Norms and the Intcrpretat,ion of Test Scores 95
computer capabilities should serve "to free one's thinking from the con-
straints of the past."
Various testing innovations resulting from computer utilization will be
discussed under appropriate topics throughout the book. In the present
connection, we shan examine some applications of computers in the
interpretation of test scores. At the simplest level, most current tests, and
especially those designed for group administration, are now adapted for
computer scoring (Baker, 1971). Several test publishers, as well as inde-
pendent test-scoring organizations, are equipped to provide such scoring
services to test users. Although separate answer sheets are commonly
used for this purpose, optical scanning equipment available at some
scoring centers permits the reading of responses directly from test book-
lets. Many innovative possibilities, such as diagnostic scoring and path
analysis (recording a student's progress at various stages of learning)
have barely been explored.
At a somewhat more complex level, certain tests now provide facilities
for computer interpretation of test scores. In such cases, the computer
program associates prepared verbal statements with particular patterns
of test responses. This approach has been pursued with both personality
and aptitude tests. For example, with the ~1innesota Multiphasic Per-
sonality Inventory (MMPI), to be discussed in Chapter 17, test users
may obtain computer printouts of diagnostic and interpretive stl;\tements
about the subject's personality tendencies and emotional condition,
together with the numerical scores. Similarly, the Differential Aptitude
Tests (see Ch. 13) proVide a Career Planning Report, which includes
a profile of scores on the separate subtests as well as an interpretive
computer printout. The latter contains verbal statements that combine
the test data with information on interests and goals given by the
student on a Career Planning Questionnaire. These statements are
typical of what a counselor would say to the student in going over his
test results in an individual conference (Super, 1973).
.. Individualized interpretation of test scores at a still more complex level
is illustrated by interactive computer systems, in which the individual is
in direct contact with the computer by means of response stations and
in effect engages in a dialogue with the computer (J. A. Harris, 1973;Holtzman, 1970; M. R. Katz, 1974; Super, 1970). This technique has been
investigated with regard to educational and vocational planning and de-
cision making. In such a situation, test scores are usually incorporated in
the computer data base, together with other inforn:tation ,,tovided by the
student or client. Essentially, the computer com~thes all the available
information about the individual with storedt-t' ",bout educational
programs and occupations; and it utilizes all re,lev;tnt' facts and relations
in answering the individual's questions and aiding him in reaching de-,
cisions. Examples of such interactive computer systems, ii!' various stages
COMPUTER UTILIZATION IN THE INTERPRETATION
OF TEST SCORES
Computers have already made a Sig~i~cant.impact ,upon eve? phase
of testing, from test construction to admlmstrahon, sconng, reportmg, and
interpretation. The obvious uses of computers-and those develop~d
earliest-represent simply an unprecedented increase in the spe~d WIth
which traditional data analyses and scoring processes can be earned out.
F'mportant however are the adoption of new procedures and
ar more 1 " .' h' h ldthe exploration of new approaches to psychological testmg w lC wo~
have been impossible without the fle:dbility, speed, and d~ta-processl~g
('~n:lhiliti('s of computPTS. As Baker (1971, p. 227) SUCCinctlyputs It,
. PrillcijJlesof Psychological Testing
. 1 d IBM's Education and Career Ex-erationaldevelopment, mc~T;' s S 'stem for Interactive Guidance
!:ionSystem (ECES). a~d fi ld i . I show good acceptance ofation (SIGI). Prehmmary e na s. nts (Harris 1973).
systemsby high school stud~nts and1 theltroPfart~edata utilized in
It an mtegra part results a so repres~n I) I der to present instructionaltiter-assisted instructwn (CAd .~ n or t le\'el of attainment, the, . t ch stu ent s curren1appropnate 0 ea d I ate the student's responses to
'ermust r.epeated~' s~or.ea~hi~::~onse history, the student may'Pgmatenal. On t e aSlSo. I . to further practice at the present
edtomore ad.vanced m:te~:r~~~ he receives instruction in more
,r to a reme~l~l branc .w . nostic anal sis of errors may leadtaryprereqUIsItematenal. .Dlag correcr the specific learning,instructionalprogram desIgned to
ltiesidentified in individual cases. f 'ble variant of computerd' t' ally more eaSl
ss costly an opera Ion d ';nstruction (CMI-see1 . . computer-manage ,
ion in earmng IS , 1 I mer does not interact directly
leton,1974). In suc~ syst~~~~t::me:ter is to assist the teacher in
,~~~~u~~:'nT~e i~~~vi~ualize~ il~struct~~n~f~:~~;U~~~'~eu::;~~
'tionpackages or more ~onventlOn:l t~: rather formidable mass of'utionof the computer ISto proces f f each student in a
'1 d' g the per ormance 0 ,ceumulateddal y regar m. 1 d' dl'fferent activity and to'I I Y be InvOve In a ';,omW lere eac I, ma ..' xt instructional step for eachthese data in prescnbmg the ne -, 'ded by the
l' t' of computers are PIOVI,J,. Examplesof thi,Sapp lCan~~iduallY Prescribed Instruction-seeJsityof Pittsburgh s IPI (1
1968) d' Pro)'ect PLAN (Planning for
.. & GI 1969' Glaser an . Iaser, , " 1 d b the Amencan n-i~gin Accordance with Needs) deve ope SYh Brudner &
I 1971' Flanagan anner, ,s for Researc~e~t ;~~~ninclud~s a progr~m of self-knowled?e,!lr,1975). Pro) d t' al planning 'as well as instructionaualdevelopment, an occupa Ion ,
"entaryand high school subjects.
'<TERION-REFERENCED TESTING
, , h testing that has aroused a surge of',URE AN~ USES.~n appro~c t~ enerally desi<Ynatedas "criterion-
J,particularly 1~ education'd1sbygGlaser (1963) this term is still
d . " Fnst propose '.)lee testmg. I 'and its definition varies among different wnters.
; mewhat1010asl~~nativeterms are in common use, such as content-,ver,severa '
.~ "f ,del)' used CAI system for tE':lching reading to first-,r a descnptlOn 0 a \\ 1 • '( n-1 \
1" 1 ' \ ch'Ll---,. 'C'(' F, C, :\t1:,,~,n!1 1,,;, 0'
~l;:.n( t.H!·(_~T!~({' :: .'~.'-
Norms and the Interpretation of Test Scores 97
domain-, and objective-referenced. These terms are sometimes employed
as synonymsfor criterion-referenced and sometimes with slightly differ~nt
connotations. "Criterion-referenced," however, seems to have gained
ascendancy, although it is not the most appropriate term.
Typically, criterion-referenced testing uses as its interpretive frame
of reference a specifiedcontent domain rather than a specified population
of persons. In this respect, it has been contrasted with the usual norm-
referenced testing, in which an individual's score is interpreted by com-
paring it with the scoresobtained by others on the same test. In criterion-
referenced testing, for example, an examinee's test performance may be
reported in terms of the specific kinds of arithmetic operations he has
mastered, the estimated size of his vocabulary, the difficulty level of read-
ing matter he can comprehend (from comic books to literary classics),
or the chances of his achieving a designated performance level on an
external criterion (educational or vocational).
Thus far, criterion-referenced testing has found its major applications
in several recent innovations in education. Prominent among these are
computer-assisted, computer-managed, and other individualized, self-
paced instructional systems. In all ,these systems, testing is closely inte-
grated with instruction, being introduced before, during, and after
completion of each instructional unit to check on prerequisite skills,
diagnose possible leaming difficulties, and prescribe subsequent instruc-
tional procedures. The previously cited Project PLAN and IPI are
examples of such programs.
From another angle, criterion-referenced tests are useful in broad sur-
veys of educational accomplishment, such as the National Assessment of
Educational Progress (\Vomer, 1970), and in meeting demands for edu-
cational accountability (Gronlund, 1974). From still another angle,
testing for the attainment of minimum requirements, as in qualifying for
a driver's license or a pilof s license, illustrates criterion-referenced
testing. Finally, familiarity with the concepts of criterion-referenced
testing can contribute to the improvement of the traditional, informal
tests prepared by teachers for classroom use. Gronlund (1973) provides
a helpful guide for this purpose, as well as ~ simple and well-balanced
introduction to criterion-referenced testing. A brief but excellent 'discus-
sion of the chief limitations of criterion-referenced tests is given by
Ebel (1972b).
CONTENTMEANING. The major distinguishing feature of criterion-
referenced testing (however defined and whether designated by this
term or by one of its synonyms) is its interpretation of test performance
in terms of content meaning. The focus is clearly on u;hat the person can
do and what he kno'.\'s,not on how he compares with others. A funda-
I:,1\1" '
IIE\Ii
lill~:,I
r\:11',I [
,1111: :
!
1 "
MASTERY TESTING. A second major feature almost always found in
criterion-referenced testing is the procedure of testing for mastery. Es-
sentiany, this procedure yields an all-or-none score, indicating that the
Norms and tIle Interpretation of Test Scores 99
indiVidual. has ~r has not attained the preestablished level of mastery .
When basic skIlls are tested, nearly complete mastery is generally ex-
pected (e.g., 80--85% correct items). A three-way distinction may also
be employed, including mastery, nonmastery, and an intermediate doubt-
ful, or "review" interval. '
In connection with individualized instru('tion, some educators have
argued that, given enough time and suitable instructional methods nearly
~veryone can achieve complete mastery of the chosen instructio~al ob-
J:etives. Individ~al differences would thus be manifested in learning
hme rather than In final achievement as in traditional educational testing
(Bloom, 1968; J. B. C~rroll, 1963, 1970; Cooley & Glaser, 1969; Gagne,
1965). It follows t.hat In mastery testing, individual differences in per-
fo~m~nce are of httle or no interest. Hence as generally constructed
cnter~on-refer~nced tests minimize indh'idual differences. For example:
they lnclude items passed or failed by all or nearly all examinees al-
though such. ite~ns are usually excluded from no~n-referenced t~sts.'
Mas:er~ t.estin? IS r~gularly. employed in the previously cited programs
fo~ l~dlvlduahzed mstructIon. It is also characteristic of published
cr~tenon-referenced tes~ for basic skills, suitable for elementary school.
Exam~le~ of such tes~ mclude the Prescriptive Reading Inventory and
Pres~np~lve Mathem~tlCsJnventory (California Test Bureau), The Skills
M:omtor~ng System in Reading and in Study Skills (Harcourt Brace
!o\'anovlch) '.and ~iagnosis: An Instructi onal Aid Series in Reading andIn Mathematics (ScLCnceResearch Associates).
Beyond basic skills, mastery testing is inapplicable or insufficient. In
more. ad~'~nced and less structured subjects, achievement is open-ended.
The ll1dlvJ~ual m~~ progress almost without limit in such functions as
understandmg, cnbcal thinking, appreciation, and originality. Moreover,
content ~vel:a~e m~y p~oc~ed in many different directions, depending
upon .the mdl~I~~al s abllibes, interests, and goals, as well as local in-
structional factllties. Under these conditions complete ma t .r . ' S ery IS un-rea lStiCan.d unnecessary. Hence norm-referenced evaluation is generally
enlployed In such cases to assess degree of attainment. Some published
tcsts are so constructed as to permit both norm-referenced and criterion-
refe~enced applications. An example is the 1973 Edition of the Stanford
AchIevement Test. While providing appropriate norms at each level this
batt~ry ~eets three important requirements of criterion-referenced ;ests:
speclflc~tlO~ of ~etailed instructional objectives, adequate coverage of
each obJective WIth appropriate items, and wide range of item difficulty,
It should be noted that criterion-referenced testing is neither as ne~'
rinciplrs of Psychological T ('sting
equirement in constructing this type of test is a. clearly defined
. f knowledge or skills to be assesscd by the test. If scores. on such
e to have communicable meaning, the content domam to be
~lust be widely recognized as important. The selected domain
subdivided into small units defined in performance terms.
llciHQIlUI context these units correspond to behaviorally defined
6nal~.bjectives, 'such as "multiplies three-digit by two-digit
•.or "identifies the misspelled word in which the final e is re-
,hen addl~g -ing." In the programs prepared for in?ividualized
ion; these objectives run to several hundred for a smgle school
.~Afterthe instructional objectives have been fonnulated, items are
d to sample each objective. This procedure is admittedly difficult
, e -consuming. \Vithout such careful specification and control of
..t, however, the results of criterion-referenced testing could de-
rite into an idiosyncratic and uninterpretable jumble.,en strictly applied, criterion-referenced testing is best adapted for
ng basic skills (as in reading and arithmetic) at elem~ntary le~e1s.
heseareas, insh'uctional objectives can also be arranged m an ordmal
archy, the acquisition of more elementary skills being prerequisite
:the acquisition of higher-level skills.6 It is impr~eticab~e a?d probably
ndesirable, however, to formulate highly speCIfic obJectIves for ad-
vancedlevels of howl edge in less highly structured subjects. At these
',ievels,both thc content and sequence of learning are likely to be much
'moreflexible.On the other hand, in its emphasis on content meaning in the interpre-
tation of test scores, criterion-referenced testing may exert a salutary
effecton testing in general. The interpretation of intelligence test scores,
_,for example, would benefit from this approach. To describe a child's
" intelligence test performance in terms of the specific intelJech~al skills
and knowledge it represents might help to counteract the confuSIOns a~d
misconceptions that have become attached .to the IQ. VVhen stated I~
these general terms, however, the critenon-referenced approa~h IS
equivalent to interpreting test sCOTesin t~e light of the demonstra~ed
validity of the particular test, rather than m terms of vague underlymg
entities. Such an interpretation can certainly be combined with n?rm-
referenced scores.
6ldeaUy, such tests follow the simplex model of a Guttman scale (see Popham &
1T1Isck,]9(9), as do the PiaF:etian ordinal scales discussed earlier in this chapter.
. : As a resl~lt.of this reduction in variability, the usual methods for findin tdtlJio~,hty and \'al,d'.ty are,inapplkahle to most criterion-referenced tests. Further irSCIlIl.sum of these pomts Willbe found in Chapters 5, S, and 8.
rinciples of Psychological Testing-/
Norms and the Interpretation at Test Scores'II 101
one I ustrated in Table 6 Tl d .171 high school boys en 'II dl~ ata for thIs table were obtained from
d' ro e m courses in Am' h'Ictor was the Verbal R' encan Istor)', The pre-
easomng test of the D'ff t' I .administered earl . th I eren la Aphtude Tests
y m e course. The crite . 'd I
The correlation between test d ~lOn."as en -of-course grades.scores an crltenon was ,66.
TABLE 6
Expectancy Table Showing Relation betwe .and Course Grades in America H' t f en DAT lerbal Reasoning Test
n IS ory or 171 Boys in Crade 11
(Adapted from Fifth Edition Manual for . .T, p. ll~. Reproduced by permission th~. DIfferential Aptitude Tests, Forms Sand
Corporation, New York, N.Y. All right~~~;:~~~~,~ 1973, 1974 by The Psychological
~'-=-==-r----=--r:--.:.:----
clearly divorced from norm-referenced testing as some of its
ts imply. Evaluating an individual's test performance in absolute
ch as by letter grades or percentage of correct items, is certainly
, er than normative interpretations, More precise attempts to
test performance in terms of content meaning also antedate the
lion of the term :'criterion-referenced testing" (Ebel, 1962;
il,l962-see also Anastasi, 1968, pp. 69-70), Other examples may
_in early product scales for assessing the quality of handwriting,
_tions, or drawings by matching the individual's work sample
f a set of standard specimens. Ebel (1972b) observes, further-
that the concept of mastery in education-in the sense of all-or-
earning of specific units-achie\"ed considerable popularity in the
and 19305and was later abandoned.om1ativeframework is implicit in all testing, regardless of how
, are expressed, (Angoff, 1974). The very choice of content or
to be measured is influenced by the examiner's knowledge of what
e expected from human organisms at a particular developmental or
ctional stage. Such a choice presupposes information about what
persons have done in similar situations, Moreover, by imposing
rm cutoff scores on an ability continuum, mastery testing does not
'by eliminate individual differences, To describe an individual's level
ding comprehension as "the ability to understand the content of
•~ett;York Times" still'leaves room for a wide range of indi\'idual
erencesin degree of understanding.f
Test ~umberPercentage Receiving Each Criterion Crade
Score of Cases Below 70 70-79 80-89 90 & above
40 & above 4630-39 36
15 22 63
20-29
6 39 39 17
43
Below 20
12 63 21 5
46 30 52
--=17
The first column of Tahle 6 shows h .' .class intervals' the numb f t d t e test SCOles, dlVlded into four" ' er 0 s u ents whose f 11' .IS gIven in the second column The r " scores. a. mto each mtervaltable indicate the pe t' f emall1l1lg entnes m each row of the
rcen age 0 cases 'th' hwho received each grade at th d f h WI III eac . test-score interval
wi~h scores of 40 or above 0:~e ;e:b e course. ~hus, of the 46 students
celved grades of 70-79 22 al Reasomng test, 15 percent re-
d' percent grades of 80-89 d 63
gra es of 90 or above At th th ' an percentbelow 20 on the test '30 e 0 er e~treme, of the 46 students scoring
b' percent receIved gr d b I 7
etween 70 and 79 a d 17 - a es e ow 0, 52 percent
limitations of the a~ai~ble dPtercent between 80 and 89. Within the
estimates of the probabilit ~ha~tthese. p~rcentages. represent the best
criterion grade. For exam ? 'f an mdlVldual WIll receive a given
34 (i.e" in the 30--39 inte~,:i/ ':e n~w t~udent receives a test score of
of his obtaining a grade of 90 ~ _"ou . conclude that the probability
of his obtaining a grade betweer~~ove lS817. out of 100; the probability
In many practical situation n. ~n 9 ISS9'~ of 100, and so on.cess" and "failure" in a 'ob ' s, cntena can be dicliotomized. into "suc-
these conditions, an e~ e~;::;,se cof study, or othe.r undertak~ng. Under
probability of success oPrfa"I y hart can be prepared, showing the
F. I ure corresponding t 'h 'Igure 7 is an -example f h 0 eac . score mterval.
selection battery developeod ~\h a~.ex:ectanc~ chart. Based on a piloty e Ir orcc, thIS expectancy en,lirt shows.
PECTANCY TABLES.Test scores may also be interpreted in terms of
eeted criterion performance, as in a training program or on a job,
s usage of the term "criterion" follows standard psychometric prac-
, as when a test is said to be validated against a particular criterion
Ch, 2), Strictly speaking, the term "criterion-referenced testing"
uld refer to this type of performance interpretation, while the other
proachesdiscussed in this section can be more precisely described as
tent,referenced. This terminology, in fact, is used in the APA test
ndards (1974).n expectancy table gives the probability of different criterion out-
roesfor persons who obtain each test score. For example, if a student
tains a score of 530 on the CEEB Scholastic Aptitude Test, what are
e chances that hislreshman grade-point average in a specific college
ill fall in the A, B, C, D, or F category? This type of information can
e obtained by examining tbe bivariate distribution of predictor scores
SAT) plotted against criterion status (freshman grade-point average),
'f the number of cases in each cell of sueh a bivariate distribution is
Changedto a percentage, the result is an expectancy table, such as the
, R l' bet "een Performance7 Expectancv Chart ShowlI1g e atlon \ , . .
G. • , d E1' ' fan from Primary Flight Trall1JUg.IectionBattery an IIDlllaI
',(From Flanagan, 1947, p, 58.)
~ . ,'thin each stanine on the battery whothe pertentage of men scormg "I .. 'It b seen that 77 percent. l' W ht trammg can e,failedto camp :t: pnmary. 19 f 1were eiiminated in the course of train-.ofthe men recelVlDg a stamne 0 . 9 f. 'led to complete the" 1 I 1 4 t of those at stamne aJ,lng. W Ii c on y percen es the ercentage of failurestraining satisfactorily. Between these ex.trcm ., . Po the basis of this
. 1 the succeSSl'\'e stamnes. n ', decreases consIstent y over ". d f Ie that approximately, expectancy chart, it ~uld be predlcte , °t
re:amPco~e of 4 win fail and
f 'I t d t who obtain a s anme s40 percent 0 pI 0 ea e s '1 1 t 'marv flight train-
;tpproximately 60 percent wil1:.atisf~ctor:':b~~~i~ye o~~~cces~ and failure
ing. Similar statements .reia: dm1 t eh~ receive each stanine. Thus, an
could be ma.de about. m lVI ua s :v60'40 or 3:2 chance of completing
individual wIth a stamne o.f 4 has . . . a criterion-referenced interpre-
primary flight training. Besldebsprovldmthg t both expectancy tables and. f t t es it can e seen a d'
tatlol1 0 es scor., 1 'd f the validitv of a test in pre Ict-expectancy charts give a genera 1 ea 0 ,
ing a given criterion.
No. of
Men
9 21,474
8 19,444
7 32,129
6 39,398
5 34,975
•• '23,699
3 11,209
2 2,139
904
CHAPTER 5
Reliability
RLIABILlTY refers to the consistency of scores obtained by the
same persons when reexamined with the same test on different
occasions, or with different sets of equivalent items, or under
othel: variable examining conditions. This concept of reliability underlies
the computation of the error of measurement of a single score, whereby
we can predict the range of fluctuation likely to occur in a single indi-
vidual's score as a result of irrelevant, chance factors.
The concept of test reliability has been used to cover several aspects of
score consistency. In its broadest sense, test reliability indicates the extent
to which individual differences in test scores are attributable to "true"
differences in the characteristics under consideration and the extent to
which they are attributable to chance errors. To put it in more technical
terms, measures of test reliability make it possible to estimate what pro-
portion of the total variance of test scores is error variance. The crux of
the matter, however, lies in the definition of error variance. Factors that
might be considered error variance for one purpose would be classified
under true variance for another. For example, if we are interested in
measuring fluctuations of mood, then the day-by. day changes in scores
on a test of cheerfulness-depression would be relevant to the purpose of
the test and would hence be part of the true variance of the scores. If, on
the other hand, the test is designed to measure more permanent person-
ality characteristics, the same daily fluctuations would fall under the
heading of error variance.
Essentially, any condition that is irrelevant to the purpose of the test
represents error variance. Thus, when the examiner tries to maintain
uniform testing conditions by controlling the testing environment, in-
structions, time limits, rapport, and other similar factors, he is reducing
error variance and making the test scores more reliable. Despi~e optimum
testing conditions, however, no test is a perfectly reliablei~strument.
Hence, every test should be accompanied by a statemellt of its reliability.
Such a measure of reliability characterizes the test when administered
under standard conditions and given to subjects simil!lT to those con-
stituting the normative sample. The characteristicsof thiss~mple should
therefore be specified, together with the type of reliability that was meas-
ured.
iud/Iles of Psychological Testing
could,of course, be as many varieties of test reliability as there
'lions affecting test scores, since any such conditions might be
_for a certain purpose and would thus be classified as error vari-
e types of reliability computed in actual practice, however, are
few. In this chapter, the principal techniques for measuring the
. of test scores will be examined, together with the sources of
iance identified by each, Since all types of reliability are con-
-with the degree of consistency or agreement between two inde-
By derived sets of scores, they can all be expressed in terms of a
lion coefficient, Accordingly, the next section will consider some
;basic characteristics of conelation caefficients, in order to clarify
use and interpretation, More technical discussion of correlation, as
as more detailed specifications of computing procedures, can be
d:in any elementary textbook of educational or psychological statis-
such as Guilford and Fruchter (1973).
9
I : ~- i m
9ii .Jifflll
,,I ~.j/ff
I , II
~4H1Hff
iiNt I
.Jiff.j/ff'
4/It.j/ff1
!.j/ff!
JItt.j/ff I ---
.Jifflll I i: !
:.j/ff JHt ! :!I ;
mr I ! !,i
II II , ,
0. 0. 0.
N
OJ
:g 60-69"5~ 50-59o
~ 40-49oX
T N MO.O. ()o..
2 I, i"'P'?o 0 0 0 0N M """ "0 -0
Score on Variable J
Bivariate D' t'b' fISn utlOn or a Hypothetical Correlation of +1.00.
fEAl\,~G OFCORRELATION.Essentiallv, a correlation coefficient (T) ex-
~ssesthe d'egree of correspondence, '01' relationship, between two sets
;scores,Thus, if the top-scoring individual in variable 1also obtains the
scorein variable 2, the second-best individual in variable 1 is second
..~stin variable 2, and so on down to the poorest individual in the group,
ncn there would be a perfect correlation between variables 1 and 2.
ucha correlation would ha\'e a value of +1.00,A hypothetical illustration of a perfect positive correlation is shown in
igure 8. This figure presents a scatter diag~\lm, or hivariate distributiOflt,
ch tally mark in this diagram indicate~~~e score of one individual in
th vllriable 1 (horizontal axis) and vain.\:B1e:2 (vertical axis). It will be
noted that all of the 100 cases in thee grolJ.l) are distributed along ~~
diagonal running from the lower left- to,'theupper right-hand corner of
.,'the diagram. Such a distribution indicates a perfect positive correlation
(+ 1.00), since it shows that each individual occupies the same relative, ,position in both variables. The closer the bivariate distribution of scares
approaches this diagonal, the higher will be the positive correlation.
Figure 9 illustrates a perfect negative correlation ( -1.00 ). In this case,
there is a complete reversal of scores from one variable to the other. The
best individual in variable 1is the poorest in variable 2 and vice versa,
this reversal being consistently maintained throughout the distribution. It
will be noted that, in this scatter diagram, all individuals fall on the
diagonal extending from the upper left- to the lower right-hand comer.
This diagonal runs in the reverse direction from that in Figure 8."- ....,,1,,.;,,~;."l;,,~t,,~('()mnlete "bsence of rdationship, such as
might occur by chance, If each individ l'out of a hat to determine his 'f' tl.a s n~me were pulled at randomwere repeated for variabl~ C) pOSI IOn m vanable 1, and if the process
Under these conditions l't -, alzderbo~r near~zero correlation would result., WOu e ImpOSSIblet d' ,
relative standing in variable 2 from k 0 pre, Ict an 1l1dividual's
1.The top-~oring subJ'ect I'n "bl a1n~whledge of IllS score in variablE!
, valla I" mlg t scar I' I IIn variable 2. Some individual 'h b h e ug I, ow, or average
~oth variables, or below ave;a~e~l~gb~th~ ~hance ~core above average in
In one variable and below in the oth .' 'Uers might ~all above averageaverage in one and at th ' .er, sh others 11lIght be above the
, e avel acre 111 the second d fwould be no regularit}, in the relate: h' f ' an so orth. ThereTI lOns Ip rom one i d' "d IIe coefficients found in a t I' n 1\I ua to another.
extremes, having some value 'h~ ~1 p~actIce generally fall between these
lations between measures of ~1,t't an zero but lower than 1.00. Corre-frequentlY low When a a I,lIes are nearly ;rlways positive, aIthoug'h
" negative conel t' . b'such variables, it usually results f th a IOn IS a tamed between twopressed, For example 1'£ t' rom e way in which the scores are ex-
, Ime scores are correla't d 'thnegative correlation wl'11 prob bl I Th ,e;. WI amount scores, a
, a y resu t. u '~f -h b' ,an anthmetic computation t t' d s, '1. cae su lect s score:()n, es IS recor ed as the xi' b f
qmred to complete all itenls h"l h' ',pm er a secondsre·, W I e IS Score on an 'th .
test represents the number of bl ,~, an mehc reasoning1 ' pro ems correctly soh d .ahon can be expected In su I h . ~" a negative corre-
. CIa case, t e poorest (i.e.", slowest) individ-
. R l' bet \'een PerformanceCh t Showmg e atIon \ .,IG,7. Expectancy aT p.' . Flight Training.ejectionBattery and Elimination from I1maly
.{FromFlanagan, 1947, p. 58.)
: . ,thin each stanine on the battery who,thepercentage of men scormg \\ I . . . It b seen that 77 percent, J' fI' ht trammg can eailed to comp :t: pnmary. Ig f 1were eiiminated in the course of train-of the men receIVing a stamne 0 . 9 f 'led to complete the
1 '1 I 4 t of those at stamne aling, W 11 C on y percen es the ercentage of failuresh'aining satisfactorily, Between these ex.trcm ", Po the basis of this
. 1 the succeSSl'\'e stanmes. ndecreases consistent y over '. d f I that approximatelyexpectancy chart, it ~uld be predlCt,e , °t
re~amPcoe~eof 4 will fail and
f 'J t d t who obtam a s amne s40 percent 0 pI 0 ca e s 1 . flight train-
itpproximately 60 percent wil1;atis~~~tor~'~b~~~i~ye~fl:~:::~and failureing, Similar statements re~a~ m~ hp i each stanine. Thus, an
could be made about. indlvldua s w 6~.~~c:rv;:2 chance of completing
. individual. with a. s~amne o.f 4 has ~idin' a criterion-referenced interpre-
primary fhght trammg. Besldebspro thgt both expectancy tables and
. f t t scores it can e seen a d'tatlon 0 es .' I 'd f the validitv of a test in pre lct-expectancy charts glVe a genera 1 ea 0 J
ing a given criterion.
No. of
Men
9 21,474
S 19,444
7 32,129
6 39,398
5 34,975
4 '23,699
3 11,209
2 2,139
904
CHAPTER 5
Reliability
RLIABILITY refers to the consistency of scores obtained by the
same persons when reexamined with the same test on different
occasions, or with diHerent sets of equivalent items, or under
othel: variable examining conditions. This concept of reliability underlies
the computation of the error of measurement of a single score, whereby
we can predict the range of fluctuation likely to occur in a single indi-
vidual's score as a result of irrelevant, chance factors.
The concept of test reliability has been used to cover several aspects of
score consistency. In its broadest sense, test reliability indicates the extent
to which individual diHerences in test scores are attributable to "true"
differences in the characteristics under consideration and the extent to
which they are attributable to chance errors. To put it in more technical
terms, measures of test reliability make it possible to estimate what pro-
portion of the total variance of test scores is error variance. The crux of
the matter, however, lies in the definition of error variance, Factors that
might be considered error variance for one purpose would be classified
under true variance for another. For example, if we are interested in
measuring fluctuations of mood, then the day-by-day changes in scores
on a test of cheerfulness-depression would be relevant to the purpose of
the test and would hence be part of the true variance of the scores. If, on
the other hand, the test is designed to measure more permanent person-
ality characteristics, the same daily fluctuations would fall under the
heading of error variance.
Essentially, any condition that is irrelevant to the purpose of the test
represents error variance. Thus, when the examiner tries to maintain
uniform testing conditions by controlling the testing environment, in-
structions, time limits, rapport, and other similar factors, he is reducing
error variance and making the test scores more reliable. Despite optimum
testing conditions, however, no test is a perfectly reliable instrument.
Hence, every test should be accompanied by a statement of its reliability.
Such a measure of reliability characterizes the test when administered
under standard conditions and given to subjects similllr to those con-
stituting the normative sample. The characteristics of thiss~mple should
therefore be specified, together with the type of reliabIlity that was meas-
ured.
iflciplesof Psychological Testing
"~ould, of course, be as many varieties of test reliability as there
,jtionsaffecting test scores, since any such conditions might be
t for a certain purpose and would thus be classified as error vari-
e types of reliability computed in actual practice, however, are
few. In this chapter, the principal techniques for measuring the
'f}'of test scores will be examined, together with the sources of
illiance identified by each. Since all types of reliability are con-
,with the degree of consistency or agreement between two inde-
'flyderived sets of scores, they can all be expressed in tcrms of a
'on coefficient. Accordingly, the next section will consider some
basic characteristics of correlation cBefficients, in order to clarify
use and interpretation. ?\fore technical discussion of correlation, as
·as more detailed specifications of computing procedures, can be
,in any elementary textbook of educational or psychological statis-
; such as Guilford and Fruchter (1973),
9
I : ,- ; ",
!i ! ./Iff III
,
!.JHt-./Iff
.., II j
!mr ./Iffi
i#ff I ;
./Iff./lff'
T./Iff./lffl
./Iff!
./Iff./lff ,j'--
./Iff 11/ I: !
:./Iff./lff
i i
I,;
lilt I ! !, ,,
II I ,
I I,
0- 0- 0-
N
••:g 60-69
'g> 50-59co
~ 40-49v
'"
I N (""')0. ().. ()o.
gb b ';t'fl'?N t") ~ Si ~
SCore On Variable I
FIG. 8, Bivariate Distr'b t' fI U IOn or a Hypothetical Correlation of +1.00.
might OCcur by chance If each ind' 'd I'out of a hat to determ'ine hi .1:1 1I.as n~me \"ere pulled at random
, s pOsitIOn In vanahle 1 a d 'f thwere repeated for variable" ' n I e processUnder these conditions it -, alzderbo~r near~zero correlation would result.
, \Vou e ImpOSSible to d' t drelative standing in variable 2 from k pre. IC an in ividual's
~. The top-sl!Oring Subject in variable a1~~w~edge of l~,s SCore in variableIn variable 2. Some individ I 'h b g t Score 11lgh,low, or average
both vadables or below av:;'l s n~,gbt hY chhance score above average in. ' age In ot . ot ers mightf II b111 one variable and below in the oth .' '11 .a a Ove averageaverage in one and at th " .er, sh others mIght be above the
ld b e a\el:lge III the second and f hwou .e no regularity in the relationshi from '.. ,. so art, ThereThe coefficients fOund in t I ~ one mdl\ Idual to another.
extremes, having some value ~~ ~'l .p~achce generally fall between these
lations between measures of a~1.tt an zero but lower than 1,00. Corre-frequentlv low When a I,lies are nearly a-lways positive, althoug'h
,. negative con-el t' . b'such variables, it usually results from th a IOn.IS 0 .tamed between twopressed. For example if time e way III which the scores are ex-
, ' SCores are correlated withnegat.lYc correlation will probabl ' result. Th ';~:'~' , am.ou~t scores, aan anthmetic computation te t .) d d us, 1f~!ch sublect s score'On. d S IS recor e as the dumb f d
qUire to complete all items wh'l h' '~er a secon sre·t t ' I e IS Score on an arith t'es represents the number of hI '''.' me IC reasoningI t' pro ems correctly sol\!cd 'a Ion can be expected. In SUell 'h . :,<:,' a negatIve cone-
,a case, t e poorest (I.e., slowest) individ-
EA!\'ING OF CORRELATION. Essentially, a correlation coefficient (T) ex-
ses the d'egree of correspondence, or relotions1Jip, between two sets
cores.Thus, if the top-scoring individual in variable 1also obtains theop score in variable 2, the second-best individual in v-ariable 1is second
~stin variable 2, and so on down to the poorest individual in the group,
'brll there would be a perfect correlation between variables 1 and 2.
ucha correlation would ha\'e a value of + 1.00.A hypothetical illustration of a perfect positive correlation is shown in
igure 8. This figure presents a scatter diag~ll.m, or bivariate distrihutiOl/,.
ach tally mark in this diagram illdicated~e score of one individual in
'oth variable 1 (horizontal axis) and vUllable: 2 (vertical axis). It will be
noted that all of the 100 cases in thee groBl) are distributed along u.~"diagonal running from the lower left- t~,'the"upper right-hand corner of
:the diagram. Such a distribution indicates a perfect positive correlation
, (+1.00), since it shows that each individual occupies the same relative
position in both variables. The closer the bivariate distribution of scares
approaches this diagonal, the higher will be the positive correlation,
Figure 9 illustrates a perfect negative correlation ( -1.00), In this case,
there is a complete reversal of scores from one variable to the other, The
best individual in variable 1is the poorest in variable 2 and vice versa,
this reversal being consistently maintained throughout the distribution. It
will he noted that, in this scatter diagram, all individuals fall on the
diagonal extending from the upper left- to the lower right-hand comer,
This diagonal runs in the reverse direction from that in Figure 8.,,- 1..•;,,~ ;."l;r·~tr'~ ('omnlete flbsellce of rdationship, such as
0- 0- 0- 0-
1 '? '? r;-~ ~ ~ R
Score on Variable 1
Reliability 107
tive. When some prod t ..
W
'll b 1 uc s are posItive and some negative the correlation1 e c ose to zero. '
In actual practice it's tstandard score befo' ~ d~o n~cessary to convert each raw scorc to acan be mad . re n mg t e cross-products, since this conversion
, he once for all after the cross-products have been added Thereare many s ortcuts fo . .method demonst. ar .computmg.the Pearson correlation coefficient. The
meanin of the ~te m. Table 7 l~ not the quickest, but it illustrates thethat l rf. rr~latIon coeffiCient more clearly than other methods
Pears~~ I:~;~:~~t:~~::i\~hor::uts. Table 7 shows the computation of a
to each child's nam ~1e ICand reading scores of 10 children. Next
reading test (Y) T~ are. hiS s~ores in the arithmetic test (X) and the
the res ective c~l e sums an .means of the 10 scores are given under
each aJthm ti umn;- The thU'? column shows the deviation (x) of
the deviatio~ (yS~o~;ero~1 thed~nthmetic mean; and the fourth column,
deviations are squareda~n ~~: ;::g /~ore fr~m the reading mean. Thesesquares are used in . x wo co umns, and the sums of the
and reading scores ~~~K:t:::!t~h~ ~and~~d /~viations of the arithmetic
dividing each x and y by'ts . 0 eSdc~le m Chapter 4. Rather than1 correspon mg u to find standard scores, we
/II
\I
./ill I \
./iIt./ill\ I
11IIJIlt
JIlt 1/1 i- Jlltl/tf
.IIII./iII \./ill
11II11II1 I
i./ill ,I i
./iII./iII.
\II
1
\I.11/I11I
1/1
i
'"~ 60-69
.9Ii> 50-59co
~ 040-049ouVl
Ic.9. Bivariate Distribution for a Hypothetical Correlation of -1.00.TABLE 7
Computation of Pearson Product-Moment Correlation Coefficient
Arith- Read-metic ing
Pupil X Y x y x:z y' xI}
Bill 41 17 +1
Carol-4 1 16 - 4
I38 28 -2 +7
Geoffrey 48 22
4 49 -14
+8 +1 64
Ann 32
1 8
Bob16 -8 -5 64 25
: 3440
18 -6 -3 36
Jane9 18
36 15 -4 -6 16
Ellen 41 24
36 24
+1 +3 1
Ruth 43
9 320 +3 -1 9
Dick 47 23 +7
1 - ~
Mary 40
+2 49 4· 14
27 0 +6 0 36
S 400 210
0
M
0 0 2~4 186 86
40 21
IN -- . ~186 --fT. = 10= v'24.40= 4.94 fT, = 10= v'18.60= 4.31
r,,=~= 86 86NUru. (10)(4.94)(4)R} = 212;91=.40
I ? "':.:'Ii';~l
. .'''_~~i~i ' -.[
. 'ualwillhave the numerically highest score on the first test, while the best
individualwill have the highest score on the second.Correlation coefficients may be computed in variom ways, depending
on the nature of the data. The. most common is the Pearson Product-Moment Correlation Coefficient. This correlation coefficient takes into
a.ceountnot only the person's position in the group, but also the amount
ofhis deviation above or below the group mean. It will be recalled that
. wheneach individual's standing is expressed in}erms of standard scores,
personsfalling above the average receive positive standard scores, while
thosebelow the average receive negative scores. Thus, an individual who
is superior in both variables to be corre1al:ed,:would have two positive
standard scores; one inferior in both woul~ have two negative standard
scores.If, now, we multiply each individ\i&r" tandard score in variable I
by his standard score in variable 2, all.at . products will be positive,
provided that each individual falls on theA.ame side of the mean on both
variables. The Pearson correlation coefficje,))t is Simply the mean of these
products. It will have a high positive val\ie:'W~~n corresponding standard
scores are of equal sign and of approximately equal amount in the two
variables. When subjects above the average in one variable are below the
average in the other, the corresponding cross-products will be negative.
If the sum of the cross-products is negative, the correlation will be nega-
'08 Prillcip1t's of PS!Jchological T('8ting ,
t the end as shown in the correlationform this division only once ad' ' the last column (xI)) have
1 7Th oss-pro uets m' dula in Tab e, e cr , d' g deviations in thc x an y
1· l' the cOITespon lll' d ten found by mu tIp ymg '( r) the sum of these cross-pro uc slumns,To compute the _~orrelatlOn(N ) , and by the product of the twodivided bv the number. of cases
ndard de~'iatiol1s (11':<Uy),
correlation of ,40 found in Table 7 ind~-STATISTICALSIGNIFICAJ'CE,The , 1 t' hl'p between the arithmetic
d f ositwe re a Ions 11tes a moderate egree 0 p d for those children doing wereading scores. There is some1ten h
encyadl'ng test and vice versa, al-
f wel on t e re , h harithmeticalso to ~er orm If we are concerned only Wit t eugh the relation IS not close, cept this correlation as an
10 'h'ldren we can acrmance of the,se c 1 , fIt' existing between the two
. f th degree 0 re a lOnuate descriptlOn 0 e '1 ch however we are usu-
I holog1ca resear , ' d 1" les in this group. n psyc d h t'cular sam1J/e of indivi ua s
1" beyon t e par 1terested in genera lZln~ h'ch the represent, For example, we
to the larger populatIOn ": 1 etic ;nd reading ability are corre-t want to know whether anthm f h e age as those we tested,
. h lchildren 0 t e sam .amongAmencan sc 00 . d vould constitute a very m-iously,the 10 cases actually ~xamAlneth'r comparable sample of the: 'f 1 opulatlOl1, no e .uate sample 0 sUf 1 a p much higher correlatIOn.sizemight yield a much lowfer ort~ tl'ng the probable fluctuation
. . 1 dures or es Imaere are stabshca proce , Ie in the size of correlations, means,'expectedfrom sample to samp , The <!uestion usually, , d' ther group n1easures, . 'rd deviations, an an) 0 .' 1 whether the correlation IS
1, however IS SImp v , h,about carre atlOns, , h 'd l'f the correlation 111t e
1 . In ot er war 5,antly greater t lan zelo. . as hi h as that obtained in our'lion is zel'O, could a cOTTel~hon glne? When we say that a
d f Plmg error a 0 'have resulte rom sam t (01) level" we mean the, ,... 'fi t at the 1 percen. , 1bon IS slgm can t of 100 that the population corre n-are no greater than one ou h t van'ables are truly corre-
, 1 de that t e wo ,zero.Hence, we conc U h risk of error we are willing to ta~eignificancelevels refer to ~ e
tIf a correlation is said to be Slg-
ing conclusions from our ~;"t f error is 5 out of 100. Most
at,the .05 ~eve1, th~ pr;;~e; ~~e ~01 or the ,05 levels, although
oglcalresearch applies 10 ed for s ecial reasbns~
,ificancelevels may be. emp b( 7 f 'Is fo reach significance evenrrelation of .40 found 111Ta e,. al d 't,h';ill11\,10 cases it is
, ht h e been antiCIpate ,WI ..•~;r,Y flevel. As mlg av l' h' conc1usively~\Yith this size 0o establish a general re at,lOn.s
6Ip t a";"t'the "05-1eve1 is .63. Any
, 1 t' Igm can ' ,he smallest corre a Ion s, '" " wered the question of,n below that value simply leaves unans
Reliability 109
whether the two variables are correlated in the population from which
the sample was drawn.
The minimum correlations significant at the .01 and ,05 levels for
groups of different sizes can be found by consulting tables of the signifi-
cance of correlations in any statistics textbook. For interpretive purposes
in this book, however, only an understanding of the general concept is
required. Parenthetically, it might be added that significance levels can
be interpreted in a similar way when applied to other statistical measures.
For example, to say that the difference between two means is significant
at the .01 level indicates that we can conclude, with only one chance out
of 100 of being wrong, that a difference in the obtained direction would
be found if we tested the whole population from which our samples were
drawn. For instance, if in the sample tested the bo),s had obtained a
significantly higher mean than the girls on a mechanical comprehension
test, we could conclude that the boys would also excel in the total popu-
lation,
THE RELIABILITYCOEFFICIENT.Correlation coefficients have man)' uses
in the analysis of psy.chological data, The measurement of test reliability
represents one application of such coefficients. An example of a reliability
coefficient, computed by the Pearson Product-Moment method, is to be
found in Figure 10. In this case, the scores of 104 persons on two equiva-
lent forms of a Word Fluency test' were correlated. In one form, the sub-
jects were given five minutes to write as many words as:'they could that
began with a given letter. The second form was identical, except that a
different letter was employed. The two letters were chosen by the test
authors as being approximately equal in difficulty for this purpose.
The correlation between the number of words written in the two forms,\
of this test was found to be ,72. This correlation is high and significant at
the ,01 level. With 104 cases, any correlation of .25 or higher is significant
at this revel. Nevertheless, the obtained correlation is somewhat lower
than is desirable for reliability coefficients, which usually fall in the .80's
or .90's, An ~nation of the scatter diagram in Figure 10 shows a
typical bivariate distribution of scores corresponding to a high positive
correlation. It will be noted that the tallies cluster c~ose to the diagonal
extending from the lower left- to the upper right-haridcorner; the trend
is definitely in this direction, although there is a certain amount of scatter
of individual entries. In the follOWing section, the uSe of the correlation
coefficient in computing different measures of test reliability will be con-
sidered. '
lOne of the subtests of the SRATests of Primary Mental Abilities' for Ages 11 to17. The data were obtained, in an investigation by Anastasi and Drake (1954).
ReliabilifY 111
less susceptible the scores are to the random daily changes in the condi-
tion of the subject or of the testing environment.
When retest reliability is reported in a test manual, the interval over
which it was measured should always be specified. Since retest correla-
tions decrease progressively as this interval lengthens, there is not one
but an infinite number of retest reliability coefficients for any test. It is
also desirable to give some indication of relevant intervening experiences
of the subjects on whom reliability was measured, such as educational or
job experiences, counseling, psychotherapy, and so forth.
Apart from the desirability of reporting length of interval, what con-
siderations should guide the choice of interval? Illustrations could readily
be cited of tests showing high reliability over periods of a few days or
weeks, but whose scores reveal an almost complete lack of correspond-
ence when the interval is extended to as long as ten or fifteen years.
Many preschool intelligence tests, for example, yield moderat~ly stable
measures within the preschool period, but are virtually useless as pre-
dictors of late childhood or adult IQ's. In actual practice, however, a
simple distinction can usually be made. Short-range, random fluctuations
that occur during intervals ranging from a few hours to a few months are
generally included under the error variance of the test score. :rhus, in
checking this type of test reliability, an effort is made to keep the interval
short. In testing young children, the period should be even shorter than
for older persons, since at early ages progressive developmental changes
are discernible over a period of a month or even less. For any type of
person, the interval between retests should rarely exceed six months.
Any additional changes in the relative test performance of individuals
that occur over longer periods o£ time are apt to be cumulative and pro-
gressive rather than entirely random. Moreover, they are likely to charac-
terize a broader area of behavior than that covered by the test perform-
ance itself. Thus, one's general level of scholastic aptitude, mechanical
comprehension, or artistic judgment may have altered appreciably over
a ten-year,period, owing to unusual intervening experiences. The indi-
vidual's status may have either risen or dropped appreciably in relation
to others of his own age, because of circumstances peculiar to his own
home, school, or community environment, or for other reasons such as
illness or emotional disturbance.
The .extent to which such factors can affect an individual's psycho-
logical development provides an important problem for investigation.
This question, however, should not be confused with that of the reliabil-
ity of a particular test. When we measure the reliability of the Stanford-
Bin~t, for example, we do not ordinarily correlate retest _~~res over a
penod of ten years, or even one year, but over a few ,,~et:1ks.'-T.p be sure,long-range retests have been conducted wit~ such tests-; bpt the results
are ~enerally discussed in terms of the predictability of adult intelligence
Flc.l0. A Reliability Coefficient of .72.
·<:.(Dalafrom Anastasi & Drake, 1954.)
l
;1;;:TYPES OF RELIABILITY
r, ost obvious method for finding the re-, TEST-RETEST RELIABILITY. The m. h'd ntical test on a second occa-
liabilityof te.st ~c~res is by. rcpeCatll1)gi:;h~S:ase is simply the correlation. " sian.The I'ehablhty coeffiCIent Tn on the two administra-
bt' d by the same persons~betweenthe scores 0 ame d to the random fluctua-
. Th . variance correspon s" lionsof the test. e enor . t the other These variations. f test seSSIOn0 •tionsof performance rom one n d t ting conditions such as extrememay result in part from uncontr? e eds ther distractions or a broken
. th dden nOlses an 0 '. hchangesm wea er, su h they arise from changes in t e
pencil point. To so~e ext:nt, lfowev~~~strated by illness, fatigue, emo-
condition of the subject h1l11Se.' as 1 f pleasant or unpleasant nature,. ecent experIences 0 a
tionalstram, worry, r . ., h the extent to which scores on a testand the like. Retest reliabIlIty sows. th higher the reliability, thecanhr I!eneralized over different occaSlDns; e
I.1
i
, ; \
\-1I : ."I
\ i -HH"
1 : 1111
\ " I.
1 i
I \ 1111 ',.jilt I \o/Ht'lII; ,
$ ~1 IIt) 0
-0. "
()."f0'0"t ~~("") M ~ I I
~ b J, ~ ~ ~ ~ ~N (") M IT'Score on FormJ: Word Fveney e.
Prillciples of PsycllOlogical Testing
omchildhood performance, rather than in terms of the reliability of a
rticular test. The concept of reliability is generally restricted to short-
ge, random changes that characterize the test performance itself
.r;ilherthan the entire behavior domain that is being tested,
It should be noted that different behavior functions may themselves
.ry in the extcnt of daily fluctuation they exhibit. For example, steadi-
ess of delicate finger movements is undoubtedly more susceptible to
, ht changes in the person's condition than is verbal comprehension, If
wish to obtain an over-all estimate of the individual's habitual finger
diness, we would probably require repeated tests on several days,
reas a single test session would suffice for verbal comprehension,
~gainwe must fall back on an analysis of the purposes of the test and
9iJ a thorough understanding of the behavior the test is designed to pre-Biet,
:'l'Although.apparently simple and straightforward, the test-retest tech-
, '~iquepresents difficulties when applied to most psychological tests.
lPracticewill probably produce varying amounts of improvement in the
~testscoresof different individuals. Moreover, if the interval between re-
estsis fairly short, the examinees may recall many of their former re-
ooses.In other words, the same pattern of right and wrong responses
_likelyto recur through sheer memory. Thus, the scores on the two ad-
1Jlinistrationsof the test are not independently obtained and the correIa-
between them will be spuriously high, The natt\re of the test itself
ay also change with repetition, This is especially true of problems in-
lyingreasoning or ingenuity. Once the subject has grasped the princi-
involvedin the problem, ur once he has worked out a solution, he can
roduce the correct Iesponse in the future without going through the
erveningsteps. Only tests that are not appreciably affected by.'if!.'Jeti-
n lend themselves to the retest technique, A number of sensory dis-
(~riminationand motor tests would fall into this category, For the large
,majorityof psychological tests, however, the retest technique is inap-
ropriate.
. ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-
untered1n test-retest reliability is through the use of alternate forms
the test. The same persons can thus be tested with one form on the
stoccasjonand with another, comparable form on the second. The cor-
lationbetween the scores obtained on the two forms represents the
'ability coefficient of the test. It will be noted that such a reliability
efficientis a measure of both temporal stability and consistency of
nse to different item samples (or test forms). This coefficient thus
binestwo ty,pes of reliability. Since both types are important for most
Reliability 113
testing purposes 110.... Imeasure for e\'al~at' 'ever, a temate-form reliability provides a useful
mg many tests.The concept of item sam Iin '
alternate-form reliability bu~ al~ ~;hcontellt salllpl~llg: ?lIderlies not onlyshortlv. It is the f . er types of reltabIhty to be discussed
- re ore appropnate to ex . 'thas probably h d th' amlOe 1 more close lv, Everyone
a e expenence of taking . ..-he felt he had a "I k b k" a course exammatlOn in \vhich
very topics he happue~:d t~e~aveb;~:~:e many of the items covered the
easion, he may have had th . ed mo~t carefully, On another oc-large number of l't e opposite expenence, finding an unusually
ems on areas he had f 'I d .situation illustrates error va . I al e to reVICW,This familiar
what extent do Scores on th.n~nc: ;esu ting from content sampling, To
ticular selection of items? I:sa ~'ff epen? on ~actors speci~c to the par-
ently, were to pre!)are another te It ~rent IOdvestlgator,workmg independ-
t' h s In accor ance with the 'fiIOns, ow much would an indi .d l' . same speci ca-Let us suppose that a 40't VI ua bSslcore differ on the two tests?
-I em voca u ary t t h ba measure of general verbal c ,e.s - as een constructed as
~ist of 40 different words is ass~:b~:~e;~:~~~ :ow suppose that a secondItems are constructed with I ame purpose, and that thecultv as the first test The d,effqua can~ to cover the same range of diffi-. d: , ,I erences 111 the sco e bt' d bm lVIduals on these two tests 'II t r s 0 ame y the same
,IUS rate the type of 'conSIderation. Owing to fortuitous f . error vanance underferent individuals the relat' , d'ffi aftors In the past experience of dif-what from pcrso~ to pe !VeT]·1 cu ty of the two lists will vary Some-
rson. IUS the Ii t I' t . hnumber of words unfamiliar to individ ;s IS mIg t contain a larp;el-The second list on the oth h d .ua A than does the second list.1 ' er an mIght co t' d'arge number of words unfamiIi t I • d' 'd n
lam a Isproportionately
ar 0 111 IVI ua B If the t . d"d Iare apprOXimately equal in thei II . WO 111 IVI ua ~"true scores") B' will neverth I r overa word knowledge (i.e., in thei~
excel B on th~ second The eIe~ excel A on the first list, while A will
therefore be reversed o'n th trea ].ve standing of these two persons will
. e wo Ists o' t hselection of items, ' wmg 0 c anee differences in the
Like lest-retest rcliabilit, alt . £ ' ..accompanied by a stateme~' f t~rntc- ~rm rdl~bIhty should always be
ministrations as well as ado . t~ engft of the mterval between test ad-If h·' escnp Ion 0 relevant' t .t e two forms are administered' . In ervenmg experiences.
correlation shows reliabilit Ifn Immediate succession, the resulting. y across orms only not .
error vanance in this cas 8' ' across occasIOns. Thee represents uctuat'o' f
one set of items to another b t H ,I ns In per ormance fromIn the d I ' u not uctuations over time
eve Opment of alternate forms h Id· .cised to ensure that the are trul ' care s ou ..?f ('Ourse be exer-of a test should be jnd~endc t{ parallel. F~ndamentaJ)y, parallel forms
same specifications. The tests :h~ ~nstruct~ tests desi~ed to meet theU ('Ontam the same number of 't
.. . 1 elDS,
Reliabilify 111
less susceptible the scores are to the random daily changes in the condi-
tion of the subject or of the testing environment.
When retest reliability is reported in a test manual, the interval over
which it was measured should always be specified. Since retest correla-
tions decrease progressively as this interval lengthens, there is not one
but an .infinite number of retest reliability coefficients for any test. It is
also desirable to give some indication of relevant intervening experiences
of the subjects on whom reliability was measured, such as educational or
job experiences, counseling, psychotherapy, and so forth.
Apart from the desirability of reporting length of interval, what con-
siderations should guide the choice of interval? Illustrations could readily
be cited of tests showing high reliability over periods of a few days or
weeks, but whose scores reveal an almost complete lack of correspond-
ence when the interval is extended to as long as ten or fifteen years.
Many preschool intelligence tests, for example, yield moderarely stable
measures within the preschool period, but are virtually useless as pre-
dictors of late childhood or adult IQ's. In actual practice, however, a
simple distinction can usually be made. Short-range, random fluctuations
that occur during intervals ranging from a few hours to a few months are
generally included under the error variance of the test score. :rhus, in
checking this type of test reliability, an effort is made to keep the interval
short. In testing young children, the period should be even shorter than
for older persons, since at early ages progressive developmental changes
are discernible over a period of a month or even less. For any type of
person, the interval between retests should rarely exceed six months.
Any additional changes in the relative test performance of individuals
that occur over longer periods of time are apt to be cumulative and pro-
gressive rather than entirely random. Moreover, they are likely to charac-
terize a broader area of behavior than that covered by the test perform-
ance itself. Thus, one's general level of scholastic aptitude, mechanical
comprehension, or artistic judgment may have altered appreciably over
a ten-year, period, owing to unusual intervening experiences. The indi-
vidual's status may have either risen or dropped appreciably in relation
to others of his own age, because of circumstances peculiar to his own
home, school, or community environment, or for other reasons such as
illness or emotional disturbance.
The .extent to which such factors can affect an individual's psycho-
logical development provides an important problem for investigation.
This question, however, should not he confused with 'that of the reliabil-
ity of a particular test, When we measure the reliability of the Stanford-
Binet, for example, we do not ordinarily correlate retest :~~res over a
period of ten years, or even one year, but over a few weeks,'~'£p he SUfe~
long-range retests have been conducted wit~ such tests:; bpt the .fcsults
are generally discussed in terms of the predictability of adult intelligence
I.1
i
I : \
\-i\ : ."I .
I
\\\
\" ;4!It 1/: 1/1
\ j
\ 4!It \ " I III/
1/11 '.flit I \.fIIt1H1!
0- ~ 0- ~Ii') 0()
0()"-
I II 1Ii') 0
0Ii') 0
Ii') 0 0() "-0Ii') ~ ~ Ii') Il'l 0()
sc:e onMFormJ: Word fluencY Test
,. '!G. 10. A Reliability Coefficient of .72.
Data from An8~tasi & Drake, 1954.)
':TYPES OF RELIABILITY
, ost obvious method for finding the re-TEST-RETESTRELIABILITY. The m. h 'dentical test on a second occa-
.. liabilityof test scores is by. rcpeatlll)g.t :h~ ase is simply the correlation.: 'sion.The l'eliability coefficlenf (Tn III ISC, n the two administra-
b . d b the same persons 0\[1betwe~i'Ithe scores 0 tame Y. d to the random fluctua-. Th . vanance correspoll S'; tions of the test. e enor . t the other These variations. . f e test seSSIOn 0 •" tionsof performance rom on II d t t'ng conditions such as extreme, I' rt f ncontro e es 1 ' kmay resu t 111 pa rom u . d ther distractions or a bro en
I • h dden nOlses an 0 " hchanges 111 we at er, su th y arise from changes m t e
. . T extent however, e .pend pomt. 0 so~e .' f 'Uustrated by illness, fatigue, emo-condition of the subject hmlsel : as 1 f pleasant or unpleasant nature,
I· recent expenences 0 a ttiona stram, worry, . ., h the extent to which scores on a tesand the like. Retest rehablhty sows. the higher the reliability, thecan he I':eneralized over different occaSIOns;
Prillciples of Psychological Testing
om childhood performance, rather than in terms of the reliability of a
rticular test. The concept of reliability is generally restricted to short-
nge, random changes that characterize the test performance itself
therthan the entire behavior domain that is being tested.
It should be noted that different behavior functions may themselves
, in the extent or daily fluctuation they exhibit. For example, steadi-
of delicate finger movements is undoubtedly more susceptible to
ht changes in the person's condition than is verbal comprehension, If
wish to obtain an over-all estimate of the individual's habitual finger
diness, we would probably require repeated test~ on several days,
'hereas a single test session would suffice for verbal comprehension,
gainwe must fall back on an analysis of the purposes of the test and
i1 a thorough understanding of the behavior the test is designed to pre-t.
Althoughapparently simple and straightforward, the test-retest tech-
ique presents difficulties when applied to most psychological tests,
.racticewill probably produce varying amounts of improvement in the
.testscores of different individuals. Moreover, if the interval between re-
s is fairly short, the examinees may recall many of their former I'e-
. Dnses.In other words, the same pattern of right and wrong responses
.4 likelyto r~cur through sheer memory. Thus, the scores on the two ad-
inistrationsof the test are not independently obtained and the correIa-
n between them will be spuriously high, The natnre of the test itself
:ayalso change with repetition, This is especially true of problems in-
DIvingreasoning or ingenuity. Once the subject has grasped the pdnci-
"Ieinvolvedin the problem, or once he has worked out a solution, he can
produce the correct response in the future without going through the
itervellingsteps, Only tests that are not appreciably affected by"lfi,eti-
tiDnI~nd themselves to the retest technique. A number of sensory dis-
,criminationand motor tests would fall into this category. For the large
ajorityof psychological tests, however, the retest technique is inap'
opriate,
ALTERNATE-FORM RELIABILITY. One way of avoiding the difficulties en-
imteredin test-retest reliability is through the use of alternate forms
the test. The same persons can thus be tested with one form on the
stDccasjonand with another, comparable form on the second. The cor-
ation between the scores obtained on the two forms represents the
'ability coefficient of the test. It will be noted that such a reliability
cient is a measure of both temporal stability and consistency of
nse to different item samples (or test forms). This coefficient thus
binestwo types of reliability. Since both types are important for most
Reliability 113
~:~:~t~;~~':~~~~~:g'enYlear, altternate-form reliability provides a usefulny ests.
The concept of item sam tin 'altemate-fOlm reliability bu~ al~ ~;hcontellt sampl:llg: ~nderlies not onlyshort Iv. It is the f . er types of reltabllIty to be disclIssedhas p;obably h drethoreappr.opnate to examine it more closely, Everyone
a e expenence of tak' g ,he felt lIe had a "I k b k» 'In a course examination in which
uc v rea because f h .very topics he happen~d to have studi many 0 t e Items covered thecasion, he may have had th ' ed mo~t carefully, On another oc-large number of I't e opposIte expenence, finding an unusually
ems on areas he had f 'I d .situation illustrates error' I al e to reVICW.This familiar
what extent do Scores on ~~n~nc: ;esu ting from content sampling. To
ticu]ar selection of items? Ifls eds'Hepen? on factors specific to the par-
I . a I erent mvestigator k' . dent y, were to preIJare another t t' d ' wor mg In epend-t' h es m accor ance with th 'fiIOns, ow much would an indi .dr, e same specI ca-Let us suppose that a 40-'t VI ua bS slcore differ ort the hm tests?
I em voca u ary test h ba measure of general verbal c h' . - as een constructed as
~ist of 40 different words is ass~:1~:d e:::~~~ ~ow suppose that a secondItems are constructed with I ame purpose, and that theculty as the first test The d.eff
quacan; to cover the same range of dim-
. d: , '. I erences 111 the sco e bt' d bIII JVldua]s on these two tests ']1 t r s a ame y the same
. I us rate the type of 'consIderation. Owing to fortuito f ' error vanance underferent individuals the relat' d~~ ators In the past experience of dif-what from pcrso~ to pe Ive
TI·I cu ty of the two lists wiII vary Some-
rSOll. lUS the fi t I' t . hnumber of words unfamiliar to individ rs IS mlg t contain a largerThe second list on the oth h d ,ua] A than does the second list.1 ' er an might conta'n d'arge number of words unfamilia t' . d"d I I a Isproportionatelyare apprOXimately egual in the; r 0 111nlVIua B. If the two individual~
"true scores"), B -will neverthele:sov:ra
word knowledge (i.e., in theirexcel B on the second Th ], e cel A on the first list, while A will
therefore be reversed o'n the treatll.ve standing of these two persons will
. eWolstso' t hselection of items. ' wmg 0 c ance differences in the
Like lest-retest rcliabilit· alt ' f '. ,accompanied by a stateme~' f t~m:te- ~nn rell~blhty should always be
ministrations as well as ado , t~ engft of the Interval between test ad-If h·' escnp Ion 0 relevant . t 't e two forms are administered' . 111 ervenmg experiences.
correlation shows reliabilit 'fn Immediate succession, the resulting. y across orms only not .
error vanance in this cas fl' ' across occasIOns. Thee represents uctuat'o' f
one set of items to another b t R . I ns In per ormance fromIn the d I ' u not uctuations over time
eve Opment of alternate forms h Id" .cised to ensure that the are tm] , care s ou . .of (,'ourse be exer-
of a test should be ind~endc t{ parallel. Fundamentally, parallel forms
same specifications. The tests :h~ ~nstruct~d tests desi~ed to meet theU contam the same number of items
. ,
Principles of Psychological Testing
:,d the 'items should be expressed in the same form and should cover the
metype of content. The range and level of difficulty of the items should
o be equal. Instructions, time limits, illustrative examples, format, and
I other aspects of the test must likewise be checked for comparability.
It should be added that the availability of parallel test forms is desir-
Ie for other reasons besides the determination of test reliability. Alter-
te forms are useful in' follow-up studies or in investigations of the
ectsof some intervening experimental factor on test performance. The
useof several alternate forms also provides a means of reducing the pos-
sibilityof coaching or cheating.
Although much more widely applicable than test-retest reliability, al-
"temate-form reliability also has certain limitations. In the first place, ifthebehavior functions under consideration are subject to a large practice
elfeet, the!'use of alternate forms will reduce but not eliminate such an
'effect. To be sure, if all examinees were to show the same improvement
with repetition, the correlation between their scores would remain un-
,"affected,since adding a constant amount to each score does not alter the
<:orrelationcoefficient. It is much more likely, however, that individuals
will differ in amount of improvement, owing to extent of previous prac-
ticewith similar material, motivation in taking the test, and other factors.
Under these conditions, the practice effect represents another source of
variance that will tend to reduce the correlation between the two test
forms, If the practice effect is small, reduction will be negligible.
Another related question concerns the degree to which the nature of
the test will change with repetition. In certain types of ingenuity prob-
lems, for example, any item involving the same principle can be readily
solved by most subjects once they have worked out the solution to the
first. In such a case, changing the specific content of the items in the
second form would not suffice to eliminate this carry-over from the first
form. Finallv, it should be added that alternate forms are unavailable for
many tests, because of the practical difficulties of constructing compara-
ble forms. For all these reasons, other techniques for estimating test re-
liability are often required.
Reliability lIS
To find split-half reliabilit tl Ii. .order to obtain th y, Ie 1st problem IS how to split the test illdivided in man ~ most nearly comparable halves. Any test can be
second half w~urd dl~e~ent wars. In most tests, the Rrst' half and the
difficulty level of 'tno
e comparable, owing to differences in nature andI ems, as well as to the cu I t' If f
Ul), })ractice fatig b d mu a Ive e ects 0 warming, ue, ore am and am' tI f
sively from the beginning to th~ end ~f th at Ie; ;ctors varying progres-
quate for most purposes is to fi d th e es.. procedure that is ade-of the test. If the items we .n. e scores on the odd and even items
of difficulty such a dl' . ~e on?llndallyan.anged in an approximate order, VIsIon Yle s verv ne I· . I
One precaution to b b d . .' ar)' eqUlva ent half-scores.e a serve 111 making such dd I'
to groups of items d l' . h' an a -even sp It pertainsea mg WIt a smale problem h
ferring to a particular mechanical di~ . ' sue. as questions re-reading test. In this case a whole r glam. or to a gIven passage in a
tact to one or the other h~lf \Vere ~ o~p of ~tems should be assigned in-
in different halves of the t~st th .e I:e~ls In such a group to be placedspuriousl inflated' . '. e Slml anty of the half-scores would be
might aIf~ct items 'i~l~c;t~n~a~~,:~.leerror in understanding of the problem
Once the two half-scores have b b' dbe correlated by the usual m th een a tame for each person, they may
correlation actuallv gives th e °l.d'b~lt.shoufld be noted, however, that this'f hoe re la I It" a onlv a half test F ' II t e entire test consists of 100 ite - h ' . - . . or examp e,tween two sets of scores each a .ms,. t e correlatIon IS computed be-
test-retest and alternate-fotm r:I;;~~~~,ls bas~d on only 50 items, In bothbased on the full nu b f ' . -' on t e other hand, each score is
. m er 0 Items In the testOther thmgs being equ I th I .
It is reasonable t . a I' e ~nger a test, the more reliable it will be?o expect t Iat, WIth a lar If'
arrive at a more adequate and . ger samp e a behaVIOr, we can. ' consIstent measure The ff t th I h
emng or shortening a test will hav . , .' e ec at engt -means of the Spearman-Bra f e allI Its ~oefficlent can be estimated by
wn ormu a, gIVen below:
nr'lI'II =: ~--,,----_
, l+(n-l)r'u
in which '1t is the estimated ffi'n is the number of times th ~o~. c~ent, ~11 the obtained coefficient, and
number of test items is incr:a eSd~ eng~ ened or shortened. Thus, if the
from 60 to 30, n is %. Th sse rom 2.'Jto 100, n is 4; if it is decreased
determining reliability bv ~heP:ari~~ntrown formula is Widely used in
porting reliability in this 'fo p a f m.ethod, m~ny test manuals re-
, formula always involves do~~in"'~~: tpphed to spht-haIf reliability, the
clitions, it can be simplified as f~Iows:ength of-the test. Under these con-
SPLIT-HALF RELIABILITY, From a sin'gle,:administration of one form of a
test it is possible to arrive at a measure 'of, reliability by various split-half
procedures. In such a way, two scores are obtained for e~c1i person by
dividing the test into comparable halves. It is apparent that split-half
reliability provides a measure of consistency with regard to content sam-
pling. Temporal stability of the scores does Ilot enter into such reliability,
because only one test session is involved: This type of reliahility co-
efficient is sometimes called a coefficient of internal consistency, since
only a single administration of a single form is required.2 Lenulhening a test h . I .
" . ' owever, wll Increase 0 I "t, " .tent samplmg not its sl b'I't .,' n y. I S conSIstency m tenns of con-
, a II} over hme (see Cureton, 1965). '
Principles of Psychological Testing
2r'1I
Tn = 1+ r'lI
. s it-half reliability was developed byAnalternate method for findmg p. f th differences between
. 0 Ily the vanance a e IIon (1939). It reqUires I If t ( , ) and the variance of tota
' the two ha -tes Sad f Ich person s scores on b 't t d in the following ormu a,res (a'r); these two values aTe su stJ u e. ,.
hich yieids the reliability of the whole test duectl) .
u'e!
111 = 1- -,-u:;
,r , hi of this formula to the definition of. It is interesting to note the relations p 's scores on the two half-' . , A d'ff ce between a person . 'd d'errorvanance. ny I eren 'f these differences, dlvl e' h r The vanance 0 ,. 'tests represents c ance eTTO. , 'es the roportion of error variance 111
by the variance of total scores, gl\ 'b P t d from 1 00 it gives the- h' 'ariance IS SU trac e , ,he scores. When t IS error \ h' h . I to the reliability coefficient.proportion of "true" variance, w IC IS equa
, . A fourth method for finding reliability,KUDER·RICHARDSON RELIABILIT1:.. f . I form is based on the
. 1 d" t 'ahon 0 a slllg e ,also utiliZing a slIlg e a mmlslII , , the test This interitem con-
f onses to a Items m .consistencv 0 resp f ariance' (1) content sam-'. a d by two sources a error v , h
,;:sistenclj is ~n uence . d s lit-half reliability); and (2) etero-\1 piing (as III altemat~-form an. p m led. The more-homogeneous the
0' geneitv of the behavlOr domalll sa.Pt
' For example if one test in-' • h' h tl . lteritem conSISenc\. , bdomain, the Ig er Ie 11 h'1 lo'ther cOllllJrises addition, su _
I . I' l' 'tcms w leal b hI" eludes only mu tip Ica IOn I ..'.. the former test will pro a yI· I' t' and dIVISIOnItems,
' traction, mu tip Ica lOn, h th latter In the latter, more. 't onsistenc\' t an e, 0 h .
' show more mten em c "f better in subtraction t an III' t t e subJ'ect ma\' per orm 1' heterogeneous es, on. "ons' another subject may score re a-~, any of the other arithmetIc operatl 0 , ly in addition, subtrac-
h d' " 'tems but more poor btively well on t e IVI510nI , A ore extreme example would etion and multiplication; and so on. m b I items in contrast to one' b t . ti I IT of 40 voca u ary, .
represented y a tcs consls I/::). I I t'ons 10 arithmetic reasomng,b 1 10 spaha re a I 0, '
containing 10 voca u ar~, ~ the latter test, there might be little orand 10 perceptual speed Item~'dI. 'd r performance on the differentno relationship between an III IVI ua s
.' types of items. ill be less ambiguous when derived., It is apparent that test scores w h t'. the highly heteroge-
t ts Suppose t a IIIfrom relatively homogeneo~ esS' 'th and Jones both obtain a score of
neous, 40-item test cited ave, rfml s of the two on this test were
20,Can we conclude that the Pheormance tly completed 10 vocabulary
? N t II Smith may aye correc ..equal. ot a a . 's and none of the arithmetic reasomngitems, 10 perceptual speed ~tem 't t Jones may have received a scoreand spatial relations items, neon ras ,
Reliability U7
·of 20 by the successful completion of 5 pcrccptual speed, 5 spatial rela-tions, 10 arithmetic reasoning, and no vocabulary items,
Many other combinations could obViously producc the same total score
of 20. This Score would have a very different meaning when obtained
through such dissimilar combinations of items. In the relatively homoge-
neous vocabulary test, On the other hand, a Score of 20 would probably
mean that the Subject llad succeeded with approximately the first 20
words, if the items were arranged in ascending order of difficulty, He
might have failed two or three easier words and correctly responded to
two or three more difficult itcms beyond the 20th, but such individual
variations are slight in comparison with those found in a more heteroge-neous test .
A highly relevant question in this connection is whether the criterion
that the test is trying to predict is itself relatively homogeneous or heter-
ogeneous. Although homogeneous tests are to be preferred because their
Scores permit fairly unambiguous interpretation, a single homogeneous
test is obViously not an adequate predictor of a highly heterogeneous cri-
terion. lvforeover, in the prediction of a heterogeneous criterion, the
heterogeneity of test items would not necessarily represent error variance.
Traditional intelligence tests provide a good example of heterogeneous
tests designed'to predict heterogeneous criteria. In such a case, however,
it may be desirable to construct several relatively homogeneous tests,
each measuring a different phase of the heterogeneous criterion, Thus,
unambiguous interpretation of test scores could be combined with ade-quate criterion coverage.
The most common procedure for finding interitem consistency is that
developed by Kuder and Richardson (1937). As in the split-half methods,
interitem consistency is found from a single administration of a single
test. Rather than requiring two half-scores, however, such a technique is
based on an examination of performance on each item. Of the various
formulas derived in the original article, the most Widely applicable, com-
monly known as "Kuder-Richal'dson formula 20," is the following:
3
In this formula, rll is the reliability coefficient of the whole test, n is the
number of items in the test, and IJ't the standard deviation of total SCOl'es
on the test. The only new term in this formula, 'S.pq, is found by tabu-
lating the proportion of persons who pass (p) and the proportion who do
not pass (q) each item. The product of p and q is computed for each
item, and these products are then added for all items, to give ~pq. Since
in the ptocess of ~est construction p is often routinely recorded in order
3 A Simple dcrivatiolJ of this formula can be found in Ebel (1965, ppo 32!hS27).
_ (~)U't - ~U';TlI - n - 1 u't
A clear description of the computational layout for finding coefficient
alpha can be found in Ebel (1965, pp. 326-330).
Reliability 119
one case, error variance covers temporal fluctuations; in another, it refers
to differences between sets of parallel itcms; and in still another, it in-
cludes any interitem inconsistency. On the other hand, the factors ex-
cluded from measures of error variance are broadly of two types: (a)
those factors whose variance should remain in the scores, since they are
part of the true differences under consideration; and (h) those irrelevant
factors that can be experimentally controlled. For example, it is not
customary to report the error of measurement resulting when a test is
administered under distracting conditions or with a longer or shorter
time limit than that specified in the manual. Timing errors and serious
distractions can be empirically eliminated from the testing situation.
Hence, it is not necessary to report special reliability coefficients corre-
sponding to "distraction variance" or "timing variance."
Similarly, most tcsts provide such highly standardized procedures for
administration and scoring that error variance attributable to these fac-
tors is negligible. This is particularly true of group tests deSigned for
mass testing and computer scoring. 'With such insb'uments, we need only
to make certain that the prescribed procedures are carefully followed
and adequately checked. 'Vith~clinical instruments employed in intensive
individual examinations, on the other hand, the!'e is evidence of con-
siderable "examiner variance:' Through special experimental designs, it
is possible to separate this variance from that attributable to temporal
fluctuations in the subject's condition or to the use of alternate test forms.
~ne source of error variance that can be checked quite simply is scorer
vanance. Certain types of tests-notably tests of creativity and projective
tests of personality-leave a good deal to the judgment of the scorer.
\Vith such tests, there is as much need for a measure of scorer reliability
as there is for the more usual reliability coefficients. Scorer reliability can
be found by having a sample of test papers independently scored by two
examiners. The two scores thus obtained hv each examinee are then cor-
related in the usual way, and the resulti~g correlation coefficient is a
measu,re of scorer reliability. This type of reliability. is commonly com-
puted when subjectively scored instruments are e.mployed in research.
"»est manuals should also report it when appropriate. '
u8 Pri'lcipks of Psychological Testing
i6'find the difficulty level of each item, this method of determining rc-
i~bilityinvolves little additional cO,mputation. l' bT,fIt canbe shown mathematically that the Kuder-Ri~hardson r~ la Ilty
, cient is actually the mean of aU split-half coeffiCients .resultll1~ from
ent splittings of a test (Cronbach, 1951).4 The ordmary spht-half
dent, on the other hand, is based on a planned split design~d to
equivalent sets of items. Hence, unless the test items are hIghly
mogeneous, the Kuder-Richardson coefficient will be .lo\~er than t~e
lit-halfreliability. An extreme example will serve to hl.ghlight t?e dlfference.Suppose we construct a 50-item test out of 25 diHerent kmd~ a
emssuch that items 1 and 2 are vocabulary items, items 3 and 4 anth-
eticreasoning, items 5 and 6 spatial orientation, a~d so on. The odd.and
venscores on this test could theoretically agree qmte clos:ly, thus. YIeld-
'ng a high split-half reliability coefficient. The homogeneity of. thiS test,
Id be very low Since there would be little consistency of
owever,wou • " lderformance among the entire set of 50 items. In thIS example, we wou.
'~xpectthe Kuder-Richardson reliability to be much lower th\lD th~ splIt-
halfreliability. It can be seen that the diHerence between Kuder-~Ichard-
,sonand split-half reliability coefficients may serve as a rough ll1dex of
theheterogeneity of a test. .The Kuder-Richardson formula is applicable to tests whose Items are
scoredas right or wrong, or according to some other all-or-none syste~.
Sometests however may have multiple-scored items. On a personahty
inventory,for exampie, the respondent may receive a di,~erent n,~~erical
score on an item, depending on whether he checks . usually, some-. " " I" "ne\1el'" For such tests a generahzed formula hastimes, rare y, or· ' . kbeen derived known as coefficient alpha (Cronbach, 1951; NOVIC &
Lewis, 1967).' In this formula, the value ~pq is replaced by ~u'i, ~he sum
of the variances of item scores. The procedure is to find the vana~ce of
all individuals' scores for each item and then to ~dd these v~na~ces
i, across all items. The complete formula for coeffiCIent alpha IS glVen
below:
SCORER RELIABILITY. It should now be apparent that the difIer:nt types
of reliability vary in the factors they subsume under error vananee. In
4 This is strictly true only when the split-half coefficientsare found by the Rulonformula,not when they are found by correlation of halves and Spearman-Brown
formula(Novick& LewiS, 1967).
OVERVIEW. The diHerent types of reliability coemsiel),ts discussed in
this section are summarized in Tables 8 and 9. In Tablit18'the operations
followed in obtaining each type of reliability are classffled,-,with regard
to number of test forms and number of testing sessions required. Table 9
shows the sources of variance treated as error vitri~nce b},;,~achprocedure.
Any reliability coefficient may be interpreted directly"in terms of the
percentage of score variance attributable to different sources. Thus, a re-
liability coefficient of .85 signifies that 85 perceI1t 9f the variance in test
Split-HalfKuder-Richardson
Scorer
Two \ Test-Retest
1'l..'C••.:J";':'.:.•.;-...•:.~io!<!'.l:::i;r~C'~<;~;;.tr..~""F.:~ ....:.Y:-:_~:_~~,,::.;.c~.-:,~;:.:.;(;,tJ';;.!:.4':~~ __••'~.~;-.:.;~ .•..c..::.t,at;.;..."Ulr'&.~~')l.t;·~•..fW"6'.!':"i·:;",-
Alternate- Form(Delayed)
Reliability 121
efficient (\/;;-;-). When the index of reliability is squared, the result is the
reliability coefficient (r1l), which can therefore be interpreted directly
as the percentage of true variance.
Experimental designs that yield more than one type of reliability co-
efficient for the same group permit the analysis of total score variance
into different components. Let us consider the following hypothetical
example. Forms A and B of a creativity test have been administered with
a two-month interval to 100 sixth-grade children. The resulting alternate-
form reliability is .70. From the responses of either form, a split-half re-
liability coefficient can also be computed.6 This coefficient, stepped up by
the Spearman-Brown formula, is .80. Finally, a second scorer has rescored
a l'andom sample of 50 papers, from which a scorer reliability of .92 is
obtained. The three reliability coefficients can now be analyzed to yield
the error variances shown in Table 10 and Figure n. It will be noted that
by subtracting the en'or variance attributable to content sampling alone
(split-half reliability) from the error variance attributable to both con-
tent and time sampling (alternate-form reliability), we find that .10 of the
variance can be attributed to time sampling alone. Adding the error vari-
~nces attributable to content sampling (.20), time sampling (_10), and
mterscorer difference (.08) gives a total error variance of .38 and hence a
true variance of .62. These proportions, expressed in the more familiar
percentage terms, are shown graphically in Figure II.
lZ0 Principles of Psyc11010gical Testing
TABLE 8Techniquesfor Measuring Reliability, in Relation to Test Form
andTesting Session
TestingSessionSRequired
Test Forms Required
A1temate-Form
(Immediate)
scores depends on true vati~nce in the trait measured and 15 percent
epends on error variance (as:'opcrationally defined by the specific pr~-
edure followed). The statistically sophisticated reader may recall that It
's the square of a correlation coefficient that represents proportion of
ommanvariance. Actually, the proportion of true variance in test scores
'sithe square of the correlation between scores on a single form of the
est and true scores free from chance errors. This correlation, known as
th6index of re1iabdity,~ is equal to the square root of the reliability co-
TABLE 10
Anal)'sis of Sources of Error Variance in a H}'P0thetical Test:fABLE 9,ourcesof Error Variance in Relation to Reliability Coefficients
5Derivationsof the indexof reliability,based on two dilTerentsets of assumptions,
\givenby Gulliksen (l950b, Chs. 2 and 3).
. 6 For a better estimate of the coefficientqf internal consistency.split-half correla-tions could be computed for each fonn amI the two coeffiCientsaveraged by the ap-propriate statistical procedures. '-\;,. ;
Type ofReliabilityCoefficient
From delayed alternate-form reliability: 1 - .70 = .30 (time samplin'k
plus contentsampling)
From split-half, Spearman-Brown reliability: 1 - .SO= .20· (contentsampling)
DiHerence .10· (time sampling)
TWDl scorer reliability: 1- .92 = .OS· (interscorerdifference )
Total Measured Error Varianetl· = .20 + .10 + .08 = .38True Variance = 1- .38 = .62
,est-Retestlemale-Form(Immediate)emale-Form(Delayed)
lit-Half
er-Richardsonand Coefficient
Ipharer
Time samplingContent samplingTime sampling and Content sampling
Content samplingContent sampling andContent heterogeneity
Interscorer differences
'II,',II i,I, ;
i'
that individual differences in test scores depend on speed of perform-
ance, reliability coefficients found by these methods will be spuriously
high. An extreme example will help to clarify this point. Let us suppose
that a 50-item test depends entirely on speed, so that individual differ-
ences in score are based wholly on number of items attempted, rather
than on errors. Then, if individual A obtains a score of 44, he will obvi-ously have 22 correct odd items and 22 correct even items. Similarly,
individual B, with a score of 34, will have odd and even scores of 17 and
17, respectively. Consequently, except for accidental careless errors on a
few items, the correlation between odd and even scores would be perfect,
or +1.00. Such a correlation, however, is entirely spurious and provides
no information about the reliability of the test.
An examination of the procedures followed in finding both split-half
and Kuder-Richardson reliability \:vill show that both are based on the
consistency in number of errors made by the examinee. If, now, indi-
vidual differences in test scores depend, l~ot on errors, but on speed, the
measure of reliability must obviously be based on consistency in speed
of u:ork. 'Vhen test performance depends on a combination of speed and
power, the single-trial reliability coefficient will fall below 1.00, but it
will still be spuriously high. As long as individual differences in test
scores are appreciably affected by speed, single-trial reliability coefficients
cannot be properly interpreted.
'What alternative procedures are available to determine the reliability
of Significantly spl1eded tests? If the test-retest techniqu~ is applicable, it
would be appropriate. Similarly, equivalent-form reliability may be
properly employed with speed tests. Split-half techniques may also be
used, provided that the split is made in terms of time rather than in
terms of items. In other words, the half-scores must be based on sep-
arately timed parts of the test. One way of effecting such a split is to
administer two eqUivalent halves of the test with separate time limits.
For example, the odd and even items may be separately printed on differ-
ent pages, and each set of items given with one-half the time limit of the
entire test. Such a procedure is tantamount to administering two equiva-
lent forms of the test in immediate succession. Each form, however, is
h¥f as long as the test proper, while the subjects' scores are normally
based on the whole test. For this reason, either the Spearman-Brown or
some other appropriate formula should be used to find the reliability of
the whole test.
If it is not feasible to administer the two half-tests separarely, an al-
ternative procedure is to divide the total t,ime into quarters, and to find
a score for each of the four quarters. This caneasil~':J;>~ 'done by having
the examinees mark the item on which they ar~ w6rkiti~ whenever the
examiner gives a prearranged signal. The number of items correctly
completed within the first and fourth quarters can then be combined to
Error Variance: 38'J.
A_-10--x.--8-'X,-"'"
Stable over lime; consistent over !orms;
free !rom interscorer difference
11. Percentage Distribution of Score Variance in a Hypothetical Test.
'LIABILITY OF SPEEDED TESTS"
oth in test construction and in the interpretation of test scores, an
portant distinction is that between t~e ~ea.s~rement. of speed and of
wer. A pure speed test is one in whIch md1~dual differences depend
tirel\, on speed of performance. Such a test IS co~s~ructed fr~~ Items
uniformly low difficulty, all of which are well wI~hm ~he. a?lhty level
the persons for whom the test is designed. The hme 1Im1t.1~made so
ort that no one can finish all the items. Under these conditIons, each
erson's score rcflects only the speed with which he worked. A pur~
DICeI' test, on the other hand, has a time limit long el:ough ~o permIt
veryone to attempt an items. The difficulty of the Items IS steeply
, raded, and the test includes some items too difficult for anyone to solve,
sothat no one can get a perfect score. "It will be noted that both speed and power tests are deSIgned to p~e-"
vent the achievement of perfect scores. The reason for such.a precauh~,
is that perfect scores are indeterminate, since it is impos~lble to .knm.Y
howmuch higher the individual's score would have been If m?re.l~ems,
d'ffi It items had been included, To enable each mdlVldualor more I cu, ,,' .d dto show fully what he is able to a~c,qm1?H,~rthe test must proVI e a e-
. qllate ceiling, either in number o~ ~te"':iJr in. difficulty level. An..ex~ep~
lion to this rule is ,found in mastery ,Jng, as Illustrated by the cllt~no~
referenced tests discussed in ChaPtrc4. The purpose of such testm~ IS
not to establish the limits of what th'e3hdividual can do, but to determme
whether a preestablished performance level has or has not been rea.ehed.
In actual practice, the distinction between speed and power :ests IS ~nc
of degree most tests depending on both powe~ and speed 111 varymg
proportiO~S. Information about these proportions is needed for each test
. rder not onlv to understand what the test measures but also to
~o~se the prop~r procedures for evaluating its reliability. Single-trial
reliability coefficients, such as t~ose found by odd-even or Ku.der-
Richardson techniques, are inapplicable to speeded tests. To the extent
Principles of PsycllOlogical Testing
'~w,' represent one half-score, while those in the second and thir~ q~artcrs
," can be combined to yield the other half-score. Such a combmahon of
. quarters tends to balance out the cumulative effects of practice, fatigue,
and other factors. This method is especially satisfactory when the items
are not steeply graded in difficulty level.
When is a test appreciably speeded? Under what conditions must the
. special precautions discussed in this section be observed? Obviously, the
mere employment of a time limit does not signify a speed test. If all
subjects finish within the giycn time limit, speed of work plays no part
in determining the scores. Percentage of persons who fail to complete
the test might be taken as a crude index of speed versus power. Even
when no one finishes the test, however, the role of speed may be negli-
gible. For example, if everyone (<()mpletes exactly 40 items of a 50-item
.test, individual differences with regard to speed are entirely absent, al-
though no one had time to attempt all the items.
The essential question, of course, is: "To what extent are individual
differences in test scores attributable to speed?" In more technical terms,
we want to know what proportion of the total variance of test scores is
speed variance. This proportion can be estimated roughly by finding the
... variance of number of items completed by different persons and dividing
'\ it by the variance of total test scores (u·'/r:J't). In the example cited
above, in which ev~ry individual finishes 40 items, the numerator of this
fraction would be zero, since there are no individuaL differences in num-
ber of items completed (u'(' = 0). The entire index would thus equal zeroin a pure power test. On the other hand, if the total test variance (U2f)
is attributable to individual differences in speed, the two variances will
.. be equal and the ratio will be 1.00. Several more refined procedures have
;". been developed for determining this proportion, but their detailed con-
sideration falls beyond the scope of this book., . '.
An example of the effect of speed on single-trial reliability coefficients
is provided by data collected in an investigi~on of the first edition of
the SRA Tests of Primary Mental Abilitie.s.~.r Ages 11 to 17 (Anastasi &
Drake, 1954). In this study, the reliab!lijY',uf each test was first deter-
mined by the usual odd-even procedm:e.;{~;fie~~coefficients, given in the
first row of Table 11, are closely sinjil Jhose reported in the test
manual. Reliability coefficients were the ..," ,nfited by correlating scores
on separately timed halves. These coef1i~~:are shown in the second
row of Table 11. Calculation of speed indexes showed that the Verbal
Meaning test is primarily a power teSt;,l~i1e the Reasoning test is some-
what more dependent on speed. The Spa.~~,and Number tests proved to
be highly speeded. It will be noted iri;1;~h'1' 11 that, when properly com-
TABLE 11
Reliability Coefficients of Four of the SRA Tesls of Primary MenIalAbilities for Ages 11 to 17 (1st Edition)
(Data from Anastasi & Drake, 1954)
Reliability Coefficient VerbalFound by: Meaning Reasoning Space Number
Single-trial odd-even method .94 ,96 .90 .92Separately timed halves .90 .87 .75 .83
p~ted, the reliability of the Space test is .75, in contrast to a spuriously
hIgh odd-even coefficient of .90. Similarly, the reliability of the Reasoning
te,st drops f~on~..96 to .87, and that of the Kumber test drops from .92 to
.8,3. The rehablhty of the relatively unspeeded Verbal Meaning test, all
the other hand, shows a negligible difference whe'n computed by the twomethods.
DEPENDENCE OF RELIABILITY COEFFICIENTSON THE SAMPLE TESTED
7 See. e.g .• Cronbach & Warrington (1951 Y,Culliksen (1950a, 1950b), Cuttman
(1955), Helmstadter& Ortmeyer (1953).
HET~ROG~XEITY.An important factor influencing the size of a reliability
coeffiCient IS the nature of the group on which reliability is measured. In
~he. ~rst pla~e, any correlation coefficient is affected by the range of
1I1?~\')?ual dl~erenc:~ in the group. If every member of a group were
ah~~ 111spcllmg ablhty, then the correlation of spelling with any other
a~lll~y would be zero in that group. It would obviously be impossible;'
WI~~1Ilsuch a group, to predict an individual's standing in any otherablhty from a knowledge of his spelling SCOFe.
Anot~er, less extreme, example is provided by the correlation between
tw~ aptItude tests, such as a verbal comprehenSion and an arithmetic rea-
sonmg test. If these tests were administered to a highly homogeneous
sampll:', such as a group of 300 college sophomores, the correlation be-
I tween the two would probably be close to zero().There is little relation-
S~i~, wi~hin such a .s~lected s~mple of college students, between any in-
dn Idual s verbal abdlty and hiS numerical reasoning abilitv. On the other
hand, wer~ the test~ to. be. give.n to a hetero~neous sample of 300 per-
sons, rangmg f~om mstItut~ona1tzed mentally retar~ed persons to college
graduates, a hIgh correlatlon would undoubted:}£,::be obtained betweep
the two tests. The mentally retarded would o~ta1.~~hoore.r:scores than tile
college graduates on both tests, and similar no{ . hips would hold for
other subgroups within this highly heterogeneo'us ',pIe.'>
Principles of Psychological Testing
mination of the hypothetical scatter diagram given in Figure 12
urther illustrate the dependence of correlatioll coefficients on the
Hity, or extent of individual differences, within the group. This
r diagram shows a high positive correlation in the entire, heteroge-
s group, since the entries are closely clustered about the diagonal
ding from lower left- to upper right-hand corners. If, now, we con-
only the subgroup falling within the small rectangle in the upper
-hand portion of the diagram, it is apparent that the correlation be-
the two variables is close to zero. Individuals falling within this
, icted range in both variables represent a highly homogeneous group,
did the college sophomores mentipned above.
'ke all correlation coefficients, reliability coefficients depend on the
'iability of ,the sample within which they are found. Thus, if the re-
ility coefficient reported in a test manual was determined in a group
'ing from fourth-grade children to high school students, it cannot be
med that the reliability would be equally high within, let us say, an
hth-grade sample. \Vhen a test is to be used to discriminate individual
Reliability 127
differences within a more homogeneous sample than the standardization
group, the reliabi~ity ~oefficient should be redetermined on such a sample.
Formulas for estimating the reliability coefficient to be expected when
the standard deviation of the group is increased or decreased are avail-
able in elementary statistics textbooks. It is preferable, however, to re-
compute the reliability coefficient empirically on a group comparable to
that on which the test is to be used. For tests designed to cover a wide
range ~f age or abil.ity, the test manual should report separate reliability
coeffiCIents for relatively homogeneous subgroups within the standardiza-tion sample.
i i I ; ! I i I ! ! i , , I
I , I i I I, , I
, , i , IIi I,I
,I , I ! i i I I iI/ I
,,, I , I ! i
, ~ , I,"
"1'/1 11\-'~--, I I , ; ! I I i ,
i I ill I ; 1\11'/1,/1
I i I i, , I I /lill,l '1/'11 Ifi'll IIi i-h': i I i 1 i i
"I jll'/I', :'/'1111, i ,I,
I,
1 ! , ! I 1/1 1'1/;/1/ /II I!I: Ii I !
i,i I !
,I ! , , I ~ I III /11/1.//, , /I:/! I!
i i ! I I I I ,II, , 1/1 /1/1:/1 I :/1' I , i , ,~,
I I I .11 11[111/1 III /I /1;11, , ; i iI !! ,
I I ! ;1 1/11/1' I: I '11,/1 1 i , ' , I,
I! I
i I , i'i il'" //I//!// 11/ II II' ! I J
,I!
I i I,i I I !1I,II,lI/llIll/ll/ I ;", ; I ,
i: .I i i I , 1 1'1' : I ~" 11111I11 'I;;;l;i.;: 'i , i i
I i I ;111/11 //,/1 1/1,11/, I t~1,i,I ii
I I I 111 111 111/:1/!1I I i •.; 1 I 1 i , I,, , I ,/1 I I I i,l I! ~ I I' I
i / I fll I 11/11/'I
". i!I : I I L,
IJII I , II i t~i , i '~?f I I
I 11·/1 I ! I ,:~it· : ! : I
1/1/ /I' I [I I i,
I11/
II:W 11 1/1/ i I I I ; II I , II I I I I 1 I
"/I , II ", ..; ,
I I,III I I .. .'fo;., I I I
jfI I i ! Ii .....- ...
/I I I I I !
I 1",'·,1. I I
ABILITY LEVEL. Kot only does the reliability coefficient vary with the
extent of individual differences in the sample, but it may also vary be-
tween groups differing in average ability level. These differences, more-
over, cannot usually be predicted or estimated by any statistical formula,
b~t c~n ~e' discovere~ .only by empirical tryout of the test on groups
d.dfermg 111 age or abilIty levcl. Such differences in the reliability of a
smgle test may arise from the faCt that a slightlv different combination of
abilities is measured at different difficulty lev~ls of the test. Or it mayresult from the statistical properties of the scale itself, as in the Stanford-
Binet (Pinneau, 1961, Ch. 5). Thus, for different ages and for different
IQ levels, the reliability coefficient of the Stanford-Binet varies from .83
to .98. In other tests, reliability may be relatively low for the younger
and less able ¥roups, since their scores are unduly influenced by guessing.
Under such CIrcumstances, the particular test should not be employed atthese levels.
It is apparen.t t~at every reliability coefficient should be accompanied
by a fuD descnptIon of the type of group on which it was detelmined.
Special attention should be given to the variability and the ability level
of the sa~~le. The reported reliability coefficient is applicable only to
~amplef, s~nll]~r to that on which it was computed. A desirable and grow-
lIlg practice In test construction is to fractionate the standardization
sample into m~re homogeneous subgroups, with regard to age, sex, grade
leve~, occupation, and the like, and to report separate reliability co-
effic~ents for each s~bgroup. Under these conditions, the reliability co-
cHicIen¥ are more lIkely to be applicable to the samples ~~th which thetest is to be used ill actual practice. ..
Score on Variable 1
.Frc. 12. The Effect of Restricted Range upon a Correlation Coefficient.INTERPRETATION OF INDIVIDUAL SCORES. The reliability of a test may be
expressed in terms of the standard error of measllre~ent ((fmen.,), also
Principles of PsycllOlogical Testing
, called tIle standard error of a score. This measure is particularly wen
suited to the interpretation of individual scores, For many testing pur-
poses, it is therefore more useful than the reliability coefficient., TI~e
, standard error of measurement can be easily computed from the rehabll-
: ity coefficient of the test, by the following formula:
.inwhich al is the standard deviation of the test scores and '11 the reliabil-
itycoefficient, hath computed on the same group. For example, if devia-
tion IQ's on a particular intelligence test have a standard devia~iol1 of ~5
.and a reliability coefficient of .89, the a"" ••. of an IQ on thIS test IS;
;.15\/1- .89= 15Y.ll = 15(.33) = 5. -v
. To understand what the UI/H'.' tells us about a score, let us suppose that
.~"wehad a set of 100 IQ's obtai~ed with the above test by a single boy,
t;tJim,Because of the types of chance errors discussed in this chapter, these
:\ scores will vary, falling into a normal distribution around Jim's true
':score.The mean of this distribution of 100 scores can be taken as the true
,scoreand the standard deviation of the distribution can be taken as the
, "11Im, • Like an\, standard deviation, this standard error can be interpreted
in t~rms of the normal curve frequencies 'discussed in Chapter 4 (see
Figure 3). It will be recalled that between the mean and ±lu there are
~pproximatf'ly 68 percent of the cases in, a normal curve. Th~s" we can
.nclude .h-.;-the chances arc roughly 2:1 (or 68:32) that JUllS IQ on
, is test :_..'" 'fluctuate between ± lUIII,n.'. or 5 points on either side of his
Ie IQ. If his true IQ is no, we 'V<:mldexpect him to score between 105
ld U5 about two-thirds (68 percent)' of the time.
If we want to be more certain of oiI~rprediction, we can choose higher
'\lddsthan :2:1. Reference to Figurei,1t~~Chapter 4 shows that ±3u covers00.7 percent of the cases. It can be::~_sg~,~t.ainedfrom normal curve fre-
uenc)' tables that a distance of 2.58?:7.?~.·~i!4erside of the mean includes
'actly 99 percent of the cases. II,tti{ee;:the chances are 99:1 that Jim's
will fall within 2.58ulllras, or (2.58)(5) = 13 points, on either side of. true IQ. We can thus state at ttte 99 percent' confidence level (with
Iy one chance of error out of l00J,:,that Jim's IQ on any single admin-
ation of the test will lie between"97 an9 123 (110 -13 and no + 13).''Jimwere given 100 equivalent te~ts. ilis IQ would fall outside this band
'Valuesonly once..'In actual practice, of course, we do not have the true scores, but only
e scores obtained in a single test administration. Under these circum-
~nces,we could try to follow ~t.above reasoning in the reverse direc-
. If an individual's obtal,p~l.score is unlikely to deviate by more
2.58O''''r ••. from his true"~ore, we could argue that his true score
lie within 2.580'n1f.B, olflis obtained score. Although we cannot as-
Reliability U9
sign a probability to this statement for any given obtained score, we call
say that the statement would be correct for 99 percent of all the cases.
On the basis of this reasoning, Gulliksen (1950b, pp. li-20) proposed
that the standard error of measurement be used as illustrated abo've to
estimate the reasonable limits of the true score for persons ""it-h any given
obtained score. It is in terms of such "reasonable limits" that the en-or of
measurement is customarily interpreted in psychological testing and it
will be so interpreted in this book.
The standard error of measurement and the reliabilitv coefficient are
obviously alternative ways of exprt'ssing test reliability. Unlike the relia-
bility coefficient, the error of measuren)('nt is independent of the varia-
bility of the group on which it is computed. Expressed in terms of indi-
vidual scores, it remains unchanged when found in a homogeneous or a
heterogeneous group. On the other hand, being reported in score units,
the error of measurement will not be directly comparable from test to
test. The usual problems of comparability of units would thus arise when
errors of measurement are reported in terms of arithmetic problems,
words in a vocabulary test, and the like. Hence, if ,,"e want to comparethe reliability of differetlt tests, the reliability coefficient is the better
measure. To interpret individual scores, the standard error of measure-
ment is more appropriatc.
INTERPRETATI01IO OF SCORE DIFFERENCES. It is particularly important to
consider test reliability and errors of measurement \\'hen evaluating the
differellces between two scores. Thinking in terms of the range within
which each score may fluctuate serves as a check against overempha-
sizing small diHerences between scores. Such caution is desirable both j
when comparing test scores of different persons and when comparing
the scores of the same individual in diHerent abilities. Similarly, changes
in scores following instructiun or other experimental \'ariables need to be
interpreted in the light of errors of measurement.
A frequent question abollt test scores concerns the individuars relative
standing in different areas. Is Jane more able along verbal than along
numerical lines? Does Tom have more aptitude for mechanical than for
verbal activities? If Jane scored higher on the verbal than on the nu-
merical sub tests .on an aptitude battery and Tom scored higher on the
mechanical than on the verbal, how sure can we be that they would still
do so1on a retest with another form of the battery? In other words, could
thc score differences have resulted merely from the chance: se)ection of
specific items in the particular verbal, numerical, and mechahical tests
employed?
Because of the growing interest in the interpretation of score p'rofi.les,
test publishers have been developing report forms that permit the evalua-
Reliability 131
the difference between the Verbal Reasoning and Numerical Ability
scores probably reflects a genuine difference in ability level; that bctween
Mechanical Reasoning and Space Relations probably does not; the dif-
ference between Abstract Reasoning and Mechanical Reasoning is inthe doubtful range.
It is well to bear in mind that the standard error of the difference be-
tween two scores is larger than the error of measurement of either of the
two scores. This follows from the fact that this difference is affected by
the chance er1"Orspresent in both scores. The standard error of the diffe;-
ence between two scores can be found from the standard errors of meas-
urement of the two scores by the follOWing formula:
:RAWSCORE I~~l'~~'llw;;, ft .~~~ I~~l~::'-;-1 ;~ I;; IPERCENTILE 60 9S 80" 95 30 80 90 'l9 85 i
,, <;: ",. -
~~\,
~- ..
. - -
- - ..
..
".0 : .. .. ..
0 .. .. ..
,
1
'"~60~~~ 50u
~ 40
30,.
in which Udi//. is the standard error of the difference between the two
scores, and Umca8.) and Urneas .• are the standard errors of measurement of
the separate scores. By substituting SDyll - TII for Umeus
,) and
SDyll - r2lI for Umeas .• , we may rewrite the formula directly in terms ofreliability coefficients, as follows~
In this substitution, the same SD was used for tests 1 and 2, since their
scores would have to be expressed in terms of the same scale before theycould be compared.
\Ve may illustrate the above procedllfe with the Verbal and Perform-
ance IQ's on the Wechsler Adult Intelligence Scale (WAIS). The split-
half reliabilities of these scores are .96 and .93, respectively. WAIS devia-
tion IQ's have a mean of 100 and an SD of 15. Hence the standard error
of the difference between these two scores can be found as follows:
Flc. 13. Score Profile on the Differential Aptitude Tests, Illustrating Use ofPercentile Bands.
(Fig. 2, Fifth Edition Manual, p. 73. Reproduced b)' permission. Copyright ® 1973,
1974 by The Psychological Corporation, New York, N.Y. All rights reseT\'ed.)
tion of scores in terms of their errors of measurement. An example is, the
Individual Report Form for use with the Differential Aptit,~~e Tests, re-produced in Figure 13. On this form, percentile scores ?~ each subtest
of the battery are plotted as one-inch bars, '\\1th the ~1:l~jPed percentil~ '
at the center. Each percentile bar corresponds to a dist~nce of approxI-
mately 1Y2 to 2 standard error~ o~ :ithe~' ~,ide ~f 't~i!o~t~ine? ~core.8
Hence the assumption that the mdlVl~ua! s true ~~~allS Wlthm the
bar is correct about 90 percent oftl,t,:.time. In iI'l~,~rp.tetingthe profiles,
test users are advised not to attach Importance to olfferences between
scores whose percentile bars overlap,- especially if they overlap by more
than half their length. In the profil~%tl!ustrated~~f~gure 13, for example,·1;-~:. -, ,
8 Because the reliability coefficient (a¥d hence th~ er •• , ••. ) varies somewhat with
subtest, grade, and sex. the actual ranges covered by the one-inch lines are not
identical, but they are sufficiently close to permit uniform interpretations for practical
purposes.
Udif/. = 15y12 - .96 - .93 = 4.95
To determine how large a score difference could be obtained by chance
at the .05 level, we multiply the standard error of the difference (4.95)
by 1.96. The result is 9.70, or approximately 10 points. Thus the differ-
ence between an individual's WAIS Verbal and Performance IQ should
be at least 10 points to be significant at the .05 level.
1RELIABIUTY OF CRITERION-REFERENCED TESTS
It will be recalled from Chapter 4 that criterion-referenced, tests usu-
ally (but not necessarily) evaluate performance in terms o(~mastery
rather than degree of achievement. A major statistical implication of
13Z Pl'inciplt:s of Psychological Tcstillg
mastery testing is a reduction in yariability of scores among persons.
Theoretically, if everyone continues training until the skill is mastered,
variability is reduced to zero. Not only is low variability a result of the
way such tests are used; it is also built into the tests through the con-
struction and choice of items, as will be shown in Chapter 8.
In an earlier section of this chapter, we saw that any correlation, in-
cluding reliability coefficients, is affected by the variability of the group
in which it is computed. As the vatiability of the sample decreases, so
does the correlation coefficient. Obviously, then, it would be inappropri-
ate to assess the reliahilitv of most criterion-referenced tests by the usual
procedures.o Under thes; conditions, even a highly stable and internally
consistent tcst could yield a reliability coefficient near zero.
In the construction of criterion-referenced tests, two important ques-
tions are: (1) How many items must be used for reliable assessment of
each of the specific instructional objectives covered by the test? (2) "What
proportion of items must be correct for the reliable establishment of
mastery? In much current testing, these two questions have been an-
swered by judgmental decisions. Efforts are under way, however, to de-
velop appropriate statistical techniques that will provide objective, em-
pirical answers (see, e.g., Ferguson & i\ovick, 1973; Glaser & Nitka, 1971;
Hambleton & l\ovick, 1973; Livingston, 19i2; Millman, 1974). A few
examples will serve to illustrate the nature and scope of these efforts.
The t,,'o question~ about number of items and cutoff score can be in-
corporated into a single hypothesis, amenable to ~testillg within the
framework of decision theory and sequential analysis (Glaser & :\'itko,
197]; Lindgren & :'.1cElrath, 1969; Wald, 1947). Specifically, we wish to
test the hypothesis that the examinee has achieved the required le"el of
mastery in tllP content domain or instructional objective sampled by tne
test items. Segucntial analysis consists in taking observations one at a
timE' and deciding after cach observation whC'f.tper to: (1) accept the
hypothesis; (2) rejcct the hypothes!s; or .(3~f~~ake add~tional o~serYa-
tlOns. Thus the number of observations (m.;fhls case :t:lytnber of items)
needed to reach a reliable conclusion is, itself deten~nined during the
process of testing. Rather than being p.::fls~nted,,:ith a fixed, prede-
termined number of items the examine~~c;;dntimieS;ltaking tbe test until
a mastery or nonmastery d~cision is r~·.·" ·'d. At that point, testing is dis-
cuntinue'd and the student is either dire '.,:fo~henext instructional level
or returned to the nonmastered level '0 ; further study. \Vith the com-
puter facilities described in Chaptn_~, such sequential decision pro-
9 For fuller discussionof special statistic;\"~roceduresrequired for the constructionand evaluationof criterion-referencedtests,see Glaser and Nitko (1971), Hambletonand Novick (1973), Millman (1974), Popham and Husek (1969). A set of tablesfor determining the minimum number of~lems required for establishing mastery atspeCifiedlevels is provided by Millman (1972,1973).
Reliability 133
ce~ures ar~ feasible and can reduce total testing time while yieldingrehable ~stlma.tes of mastery (Glaser & Kitko, 1971).
Some Investigators have been explorinO' the use of Ban'sian estimationtechniques, whi.eh lend themselves well t~ the kind of decisions requiredby,ma~tery testmg. Because of the large number of specific instructionalobjectives to bc t~sted, criterion· referenced tests typically provide only a
small number of Itcms for cach objective. To supplement this limited in-
formation, procedures have been developed for incorporatinO' collateral
data from the student's previous performance history, as well ~s from the
test results of other students (Ferguson & !'\oviek, 197:3; Hambleton &Novick, 1973).
When flexible, individually tailored procedmes are impracticable,
I~ore traditional techniques can be utilized to assess the reliability of a
gl\'en .test. For example, mastery decisions reached at a prerequisite in-
structional level can be che{:ked against performance at the next instruc-
tional level. Is there a sizeable proportion of students who reached or
exceeded the cutoff score on tIle masten' test at .the lower level and
~ailed t~ achi~\'e mastery at the next levei within a reasonable period of
mstructlOnal tU1W?Does an analysis of their difficulties suggest that they
had not truly mastered the prerequisite skiIIs:l If so, these findings would
strongly suggest that the mastery test was unreliable. Either the addi-
tion of more items or the establishment of a higher cutoff score would
seem to be indicated. Another procedure for determining the reliability
of a master)' test is to administer two parallel forms to the same indi-
viduals and note the percentage of persons for ",hom the same decision
(~mstery or nonmastery) is reached on both forms (Hambleton & No-\'Ick, ] 973 ).
In the development of several criterion-referenced tests, Educational
Testing Service has followed an empirical procedure to set standards of
mastery. This procedure involves administering the test in classes one
grade below and one grade above the grade where the particular con-
ce?t or skill i~ taught. The dichotomization can be fmther rcGned by
usmg teacher Judgments to exclude any cases in the lower grade knoVl'll
to have mastered the concept or skill and any cases in the higher grade
who have demonstrably failed to master it. A cutting score, in terms of
number or percentage of correct items, is then selected that best dis-criminates between the two groups. .
Allstatistical procedures for use with criterion-referenced tests are in
an exploratory stage. Much remains to be done, in both theoretical de-
veloIJ!nent and ~mpir.ical ~ryouts, before the most effective IJlethodologyfor different testmg situatlons can be formulated. 4
HAPTER 6
Validity: Basic Concepts 135
sample of the behavior domain to be measured. Such a validation -pro-
cedure is commonly used in evaluating achievement tests .. This type of
test is designed to measure how well the individual has mastered a
specific skill or course of study. It might thus appear that mere inspec-
tion of the content of the test should suffice to establish its "a1idih' for
such a purpose. A test of multiplication, spelling, or bookkeeping '~'ould
seem to be valid by definition if it consists of multiplication, spelling, or
bookkeeping items, respectively.
The solution, however, is not so simple as it appears to be.' Onc diffi-
culty is that of adequately sampling the item universe. The behavior do-
main to be tested must be systematically analyzed to make certain that
aJJ major aspects are covered by the test iteme;. and in the correct pro-
~r example, a test can easily become overloaded with those
aspects of the field that lend thcmselves more readily to the pl'eparation
of objective items. The domain under consideration should be fully de-
scribed in advance, rather than being defined after the test has been pre-
pared. A \VeIl-constructed achievement test should cover the objectives of
instruction, not just its subject matter. Content must therefore be broadly
defined to include major objectives, such as the application of principles
and the interpretation of data,~ as well as factual knowledge. ~vloreover,
content validity depends on the relevance of the individual's test re-
sponses to the behavior area under consideration, rather than on the
apparent rcle\'ance of item content. Mere inspection of the test may fail
to reveal the processes actually used by examinces in taking the test.
It is. also important to guard against any tendency to overgeneralize
regarding the domain sampled by the test. For instance, a multiple-choice
spelling test may measure the ability to recognize correctly and incor-
rectly spelled worde;. But it cannot be assumed that such a test also
measures ability to spell correctly from dictation, frequency of misspell-
ings in written compositions, and other aspects of spelling ability (Ahl-
strom, 1964; Knoell & Harris, 1952). Still another difficulty arises from
the possible inclusion of irrelevant factors in the test scores. For example,
a test designed to measure proficiency in such areas as mathematics or
mechanics may be unduly influenced bv the ability to understand verbal
directions or by speed o{performing si~ple, routi~e tasks.
.alidity:
.;Basic C011cepts
T·HE VALIDlTY of a test concerns u;lwf the test measures and how
, wen it does so. In this connection, we should guard against ae-
, cepting the test name as an index of .what. the ~est measures. Test
names provide short, convenient labels for IdentificatIon purposes. Most
test names are far too broad and vague to furnish meaningful clues to the
behavior area covered, although increasing e£forts are being made to use
more specific and operationally definable test names. ~he ~rait measured
by a given test can be defined only through an e~amIna~l~n of. the ob-
jective sources of information and empirical operatIOns ut~li~ed In estab-
lishing its validity (Anastasi, 1950). Moreover, the vahdlty of ,a .tes;
cannot be reported in general terms. No test can be said to ha.ve 'hl~h
or "low" validitv in the abstract. Its validity must be determmed WIth
reference to the' particular use for, which the test is being considered.
Fundamentallv all procedures for determining test validity are con-
cerned with the 'r~lationships between performance on the test and other
independently observable facts about the behavio~ ehar~cte~stics under
consideration. The specific methods ·employed for mvestIgatmg these re-
lationships are numerous and have been descri~ed by various names. In
the Standards for Educational and PsycJlOloglcal Tests,' (1974), these
procedures are classified under three prineip~~"categories: c~l1t~nt,
criterion-related, and construct validity. Each o~ tnese types of valIdatIon,
procedures will be considered in one of the .fgll?c'~ir:!g.section~, and the
relations amona them will be examined in,~ .concludmg section. Tech-
niques for analyzing and intcrpreting vali1~tt "data with reference to
practical decisions will be discussed in Chapter 7. SPECIFIC PROCEDURES. Content validity is built into a test from the out-
set through the choice of appropriate' items. For educational tests, the
prepfaration of items is preceded by a thorough and systematic examina-
ti'Qn of relevant course syllabi and textbooks, as well as by consultation
NATURE. Content validity involves essentially the systematic exami~a-
tion of the test content to determine whether it covers a representative
I Further discussions of content validity from several angles ca,n be found in Ebel
(1956), Huddleston (1956), and Lennon (1956). .
Principles of Psychological Testing
with subject-matter experts. On the basis of the information thus gath-
'-ered,test specifications are drawn up for the item writers. These specifi-
cations should show the content areas or topics to be covered, the instruc-
'onal objectives or processes to be tested, and the relative importance of
'ndividual topics and processes. On this basis, the number of items of
ach kind to be prepared on each topic can be established. A convenient
ay to set up such specifications is in terms of a two-way table, with
ocesses across the top and topics in the left-hand column (see Table
,eh. 14). Not all cells in such a table, of course, need to have items,
,nee certain processes may be unsuitable or irrelevant for certain topics.
t might be added that such a specification table will also prove helpful
. the preparation of teacher-made examinations for classroom use in any
ubject.
~Jn listing objectives to be co\'ered in an educational achievement test,
e test constructor can be guided by the extensive survey of educational
jectives given in the Taxonomy of ~ducational Objectives (Bloom
a!., 1956; Krathwohl et al., 1964), Prepared by a group of specialists
educational measurement, this handbook also provides examples of
any types of items designed to test each objective. Two volumes are
ilable, covering cognitive and affective domains, respectively. The
jor categories given in the cognitive domain include knowledge (in
senseof remembered facts, terms, methods, principles, etc.), compre-
sion,application, analysis, synthesis, and evaluation. The classification
affective objectives, concerned with the modification of attitudes, in-
rests, values, and appreciation, includes five major categories: recciv-
'g, responding, yaluing, organization, and characterization.
IThediscussion of content validity in the manual of an achievement test
uld include information on th~ content areas and the skills or ob-
ivescovered bv the test, with some indication of the number of items
ach category. 'In addition, the procedures followed in selecting cate-
, s and classifying items should be described. If subject-matter experts
ipated in the test-construction process, their number and pro-
lal qualifications should be stated. If they served as judges in classi- ,
items, the directions they were given should be reported, as well as
extent of agreement among judges. Because curricula and course
eilt change over time, it is paI:tJcularly desirable to give the dates
n subject-matter experts were' consulted. Information should like-
be provided about number and nature of 'course syllabi and text-
s surveyed, including publication dates.
umber of empirical procedures may also be followed in order to
ement the content validation of an achievement test. Both total
s and performance on individual items can be checked for grade
ess.In general, those items are retained that show the largest gains
percentages of children passing them from the lower to the upper
JeqwnN wall ~N'" .••LllCO•.... ..,'" ~::~"'.,.'" "' •.... "" "'o~--~~~~ ~NN "''''.,. Lll"''''0:>0>0"INN NNN NNM
U!Pn&S IU!~OS
~ a3u:a!3S'u;:>is -,~~ samuewnH
" " "
3A!leJJeN i
II
" " "
5Cl'lpn~s le!XlS
" " "'0
is eou81OS x" "a" " " x
'p f!.• .,u;'C:6- S3!l!UllWnH ;
~
'M.!leJJC!N
I" " "sa!P01S II?POS
""0'u; a:>Ua!~S xIii " " " " ".r.
E0.
E 5a!)!lJcwnH
" " x0I,)
,a,,!~eJJeN
" ":
llj5!l:f% CONO LllMcn "'N~ "I•.•..'" "'''' •...coco", .,. ... .,. """-CX) r--.lllCO
"'''''''6 oiIpt'!JE)•.•..MLll CO.,.<0 •...•....•.... Nmq-10l!:t~LO "'''''''' MNO
'".,.CO'" "'''I''' <t ••• .,.
~.~- ;
£~ '4D!1l% ~~.:rl"'.,.- •.... ""'"~ 0 .•••.•..M COOOl "'ON ~~filll!'l~;l;~Z gaP'!J~! ~ "'''' ..•. "'N.,. •.•.,.to ",•••co N"'.,. 0••..'"~~g~- LllNN
.....11l5!1l~ f....-CD
"I"'''' "'.,.- coo", "'OeoL ape.!)
•.•..<0.,."''''''' "'''I''' co•••.'" "'''''''
-~•... "'~Lll N"'<t COOlN "'0>..-N.,.M NM.,."l"'''' "--N "INN
JaqwnN wall -N'" "''''(0 "'COOl O~N ~~~ ~~~ "'O~ "1M.,. "'''' •.... "''''0~NN "'NN N"'''' NNM
Principles of Psychological Testing
es.Figure 14 shows a portion of a table from the manual of the
entialTests of Educational Progress-Series II (STEP). For every
. in each test in this achievement battery, the information provided
des its classification with regard to learning skill and type of ma-
l,as well as the percentage of children in the normative sample who
the right answer to the item in each of the grades for which that
of the test is designed. The 30 items included in Figure 14 repre-
t onepart of the Reading test for Level 3, which covers grades 7 to 9.
ther supplementary procedures that may be employed, when ap-
priate, include analyses of t~l)es of errors commonly made on a test
observation of the work methods employed by examinees. The latter
ld be done by testing students individually with instructions to "think
ud" while ,solving each problem. The contribution of speed can be
ckedby noting how many persons fail to finish the test or by one of
emore refined methods discussed in Chapter 5. To detect the possible
irrelevantinfluence of ability to read instructions on test performance,
,~res on the test can be ~rrelated \",ith scores on a reading compre-
nsiontest. On the other hand, if the test is designed to measure read-
g comprehension, giving the questions v.oithout the reading passage on
hich they are based will show how many could be answered simply
fromthe examinees' prior information or other irrelevant cues.
Validity: Basic Concepts 1.39
into the initial stages of constructing any test, eventual validation of apti-
tude or personality tests requires empirical verification by the procedures
to be described in the following sections. These tests bear less intrinsic
resemblance to the behavior domain they are trying to sample than do
achievement tests. Consequently, the content of aptitude and personality
tests can do little more than reveal the hypotheses that led the test con-
structor to choose a certain type of content for measuring a specified
trait. Such hypotheses need to be empirically confirmed to estabiish thevalidity of the test.
Unlike achievement tests, aptitude and personality tests are not based
on a specified course of instruction or uniform set of prior experiences
from which test content can be drawn. Hence, in the latter tests, indi-
viduals are likely to vary more in the work methods or psycholOgical
processes employed in responding to the same test items. The identical
test might thus measure different functions in different persons. Under
these conditions, it would be virtually impossible to determine the psy-
chological functions measured by the tcst from an inspection of its
content. For example, college graduates might solve a problem in verbal
or mathematical terms, while a,mechanic would arrive at the same solu-
tion in terms of spatial visualization. Or a test measuring arithmetic
reasoning among high scho.ol freshmen might measure only individual
differences in speed of computation when given to college" students. A
specific illustration of the dangers of relying on content analysis of apti-
tude tests is provided by a study conducted with a digit-symbol substitu-
tion ~est"(Burik, 1950). This test, generally regarded as a typical "code-
learmng test, was found to measure chiefly motor speed in a group ofhigh school students.
APPLICATIONS. Especially when bolstered by such empirical checks as
thoseilIusb'ated above, content vali,dity provides an adequate technique
forevaluating achievement tests. It permits us to answer two questions
ihatare basic to the validitv of an achievement test; (1) Does the test
'cover a representative sa~ple of the speCified skills and knowledge?
(2) Is test performance reasonably free from the influence of irrelevant
; \Janables?
~. Content validity is particularly appropriate for the criterion-refer~n~d
.. testsdescribed in Chapter 4. Because performance on these tests lS 111-
f .terpreted in tern1S of content meaning, it is obvious that content validity
~ isa prime requiremenf for their effective use. Content validation is also
· applicable to certain occupational tests designed for employee selection
andclassification, to .be discussed in Chapter 15. This type of validation
issuitable when the test is an actual job sample or otherwise calls for the
sameskills and knowledge required on the job. In such cases, a thorough
· jobanalysis should be carried out in order to demonstrate the close re-
· semblance between the job activities and the test.
For aptitude and personality tests, on the other hand, content validity
is usually inappropriate and may, in fact, be misleading. Although con-
siderations of relevance and effectiveness of content must obviously enter
FACE "ALIDITY. Content validitv should not be confused with face va-
lidity. The latter is not validity 'in the technical sense; it refers, not to
what the test actually measures, but to what it appears superficially to
measure. Face validity pertains to whether the test "looks valid" to the
examinees who take it, the administrative personnel who decide on its
use, and other technically untrained observers. Fundamentally, the ques-
tion of face validity concerns rapport and public relations. Although
common usage of the term validity in tlhs connection may make for
confusion, face validity itself is a desirable feature of tests. For example,
when tests originally designed for children and developed within a class-
room setting were grst extended for adult use, they frequently met with
~esistance and criticism because of their lack of face validity. Certainly
if test content appears irrelevant, inappropriate, silly, or childish, the
result will be poor cooperation, regardless of the actual validity of the
Validity: Basic Concepts 141
sonnel to occupational training programs represent examples of the sort
of decisions requiring a knowledge of the predictive validity of tests.
Other examples include the use of tests to screen out applicants likely
to develop emotional disorders in stressful environments and the use of
tests to identify psychiatric patients most likely to benefit from a par-ticular therapy.
In .a number of instances, concurrent validity is found merely as a
su~stJt~te for predictive validity. It is frequently impracticable to extend
vah~atlon ~rocedures over the time required for predictive validity or to
o~tam a s~Itable preselection sample for testing purposes. As a compro-
m~se .solutIOn, therefore, tests are administered to a group on whom
cntenon data are already available. Thus, the test scores of college
stud~nts may b~ compared with their cumulative grade-point average at
~he tIme of testmg, or those of employees compared with their currentJob success.
For certain uses of psychological tests, on the other hand, concurrent
validity ~sthe ~~st ~pprop!iate type and can be justified in its own right.
The logICal dI~tinchon between predictive and concurrent validity is
?ased, not on hme, but on the objectives of testing. Concurrent validity
ISrel.ev~nt to tests employed for diagnosis of existing status, rather than
predIction of future outcomes. The difference can be illustrated bv ask-
ing: "Is Smith neurotic?" (concurrent validity) and "Is Smith lik"ely tobecome neurotic::>"(predictive validity) . .
. Because ~he criterion for concurrent validity is always available at the
hme of testmg, we might ask what function is served bv the test in such
situa~ions. B~sicalIy, such tests provide a simpler, quicker, or less ex-
~ensive subs.htute for the criterion data. For example, if the criterion con-
SIStsof continuous observation of a patient during a two-week hospital- ,
ization period, a test that could sort out normals from neurotic and '
?oubtful cases would appreciably reduce the number of persons requir-mg such extensive observation.
140 Principles of Psychological Testing
~st.Especially in adult testing, it is not sufficie~t for a t~st to. be ob-
ctivelyvalid. It also needs face validity to function effectively In prac-
o al situations..Face validity can often be improved by merely reformulating test
msin terms that appear relevant and plausible in the particular setting
whichthe" will be used. For example, if a test of simple arithmetic
soningis 'constructed for use with machinists, the items should be
ded in tcrms of machine operations rather than in terms of "how
y oranges can be purchased for 36 cents" or other traditional school-
k problems. Similarly, an arithmetic test for naval personnel can be
ressedin naval terminology, without necessarily altering the functions
asured.To be sure, face validity should never be regarded as a substi-
e for objectively determined validity. It cannot be assumed that im-
\1ng the face validity of a test '\vill improve its objective validity.
r can it be assumed that when a test is modified so as to increase its
e validity,its objective validity remains unaltered. The validity of the
inits final form will always need to be directly checked.
riterion-relatedvalidity indicates the effectiveness of a test in predict-
an individual's beha\'ior in specified situations. For t~is purpose, per-
anceon the test is checked against a criterion, i.e., a direct and in-
dent measure of that which the test is deSigned to predict. Thus,
mechanical aptitude test, the criterion might bc subsequent job
ormanceas a machinist; for a scholastic aptitude test, it might be
gegrades; and for a neuroticism test, it might be associates' ratings
..her available information on the subjects' behavior in various life
lions.
'CURREI'.:TAND PREDICTIVE VALIDITY. The criterion measure against
test scores are validated may be obtained at approximately the
. time as the test scores or after a stated interval. The APA test
·urds (1974), differentiate between concurrent and predictive valid-
the basis of these time relations between criterion and test. The
rediction"can be used in the broader sense, to refer to prediction
he test to any criterion situation, or in the more limited sense of
'onover a time interval. It is in the latter sense that it is used in
ression"'predictive validity:' The information provided by pre-
validityis most relevant to tests used in the selection and das-
n of personnel. Hiring job applicants, selecting students for
onto college or professional schools, and assigning military per-
• ~RITERION CO~TAMINATION. An essential precaution in finding the va-
hdlty of a test IS to make certain that the test scores do not themselves
influence any individ~ars c~terion. status. For example, if a college ill<-
st.metor or a foreman III an mdustnal plant knows that a particularillai~
VIdual scored very p~rly on an aptitude test, such lcIl,owl~qgemight in-
fluence the gr~de gIVen to the student or the rating assigned to the
worker. Or a hIgh-scoring person might be given the benefit of the doubt
~hen academic grades or on-the-job ratings are being prepared. Such
mHuences would obviously raise the correlation between test scores and
crite~on in ~ manner that is entirely spurious' or <ilrtificia1:;. .'
TIus pOSSIblesource of error in test validation is known as criterion
rillciplesof Psychological Testing
tion, since the criterion ratings become "contaminated" by the
owledgeof the test scores. To prevent the operation of such an
s absolutely essential that no person who participates in the as-
of criterion ratings have any knowledge of the examinees' test
or this reason, test scores employed in "testing the test" must
rictlyconfidential. It is sometimes difficult to convince teachers,
s,military officers, and other line personnel that such a precau-
ential. In their urgency to utilize all available information for
decisions,such persons may fail to realize that the test scores
e put aside until the criterion data mature and validity can be
d,
Validity: Basic Concepts 143
selected group than elementary school graduates, the relation between
amount of education and scholastic a titnde is far from erEect. Espe-
cIa y at t e Ig er e ucationallevels, economic, social, motivational, and
other nonintellectual factors may influence the continuation of the indi-
vidual's education. Moreover, with such concurrent validation it is diffi-
cult to disentangle cause-and-effect relations. To what extent ~re the ob-
tain~d differences in intelligence test scores simply the result of the
yarymg amount of education? And to what extent could the test have
predicted individual differences in subsequent educational progress?
These questions can be answered only when the test is administered be-
fore the criterion data have matured, as in predictive validation.
I.n t~e development of special aptitude tests, a frequent type of cri-
teno~ is bas~d on performance in specialized training. For example, me-
chamcal aptitude tests may be validated against final achievement in
sho~ courses. Various business school courses, such as stenographY,
t~l~g, or bookkeeping, provide criteria for aptitude tests in these area's.
SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-
datmg musIc. or art. aptitude tests. Several professional aptitude tests
have been validated In terms of achievement in schools of law medicine
dentistry, engineering, and oth;r areas. In the case of custom-:nade tests'
deSigned for use within a specific testing program, training reco;ds are ~
f:equent ~ource of ~riterion data. An outstanding illustration is the valida-
hO~ ~f Au Force pIlot selection tests against performance in basic flight
tr~m~g. Performance in training programs is also commonly used as a
~ntenon ~or test validation in other military occupational specialties and
m some mdustrial validation studies.
~mong the specific indices of training performance employed for cri-
tenon purposes may be mentioned achievement tests administered on
.completion of training, formally assigned grades, instructors' ratings, and
succ~ssful co~pletjon of. training versus elimination from the program.
l\ful~lple .aptItude battenes have often been checked against grades in
spec,IRehIg? school or college courses, in order to determine their validity
as dIfferential predictors. For example, scores on a verbal comprehension
test may be compared with grades in English courses spatial visualiza-
tion scores with geometry grades, and so forth. '
In connection with the use of training records in general as criterion
measures, a useful distinction is that between intermediate and ultimate
criteria: In the development of an Air Force pi!Pt-selection test or a medi-
cal aptItude test, for example, the ultimate criteria would be combat
perfo~mance a~d eventual achievement as a practicing physician, re-
spectIvely. ObVIOuslyit would require a long time for such criterion data
to mature. It is doubtful, moreover, whether il~ly ultimate criterion is
ever obtained in actual practice. Finally, even were such an ultimate
criterion available, it would probably be subject to many unconttolled
MON CRiTERIA. Any test may be validated against as many criteria
e are specific uses for it. Any method for assessing behavior in
tion could provide a criterion measure for some particular pur-
he criteria employed in £ndif\g the validities reported in test
Is,however, fall into a few common categories. Among the criteria
equendyemployed in validating intelligence test~ is some index of
ic ac ' t. It is for this reason that such tests have often
ore precisely described as measures of scholastic aptitude,. The
cindicesused as criterion measures include school grades, achieve-
est scores, promotion and graduation records, special honors and
as,and teachers' or instructors' ratings for "intelligence." Insofar as
ratings given within an acade~ic setting are likely to be heavily
~dby the individual's scholastic performance, they may be properly
edwith the criterion of academic achievement.e various indices of academic achievement have provided criterion
at all educational levels, from the primary grades to college and
uateschool.Although employed principally in the validation of gen-
intelligence tests, they have also served as criteria for certain
'pIe-aptitude and personahty tests. In the validation of any of these
s.oftests for use in the selection of college students, for example, a
:~on criterion is freshman grade-point average. This measure is the
.agegrade in all courses taken during the freshman year, each grade
gweighted by the number of course points for which it w.a~~ceived.
variant of the criterion of academic achievement frequenl:ly em-
edwith out-of-school adults is the amount of education the individual
pleted. It is expected that in general the more intelligent individuals
inutl their education longer, while the less inte.lli ent drop out of
01earlier. The assumption underlying this crite . that the educa-
al ladder serves as a progressively selective nee, eliminating
oseincapable of continuing beyond each step. Although it is undoubt-
ly true that college graduates, for example, represent a more highly
4 Principles of Psychological Testing
. tors that would render it relatively useless. For example, it would be
cult to evaluate the relative degree of success of physicians practicing
erent specialties and in different parts of the country. F'or these rea-
s, such intermediate criteria as performance records at some stage of
iningare frequently employed.as criterion measures.
or many purposes, the most satisfactory type of criterion measure is
t based on follow-up records of actual ;ob performance. This criterionbeen used to some extent in the validation of general intelligence as
aspersonality tests, and to a larger extent in the validation of special
tude tests. It is a common criterion in the validation of custom-made
. for specific jobs. The "jobs" in question may vary widely in both
I and kind, including work in business, industry, the professions, and
armed services. Most measures of job performance, although prob-
not representing ultimate criteria, at least provide good inter-
iate criteria for many testing purposes. In this respect they are to be
erred to training records. On the other hand, the measurement of
perform;mce does not permit as much uniformity of conditions as is
, ible during training. Moreover, since it usually involves a l?nger
low-up,the criterion of job puformance is likely to entail a loss m the
mber of available subjects. Because of the variation in the nature of
inally .similar jobs in different organizations, test manuals reporting
'ditydata against job criteria should describe not only tbe specific
terion measures employed but also the job duti~s performed by the
rkers.Validation by the method of contrasted groups generally involves a
compositecriterion that reflects the cumulative and uncontrolled selective
j~fluencesof everyday life. This critcrion is ultimately based on survi"al
,withina particular group versus elimination therefrom. For example,. ip
e validation of an intelligence test, the scores obtained by institution~l-
, ed mentally retarded children may be compared with those obtained
y schoolchildren of the same age. In this case, the multiplicity of factors
etermining commitment to an institution for the mentally retarded con-
itutes the criterion. Similarly, the validity of a musical aptitude or a
echanical aptitude test may be checked by comparing the scores ob-
tainedby students enrolled in a music school or an engineering school,
respectively,with the scores of unselected high school or college student~.
, To be sure, contrasted groups can be selected on the basis of any cn-
terion, such as school grades, ratings, or job performa!1ce, by simply
choosingthe extremes of the distribution of criterion me~sures. The con-
trasted groups included in the present category, ho}'-'wer, are disti?ct
groupsthat have gradually become differentiated through the operation
ofthe multiple demands of daily living. The criterion under cons~dera-
lionis thus more complex and less clearly definable than those preVIously
discussed.
Validity: Basic Concepts 145
. The method o~ contrasted groups is used quite commonly in the valida-
hon of persollahty tests. Thus, in validating a test of social traits, the
test perform~nce of salesmen or executives, on the one hand, may be
compar~d WIth that of clerks or engineers, on the other. The assmnption'
underlymg such a procedure is that, with reference to man v social traits
individua.ls who hav~ entered and remained in such occupatiq9~:~s selling
or executive work Will as a group excel persons in such fiela~['ils clerical
work or engine.ering. Similarly, college students who hav~>:engaged in
man~ .extracl~rncular activities may be compared with those who have
partlcIp~ted 111 nOlle during a comparable period of college attendance.
Oc~up~tlOnaI groups have frequently becn used in the development and
vahdahon ?f interest tests, such as the Strong Vocational Interest Blank,
as well as ~n the preparation of attitude scales. Other groups sometimes
employed. m the validation of attitude scales include political, religious,
~eograp~lCal, or ot~er spccial groups generally knO\vn to represent dis-tmetly dIfferent pomts of "iew on certain issues.
In the developmc.nt of certain personality t~sts, psychiatric diagnosis is
used both as a basIS for the selection of items and as evidence of test
v~lidity, Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-
VIded that It IS based on prolonged observation and detailed case history
rather than on a cursory psychiatric interview or examination. In th;
latter. case, there is no reason to expect the psychiatric diagnosis to be
supenor to the test score itself as an indication of the individual's emo-
tion~l ~ondition. Such a psychiatric diagnosis could not be regarded as
~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidIty would have to be determined.
Mention has already been made, in connection with other criterion
cate,?o~ies, of certain types of ratings by school teachers, instructoml in
speclahzed. cou~s.es, an~ jo~ supervisors. To these can be added ratings
by offic~rs 10 mIhtary sltuahons, ratings of students by school counselors,
and ratmgs by co-workers, classmates, fellow club-members and other
grou?~ of associ~tes. The ratings discussed earlier represent~d merely a
SUhsldI~ry tec?mque for obtaining information regarding such criteria as
academiC achIevement, performance in specialized training, or job suc-
ce~s. :Ve are now considering the use of ratings as the very core of the
cntenon mea~ur~. Under these circuwstances, the ratings themselves
define the CrItenon. Moreover, such.:ratings are not restricted to the
evaluation of speci~c achievement, but involve a personal judgment by
an observer regardmg any of the variety;of traits that psychological tests
attempt to measure. Thus, the subjects in the vali~\ltion sample might be
~ate? on such c?aracteristics as dominance, mech~ll.icaI ingenuity, orig-mali~, leadership, or honesty.':":"
Ratings have bee~ employed in the valid~tion of,lltmost every type of
test. They are partICularly useful in providing criteria for personality
ril1ciplesof Psychological Testing
ation,since the criterion ratings become "contaminated" by the
owledgeof the test scores. To prevent the operation of such an
.'s absolutely essential that no person who participates in the as-
t of criterion ratings have any knowledge of the examinees' test
or this reason, test scores employed in "testing the test" must
rictlyconfidential. It is sometimes difficult to convince teachers,
s,military officers, and other line personnel that such a precau-
ential. In their urgency to utilize all available information for
decisions,such persons may fail to realize that the test scores
e put aside until the criterion data mature and validity can be
d.
Validity: Basic Concepts 143
selected group than elementary school graduates, the relation between
amount of education and scholastic a titude is far from erfect. Espe-
cIa y at t e Ig er e ucationallevels, economic, social, motivational, and
other nonintellectual factors may influence the continuation of the indi-
vidual's education. Moreover, with such concurrent validation it is diffi-
cult to disentangle cause-and-effect relations. To what extent ~re the ob-
tain~d differences in intelligence test scores simply the result of the
yarymg amount of education? And to what extent could the test have
predicted individual differences in subsequent educational progress?
These questions can be answered only when the test is administered be-
fore the criterion data have matured, as in predictive validation.
I.n t~e development of special aptitude tests, a frequent type of cri-
teno~ is bas~d on performance in specialized training. For example, me-
chamcal aphtude tests may be validated against final achievement in
sho~ courses. Various business school courses, such as stenography,
t~l~g, or bookkeeping, provide criteria for aptitude tests in these areas.
SlIl~Ilarly,p~rformance in music or art schools has been employed in vali-
datmg musIC,or art. aptitude tests. Several professional aptitude tests
have been validated m terms otachievement in schools of law, medicine,
dentistry, engineering, and other areas. In the case of custom-made tests
designed for use within a specific testing program, training reco;ds are ~
f:equent ~ource of ~riterion data. An outstanding illustration is the valida-
hO~ ?f Alr Force pllot selection tests against performance in basic flight
tr~m~g. Performance in training programs is also commonly used as a
~ntenon for test validation in other military occupational specialties and
m some industrial validation studies.
~mong the specific indices of training performance employed for cri-
tenon purposes may be mentioned achievement tests administered on
completion of train,ing, formally assigned grades, instructors' ratings, and
succ~ssful co~plehon of. training versus elimination from the program .
l\ful~lple .aphtude battenes have often been checked against grades in
spec.,fi~hlg~ school or college courses, in order to determine their validity
as dIfferential predictors. For example, scores on a verbal comprehension
test may be compared with grades in English courses spatial visualiza-
tion scores with geometry grades, and so forth. '
In connection with the use of training records in general as criterion
measures, a useful distinction is that between intermediate and ultimate
criteria: In the development of an Air Force pilpt-selection test or a medi-
cal aptitude test, for example, the ultimate criteria would be combat
perfo~mance a~d eventual achievement as a practidng physician, re-
spectively. ObVIOuslyit would require a long time for such criterion data
to mature: It is. doubtful, moreover, whether a".truly ultimate criterion is
ever obtamed m actual practice. Finally, even were such an ultimate
criterion available, it would probably be subje,ct to many uncontrolled
MON CRiTERIA. Any test may be validated against as many criteria
e are specific uses for it. Any method for assessing behavior in
ationcould provide a criterion measure for some particular pur-
he criteria employed in finding the validities reported in test
Is, however, fall into a few common categories. Among the criteria
equentlyemployed in validating intelligence test~ is some index of
ic ac . t. It is for this reason that such tests have often
ore precisely described as measures of scholastic aptitude .. The
cindicesused as criterion measures include school grades, achieve-
est scores, promotion and graduation records, special honors and
as,and teachers' or instructors' ratings for "intelligence." Insofar as
ratings given within an acade~ic setting are likely to be heavily
,edby the individual's scholastic performance, they may be properly
. edwith the criterion of academic achievement.e various indices of academic achievement have provided criterion
at all educational levels, from the primary grades to college and
.uateschool.Although employed principally in the validation of gen-
.intelligence tests, they have also served as criteria for certain
tiplc-aptitude and personality tests. In the validation of any of these
of tests for use in the selection of college students, for example, a
on criterion is freshman grade-point average. This measure is the
, ge grade in all courses taken during the freshman year, each grade
gweighted by the number of course points for which it waJJ~ceived.
variant of the criterion of academic achievement frequently em-
yedwith out-of-school adults is the amount of education the individual
pleted. It is expected that in general the more intelligent individuals
tinue their education longer, while the less int~lli ent drop out of
001 earlier. The assumption underlying this erite . that the educa-
al ladder serves as a progressively selective . nee, eliminating
se incapable of continuing beyond each step. Although it is undoubt-
ly true that college graduates, for example, represent a more highly
44 Principws of Psychological Testing
tors that would render it relatively useless. For example, it would be
cult to evaluate the relative degree of success of physicians practicing
rent specialties and in different parts of the country. For these rea-
sueh intermediate criteria as performance records at some stage of
iningare frequently employed as criterion measures.
or many purposes, the most satisfactory type of criterion measure is
t based on follow-up records of actual ;ob performance. This criterion
.been used to some extent in the validation of general intelligence as
aspersonality tests, and to a larger extent in the validation of special
tude tests. It is a common criterion in the validation of custom-made
for specine jobs. The "jobs" in question may vary widely in both
and kind, including work in business, industry, the professions, and
armed services. Most measures of job performance, although prob-
not representing ultimate criteria, at least provide good inter-
iate criteria for many testing purposes. In this respect they are to be
erred to training records. On the other hand, the measurement of
perform;mce does not permit as much uniformity of conditions as is
ible during training. Moreover, since it usually involves a longer
low-up,the criterion of job ptrformanee is likely to entail a loss in the
mber of available subjects. Because of the variation in the nature of
minallv.similar jobs in different organizations, test manuals reporting
~ditydata against job criteria should describe not only the specific
'terion measures employed but also the job duti~s performed by the
rkers.Validation by the method of contrasted groups generally involve~ a
ill osite criterion that reflects the cumulative and uncontrolled selectIve
fluencesof everyday life. This critcrion is ultimately based on sur\'iY~1
'thin a particular group versus elimination therefr?m. For.ex~mp.le,.~n,
e validation of an intelligence test, the scores obtamed by mSbtutlOnal-
mentally retarded children may be compared with those obtained
schoolchildren of the same age. In this case, the multiplicity of factors
etermining commitment to an institution for the mentally ret~rded con-
stitutes the criterion. Similarly, the validity of a musical aptitude or a
echanical aptitude test may he checked by comparing the scores ob-
ainedby students enrolled in a music school or an engineering school,
espectively,with the scores of un selected high school or college student~.
To be sure, contrasted groups can be selected on the basis of any cn-
terion, such as school grades, ratings, or job performa!!ce, by simply
choosingthe extremes of the distribution of criterion metsures. TIle con-
trasted groups included in the present category, h~~wer, are disti~ct
groupsthat have gradually become differentiated through the ope~ation
ofthe multiple demands of daily living. The criterion under cons~dera-
tionis thus more complex and less clearly definable than those preViously
discussed.
Validity: Ba51c Concepts 145
. The method o~ contrasted groups is used quite commonly in the valida-
tion of personahty tests. Thus, in validating a test of social traits, the
test perform~nce of salesmen or executives, on the one hand, maybe
compar~d WIth that of clerks or engineers, on the other. The assumption'
underlymg such a procedure is that, with reference to many socialtraits
individua.ls who hav~ entered and remained in such occupatiQp~r~s selling
or executive work Will as a group excel persons in such fie1~&~iisclerical
work or engineering. Similarly, college students who hav~.'~ngaged in
man! .extracl~rricular activities may be compared V\'ith those who have
partlcIp~ted 111 nOlle during a comparable period of college attendance.
Oc~up~tlOl1al.groups have frequently been used in the development and
vahdabon ?f mterest tests, such as the Strong Vocational Interest Blank,
as well as ~n the preparation of attitude scales. Other groups sometimes
employed. III the validation of attitude scales include political, religious,
~eograp~lCal, or other special groups generally known to represent dis-
tmetly different points of \oiew on certain issues.
In the developme.nt of certain personality t~sts, psychiatric diagnosis is
used both as a basIS for the selection of items and as evidence of test
v~lidity. Ps.y~hiatric diagnosis may serve as a satisfactory criterion pro-
VIded that It is based on prolonged observation and detailed case history
rather than on a cursory psychiatric interview or examination. In th~
latter. case, there is no reason to expect the psychiatric diagnosis to be
supenor to the test score itself as an indication of the individual's emo-
tion~l ~ondition. Such a psychiatric diagnosis could not he regarded as
~ c:ltenon measure, but rather as an indicator or predictor whose own va-lidity would have to be determined.
Mention has already been made, in connection with other criterion
catel?o~ies, of certain types of ratings by school teachers, instructor,s in
speCialized. cou~s.es. an~ jO~ supervisors. To these can be added ratings
by officers m mIlitary Situations, ratings of students bv school counselors
and ratings by co-workers, classmates, fellow club-~embers, and othe;
grou?~ of associ~tes. The ratings discussed earlier represented merely a
suhsldl~ry tec?mque for obtaining information regarding such criteria as
academIC achievement, performance in specialized training, or job suc-
ce:s. :Ve are now considering the use of ratings as the viery core of the
cntenon mea:ur~. Under these circutJIstances, the ratings themselves
define the crltenon. Moreover, suchuatings are not restricted to the
evaluation of speci~c achievement, ~ut involve a personal judgment by
an observer regardmg any of the vanety,of traits that psychological tests
attempt to measure. Th~s, .the subjects in the vali~\ltion sample might be
~ate~ on such charactensbcs as dominance, mechll.:nical ingenuity, orig-mality, leadership, or honesty. .,;,.
Ratings have bee~ employed in t?e validl!tion of;,almost every type of
test. They are partIcularly useful In providing criteria for personality
146 Principles of PSljchological Testing'J
;,tests,since objective criteria are much more difficult to find in this area.
}lfhisis especially true of distinctly social traits, in which ratings based on
;personal contact may constitute the most logically defensible criterion.
iiAlthoughratings may be subject to many judgmental errors, when ob-
. )ained under carefully controlled conditions they represent a valuable
's9urceof criterion data. Techniques for improving the accuracy of
i:iatingsand for reducing common types of errors will be considered in
,{,Chapter20.
,11 Finally, correlations between a new test and previously available tests
i~arefrequently cited as evidence of validity. When the new test is an ab-
,breviated or Simplified form of a currently available test, the latter can
,;Properly be regarded as a criterion measure. Thus, a paper-and-pencil
',test might be validated against a more elaborate and time-consuming per-
<i, formancetest whose validity had previouslv been established. Or a group
ftestmight be Ivalidated:against an individu~l test. The Stanford-Binet, for
:lhample, has repeatedly served as a criterion in validating group tests.
"In such a case, the new test may be regarded at best as a crude appro xi-
~mation of the earlier one. It should be noted that unless the new test
, represenl~a simpler or shorter substitute for the earlier test, the use of the
,';latter as a cdterion is indefensible.t,,;
1 SPECIFICITYOF CRITERIA.Criterion-related validity is most appropriate
'for local validation studies, in which the effectiveness of a test for a
,; specificprogram is to be assessed. This is the approach followed, for
, example,when a given company wishes to evaluate a test for selecting
, applicants for one of its jobs or when a given college wishes to determine
i howwell an academic aptitude test can predict the course performance
~"ofits students. Criterion-related validity can be best characterized as the
~practical validity of a test in a specified situation. This type of validation
',represents applied research, as distinguished from basic research, and as
: such it provides results that are less generalizable than the results of
I other procedures.
That criterion-related validity may be quite specific has been demon-
, strated repeatedly. Figure 15 gives examples of the wide variation in the
correlations of a single type of test with criteria of job proBciency. The
, .~'firstgraph shows the distribution of 72 correlations found between in-
~:telligence test scores and measures of the job proficiency of general
c. clerks; the second graph summarizes in similar fashion 191 correlations
. :' between finger dexterity tests and the job proficiency of benchworkers.
,:;Although in both instances the correlations tend to chIster in a particular
',~range of validity, the variation among individual studies is considerable.
~,.Thevalidity coefficient may be high and positive in one study and negli-
'; gibleor even substantially negative in another.£.1"
Validity: Basic COllcepts 147
Similar .vari~tion with r~gard to the prediction of course grades is il-
lustrated m Flgure 16. ThIS Bgure shows the distribution of correlations
obtained between grades in mathematics and scores on each of the sub-
tests Of,the Differential Aptitude Tests. Thus, for the Numerical Ability
test (NA), the largest number of validity coefficients among boys fell
between .50 and .59; but the correlations obtained in different mathe-
m~tics ~ourses and in different schools ranged from .22 to .75. Equally
Wide dlff~rences we~e found with the other subtests and, it might be
added, WIth grades 10 other subjects not included in Figure 16.
2072 coefficients for general
c1erh on intelligence tests,proficiency criteria
~ 10c:.,'u~••0U
0~~ -1.00
+1.00'0>... 200
191coefficients for bench"Ol workers on finger dexterity0
C tests, proficiency criteria.,~•• 100..
o-1.00 -0.50 .00 +0.50 +1.00
FIG. 15. Examples of Variation in Validity Coefficients of Given Tests for Par-ticular Jobs.
(Adapted from Ghiselli, 1966, p. 29.)
Some. of .the variation in validity coefficients against job criteria re-
ported l.n FIgure 15 r~ults from differences among the specific tests em-
plo ed 10 different studies to measure' . , ..:~2, rity. In
the resu s 0 0 19ures and 16, moreover/some variation is at-
tributable to diHerences in homogeneity and lev~l~£ the groups tested .
The range of validity coefficients found, however, is far wider than
could be explained in these terms. J)ifferences in the' crjtena themselves
~un~oubtedb' a m.!!iorr~ason for th.~~~ariatiQnQ~~~rvgafilong vali<lliy
c~. Thus, thc duties of offic~g~rks or berichworkers may differ
148 Principles of Psychological Testing
. de artments in the same company.'dely among compames or amo~. tP differ in content, teachingmilarlv, courses in the same su Jec may t' student achieve-ethod'instructor characteristics, bases for evalua
lmg . to be the
' c· ntly w lat appearsellt, and l~umerous other ways. o~sd~e:ent ' combmation of traits ine critefJon ma resent ver
i rrent situations. t'. the same situation. For example,riteria may also vary over Ime In. .. criteria often differse validitv coefficient of a test against Job .tra~m(th' lli 1966) Thereomits v~lidity against job performance cntena Ise, ce of ~ iven
'evidence that the traits required for successful terfo~~:nor job e;peri-
b or even a. si~g~detaslk(;~r~' ~vi~n th:9;~~o~~iS~!::c& Fruchter, 1960;ce of the mdivi ua eiS m .' . ' . 1960) There is also
~:~~::le~d~:1~~I'Sh~~~6;h~~~~I~ri:ri~~:~ge ove~. timt.e fo1rgOotha~r. f . b shIfts In orgamza IOna ,asons such as changmg nature (} )0 s, al d't'ons ('.lac-
' . k d ther tempor con 1 1 IV.dividual advancement In ra~ ' an
llkn° f course that educational
. 1967 P . 1966) It IS we own, 0 ,.nne)', ; nen,. t' In other words the' . d t t change over Ime. ,meula an course con en. ... IIi ence and aptitude teststeria most commonly used m vaiidatmg mte g d'
'namely, job performance and edut:ational achievement-are ynamlc
Validity: Basic Concepts 149
rather than static. It follows that criterion-related validity is itself subjectto temporal changes.
<SYl':mETIC VALIDITY. Criteria not only differ across situations and over
time, but they are also likely to be complex (see, e.g., Richards, ~llor,
Price, & Jaeobsen, 1965). Success on a job, in school, or in other actirytiesof daily life depends not on one trait but on many traits. Hence, 'prac-
tical criteria are likely to be multifaceted. Several ,different indicators
or measures of job proficiency or academic achievement could thus be
used in validating a test. Since these measures may tap different traits
or combinations of traits, it is not surprising to find that they yield differ-
ent validity coefficients fpr any given test. '\'hen different criterion
measures are obtained for the same individuals. their interoorre!atioDs are
often quite low. For instance, accident records or absenteeism may show\"
virtually no relation to productivity or error ,data for the same job (Sea-
shore, Indik, & Georgopoulos, 1960). These differences, of course, are
reflected in the validity coefficients of any given test against different
criterion measures. Thus, a test may fail to correlate significantly with
supervisors' ratings of job proflciency and yet show appreciable validity
in predicting who will resign and who will be promoted at a later date(Albright, Smith, & Glennon, 1959),
Because of criterion complexity, validating a test against a composite
criterion of job proficiency, academic achievement, or other similar ac-
complishments '~a be of uestionable value and is certainl of limited
generality. If different subcriteria are relatively independent, a more ef-
fectIve procedure is to validate each test against that aspect of the cri-
teiiO'i1Jf IS best designed to measure. An analysis of these more speCific
reIahonships lends meaning t6 the test Scores in terms of the multiple
dimensions of criterion behavior (Dunnette, 1963; Ebel, 1961; S. R. Wal-
lace, 1965). For example, one test might prove to be a valid predictor of
a clerk's perceptual speed and accuracy in handling detail work, another
of his ability to spell correctly, and still another of his ability to resistdistraction.
If, now, we return to the practicClI question of evaluating a test or
combination of tests for effectiveness in predicting a complex criterion
such as success on a given job, we are faced with the necessity of con-
ducting a separate validation stud in each loc tion and re eatin
it at frequent mten~ S. This is admittedly a desi procedure and one
that is often recommended in test manuals. In ma~r situations, however,it is not feasible to follow this procedure be~jise of well-nigh insur-
mountable practical obstacles. Even if adequatel~ p'ained' personnel are
available to carry out the necessary research, mosf Critf:'rion-related va-
lidity studies conducted in industry afe likely to prove unsatisfactory for
• f· "\' l'd'ty Coefficientsof the Differential Aptitude16. GraphIC Summary 0 a I I . Th bad ac-. (Forms Santi T) for Course Grades in Mathematics! em ~rst ~ theanyingnumb~r.sin each column indicate the number 0 coe clen S In
givenat the left. "
R roduced by permiSSIon. CopyrIght © 1975,Fifth Edition Manual, p. 82: eP
NY k N Y All rights reserved.)
by The Psychological Corporatlon, ew or, .•
tso Principles of Psychological Testing
at leastthree reasons, First, it is difficult to obtain dependable and suf-
Scientlycomprehensive criterion data. Second, the number of employees
engagedin the same or closely similar jobs '~ithin a co~pany i,s often
60 small for significant statistical results. Thlfd, correlations will very
~robablybe lowered by restriction of range through preselection, si~ce
polythose persons actuany hired can be followed up on .the Job.
: For all the reasons discussed above, personnel psychologJ.sts have
- shownincreasing interest in a technique 1.."110\\'11as synthetic validity.
\Firstintroduced by Lawshe (1952), the concept of synthetic validity has
!beendefined by Balma (1959, p. 395) as "the inferring of validity in a
specificsituation from a systematic analysis of job elements, a determina-
_Honof test validity for these elements, and a combination of elemental
fvalidities into a ~'hole." Several procedures have been developed for
",.gathering the1needed empirical data and for ~mbining these d~ta. to
, obtainan estimate of synthetic validity for a particular complex cntenon
(see,e.g., Guion, 1965; Lawshe & Balma, 1966, Ch. 14; McCormick, 1959;
l'rimoff, 1959, 1975). Essentially, the process involves three steps: (1)
_. detailed job analysis to identify the job elements and their relative
_weights; (2) analysis and empirical study of each test to determine ~he
.i extent to which it measures proficiency in performing each of these Job
elements; and (3) finding the validity of each test for the given job
synthetically from the weights of these elements in the job and in the
test.In a long-term research program conducted with U.S. Civil Service job
applicants, Primoff (1975) has developed the J-coefficient (for "job-
coefficient") as an index of synthetic validity. Among the special features
of this procedure are the listing of job elements expressed in terms of
worker behavior and the rating of the relative importance of these ele-
ments in each job by supervisors and jo1},}p$]Jmbents. Correlations be-
tween test scores and sell-ratings on jOp;Jj~m~~s are found in total ap-
plicantsamples (not subject to thep~1f'~-~~lW? of employed workers).
Various chec1..ing procedures are fon9~~ed to ensure stability of correl~-
tions and weights derived from self-~~~gs. as wen as adequacy of C[l-
terion coverage. For these purpose.s;~a~a_ are ?btained from d~Herent
samples of applicant populations. 1\~~£nal estimate of correlation be-
tween test and job performance is,!9Pnd from the correlation of each
job element with the pifticular job';~~ the weight of the same element
in the given test.' There i" evidence that the J-coefficient has proved
~f·', The statistical procedures aTe essentiaIly an adaptation of multiple regression
equations, to be discussed in Chapter- 7. For each job element, its correlation withthe job is multiplied by its weight in the test, and these produtcs are added across all
appropriate job elements.
Validity: Basic Concepts 151
helpful in improvin~ th~ employment opportunities of minority appli-
cants and persons WIth lIttle formal education, because of its concentra-
tion on job-relevant skills (Primoff, 1975).
, A different application of synthetic validity, especially suitable for use
m a sn~all company with few employ~es in each type of job, is described
by Gmon (1965). The study was carried out in a company having 48
employee~, each of whom was doing a job that was appreCiably different
from the Jobs of the other employees. Detailed job analyses nevertheless
revealed seve.n job elements commo!}Jto many jobs. Each employee was
rated on the Job elements appropriate to his job; and these ratings were
then checked against the employees' scores on each test in a trial battery.
On the basis of these analyses, a separate battery could be "svnthesized"
for each job by co~bining the two best tests for each of the j~b elements
demanded by that Job. When the batteries thus assembled were applied
t~ a subsequently hired group of 13 employees, the results showed con-
SIderable promi~e. Because of the small number of cases, these results
are only suggestive. The study was conducted primarily to demonstrate a
model for the utilization of synthetic validity.
The two examples of synthetic validity were cited only to illustrate
the scope of possible applications of these techniques. For a description
of the actual procedures followed, the reader is referred to the ariginal
sources .. In ~ummary, the Concept of synthetic validity can be imple-
~ented III diHerent ways to fit the practical exigencies of different situa-
tIOns.. It oH~rs .a promising approach to the problem of complex and
changmg. cntena; and it permits the assembling of test batteries to fit
~he reqUIrements of specific jobs and the detennination of test validity
1D many contexts where adequate criterion-related validation studies are
impracticable.
!he construct validity o~ a test is the extent to which the test may be
saId to me~ure. a theoretical construct or trait. Examples of such con-
structs ~re mtelhge~~, mechanical comprehension, verbal fluency, speed
of ,;alking, neurotiCIsm, and anxiety. Focusing on a broader, more en-
dunng, .and more abstract kind of behavioral description t'han the previ.
ously dlscusse~ types ,of validity, construct validation requires the grad-
ual a~um~latIon of mfonnation from a variety of sources. Any data
thrOWIng hght on the nature of the trait under consideration and the
~~~tions .aHecting i~ developm.e~t and manifestations' are grist for this
,al~dl~ mill: IllustratIOns of speCific technique~ $uitabl~, for construct
vahdatlon Will be considered below. ':&-"
Validity; Basic Concepts 153
acco~~ing to. a hierarchical pattern of learned skills, they, too, can utilize
empmcal eVidence of hierarchical invariance in their validation.";: DEVELOPMENTAL CHANGES.A major criterion employed in the validation
',',ofa number of intelligence tests is age d.ifferentiation. Su.ch tests a.~the
,Stanford.Binet and most preschool tests arc checked agamst chronolog-
':ical age to determine whether the scores show a pr~gressive i~crease
, .with advancing age. Since abilities are expected to mcre~se \~lth age
, ,duringchildhood, it is argued that the test scores should likewise show
, such an increase, if the test is valid. The very concept of an age scale
,:'0£ intelligence, as initiated by Binet, is based on the assumption that "in-
~telligence"increases with age, at least until maturit,Y- . .The criterion of age differentiation, of course, IS mapp1icable to any
,functions that do not exhibit clear-cut and consistent age changes. In the
area of personality measurement, for example, it ~as found li~ited u~e.
Moreover, it should be noted that, even when apphcable, age differentia-
tion is a necessary but not a sufficient condition for validity. Thus, if the
. test scores fail t~ improve with age, such a finding probably indicates
" that the test is not a valid measure of the abilities it was designed to
."sample. On the other hand, to prove that a test measures something
that illcr,eases with age does not define the area covered by the test very
precisely. A measure of height or weight would al~o show regul~r ag~
inc1'ements,although it would obviously not be deSignated as an mtelli-
'\ gencetest. . .'A final point should be emphasized reg~rding the. mterpretahon .of ~e
age criterion. A psychological test validated a?amst such a cnteno~
measures behavior characteristics that increase w1th age under the condl'
tions existing in the type of environment in which the test was stand-
ardized. Because different cultures may stimulate and foster the develop-
ment of dissimilar behavior characteristics, it cannot be assumed that
the criterion of age differentiation is a universal one .. Lik~ all ~th~r
, criteria, it is circumscribed by the particular cultural settmg m whlCh It
is derived.Developmental analyses are also basic to the construct validation of
the JPiagetian ordinal scales cited in Chapter 4. A fundamental assump,
tion of such scales is 1thesequential patterning of development, such that
the attainment of earlier stages in concept development is prerequisite to
the acquisition of later conceptual skills. T'here is thus. an ~ntrinsic h~er-
archy in the content of these scales. The construct vahdahon of ~rdi~al
scales should therefore include empirical data on the sequential 10-
variance of the successive steps. This involves checking the performance
of children at different levels in the development of any tested concept,
such as conservation or object permanen,ce. Do children who demonstrate
mastery of the concept at a given level :also exhibit mastery at the ~ower
levels? Insofar as criterion-rt:ferenced tests are also frequently deSIgned
CO~~Anoss WlTIl OTHER TESTS. Correlations between a new test
and slIDllar earlier tests are sometimes cited as evidence that the new test
me~sures apprOximately the same general area of behavior as other tests
des~gnated by"the ~ame name, such as "intelligence tests" or "'mechanical
aphtude tests: Unlike the correlations found in criterion-related validity,
these correlahons sh~uld be ~oderately high, but not too high. If the new
test correlates too lughly With an already available test, withuut such
added advantages as brevity or ease of administration, then the new test
represents needless duplication.
Correlations with other tests are employed in still another way to
d~m~nstrate that the new test is relatively free from the influence of cer-
ta~n m~le:ant factors. For ex~~ple, a special aptitude test or a person-
alItr teat "hould hav.e a neglIgtble correlation with tests of general in-
te1hgence ~r scholastic aptitude. Similarly, reading comprehension should
not appreCiably affect performance on such tests. Thus, correlations with
t~sts of general intelligence, reading, or verbal comprehension are some-
h~es reporte~ as indirect or negative evidence of validity. In these cases,
hlgh correlations, would make the test suspect. Low correlations, how-
ever, would n~t 10 t~emselves insure validity. It will be noted that this
use o~ correlations With other tests is similar to one of the supplementary
techmques described under content validity. '
FA~OR ANALYSrs.Of particular relevance to construct validitv is fador
an~lySlS: a s~atistical procedure for the identification of psy~hological
~ralts. E,s~entia.lly, factor analysis is a refined technique for analyzing the
I~terrelationships of behavior data. For example, if 20 tests have been
glven ~o300 persons, the first step is to compute the correlations of each
t~st Wlth e:ery other., An inspection of the resulting table of 190 eoi-rela-
ti,O~ may Itself reveal. certain clusters among the tests, suggesting the 10-
catI?n of common traIts. Thus" it_ tests as vocabulary, analogies op-
pOSites, and sent~nce ~mpletioJl •• high correlations with each ~ther
and low correlations With all ot~ ~ts, we could tentatively infer the
pre~en.(:e of a verbal :omprehe~ioj "tor. Because',~uch an inspectional
ana ~m of .a ~rrelaho~ table is ~t and uncetjtirln, however, more
precIse statistical teclm1ques have blWft developed to locat th
f t. cd """'- e e common
ac ors reqmr to account for the'ttbtai, ·ned co i ti' Th h. f f "rre,a ons. ese tee -m~ues a .actor a~alysis will be e~amiil~d further in Chapter 13, together
WIth multnple aptItude tests developed~~y means of I~r analysis.
Principles of PSljchological Testing
n the process of factor analysis, the number of variables or .cate~ories
ermsof which each individual's performance can be descnbed lS re-
ed from the number of original tests to a relatively small number of
rs, or common traits. In the example cited above, five or six factors
t suffice to account for the intercorrelations among the 20 tests. Each
'dual might thus be described in terms of his scores in the five or six
ors, rather than in tcrms of the original 20 scores. A major purpose of
(>ranalysis is to simplify the description of behavior by reducing the
er of categories from an initial multi licit 1 of test vari bles to a few
ac;Aft~rthe factors have been idcntified, they can be utilized in describing
e factorial composition of a test. Each test can thus be cl1afacterized in
rmsof the l1)a)orfactors determining its scores, together with the weight
r loading of each factor and the correlation of the test with each facto~.
uch a correlation is known as the factorial validity of the test. Thus, lf
he verbal comprehension factor has a weight of .66 in a vocabulary test,
he factorial validity of this vocabulary test as a measure of the trait of
erbal comprehension is .66. It should be noted that factorial validity is
entially the correlation of the test with whatever is common to a group
of tests or other indices of behavior. The set of variables analyzed can,
ofcourse, include both test and nontest data. Ratings and other criterion
'measurescan thus be utilized, along with other tests, to explore the fac-
torial validity of a particular test and to define the common traits it
measures.
Validity: BasicCaacepts 155
~orrelation of .subtest scores with total score. Many intelligence tests, for
lD:tance,. con~lst of separately administered subtests (such as vocabulary,
anthmehc, picture completion, etc.) whose scores are combined in finding
the total test score. In the construction of such tests, the scores on each
subtest are often correlated with total score and any subtest whose cor-
relation with total score is too low is eliminated. The correlations of the
rem~ining sUbte~ts with total score are then reported as evidence of
the Internal consistency of the entire instrument.
. It is app.arent that internal consistency correlations, whether based on
Items or subtests,. are essentially measures of homogeneity. Because it
helps to charactenze the behavior domain or trait sampled by the test, the
degree of homogeneity of a test has some relevance to its construct valid-
ity .. Ne.vert~eless, ~he contribution of internal consistency data to test
vahdatlOn IS very limited. In the absence of data external to the test it-
self, little can be learned about what a test measmes.
INTERNALCONSISTENCY. In the published descriptions of certain tests,
especially in the area of personality, the statement is made that the test
has been validated by the method of internal consistency. The essential
characteristic of this method is that the criterion is none other than the
-total score on the test itself. Sometimes an adaptation of the contrasted
. grOUpmethod is used, extr'"emegroups being selected on the basis of the
total test score. The performance of the upper criterion group on each test i
item is then compared with that of the lower criterion group. Items that
fail to show a significantly greater proportion of "passes" in the upper
than in the lower criterion group are considered invalid, and are either
~liminated or revised. Correlation~l pr.qcedures may also be employed for
this purpose. For example, the biserial'correlation between ."pass-f~il" .on
each item and total test score can be computed. Only those Items )'leldmg
significant item-testcorr~fliJi.Pns would be retained. A test whose items
were selected by this meth,qd can be said to show internal consistency,
since each item differentiates in the same direction as the entire test.
Another application of the criterion of internal consistency involves the
EFFECT OF EXPERIYENTAL VARIABLES ON TEST SCORES' A further source
of data forconstmct validation is provided by ex-periments on the effect
of selecte(;I~ariables on test scores. In checking the validitv of a ~riterion-
referellce'O test for use in an individualized instruction~l program, for
example, one approach is through a comparison QE pretest and posttest
scor~s..The rationale of such a test calls for low scores on the pretest,
admlms~ered b~fore ~he relevant instruction, and high scores on the post-
test. ThiS relationshIp can also be checked for individual items in the
te~t (Po.pharo, 1971). Ideally, the largest proportion of examinees should
fall an Item ?n the pretest and pass it on the posttest. Items that are
commonly falled on both tests are too difficult, and those passed on both
tests ~oo easy, for t~e purposes of such a test. If a sizeable proportion of
exa~mees pass an ltem on thc pretest and fail it on the posttest, there is
obvlOusly something wrong with the item, or the instruction, or both.
A. test designed to measure anxiety-proneness can be administered to
sub!ects who are subsequently put through a situation designed to arouse
amQe.~, such as .t~~ng an examination under distracting and stressful
conditions. The lDltlal anxiety test scores can then be correlated with
phySiolog!cal. and other indices of an~iety expression du~pg and after
the exammatIon. A different hypothesis regarding an anxietY· test could
?e evalua~ed by admini~tering the test before and after an anxiety-arous-
mg expen:~ce an~ seemg whether test scores rise Significantly on the
retest. PosItive flndmgs from such an experiment would indicate that the
test scores. reBect current anxiety level. In a similar w,lI.y;' exper4;h.lentscan be. designed to test any other hypothesis regarding th.~;;tfait ~~SUredby a gIVen test.' .'
TABLE 12
A Hypothetical Multitrait-M:ultimethod Matrix
(From Campbell & Fiske, 1959, p. 82.)
NVERGENT AND DISCRIMINANT VALIDATiON. In a thoughtful analysis
nstruet validation, D. T. Campbell (1960) points out that in order
emollstrate construct validity we must show not only that a test cor-
es highly with other variables with whi~h .it should ~heoret.ically
elate, but also that it does not correlate sIgmficantly wIth van abIes
which it should differ. In an earlier article, Campbell and Fiske
) described the former process as convergent validation and the
er as discriminant validation. Correlation of a mechanical aptitude
with subsequent grades in a shop course would be an example of
vergentvalidation. For the same test, discriminant validity would be
rated by a low and insignificant correlation with scores. on a .reading
prehension test, since reading ability is an irrelevant varIable m a test
gnedto measure mechanical aptitude. ., .t will be recalled that the requirement of low correlatlOn WIth trrele-
t variables was discussed in connection with supplementary and pre-
tionary procedures followed in content validation. Discrin;inant va~-
ionis also especially relevant to the validation of personality tests, In
ich irrelevant variables may affect scores in a variety of ways.
ampbell and Fiske (1959) proposed a systematic experimental deSign
the dual approach of convergent and discriminant validation, which
ey called the multitrait-multimet1lOd J7latrix. Essentially, this procedure
quiresthe assessment of two or more traits by tw.o Qr~ore metho~s. A
pathetical example provided by Campbell and FIske WIll serve to IUUS-
ate the procedure. Table 12 shows all possible correlations among the
oresobtained when three traits are each measured by three methods.
, e three traits could represent three personality characteristics, such as
A) dominance, (B) sociability, and (C) achievement motivation. The
hreemethods could be (1) a self-report inventory, (2) a projective tech-
'iquc,and (3) associates' ratings. Thus, Al would indicate dom~na~ce
oreson the self-report inventory, A2 dominance scores on the projective
est,C3 associates' ratings on achievement motivation, and so forth.
The hypothetical correlations giv~n in Ta~le 12 include reli.ability co-
fficients (in parentheses, along principal dIagonal) and validity coef-
cients (in boldface, along three shorter diagonals). In these validity
coefficients,the scores obtained fc",~~~psame trait by different methodsarecorrelated; each measure "is.thu~ 'being checked against other, inde-
pendent measures of the same'::'trait, ~s'.in the familiar validati~n proce-
dure.The table also includes correlations between different traIts meas-
uredby the same riJ":.thod'(in solid triilngles) ~nd corrclati.ons between
different traitsllleasured by different methods (Ill broken trIangles). For
satisfactory construct validity, the validity coefficients should obviously
be higher than the correlations between different traits measured by
different methods; they should also be higher than the correlations be-
,...56::-:- ...~22---:11: 67-'---42-------:I ',', I 1'" ',. .33 II..... .•. I : ......•......, ....•....•.. :
:.23'".58"',)2: :.43 '".~6',,:.34:I "... ....,~ t •••••• I
l~~1 :..~~:~~~45L~~ .:~~~::::~58~.58 (.85)•...• ·::;"'-~"':'_~~~'~:;~'::>'.-~;~I;;,';"_:~~~~ ..•_.:;.::~:=.t;,..~~~':Q.~IM&~)
Note: Le~tersA. B, C refer to traits, subSCripts1,2,3 to methods. Validity coefficients
(rnon~tralt-heteromethod) are the three diagonal sets of boldface numbers; reliability
c~efficlents (~ono~ralt-rnonomethod) are the numbers in parentheses along principal
diagonal. Sohd tnangles enclose heterotrait-monomethod correlations; broken tri-angles enclose heterotrait-hcteromethod correlations.
'l
tween different traits measured by the same method. For example, the
COITf:lationbetween dominance scores from a self-report inventory and
dOITt~ijancescores from a projective test should be higher than the cor-
relatIon between dominance and sociability scores from a self-report in.
ventor~. If ~he l~tter correlation, representing common method variance,
:-rere hIgh, It mIght inllicate, for example, that a person's scores on this
Inventory are unduly affected by some irrelevant common factor such as
ability to understand the questions or desire to make oneself appear in afavorable light on all traits.
Fiske (1973) has added still another set of correlations that should be
checke~, esp~cially in the construct validation of personality tests. These
~rrelab~ns Involve the same trait measured by the"same method, but
With a dlffer~nt test. For examplc, two il)vestigators may each pliepare
a self-report Inventory designed to assesseIl,durance. Yet the end~rance
scores obtained with the two inventories may show quite diffe~~nt. pat-
terns of correlations with measures of other personality traits. Under these
Traits
A,
Method 1 B,
C,
A,
Method 2 B.
C.
IllustrativeQuestion
Type ofValidity
Validity: Basic Conc('pts 159
at a higher educational level, as when selectinO' hiO'h school students forb t:<
college admission, it needs to be evaluated against the criterion of sub-
sequent college performance rather than in terms of its content validity.
The examples given in Table 13. focus on the differences among the
various types of validation procedures. Further consideration of these
procedures, however, shows that content, criterion-related, and construct
validity do not correspond to distinct or lOgically coordinflte categories.
On the contrary, construct validity is a comprehensive concept, which
includes the other types. All the specific techniques for establishing con-
tent and criterion-related validity, discussed in earlier sections of this
chapter, could have heen listed again under construct validity. Comparing
the test performance of contrasted groups, such as neurotics and normals,
is one way of checking the construct validity of a test designed to meas-
ure emotional adjustment, anxiety, or other postulated traits. Comparing
the test scores of institutionalized mental retardates with those of normal
schoolchildren is one way to investigate the construct validity of an
intelligence test. The correlations of a mechanical aptitude test with per-
formance in shop courses and in a wide variety of jobs contribute to our
understanding of the construct measured by the test. Validity against
various practical criteria is commonly reported in test manuals to aid the
potential user in understandin~ what a test measures. Although he may
not be directly concerned with the prediction of any of the specific cri-
teria employed, by examining such criteria the test user is able to build
up a concept of the behavior domain sampled by the test.
Content validity likewise enters into both the construction and the
subsequent evaluation of all tests. In assembling items for any new test,
the test constructor is guided by hypotheses regarding the relations be-
tween the type of content he chooses and the behavior he wishes to
measure. All the techniques of criterion-related validation, as well as the
other techniques discussed under construct validation, represent ways of
testing such hypotheses. As for the test user, he too relies in part on
content validity in evaluating any test. For example, he may check the
vocabulary in an emotional adjustment inventory to determine whether
some of the words are too difficult for the persons he plans to test; he
may conclude that. the scores on a particular test depend too much on
speed for his purposes; or he may notice that an intelligence test de-
veloped twenty years ago contains many obsolescent items unsuitable for
use today. All these observations about content are relevant to the con-
struct validity of a test. In fact, there is no information provided by any
validation procedure that is not relevant to construct validity.
The term construct validity was officially introduced into the psy-
chome~rist's lexicon in 1954 in the Technical RecommenN4a{ions for Psy-
c11010glcal Tests and Diagnostic Techniques, which constituted the first
edition of the current APA test Standards (1974). Although the validation
Principles of Psychological Testing
.ditions,it cannot be concluded tllat both inventories measure the same
·sonalityconstruct of endurance. ., .t might be noted that within the framework of the mnlhtrmt-mulh-
hodmatrix, reliability represents agreement between two measures of
same trait obtained through maximally similar methods, such as
alle! forms of the same test; validity represents agreement between
measures of the same trait obtained by maximally different methods,
chas test scores and supervisor's ratings. Since similarity and difference
methods arem~tters of degree, theoretically reliability and validity can
regarded as falling along a single continuum: O~~inarily, ho\~'e~er, the
hniques actually employed to measure rehabllIty and validIty cor-
ond to easily identifiable regions of this continuum.
We have considered several ways of asking, "How valid is this test?"
Topoint up the distinctive features of the different types of validity, let
usapply each in turn to a test consisting of 50 assorted arithmetic prob-
lems.Four ways in which this test might be employed, together with the
typeof validation procedure appropriate to each, a:e illustra:ed ~nTable
13. This example highlights the fact that the chOIce of valIdahon pro-
; cedure depends on the use to be made of the test scores. The same test,
when employed for different purposes, should be validated in different
ways.If an achievement test is useet to predict subsequent performance
TABLE 13Validationof a Single Arithmetic Test for Different Purposes
. Achievement test in ele-mentary school aritlune-
ticAptitude test to predict
performance in highschool mathematics
Technique for diagnosing
learning disabilities
Measure of logical reason-
ing
How much has Dicklearned in the past?
How well will Jim learn in
the future?
Criterion-related:
predictive
Does Bill's performanceshow specific disabili-
ties?How can we describeHenry's psychological
functioning?
Criterion-related:concurrent
160 Pritlci,Jles of PSlJchological Testing
procedures subsumed under construct validity were not new at the time,
the discussions of construct validation that followed served to make the
. implications of these procedures more explicit and to provide a systematic
,; rationale for their use. Construct validation has focused attention on the
role of psychological theory in test construction and on the need to
formulate hypotheses that can be proved or disproved in the validation
process. It is particularly appropriate in the evaluation of tests for use
in research.
In practical contexts, construct validation is suitable for investigating
; the validity of the criterion measures used in traditional criterion-related
" test validation (see, e.g., James, 1973). Through an analysis of the cor-
relations of different criterion measures with each other and with other
, relevant variables, and through factorial analyses of such data, one can
learn more about the meaning of a particular criterion. In some instances,
the r~sults of such a study may lead to modification or replacement of the
criterion chosen to validatc a test. Under any circumstances, the results
will enrich the interpretation of the test validation study.
Another practical application of construct validation is in the evalu-
ation of tests in situations that do not permit acceptable criterion-related
validation studies. as in the local validation of some personnel tests for
industrial use. The difficulties encountered in these situations were dis-
cussed earlier in thi.s chapter, in connection with synthetic validity. Con-
str~ct validation offers another alternative approach that could be fol-
lowed in evaluating the appropriateness of published-tests for a particular
job. Like synthetic validation, this approach requires a systematic job
analysis, followed by a description of worker qualifications expressed in
;.:''terms of relevant behavioral constructs. If, now, the test has bcen sub-
jected to sufficient research prior to publication, the data cited in the
manual should permit a specification of the principal constructs measured
by the test. This information could be used directly in assessing the
relevance of the test to the required job functions, if the correspondence
of constructs is clear enough; or it could serve as a basis for computing
a J-coefficient or some other quantitative index of synthetic validity. .
Construct validation has also stimulated the search for novel ways of I
gathering validity data. Although the principal techniques employed in
investigating construct validity have long been familiar, the field of
operation has been' expanded to admit a \\rider variety of procedures.
This very multiplicity of data-gathering techniques, however, presents
certain hazards. It is possible for a test constructor to try a large number
of different validation procedures, a few of which will yield positive re-
sults by chance. If these confirmatory results were then to be reported
without mention of all the validity probes that yielded negative results, a
very misleading impression about the validity of a test could be created.
I Another possible danger in the application of construct validation is that
Validity: Basic Concepts 161
it may open the way for s b" .validity. Since . ~ J~chve, unvenfled assertions about test
cept, it has bE':~~~~~~ v;~~~~;s ~uc~ asbroad and loosely dcflned can-constructors Seem to ~r . . rs 00. ome textbook writers and test
psychological trait na~lescelVe It as content validity expressed in terms of
subjective accounts of ~h~:~~e, t~e~ present as construct validity purely
A further source of ossibl ey e ~ve (o~ hope) the test measures.construct validation "is ; I e
dCo~fuslOn anses from a statement that
a measure of some at .~vo ve w e~ever ~ test is to be interpreted as
oned'." (Cronbach & ~:e~tel~r quahty whIch is not 'operationally de-
published analysis of the co' 5;, f' 282). Appearing in the first detailed
often incorrectl acce ted :c~p ? ~nstruct "alidity, this statement was
the absence of ~ata ~hat t~ Justifrng a claim for construct validity in
such an interpretati;n is i1lus:a~~t ors of .the sta~e~ent did. not intendarticle, that "unless the n t k d b
kytheIr own inSIstence, III the same
e war ma es contact with b .construct validation cannot b I' d" 0 servations . . .h . . . e c alme (p. 291) In th .t ey cnhclze tests for wh' h" fi . e same connectIon,been oHered as if it wcre l~al'; t.ne~pun network of rationalizations has
construct, trait or behavio dla I~n (p, 291). Actually, the theoretical
b d ' r omam measured bv rti Ie a equateI), defined only' th I' h f - a pa cu ar test canvalidating that test Such I~ Iie. ~g t 0 data gathered in the process of
abIes with which th~ test c~ ~ ~lhO~ would take into account the vari-
found to affect its Scores an~et~ ed SIgnificantly, as well as the conditions
scores. These procedures are e ; ~~~ps that diff~r significantly in suchbutions made bv the co t n fIre :- m aCcord w1th the positive contrl-
. ncep 0 construct valid'ty I 'the empirical investigation of the r I' h' 1. t IS only throughexternal data that we can d' ehahons IpS of test SCores to other
ISCOverw at a test measures.
HArTER 7
alidity:
Measuremel~t
and lrlterpretation
MEASUREMEXT OF RELATIONSHIP. A validity coefficient is a correlation
between test score and criterion measure. Because it provides a single
numerical index of test validity, it is commonly used in test manuals to
report the validity of a test against each criterion for which data are
available. The data used in computing any validity coefficient can also
be expressed in the form of an expectancy table or expectancy chart,
illustrated in Chapter 4. In fact, such tables and charts provide a con-
venient way to show what a validity coefficient means for the person
tested. It will be recalled that expectancy charts give the probability that
an individual who obtains a certain score on the test will attain a speci-
fied level of criterion performance. For example, with Table 6 (Ch. 4,
p. 101), if we know a student's score on the DAT Verbal Reasoning test,
"",e can look up the chances that he will earn a particular grade in a
hIgh school course. The same data yield a validity coefficient of .66
When both test and criterion variables are continuous, as in this example,
the familiar Pearson Product-Moment Correlation Coefficient is appli-
cable. Other types of correlation coefficients can be computed when the
data are expressed in different forms, as when a two-fold pass-fail cri-
terion is employed (e.g., Fig. 7, Ch. 4). The specific procedures for
computing these different kinds of correlations can be found in any
standard statistics text.
CHAPTER 6 was concerned with different concepts of validity and
. their appropriateness for various testing. f~nctions; t~is. chapter
deals with quantitative expressions of vahdlty and theIr mterpre-
tation. The test user is concerned with validity at either or both of two
stages. First, when considering the suitability of a test for his purposes,
he examin~ailable validit)'data reported in the test manual or ot~er
p~ed so.Jltces..Through such in~ormation, he arrives at a tentative
concept of what psychological fu~ctlOns the test actually measures, and
he judges the relevance of such function~ to his p.rop~sed use of t~e test.
In effect, when a test user relies on published validation data, he IS dea.l-
ing with construct validity, regardless of the specific pro?ed~res used m
gathering the data. As we have seen in Chapter 6, the cntena employed
in published studies cannot be assumed to be iden?cal. with th~se the
test user wants· to predict. Jobs bearing the same title m two dIfferent
companies are rarely identical. Two courses in freshman English taught
in different colleges may be quite dissim~1;l.r· i
Because of the specificity of each criterion, te~t users are .us~ally ad-
vised to check the validity of anv chosen, 'test agamst local cnterla when-
ever possible. Although publishe'd dat~ay str~ngl~ sugg~st that a given
test should have high validity in a particular sltuatio~, dlTee: corrobo~a-
tion is always desirable. The dete:t'inination of validJ!Y agamst specific
local criteria represents the second stage in the test ~r's evaluation of
valKTfty.The teChnIques ttr'1le dIscussed 1~ this chapter are esp~cially
relevant to the analysis of validity data obtamed by ~e test u.ser hlms~1f.
Most of them are also useful, however, in understanding and mterpretmg
the validity data reported in test manuals.
J6z
COI\"DITIONS AFFECTING VALIDITY COEFFlCIEXTS. As in the case of reli-
ability, it is essential to specify the nature of the group on which a
validity coefficient is found. The same test may measure different func-
tions when given to individuals who differ in age, sex, educational level,
occupation, or any other relevant characteristic. Persons with different
experiential backgrounds, for example, may utilize different work meth-
ods to solve the same test problem. Consequently, a test could have high
validity in predicting a particular criterion in one population, and little
or no validity in another. Or it might be a valid measure of different
functions in the two populations. Thus, unless the validation s~ple is
repri'!seiififiVe of the population on which the test is to be used, validity
should be redetermined on a more appropriate sample.
The question of sample heterogeneity is relevant to the measurement
of validity, as it is to the measurement of reliability,'.,since both charac-
teristics ale commonly reported in terms of correlation eoefficiElnts. It
will be recalled that, other things being equal, the wider the range of
scores, the higher will be the correlation. This fact should be kept in
Principles of Psychological Testing
mind when interpreting the validity coefficients given in test manuals.
,;. Il.special difHcttlt}, encountered in many validation samples arises from
preselection. For example, a new test that is being validated for job selee-
.tionmay be admini$tered to a group of newly hired employees on whom
;criterJIonmeaSures of job performance will eventua11y be available. It is
~likely;however, that such employees represent a superior selection of all
"!hosewho applied for the job. Hence, the range of such a group in both
.'tests¢ores and criterion measures will be curtailed at the lower end of the
:·bdistribution.the effe~t of such preselection will therefore be to lower the
'validity coefficient. In the subsequent use of the test, when it is admin-
dster¢d to all applicants for selection purposes, the validity can be ex-
pected to be somewhllt higher.
" Validity coefficients may also change over time because of changing
.'selection standards. An example is provided by a comparison of validity
,coefficients compll.ted over a 3D-year interval with Yale students (Burn-
"ham, 1965). Correlations were found between a predictive index based
, on College Entrance Examination Board tests and high school records,
f onthe one hand, and average freshman grades, on the other. This correla-
tion dropped from .11 to .52 over the 30 years. An examination of the
r' bivariate distributions dearly reveals the reason for this drop. Because of
~higher admissibn standards, the later class was more homogeneous than
.:the earlier class in both predictor and criterion performance. Conse-
quently, the correlation was lower in the later group, although the ac-
t curacy with whkh individuals' grades were predicted showed little
ch~nge. In other words, the observed drop in correlation did not indicate
. that the predictors were less va-lid than they had been 30 years earlier.
Had the difference$ in group homogeneity been ignored, it might have
" been 'Wrongly concluded that this was the case.
'0' For the propet interpretation of a validity coefficient, attention should
alm be given to the form of the relationship between test and criterion.
, The computation of a Pearson correlation coefficient a;;sumes that the re-
lationship is linear and uniform throughout the range. There is evidence I
I that in certain situations, however, these' conditions may not be met
, (Fisher, 1959; Kahneman & Ghiselli, 1962). Thus, a particular job may
" require a minimum level of reading comprehension, to enable employees
to read instructiorl manuals, labels, and the like. Once this minimum is
e:,tceeded, however, further increments in reading ability may be un-
related to degree of job success. This would be an example of a nonlinear
relation between test and job performance. An examination of the bivari-
ate distributjon or scat.\:er diagram obtained by plotting reading compre-
hension scores a!Ylinst criterion measures would show a rise in job per-
I fprmance up to the minimal required reading ability and a leveling off
beyond that point. Hence, the entries would cluster around a curve rather
Validity: Mcasuremcnt and Interprctation ~65
In other situations th 1" f bindividual entries m;y d: .lfIte~ ~st 6t may be a straight line, but the
at the lower end of the s:~ e Sart er around this line at the upper than
aptitude test is a necClisa; ~ut u~:se that 'performa~c::e on a scholastic
achievement in a course Th t' h a tufficlent condItion for successful
poorly in the cOU"se' bl!lt'a' a ISth,t h~ how-scoring students will perform
• ,. mong e Ig -scor' t d .fonn well in the course . d th mg s u ents, some WIll per-
motivation. In this situat~:n ~h ers ",:::1:erf~rm poorly because of lowperformance among the ·h.'h ere. WI e WIder variability of criterion
dents, This condition in ~g ~sco~g t~an. am?ng the low-scoring stu-scedasticih.' Th p. bwanate dIstrIbution is known as hctero-
'J' e earson correlatio hvariability throughout tb ~ assum:s ?moscedasticity or eqll.al
present example, the bivae..r:n~ o'b th~ bIVanate distribution. In the
at the upper end and n na e shtn utIon would be fan-shaped-wide
b' , arrow at t e lower end A . ' .Ivanate distribution itsdf ill 11 . . II exammation of thenature of the relations·hip b 'tV usua y give a, good indication of the
e ween test and 't' Eand expectancy charts also I cn erIOn. xpectancy tablesthe test at different levels. correct y reveal the relative effectiveness of
MAGNITUDE OF A V.Aj.LIDITY·COEFFr •
coefficient be? No gener I CIE~T. How hIgh should a validity. . a answer to thIS gr' .mterpretation of a validit ffi . ues lOll IS pOSSIble, since theof concomitant circumsta; coe clent ~ust take into account a number
be high enough to be sta~~~·'n;~o~tamed correlation, of course, should
such as the 01 or 05 level' d~8.Ica !Jds~gnificant at some acceptable level. . . s ISCusse in Cha t 5 I h '
drawing any conclusions about th • lid' per . not er words, before
sonably certain that the obt' d el~d~ Ity of a test, we should be rea~. ame va I Ity coeffi' t ld
throug~ chance fluctuatip.tls of sam Ii . Clen cou not ,have arisenHavmg establjshed a signiflcant p ng fro.m a true correlation of zero.
criterion, however, we need to e correlat1~n between test Scores and
light of the uses to be m d f v~luate the SIZeo~ the correlation in ~he
vidual's exact criterion s~ e 0 ~le test. If we WIsh to predict an indi-
will receive in college the ~:fi~~tas:: grade-point average a student
of the standard erro; of estimare coe .clen.t may be interpreted in terms
measurement discussed in : whl7h IS analogous to the ,error of
that the errOr of measure.:~;~c~~n WIth reliability. It wl"llbe recalled
pected in an individual's n Icates the margin,. of error to be ex-
~irni1ar1y, the eITor of esti=~: :~ a res~t of th~ unreliability of the t~t.
m the individual's predicted 't o~s t e margm of ~r,rot to be expe~tec:lvalidity of the test. cn erIon score, as a ~lwf the imper{~(;t
The error of estimate is found b th f 11 . ";"'"Y e 0 owmg fOfn,ula:
6 Prillciples of Psychological Testing
:whichr2 >'V is the square of the validity coefficient and Uv is th~ ~tandard
eviatiol1of the criterion scores. It will be noted that if the vahdlty were
erfect(r >'V ;::: 1.00), the error of estimate would be zero. On the other
and,with a test having zero validity, the error of estimate is as large as
e standard deviation of the criterion distribution (ucBr.;::: ulIVI -0 =v), Under these conditions, the prediction is no better. than. a ~ues~; and
he range of prediction error is as wide as the enbre distnbutIOn of
criterionscores. Between these two extremes are to be found the errors
ofestimatecorresponding to tests of varying validity.
Reference to the formula for cr •• t. will show that the term VI - r'''11
, servesto indicate the size of the error relative to the error that wou~
, result from a mere guess, i.e., with zero validity. In other words, lf
v'l- r'xv ig equal to 1.00, the error of estimate is ~s .lar~e as it would beifwe were to guess the subject's score. The predlc~ve lmprove~~nt at-
tributable to the use of the test would thus be rol. If the validlty co-
efficientio; .80, then VI - "XI/ is equal to .60, and -the error is 60 percent
aslarge as it would be by chance. To put it diffe:ently, the use of s~ch a
test enables us to predict the individual's critenon performance wlth a
marginof error that is 40 percent smaller than it would be if we were to
guess. . . . .It would thus appear that even with a validlty of .80,whl~h 1S unusu~lIy
high, the error of predicted scores is.conside~abl~..u th,e pnmary ~~ctl~n
of psychological tests were to predlct each mdIvl~ual ~ exact l?OSlhO~in
the criterion distribution, the outlook would be qUite dlscouraglOg. \\ hen
examined in the light of the error of estimate, mos~ t~sts do not appear
very efficient. In most testing situations, ho~ev~r,. lt IS not necessary to
predict the specific criterion performance of mdlvl~ual .c~ses, but rather
to determine which individuals will exceed a certam mlmmum standard
of performance, or cutoff point, in the cri:erion. What are the ch.an:es
that Mary Greene will graduate from medIcal school, tI:at Tom Hlggms
",'in pass a course in calculus, or that Beverly ~ruce WIll succeed as an
astronaut? Which applicants are likely to be satlsfactory clc::rks,salesmen,
or machine operators? Such information is ~seful not only fo~ ~roup i
selection but also for individual career planmng. For example, lt 15 ad-
vantageous for a student to know that he has a gOO? chanc~ of pas~ingall courses in law school, even if we are unable to estimate WIth certamty
whether his grade average will be 74 or ~I: . .,A test may appreciably improve predIctive effiCIencyIf It sho~s a~1J
significant correlation with the criterion, however 10w..Un.der.certa~n Clt-
cumstanees even validities as low as .20 or .30 may Justify lncluslon of
the test in ~ selection program. For many testing purposes, evaluation .of
tests in terms of the error of estimate is unrealistically stringent. Consld-
eration must be given to other ways of evaluating the contribution of a
Validity: AI easuremellt and Interpretation 167
test, which take into account the types of decisions to be made from the
scores. Some of these procedurcs will be illustrated in the following sec-
tion.
BASIC APPROACH. Let us suppose that 100 applicants have been given
fln aptitude test and followed up until each could be evaluated for suc-
cess on a certain job. Figure 17 shows the bh'ariate distribution of test
scores and measures of job success for the 100 subjects. The correlation
between these two variables is slightly below .70. The minimum accept-
able job performance, or criterion cutoff point, is indicated in the diagram
by a heavy horizontal line. The 40 cases falling below this line would
represent job failures; the 60 above the line, job successes. If all 100 appli-
~ants are hired, thereforc, 60 percent will succeed on the job. Similarly,
if a smaller number were hired at random, without reference to test
scores, the proportion of successes would probably be close to 60 percent.
Suppose, however, that the test scores are used to select the 45 most
promising applicants out of the 100 (selection ratio;::: .45). In such a
case, the 45 individuals falling to the right of the heavy vertical line
would be chosen. Within this group of 45, it can be seen that there- arc 7
job failures, or false acceptances, falling below the heavy horizontal line,
and 38 job successes. Hence, the percentage of job successes is now.84
rather than 60 (i.e., 38/45 = .84). This increase is attributable to the use
of the test as a screening instrument. It will be noted that errors in pre-
dic:te:d criterion score that do not affect the decision can be ignored.
Opl)' those prediction errors that cross the cutoff line and hence place
the individual in the wrong category will reduce the selective effective-
ness of the test .
. For a complete evaluation of the effectiveness of the test as a screening
mstrument, another category of cases in Figure 17 must also be examined.
This is the category of false re;ections, comprising the 22 persons who
score below the cutoff point on the test but above the criterion cutoff.
From these data we would estimate that 22 percent' of the total applicant
sample are potential job successes who will be lost if the test is used as a
screening device with the present cutoff point. These false rejects in a
personnel selection situation correspond to the false positives in clinical
evaluations. The latter term has been adopted frO,J:lkmedical practice, in
whi~ .a t~st for a pathological condition is reported ~positive if the
condltion 1S present and negative if the patient is Dormal. A false positive
thus refers to ~ case in ~hich the test erroneously 4l~~atf,(~-1:hepresence
?f ~ ?athologJ~1 condition, as when brain damage~,-~ mdicated in anmdlVldual who lS actually normal. This terminology is likely to be COD-
r.' fusingunless we remember that in clinical practice a positiv~ result po a, test denotes pathology and unfavorable diagnosis, whereas In pers~n~el
. selectiona positive result conventionally refers to a favorabJ~ prediCtIon
: regarding job performance, academic achievement, and the lI~e.
. In settin on a test, attention should be ven to the
'i. percentage of false rejects (or false positives as we as to the .erc:nt-i
) a cesses an ai ures wit in t~_se eete grou.!} In certam SItu-
;; ations,the cutoff point should be set sufficiently higt, to e~clu?e all but
',' a few possible failures. This would be the case when t~~',;obIS of such!: a nature that a poorly qualified worker could cause senous loss or dam-
i age. An example would be a commercial airline. pilot. Under o.ther
:' circumstances, it may be more important to admit. as many qualIfied
~ personsas possible, at the risk of including more fallures .. In the latter
',> case the number of false rejects can be reduced by the choice of a lower
,~,cutoffscore. Other factors that normally determine the position of ~he
."i, cutoffscore include the available personnel snpP4:, the number of job. -
"
Validity: Measurement and Interpretation 169
openin s, and the ur ('nc or seed with which t ,filled.
In many personnel decisions, the selection ratio is determined by the
practical demands of the situation. Because of supply and demand in
filling job openings, for example, it may be necessary to hire the top 40
percent of applicants in one case and the top 75 percent in another.
When the selection ratio is not externall,T imposed, the cutting smre 011
a test can be set at that point giving the maximum differentiation be.
tw~ Clilelioll grouEs. TIus can be done roughly by comparing the
distrl ution of test scores in the two criterion groups. More precise math-
ematical procedures for setting optimal cutting scores have also been
worked out (Darlington & Stauffer, 1966; Guttman & Raju, 1965; Rorer,
Hoffman, La Forge, & Hsieh, 1966). These procedures make it possible totake into account other relevant parameters, such as the relative serious-ness of false rejections and false acceptances.
In the terminology of decision theofy, the example given in Figure 17
illustrates a simple strategy, or plan for deciding which applicants to ac-
cept and which to reject. I~ mor.e.~eral terms, a strategy is a technique
for utilizing information in order to reach a decision about individuals. In
tTllscase, the strategy was to accept the 45 persons with the highest te;
scores. The increase in percentage of successful employees from 60 to 84
could be used as a basis for estimating the net benefit resulting from theuse of the test.
Statistical decision theory was developed by Wald (1950) with special
reference to the decisions required in the inspection and quality control
of industrial products. Many of its implications for the construction and
interpretation of psychological tests have been systematically worked out
by Cronbach and GIeser (1965). Essentially, decision theory is an at-
tempt to put the decision-making process into mathematical form, so thdt
available information may be used to arrive at the most effective decision~nder .s~ecified circumstances. The mathematical procedures employed
lD. d~c1Slon.th~ory a~e often quite complex, and few are in a form per-
mItting theIr Immediate application to practical testing problems. Some
of the basic concepts of decision theory, however, are proving helpful in
the reformulation and clarification of certain questions about tests. A few
of these ideas were introduced into testing before the formal develop-
ment of statistical decision theory and were later recognized as Dttinginto that framework.
, I'~
I I
I
Job
Successes
Criterion
Cutoff
Job
failures
LowLow
Test Score
~'FIC. 17. Increase in the Proportion of "Successes" Resulting from the Use of, a Selection Test.
PREDICTION OF OUTCOMES. A precursor of decision theory ini.psychologi.
ca.1testing is ~o b~ found in the Taylor-Russell table~( 193,~),which per-
mIt a detennmation of -the net gain in selection acc~racy atbibutable to
the use of the test. ~ information required inc1\ipls'the validity co-
·60 .60
.60 .60
.61 .60
.61 .61
.62 .61
Validity: Measurement and Interpretation 171
selected after the use of the test. Thus, the difference between .60 and
anyone table entry shows the increase in proportion of successful selec-
tions attributable to the test.
Obviously if the selection ratio w.ere 100 percent, that is, if all appli-
cants had to be accepted, no test, howen'r valid, could improve the
selection process. Reference to Table 14 sho\\'s that, when as many as 95
percent of applicants must be admitted, even a test with perfect validity
( r = 1.00) would raise the proportion of successful persons by only 3 per-
cent (.60 to .63). On the other hand, when only 5 percent of applicants
need to be chosen, a test with a validity coefficient of only .30 can raise
the percentage of successful applicants selected from 60 to 82. The rise
from 60 to 82 represents the incremental vaUdity of the test (Sechrest,
1963), or the increase in predictive validity attributable to the test. It
indicates the contribution the test makes to the selection of individuals
who will meet the minimum standards in criterion performance. In ap-
plying the Taylor-Russell tables, of course, test validity should be com-
puted on the same sort of group used to estimate percentage of prior
successes. In other words, the contribution of the test is not evaluated
against chance success unless 'applicants were preViously selected by
chance-a most unlikely circumstance. If applicants had been sele<;:teq
on the basis of previous job history, letters of recommendation, and inter-
views, the contribution of the test should be evaluated ODe. the- ,basis atwhat the test adds to these previous selection procedures. ..
The incremental validity resul~~~ from the use of a test depends not
only on the selection ratio but l\~'()ll the base rate. In the previously
illustrated job selection situation, the base rale refers to the proportion of
successful employees prior to the introduction of the test for selection
purposes. Table 14 shows the anticipated outcomes when the base rate
is .60. For other base rates, we need to consult the other appropriate
tables in the cited reference (Taylor & Russell, 1939). Let us consider
an example in which test validity is .40 and the selection ratio is 70 per-
cent. Under these conditions, what would be the contribution or incre-
mental validity of the test if we begin with a base rate of 50 percent?
And what would be the contribution if we begin with more extreme base
rates of 10 and 90 percent? Reference to the appropriate Taylor-Russell
tables for these base rates shows that the percentage of successful em-
ployees would rise from 50 to 75 in the Hrst case; from 10 to 21 in the
second; and from 9 to 99 in the third. Thus, the improvement in percent-
age of successful employees attributable tQ .the use of the test is 25 whenthe base rate was 50, but only 11 and 9 when the b,ase rates were more
extreme. .
The implications of extreme base rates are of specia~,,interest in clinical
psychology, where the base rate refe~ to' the frequency of the patho-
lOgical condition to be diagnosed in the, p.qpulation tested (Buchwald,
o Principles of Psychological Testing
cient of the test, the proportion of applicants who m~~t be acclep~e~
lection ratio), and the proportion of successfu~ app lc~n~ :: ~~r:ethout the use of the test (base rate). A change many 0 t I"
ctorscan alter the predictive efficiency of the test.For urposes of illustration, one of the Taylor-Russell tables has been
e rod~eed in Table 14. This table is designed for us~ when the base
.aie or ercenta e of successful applicants selected pnor to the use of
he test 1s 60. Ot~er tables are prOVided by Taylor and Russe~l for ~t~~r
base ra~es Across the top of the table are given different va ues ~ .e
selection ;atio, and along the side are the tes~ validities. The entnes 111
the' body of the table indicate the proportion of successful· persons
TABLE 14 i ( f G'Proportionof "Successes" Expected through the Use 0 Test 0 lven
Validityand Given Selection Ratio, for Base Rate .60.
(FromTaylor and Russell, 1959, p. 576) . =-"~':~~,",7"'J2'-':UliH~~'.:>,JI;~~,.:~!M.r ••_.:::·..:;.':5.~~~
Selection Ratio
.30 .40 .50 .60 .70 .80 .90 .95
.75
.80
.85
.90
.951.00
.991.00
1.00
1.001.001.00
.99
.991.00
1.001.00
1.00
.96
.98
.991.001.001.00
.93
.95
.97
.991.00
1.00
.90
.92
.95
.97
.99
1.00
.60 .60 .60
.61 .61 .61
.63 .62 .61
.64 .63 .62
.65 .64 .63
.66 .65 .63
.68 .66 .64
.69 .67 .65
.70 .68 .66
.72 .69 .66
.73 .70 .67
.75 .71 .68
.76 .73 .69
.78 .74 .70
.80 .75 .71
.71.72
.73
.74
.75
.75
.86
.88
.91
.94
.971.00
.81 .77
.83 .78
.86 .80
.88 .82.92 .841.00 .86
.62 .61
.62. .6J
.63 .62
.63.62
.64 .62
.64 .62
.64 .62
.65 .63
.65 .63
.66 .63
.66 .63
.66 .63
.66 .63
.67 .63
.67 .63
.67 .63
Princillies of PSljcllological Testing
,t1965; Cureton, 1957a; Meehl & Rosen, 1955; J. S. Wiggins, 1973). For
:)example, if 5 percent of the intake population of a clinic has organic
:brain damage, then 5 percent is the base rate of brain damage in this
,~population. Although the introduction of any valid test win improve
:~.predictive or diagnostic accuracy, the improvement is greatest when the
. base rates are closest to 50 percent. '''ith the extreme base rates found
;i'wfth rare pathological conditions, however, the improvement may be
.:, negligible. Under these conditions, the use of a test may prove to be
unjustified when the cost of its administration and scoring is taken into
'; account. In a clinical situation, this cost would include the time of pro-
fessional personnel that IDlght otherwise be spent on the treatment of
• additional cases (Buchwald. 1965). The number of false positives, or
normal individuals incorrectly classified as pathological, would of course
increase this overall cost in a clinical situation."'Then the seriousness of a rare condition makes its diagnosis urgent,
.. tests of moderate validity may be employed in an early stage of sequential
decisions. For example, all cases might first be screened with an easily
administered test of moderate validity. If the cutoff score is set high
enough (high scores being favorable), there will be few false negatives
but many false positives, or normals diagnosed as pathological. The latter
can then be detected through a more intensive individual examination
given to all cases diagnosed as positive by the test. This solution would
be appropriate, for instance, when available facilities !Jlake the intensive
individual examination of all cases impracticable.
RELATION OF VALIDITY TO MEAN OUTPUT LEVEL. In many practical situ-
ations, what is wanted is an estimate of the effect of the selection test,
not on percentage of persons exceeding the minimum performance, but
on overall output of the selected persons. How does the actual level of
job proficiency or criterion achievement of the workers hired on the
basis of the test compare with that of the total applicant sample that
would have been hired without the test? Following the work of Taylor
and Russell, several investigators addressed themselves to this question
(Brogden, 1946; Brown & Ghiselli, 1953; Jarrett, 1948; Richardson, 1944).
Brogden (1946) first demonstrated that the expected increase in output
is directly proportional to the validity of the test. Thus, the improvement
resulting from the use of a test of validity .50 is 50 percent as great as the
improvement expected from a test of perfect validity.The relation between test validity and expected rise in criterion
achievement can be readily seen in Table 15.1 Expressing criterion scores
1 A table including more values for both selection ratios and validity coefficients
was prepared by Naylor and Shine (1965).
00,..(
It:l
~
0
~
It:lce:
0
~
It:l1:-:
0I,,;
It:l
~
..,0c
II) C'1·8esIII \I')0C,) ~.e-:2 0
~~
o~
~ c.2Qj:8~tn IX:
Principles of Psychological Testing
standard scores with a mean of zero and an SD of 1.00, this table gives
e expected mean criterion score of workers selected with a test of given
idity and with a given selection ratio. In this context, the base output
an, corresponding to the performance of applicants selected without
se-ofthe test, is given in the column for zero validity. Using a test with
erovalidity is equivalent to using no test at all. To illustrate the use of
he table, let us assume that the highest scoring 20 percent of the appli-
.cantsare hired, (selection ratio == .20) by means of a test whose validitycoefficient is.50. Reference to Table 15 shows that the mean criterion
.performance of this group is .70 SD above the expected base mean of an
Illitested sample. \Vith the same 20 percent selection ratio and a perfect
test (validity coefficient = 1.00), the mean criterion score of the acceptedapplicants }vould be 1.40, just twice what it would be with the test of
validity .50. Similar direct linear relations will be found if other mean
criterion performances are compared within any roW of Table 15. For
instance, with a selection ratio of 60 percent, a validity of .25 yields a
mean criterion score of .16, while a validity of .50 yields a mean of .32.
Again, doubling the validity doubles the output rise.The evaluation of test validity in terms of either mean predicted out-
put or proportion of persons exceeding a minimum criterion cutoff is
obviously much more favorable than an evaluation based on the previ-
ously discussed error of estimate. The reason for the difference is that
prediction errors that do not affect decisions are irrelevant to the selec-
tion situation. For example, if Smith and Jones are both superior workers
and are both hired on the basis of the test, it does not matter if the test
shows Smith to be better than Jones while in job performance Jones
excels Smith.
TIlE ROLE OF VALUES IN DECISION TIIEORY. It is characteristic of decision
theory that tests are evaluated in terms of their effectiveness in a specific
situation. Such evaluation takes into account not only the validity of the
test in predicting a particular criterion but also a number of other
parameters, including base ra:e and s~~ Another important'
parameter is the relative utility of expected outcomes, the judged favor-
ableness or unfl\.vorablcness of each outcome. The lack of adequate
systems for assigning values to outcomes in terms of a uniform utility
scale is one of the chief obstacles to the application of decision theory.
In industrial decisions, a dollar-and-cents value can frequently be as-
Signed to different outcomes. Even in such cases, however, certain out-
comes pertaining to good will, public relations, and employee morale are
difficult to assess in monetary terms. Educational decisions must take into
account institutional goals, social values, and other relatively intangible
factors. Individual decisions, as in counseling, must consider the indi-
Validity: Measurement and Interpretation 175
vidual's preferences and I hout, however, that decisi:: Ut~:~ste~: It a~ been repeatedly pointedvalues into the d .. ry Id not mtroduce the problem of
tems h 1
eClSlon process, but merely made it explicit Value-:"·ave a ways enter d . t d .. . ~
clearly re~gnized or sy:te:a~ica~~s~:~dl~~~ they were not heretofore
In choosmg a decision strate th 1 .utilities across all outcome R lY' e goa IS to maximize expectedof a Simple de . . s. . e er.enee to the schematic representationcedure Th' d~Islon strategy m FIgure 18 \vill help to clarify the pro-
17 in :.vhic~ la~ralm sho~s the decision strategy illustrated in Figurethe' d " a smg e test IS administered to a group of applicants and
eClSIon to accept or . t Iicutoff score on the t t ~Jec an app cant is made on the basis of avalid and fals es. here are four possible outcomes, including
ability of he acceptances and valid and false rejections. The prob-eac outcome can be f d fr h
each of the four sectio . oun om t e number of persons inin that example th ns ofbFIgu:e. 17. Since there were 100 applicants
, ese num ers dIVIded b 100' hthe four outcomes listed in Fi . 18 rh gIVe t e probabilities ofutilities of the diff gure. e other data needed are the
erent outcomes expre dexpected overall utili of th ' sse on a common scale. Theing the probability t h e,strategy could then be found by multiply-
these products forOt~:c faoutco~e by the utility of the outcom~, addingrespondin to h u~ ou comes, and subtracting a value cor-
test of 10:val:d~;~:t;~r:e~:~:r t~h~s last ~erm ~i?h~ights th~ fact that a
easily administered by reIat' r e r~tamed If It IS short, mexpensive,
group administration An 1· IdV:~dun1tramed personnel, and suitable for. n IVI ua test req . . t' d
or expensive equipment would d h' h Uln~g. a rame examinerllee a Ig er vahdlty to justify its use.
Decision Outcome Probability
Valid Acceptance .38
False Acceptance .07
V~lid Rejection .33
False';Rejection .22
Administer testand C1pply
cutoff score
FIG. 18. A$imple Decision Strategy.
2 For ,w fl'ctitious example illustraf all .'Wiggins"0973), pp. 257-274. mg steps II! these computations, see y, a
It should also be noted that many personnel decisions are in effect
sequential, although they may not be so perceived. Incompetent em-
ployees hired because of prediction errors can usually be discharged
after a probationary period; failing students can be dropped from col-
lege at several stages. In such situations, it is only adverse selection
decisions that are terminal. To be sure, incorrect selection decisionS- that
are later rectified may be costly in terms of several value systems. }Jut
they are often less costly than terminal wrong decisions. " ",
A second condition that may alter the effectiveness of a psychological
test is, the availability of alternative treatments and the possibility of
adaptmg treatments to individual characteristics, An example would be
the utilization of different training procedures for workers at different
aptitude levels, or thc introduction of CQ.l!lpensatory educational pro-
grams for students with certain educational disabilities. Under these
conditions, the decision strategy followed in individual eases should take
into account available data on the interaction of initial test score and dif-
ferential treatment. When adaptive treatments ar~ utilized, the success
rate js likely to be substantially improved. Be£ause, the assignment of
in<!ividuals to alternative treatments is essentiallyadilsSif'ication rather
tharu-sel~oblem, -mOre wiI[ be Sald about tlleTequired method-
ology in a later section on classification decisions.
The examples cited illustrate a few of the ways in which the concepts
an~ rationale of decision theory can assist in the evaluation of psycho-
logIcal tests for specific testing purposes, Essentially, decision theory has
served to focus attention on the complexity of factors that determine the
contribution a given test can make in a particular situation. The validity
coefficient alone cannot indicate whether or not a test should be used
~ince it is only one of the factors to be considered in evaluating th~
lmpact of the test on the efficacy of the total decision process) .
';, SEQUEXTIAL STRATEGIES AND ADAPTIVE TREATMENTS. In some situations,
~;the effectiveness of a test may be increased through the use ~f more
complex decision strategies which take still more param,etoe~s.lllto .ac-
. count. Two examples will serve to illustrate these poss~blhtles, ,~ust,
, ,t t may be used to make sequential rather than termmal deCISIOns,'. es sod' F' 17 d 18 aU;,With the simple decision strategy Illustrate III 19ures an ,:"decisions to accept or reject ar: treated as terminal. Figure 19, on the
" other hand, shows a two-stage sequential decision, T~st A could be a
,';shortand easilv administered screening test. On the baSIS of.per~orma~ce
, on this test, in'dividuals would be sorted into three categ?nes.: mcludl~~
thoseclearly accepted or rejected, as well s. 3n in~ermedlat~ uncertam
group to be examined further with more intenSIve tec~mque~, repre-
.. sented by Test B. On the basis of t~e second-sta?e testmg, tIllS group
,; wouldbe sorted into accepted and rejected categorIes.
Such sequential testing can also be cmployed within a si~gle test~ng, f t to t'm (DeWItt & \Velss
,> session,to ma'Ximize the effectlve usc 0 es mg Ie.'..~.1974; Linn, Rock, & Cleary, 1969; Weiss- -& 13etz, 1973): Altho~gh. appli-
cable to paper-and-pencil printed grou~ ~ts, seq~entIal testmg IS par-
ticularly well suited for computer testing, ~ssenhally the sequen~e ~f
items ~r item groups 'within the test is determine? b~ the examl,nee s
ownperfom1anceo For example, everyone might begm w1th a set of Ite~s
of intermediate difficulty. Those who score poorly are routed t? easIer
items' those who score well, to more difficult items. Such branchmg may I
oeeu; repeatedly at several stages, The princip~l eff.e~t is that each
examinee attempts only those items suited to h~s abJ1l~y level, rather
than trying all items, Sequential testing ~~del.s WIll be dlscusse~ further
in Chapter 11, in connection with the utlhzahon of computers 10 group
testing. hI' I d' dAnother strategy, suitable for the diagnosis of psye 0 ogICa 1~or ers,
is to use only two categories, but to test further. a~ cases clas~ified as
.. positives (i.e., possibly pathol~gi~al) ~y the.prel~mmary sc~eem~g test..', This is the strategy cited earlIer ll1 this. s.e~tion, ~n connection Wlth the
use of tests to diag,nose pathological condItIons With very low base rates.
,.~\
DIFFERENTIALLY PIlEDlCTABLE SUBSETS OF PERSONS. The validity of a
test for a given criterion may vary among subgroups differing in personal
characteristics. The classic psychometric model assumes that prediction
errors are characteristic of the test rather than of the person and that
these errors are randomly distributed among persons. With the flexibility
of ap~roach ushe,re~ in by decision theory, there has been increasing ex-
ploration of prediction models involving interacti~ hetween persons and
3 For a ,fuller discussion of the implicationsof decision theory for test use, see
J. S. Wlggms (1973), Ch. 6, and at a more technical level; Cronbach and GIeser(1965), .
178 Principles of Psychological Testing~.' ts. Such interaction implies that the same test may be a better pre-
tor for cert~i~Ciasses or subsets of persons than it is for others. For
xamplc,a given test may be a better predic~or of criterio~ performance
or men than for women, or a better predlctor for applicants from a
ower than for applicants from a higher socioeconomic level. In. these
xamples,sex and socioeconomic level are known as moderator vanables,
sincethey moderate the validity of the test (Saunders, 1956).
I When computed in a total group, the vali~ity coe!R<,ient of a test may
'be too low to be of much practical value In prcdlction. But when reo
< computed in subsets of individuals differing in some i~e~tifia?le charac-
, teristic, validity may be high in one subset and negl1g~~le In anot~er.
; The test could thus be used effectively in making declSJons regardmg
! persolls in the first group but not in the second. Per~aps anothe~ test or
" some other assessment device could be found that IS an effective pre-
dictor in the second group. .A moderator variable is some characteristic of persons that makes It
i posS'ibfeto-pre'ct e pre ictability 0 I erent 10 ividuals with a given
ins rument. t may e a emograp lC vana e, such as sex, age, e u-
.al level, or socioeconomic background; or it may be a score on
another test. Interests and motUlation often function as moderator
variables. Thus, if an applicant has little interest in a job, he will prob-
ably perform poorly regardless of his scores on relevant aptitude tests.
Among such persons, the correlation between aptitude test scores and
job performance would be low. For individuals who are interested and
highly motivated, on the other hand, the correlation between aptitude
test score and job success may be quite high.
EMPmlCALEXAMPLESOF MODERATORVARIABLES.Evidence for the op-
eration of moderator variables comes from a variety of sources. In a sur-
vey of several hundred correlation coefficients between ap~tude test
scores and academic grades, H. G. Seashore (1962) found htgher cor-
relations for women than for men in the large majority of instances. Tht;same trend was founa in high sChool and college, although the trend was
more pronounced at the coll~ge level. ~he ?~ta do not in.dicate. the
reason for this sex difference in the predictabIhty of academlc achieve-
ment, but it may be interesting to speculate about it in the light of other
known sex differences. If women students in general tend to be more
conforming and more inclined to accept the values and standards of the
school situation, theiJ;class achievement will probably devend largely on
their abilities. If, on the other hand, men students tend to concentrate
their efforts on those activities (in or out of school) that arouse their
individual interests, these interest differences wO..!,Jldintroduce additional
val'ianee-in their......courseachiev~t and would make it more difficult to
Validity: Mr:a~'U"C11lentand Interpretation 179
predict achievement from test scores. Whatever the reason for the
difference, sex does a ear to function as a moderator variable in the
predictability of academic gra es from aphtu e test scores.
A number of investigations have been specially designed to assess the
role of moderator variables in the prediction of academic achievement.
Several studies (Frederiksen & Cilbert, 1960; Frederiksen & Melville,
1954; Stricker, 1966) tested the hypothesis that the more compulsive
students, identified through two tests of compulsivity, Y{,ouldput a great
deal of effort into their course work, regardless of their interest in the
courses, but that the effort of the less compulsive students would depend
on their interest. Since effort will be reflected in grades, the correlation
between the appropriate interest test scores and grades should be higher
among noncompulsive than among compulsive students. This hypothesis
was confirmed in several groups of male engineering students, but not
among liberal arts students of either sex. Moreover, lack of agreement
among different indicators of compulsivity casts doubt on the generality
of the construct that was being measured.
In another study (Grooms & Endler, 1960), the college grades of the
more anxious students correlated higher (r = .63) with aptitude and
achievement test scores than did the grades of the less anxious litudents
(r = .19). A different approach is illustrated by Berdie (1961), who in-
vestigated the relation between intraindividual variability on a test and
the predictive '-'ilidity of the same test. It was hypothesized that a given
test will be a- better predictor for those individuals who perform more
consistently in different parts of the test-and whose total scores are thus
more reliable. Although the hypothesis was partially confirmed, the re-
lation proved to be more complex than anticipated (Berdie, 1969).
In a different context, there is evidence that self-report personality in-
ventories may have higher validity for some types of neurotics than for
others (Fulkerson, 1959). The characteristic behavior of the two types
tends to make one type careful and accurate in reporting symptoms, the
o~her ~areless and evasive. The individual who is characteristically pre-
ClSeand careful about details, who tends to worry about his problems,
and who uses intellectualization as a primary defense is likely to provide
a more accurate picture of his emotional difficulties on a self-report in-
ventory than is the impulsive, careless individual who tends to avoid
expressing unpleasant thoughts and emotions and who llses denial as aprimary defense.
Ghi~elli (1956, 1960a, 1960b, 1963, 1968; Chise~!C Sander~, 1967) has
extenslvely explored the role of moderator variaBles iIl. UidiIstrial situ-
ations. In a study of taxi drivers (Ghiselli, 1956), the @rrelati~n between
an aptitude test and a job-performance criterion in the t6tl;J applicant
sa'ijl~ was only .220. The group was then sorted into tpirds qp the basis~ ~~ ..~ on an occupational interest test. When the validity of the
,_~ Validity: Measurement and Interpretation 181
/?I?'e:-T~!'~redict a s~n.gle.criterion, they are known as a test batten(. 1Jpe chiefOf:tlp.fJ701(.fl.problem arlsmg 10 the use of such batteries concerns the way 'in which
scores ,on the di~ert:n~ tests are to be combined in arrivi,!!g at a decision
regardmg each IndiVIdual. The. statistical procedures followed for this
purpose ~re of tw.g major typ:s, namely, multiple regression equationand multiple cutoff scores. --------'--:...::-.~-=..::!.:.::.:.:.:=___, ..
~Vhe~~ts ~re adIriinistered in the intensive study of individual cases,?1t~/et-,VV1C
:: 111 ~li~lCaldiagnosis, counseling, or the evaluation of high-level execu//' I ' "."",,-
ves, It Isa£.QmIDOlLpr.actice.fOLtb~aminer to utilize test scores with~1out further st~tistical...analpis.-W preparing a case report and in making!
recom~endatI~ns, the examiner relies on judgment, past experience, and \
theoret~cal ratIOnale to interpret score patterns and integrate findings \
from dl~erent tests. Such clinical use of test scores will be discussed \further 1ll Chapter 16. \
\• MULTIPLE ~RESS~ON. EQUATION. The multiple regression equation '\ I '--"1YIelds a predicted cntenon Score for each individual on the basis of his ,1'V~, ..-;f ir
U':'
score . b t . '1'1.. £ II ._-' v(.{li,(,Gi/ltt/Vl..', '. a te ~L1e 10 owing regression equation _-1.::- ....-------...illu~trates the applIcation of this technique to predicting a student's JachIevement in high school mathematics courses from his scores on Ii,)'verbal (V), numerical (N), and reasoning (R) tests: L :1"11
. Math~matics Achievement =: .21V + .21N + .82 R + 1.35, 'IIn t~IS ~quabon, the student's stanine Score on each of the three tests is 1 I:multiplied by the corresponding weight given in the equation. The sum :1
of t~c..sepr~~uct~, plus a constant (1.35), gives the student's predicted ';:s~!!~ne pOSItIon lD mathematics courses. 1 ;:.'Suppose that Bill Jones receives the following stanine scores: I ".
Verbal 6 ; :If
~::~~~ : 1 'lii,'1The estimated ma'h ti' h'
Lema cs ac levement of this student is found as 'i i,follows: ' II
1 fill
I illl'1 i: \
(i
11111:1
'I,i!J 'J, ,"
180 Principles of Psychological Testing
aptitude test was recomputed within the third whose occupational in-
terest level was most appropriate for the job, it rose to .664.
A technique employed by Chiselli in much of his research consists in
finding for each individual the absolute difference (D) between his
:. actualand his predicted criterion scores. The smaller the value of D, the
, morepredictable is the individual's criterion score. A predictability scale
~; is then developed by comparing the item responses of two contrasted
:- subgroups selected on the basis of their D scores. The predictability
-: scaleis subsequently applied to a new sample, to identify highly pre-
o dictableand poorly predictable subgroups, and the validity of the original
. testis compared in these two subgroups. This approach has shown con-
siderablepromise as a means of identifying persons for whom a test will
" be a good or a poor predictor. An extension of the same procedure has
.'been developed to determine in advance which of two tests will be a
'\ betterpredictor for each individual (Chiselli, 1960a).
Other investigators (Dunnette, 1972; Hobert & Dmmette, 1967) have
"'. argued that Chiselli's D index, based on the absolute amount of pre-
.~dictionerror without regard to direction of error, may obscure important
individualdifferences. Alternative procedures, involving separate analyses
ofoverpredicted and underpredicted cases, have accordingly been pro-
'posed. .;\ Atthis time the identification and use of moderator variables are still
,'i,n' an explor;tory ·phase. Considerable caution is_required to avoid
methodologicalpitfalls (see, e,g., Abrahams & Alf, 1972a, 1972b; Dun-
nette,1972;Ghiselh, 1972; Velicer, 1972a, 1972b). The results are usually
~9uitespecific to the situations in which they were obtained. And it is
iinportant to check the extent to which the use of moderators actually
'proves the prediction that could be achieved through other more
'rectmeans (Pinder, 1973).
':;xForthe prediction of practical criteria, not one but several tests are
eperallyrequired. Most: criteria are complex, the criterion measure de-
. ingon a number of different traits. A single test designed to measure
a criteriQnwould thus have to be highly heterogeneous. It has al-
y been pointed out, however, that~ re~!i.~~ homogeneous _~~st,
suringlargely' a singlet~ is more satisfactory b~~.e_iL)'ieIasJess--.--.
. ---US-Scores ('Ch-:;)). Hence, it is usually preferable to use a
ination of several relatively homogeneous t~sts, each covering a
ent aspect of the criterion, rather than a single test consisting of a
podge of many diffe:rent kinds of items.
en a number of speciaUy selected tests are employed together to
Math. Achiev. == (.21)(6) + (.21) (4) +( .32)( 8) + 1.35 = 6.01
Bill's ~redictcd stanine is approximately 6. It ~l be recalled (Ch. 4) that
a stanme of 5 represents average pedormance. Bill would thus be ex-
pected to ~o somewhat better than average in mathe~tics courses. His
very supenor performance in the reasoning test (R =8') and his above-
~verage score on the verbal test (V = 6) compensate for his poor score10 spee~ and a~uracy of computation (N = 4). _
SpecIRc techmques for the computation of regression equations can be