Test Construction-RV Etc McAnulty
description
Transcript of Test Construction-RV Etc McAnulty
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 1/7
The discussion of the construction, evaluation and application of
psychological tests is beyond the scope of this course. However, issues of
the reliability and validity of a psychological test are parallel to concernsthat one may have about any measure. A psychological test is simply an
approach to measurement often used in psychology.
So here we provide a brief overview of some of the traditional ways of
thinking about the reliability and validity of tests. Be warned, though, that
when we discuss the threats to validity in experiementation we shall be
using a rather different conceptual framework.
I. Test construction: Introduction and Overview
II. Reliability
III. Validity
IV. Item Analysis
V. Test Interpretation
I. Test construction: Introduction and
Overview
A. Definition of Psychological Tests:
"an objective and standardized measure of a sample of behavior."
B. Standards of Test Construction and Test Use:
A good test should be reliable and valid. Concerns relating to standards
include user qualification, security of test content, confidentiality of testresults, and the prevention of the misuse of tests and results.
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 2/7
C. Test Characteristics and Response sets.
Characteristics include
i. a test of maximum performance (e.g., achievement test) which tells uswhat a person can do.
ii. a test of typical performance (e.g., personality test) which tells us what a
person usually does.
iii. a speed test, in which response rate is assessed.
iv. a mastery test asses whether or not the person can attain a pre-specified
mastery level of performance.
Response sets include: social desirability (giving responses that are
perceived to be socially acceptable), acquiescence (agreeing or disagreeing
with everything, and deviation (giving unusual or uncommon responses).
All of the above can threaten the validity of a given set of results.
II. Reliability
A. Classical test theory states:
a test is reliable a) to the degree that it is free from error and provides
information about examinees’ "true" test scores and b) to the degree that it
provides repeatable, consistent results.
B. Methods of estimating reliability
[we did not go over this as it involves calculating a reliability coefficient--a
correlation coefficient--which we have not discussed yet]
C. Standard error of measurement.
[we skipped over, as we are not ready to discuss yet]
D. Factors affecting the reliability coefficient
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 3/7
Any factor which reduces score variability or increases measurement error
will also reduce the reliability coefficient. For e.g., all other things being
equal, short tests are less reliable than long ones, very easy and very difficult
tests are less reliable than moderately difficult tests, and tests where
examinees’ scores are affected by guessing (e.g. true-false) have loweredreliability coefficients.
III. Validity
Three major categories:
content, criterion-related, and construct validity
1) content validity:
A test has content validity if it measures knowledge of the content domain
of which it was designed to measure knowledge. Another way of saying this
is that content validity concerns, primarily, the adequacy with which the test
items adequately and representatively sample the content area to be
measured. For e.g., a comprehensive math achievement test would lack
content validity if good scores depended primarily on knowledge of English,or if it only had questions about one aspect of math (e.g., algebra). Content
validity is primarily an issue for educational tests, certain industrial tests,
and other tests of content knowledge like the Psychology Licensing Exam.
Expert judgement (not statistics) is the primary method used to determine
whether a test has content validity. Nevertheless, the test should have a high
correlation w/other tests that purport to sample the same content domain.
This is different from face validity: face validity is when a test appears valid
to examinees who take it, personnel who administer it and other untrained
observers. Face validity is not a technical sense of test validity; i.e., just b/c
a test has face validity does not mean it will be valid in the technical sense
of the word. " just cause it looks valid doesn’t mean it is."
2) criterion-related validity:
Criterion-related validity is a concern for tests that are designed to predictsomeone’s status on an external criterion measure. A test has criterion-
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 4/7
related validity if it is useful for predicting a person’s behavior in a specified
situation.
2a) Concurrent vs. predictive validation:
First the term "validation" refers to the procedures used to determine how
valid a predictor is. There are two types.
In concurrent validation, the predictor and criterion data are collected at or
about the same time. This kind of validation is appropriate for tests designed
to asses a person’s current criterion status. It is good diagnostic screening
tests when you want to diagnose.
In Predictive validation, the predictor scores are collected first and criterion
data are collected at some later/future point. This is appropriate for testsdesigned to assess a person’s future status on a criterion.
2b) Standard Error of estimate
The standard error of estimate (σ est) is used to estimate the range in which
a person’s true score on a criterion is likely to fall, given his/her score as
estimated by a predictor.
2c) Decision-Making
In many cases when using predictor tests, the goal is to predict whether or
not a person will meet or exceed a minimum standard of criterion
performance — the criterion cutoff point. When a predictor is to be used in
this manner, the goal of the validation study is to set an optimal predictor
cutoff score; an examinee who scores at or above the predictor cutoff is
predicted to score at or above the criterion cutoff.
We then get
true positives (or valid acceptance): accurately identified by the predictor as
meeting the criterion standard
False positives (or false acceptance): incorrectly identified by the predictor
as meeting the criterion standard.
True negative (valid rejection): accurately identified by the predictor as not
meeting the criterion standard.
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 5/7
False negative (invalid rejection): meets the criterion standard, even though
the predictor indicated s/he wouldn’t .
2d) Factors affecting the criterion-related validity coefficient:
This is about factors that potentially affect the magnitude of the criterion-
related validity coefficient. We will not go into this as it relates to
correlation.
3) Construct validity
a test has construct validity if it accurately measures a theoretical, non-
observable construct or trait. The construct validity of a test is worked out
over a period of time on the basis of an accumulation of evidence. There are
a number of ways to establish construct validity.
Two methods of establishing a test’s construct validity are
convergent/divergent validation and factor analysis.
3a) Convergent/divergent validation
A test has convergent validity if it has a high correlation with another test
that measures the same construct. By contrast, a test’s divergent validity is
demonstrated through a low correlation with a test that measures a differentconstruct. Note this is the only case when a low correlation coefficient (b/w
two test that measure different traits) provides evidence of high validity .
The multitrait-multimethod matrix is one way to assess a test’s convergent
and divergent validity. We will not get into this right now.
3b) Factor analysis
Factor analysis is a complex statistical procedure which is conducted for a
variety of purposes, one of which is to assess the construct validity of a testor a number of tests. We will get there.
3c) Other methods of assessing construct validity:
we can asses the test’s internal consistency. That is, if a test has construct
validity, scores on the individual test items should correlate highly with the
total test score. This is evidence that the test is measuring a single construct
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 6/7
also developmental changes. tests measuring certain constructs can be
shown to have construct validity if the scores on the tests show predictable
developmental changes over time.
and experimental intervention, that is if a test has construct validity, scoresshould change following an experimental manipulation, in the direction
predicted by the theory underlying the construct.
4) Relationship between reliability and validity
If a test is unreliable, it cannot be valid.
For a test to be valid, it must reliable.
However, just because a test is reliable does not mean it will be valid.
Reliability is a necessary but not sufficient condition for validity!
IV. Item Analysis
There are a variety of techniques for performing an item analysis, which is
often used, for example, to determine which items will be kept for the final
version of a test. Item analysis is used to help "build" reliability and validity
are "into" the test from the start. Item analysis can be both qualitative and
quantitative. The former focuses on issues related to the content of the test,
eg. content validity. The latter primarily includes measurement of item
difficulty and item discrimination.
item difficulty: an item’s difficulty level is usually measured in terms of thepercentage of examinees who answer the item correctly. This percentage is
referred to as the item difficulty index, or "p".
Item discrimination refers to the degree to which items differentiate among
examinees in terms of the characteristic being measured (e.g., between high
and low scorers). This can be measured in many ways. One method is to
correlate item responses with the total test score; items with the highest test
correlation with the total score are retained for the final version of the test.
This would be appropriate when a test measures only one attribute andinternal consistency is important.
7/17/2019 Test Construction-RV Etc McAnulty
http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 7/7
Another way is a discrimination index (D).
V. Interpretation of Test Scores
1. norm-referenced interpretation:
involves comparing an examinee’s score to that of others in a normative
sample. It provides an indication of where the examinee stands in relation to
others who have taken the test. Norm-referenced scores include
developmental scores (eg, mental age scores and grade equivalents), which
indicate how far along the normal developmental path an individual has
progressed, and within-group norms, which provide a comparison to the
scores other individuals whom the examinee most resembles.
2. Criterion-referenced interpretation:
involves interpreting an examinee’s test score in terms of an external pre-
established standard of performance.
http://www.mathcs.duq.edu/~packer/Courses/Psy624/test.html