Test Construction-RV Etc McAnulty

7/17/2019 Test Construction-RV Etc McAnulty

http://slidepdf.com/reader/full/test-construction-rv-etc-mcanulty 1/7

The discussion of the construction, evaluation and application of

psychological tests is beyond the scope of this course. However, issues of

the reliability and validity of a psychological test are parallel to concernsthat one may have about any measure. A psychological test is simply an

approach to measurement often used in psychology.

So here we provide a brief overview of some of the traditional ways of

thinking about the reliability and validity of tests. Be warned, though, that

when we discuss the threats to validity in experiementation we shall be

using a rather different conceptual framework.

I. Test construction: Introduction and Overview

II. Reliability

III. Validity

IV. Item Analysis

V. Test Interpretation

I. Test construction: Introduction and

Overview

A. Definition of Psychological Tests:

"an objective and standardized measure of a sample of behavior."

B. Standards of Test Construction and Test Use:

A good test should be reliable and valid. Concerns relating to standards

include user qualification, security of test content, confidentiality of testresults, and the prevention of the misuse of tests and results.



C. Test Characteristics and Response sets.

Characteristics include

i. a test of maximum performance (e.g., achievement test) which tells uswhat a person can do.

ii. a test of typical performance (e.g., personality test) which tells us what a

person usually does.

iii. a speed test, in which response rate is assessed.

iv. a mastery test asses whether or not the person can attain a pre-specified

mastery level of performance.

Response sets include: social desirability (giving responses that are

perceived to be socially acceptable), acquiescence (agreeing or disagreeing

with everything, and deviation (giving unusual or uncommon responses).

All of the above can threaten the validity of a given set of results.

II. Reliability

A. Classical test theory states:

a test is reliable a) to the degree that it is free from error and provides

information about examinees’ "true" test scores and b) to the degree that it

provides repeatable, consistent results.

B. Methods of estimating reliability

[we did not go over this as it involves calculating a reliability coefficient--a

correlation coefficient--which we have not discussed yet]

C. Standard error of measurement.

[we skipped over, as we are not ready to discuss yet]

D. Factors affecting the reliability coefficient



Any factor which reduces score variability or increases measurement error

will also reduce the reliability coefficient. For e.g., all other things being

equal, short tests are less reliable than long ones, very easy and very difficult

tests are less reliable than moderately difficult tests, and tests where

examinees’ scores are affected by guessing (e.g. true-false) have loweredreliability coefficients.

III. Validity

Three major categories:

content, criterion-related, and construct validity

1) content validity:

A test has content validity if it measures knowledge of the content domain

of which it was designed to measure knowledge. Another way of saying this

is that content validity concerns, primarily, the adequacy with which the test

items adequately and representatively sample the content area to be

measured. For e.g., a comprehensive math achievement test would lack

content validity if good scores depended primarily on knowledge of English,or if it only had questions about one aspect of math (e.g., algebra). Content

validity is primarily an issue for educational tests, certain industrial tests,

and other tests of content knowledge like the Psychology Licensing Exam.

Expert judgement (not statistics) is the primary method used to determine

whether a test has content validity. Nevertheless, the test should have a high

correlation w/other tests that purport to sample the same content domain.

This is different from face validity: face validity is when a test appears valid

to examinees who take it, personnel who administer it and other untrained

observers. Face validity is not a technical sense of test validity; i.e., just b/c

a test has face validity does not mean it will be valid in the technical sense

of the word. " just cause it looks valid doesn’t mean it is."

2) criterion-related validity:

Criterion-related validity is a concern for tests that are designed to predictsomeone’s status on an external criterion measure. A test has criterion-



related validity if it is useful for predicting a person’s behavior in a specified

situation.

2a) Concurrent vs. predictive validation:

First the term "validation" refers to the procedures used to determine how

valid a predictor is. There are two types.

In concurrent validation, the predictor and criterion data are collected at or

about the same time. This kind of validation is appropriate for tests designed

to asses a person’s current criterion status. It is good diagnostic screening

tests when you want to diagnose.

In Predictive validation, the predictor scores are collected first and criterion

data are collected at some later/future point. This is appropriate for testsdesigned to assess a person’s future status on a criterion.

2b) Standard Error of estimate

The standard error of estimate (σ est) is used to estimate the range in which

a person’s true score on a criterion is likely to fall, given his/her score as

estimated by a predictor.

2c) Decision-Making

In many cases when using predictor tests, the goal is to predict whether or

not a person will meet or exceed a minimum standard of criterion

performance — the criterion cutoff point. When a predictor is to be used in

this manner, the goal of the validation study is to set an optimal predictor

cutoff score; an examinee who scores at or above the predictor cutoff is

predicted to score at or above the criterion cutoff.

We then get

true positives (or valid acceptance): accurately identified by the predictor as

meeting the criterion standard

False positives (or false acceptance): incorrectly identified by the predictor

as meeting the criterion standard.

True negative (valid rejection): accurately identified by the predictor as not

meeting the criterion standard.



False negative (invalid rejection): meets the criterion standard, even though

the predictor indicated s/he wouldn’t .

2d) Factors affecting the criterion-related validity coefficient:

This is about factors that potentially affect the magnitude of the criterion-

related validity coefficient. We will not go into this as it relates to

correlation.

3) Construct validity

a test has construct validity if it accurately measures a theoretical, non-

observable construct or trait. The construct validity of a test is worked out

over a period of time on the basis of an accumulation of evidence. There are

a number of ways to establish construct validity.

Two methods of establishing a test’s construct validity are

convergent/divergent validation and factor analysis.

3a) Convergent/divergent validation

A test has convergent validity if it has a high correlation with another test

that measures the same construct. By contrast, a test’s divergent validity is

demonstrated through a low correlation with a test that measures a differentconstruct. Note this is the only case when a low correlation coefficient (b/w

two test that measure different traits) provides evidence of high validity .

The multitrait-multimethod matrix is one way to assess a test’s convergent

and divergent validity. We will not get into this right now.

3b) Factor analysis

Factor analysis is a complex statistical procedure which is conducted for a

variety of purposes, one of which is to assess the construct validity of a testor a number of tests. We will get there.

3c) Other methods of assessing construct validity:

we can asses the test’s internal consistency. That is, if a test has construct

validity, scores on the individual test items should correlate highly with the

total test score. This is evidence that the test is measuring a single construct



also developmental changes. tests measuring certain constructs can be

shown to have construct validity if the scores on the tests show predictable

developmental changes over time.

and experimental intervention, that is if a test has construct validity, scoresshould change following an experimental manipulation, in the direction

predicted by the theory underlying the construct.

4) Relationship between reliability and validity

If a test is unreliable, it cannot be valid.

For a test to be valid, it must reliable.

However, just because a test is reliable does not mean it will be valid.

Reliability is a necessary but not sufficient condition for validity!

IV. Item Analysis

There are a variety of techniques for performing an item analysis, which is

often used, for example, to determine which items will be kept for the final

version of a test. Item analysis is used to help "build" reliability and validity

are "into" the test from the start. Item analysis can be both qualitative and

quantitative. The former focuses on issues related to the content of the test,

eg. content validity. The latter primarily includes measurement of item

difficulty and item discrimination.

item difficulty: an item’s difficulty level is usually measured in terms of thepercentage of examinees who answer the item correctly. This percentage is

referred to as the item difficulty index, or "p".

Item discrimination refers to the degree to which items differentiate among

examinees in terms of the characteristic being measured (e.g., between high

and low scorers). This can be measured in many ways. One method is to

correlate item responses with the total test score; items with the highest test

correlation with the total score are retained for the final version of the test.

This would be appropriate when a test measures only one attribute andinternal consistency is important.



Another way is a discrimination index (D).

V. Interpretation of Test Scores

1. norm-referenced interpretation:

involves comparing an examinee’s score to that of others in a normative

sample. It provides an indication of where the examinee stands in relation to

others who have taken the test. Norm-referenced scores include

developmental scores (eg, mental age scores and grade equivalents), which

indicate how far along the normal developmental path an individual has

progressed, and within-group norms, which provide a comparison to the

scores other individuals whom the examinee most resembles.

2. Criterion-referenced interpretation:

involves interpreting an examinee’s test score in terms of an external pre-

established standard of performance.

http://www.mathcs.duq.edu/~packer/Courses/Psy624/test.html

Test Construction-RV Etc McAnulty

Documents

Transcript of Test Construction-RV Etc McAnulty