Post on 15-Oct-2014
1
How to combine scores across multiple questions to form a total scale score (modified
and shortened, from Chapter 19, Warner, 2007)
19.6 Methods for the Computation of Summated Scales
19.6.1 Implicit Assumption: All Items Measure the Same Construct and Are Scored in
the Same Direction
When we add together scores on a list of measures or questions, we implicitly assume that
all these scores measure the same underlying construct and that all the questions or items are
scored in the same direction.
Consider the first assumption, the assumption that all items measure the same construct.
What information would be obtained by a set of numbers that measured completely unrelated
things? A sum of X1 height, X2 agreeableness, and X3 number of pairs of shoes owned
by a person would be a meaningless number because the scores that are combined are not
measures of the same underlying latent variable. In general, it does not make sense to
summarize information across a set of X measured variables by summing them unless they
are highly correlated with each other, and both the pattern of correlations and the nature of
the items are consistent with the interpretation that all the individual X items are slightly
different ways of measuring the same underlying latent variable (e.g., depression).
The items included in psychological tests such as the CESD scale are typically written so
that they assess slightly different aspects of a complex variable such as depression (e.g., low
self-esteem, fatigue, sadness). To evaluate empirically whether these items can reasonably be
interpreted as measures of the same underlying latent variable or construct, we look for
reasonably large correlations among the scores on the items. If the scores on a set of
measurements or test items are highly correlated with each other, this evidence is consistent
with the belief that the items may all be measures of the same underlying construct. However,
2
high correlations among items can arise for other reasons and are not necessarily proof that
the items measure the same underlying construct; for example, they may occur due to
sampling error or may arise because the items have some kind of measurement artifact in
common, such as a strong social desirability bias. The most widely reported method of
evaluating reliability for summated scales, Cronbach’s’s alpha, is based on the mean of the
inter-item correlations.
19.6.2 Reverse-Worded Questions
Consider the second assumption: the assumption that all items are scored in the same
direction. In the CESD scale in the appendix to this chapter, most of the items are worded in
such a way that a higher score indicates a greater degree of depression. For example, for
Question 3, “I felt that I could not shake off the blues even with help from my family or
friends,” the response that corresponds to 4 points (“I felt this way most of the time, 5–7 days
per week”) indicates a higher level of depression than the response that corresponds to 1
point (“I felt this way rarely or none of the time”). However, a few of the items (Numbers 4,
8, 12, and 16) are reverse worded. Question 4 asks how frequently the respondent “felt that I
was just as good as other people.” The response to this question that would indicate the
highest level of depression corresponds to the lowest frequency of occurrence (1 Rarely or
none of the time). When reverse-worded items are included in a multiple-item measure, the
scoring on these items must be recoded before we sum scores across items, such that a high
score on every item corresponds to the same thing, that is, a higher level of depression.
When I name my SPSS variables, I generally give names that help me to remember what
scale each question belongs to, which item number, and whether or not it is reverse scored.
So for example when my survey included the 20 item (question) CESD scale I named the
items dep1, dep2, dep3, etc. However, when a question is reverse worded and needs to be
3
recoded before it is used in a reliability analysis (such as Cronbach’s’s alpha) or summed
with other items, I initially give the variable a name such as “revdepd4”.
When self-report methods are used, it is often desirable to include some reverse-worded
questions. Self-report responses are prone to many types of bias, including yea-saying or
nay-saying bias (some respondents tend to agree or disagree with all items), and social
desirability bias (many people tend to report behaviors and attitudes that they believe are
socially desirable); see Converse and Presser (1999) for further discussion. To avoid the yea-
saying bias, some scales include reverse-worded items. For example, the CESD scale
includes statements about feelings and behaviors, and respondents are asked to rate how
frequently they experience each of these, using a scale from 1 (rarely or none of the time, less
than 1 day a week) to 4 (most or all of the time, 5–7 days a week).
It is generally preferable to report final scores for a scale scored in a direction such that a
higher score corresponds to “more” of the attitude or ability that the test is supposed to
measure. For example, it is easier to talk about scores on a depression scale, and to interpret
correlations of the depression scale with other variables, if a higher score corresponds to
more severe depression. (If a depression scale was scored such that a high score corresponded
to a low level of depression, then scores on the depression scale would correlate negatively
with other measures of negative mood such as anxiety; this would be confusing for the data
analyst and the reader.) Most of the items on the CESD scale are worded such that a high
frequency of reported occurrence corresponds to a higher level of depression. For example, a
high reported frequency of occurrence for the item “I had crying spells” corresponds to a
higher level of depression. However, a few of the CESD scale items were reverse worded, for
example, “I enjoyed life.” For these reverse-worded items, a score of 1 or 2 indicating a low
frequency of occurrence corresponds to a higher level of depression. Before combining
4
scores across items that are worded in different directions (such that for some items, a high
score corresponds to more depression, and for other items, a low score corresponds to more
depression), it is necessary to recode the direction of scoring on reverse-worded items so that
a higher score always corresponds to a higher level of depression. Items 4, 8, 12, and 16 in
the appendix were reverse worded. Scores on these reverse-worded items must be recoded
when we form a sum of the scores across all 20 items to serve as an overall measure of
depression.
In the following example, revdep4 is the name of the SPSS variable that corresponds to
the reverse-worded depression item “I felt that I was just as good as other people” (item
number 4 on the CESD scale). One simple method to reverse the scoring on this item (so that
a higher score corresponds to more depression) is as follows: Create a new variable (dep4)
that corresponds to 5 revdep4. If you take a value that is one unit higher than the highest
possible score on a measure (in this case, because the possible scores are 1, 2, 3, and 4, we
use the value 5), and then subtract each person’s score from that reference value, this reverses
the direction of scoring. This can be done in SPSS by making the following menu selections:
<Transform> <Compute>.
In the dialog box for the Compute procedure (see Figure 19.5), the name of the new
variable or Target Variable (dep4) is placed in the left-hand side box. The equation to
compute a score for this new variable as a function of the score on an existing variable is
placed in the right-hand side box titled Numeric Expression (in this case, the numeric
expression is 5 revdep4).
Insert Figure 19.5
Figure 19.5 Computing a Reverse-Scored Variable for Dep4
It is also helpful to create a variable with a different name for the reverse-coded score for
5
each item (e.g., dep4 is the reverse-coded score on revdep4). If you change the direction of
scoring by changing the original values and retain the original variable name (as in this
example, dep4 5 dep4), it is easy to lose track of which items have and have not already
been reverse scored.
Based on a preliminary examination of the data, the researcher evaluates whether the two
assumptions required for simple summated scales are satisfied (i.e., scores on all the items are
positively intercorrelated, and it makes sense to interpret all the items as measures of the
same underlying construct or variable).
19.6.3 Sum of Raw Scores
After recoding any reverse-worded items, you can create a total score for each scale by
summing scores across items as shown in Figure 19.7. In this first example, a score for
selected items from the CESD scale was computed by summing the scores on Items 1
through 5 (with Item 4 reverse scored). The <Transform> and <Compute> menu selections
open the SPSS Compute dialog window that appears in Figure 19.7. The name of the new
variable (in this example, briefcesd) is placed in the left-hand side window under the Target
Variable. The equation that specifies which scores are summed is placed in the Numeric
Expression window. To form a score that is the sum of items named dep1 to dep5 (but using
dep4 instead of revdep4), you can use the following numeric expression:
briefcesd dep1 dep2 dep3 dep4 dep5. (19.6)
Insert Figure 19.7
Figure 19.7 Computation of a Brief 5-Item Version of the Depression Scale
If an individual has a missing score on one or more individual items, use of this
computation: briefcesd dep1 dep2 dep3 dep4 dep5. will result in a system missing
code for the new scale total score. In this dataset, one participant had a system missing code
6
on revdep4 and dep4; therefore, the number of scores is reduced from N 98 in the entire
SPSS data file to N 97 for analyses that involve the variable briefcesd. If you want to
obtain a score for people who have missing values on some items, you can use the “MEAN”
function in the SPSS Compute dialog window (see Figure 19.8); this returns the mean score,
based on all non-missing items. For example, if a person is missing a score on dep2, the
numeric expression mean(dep1, dep2, dep3, dep4, dep5) will return the mean for all
availablel scores on Items dep1, dep3, dep4, and dep5. If you want to put the total score back
into the units that you would have obtained by summing items, multiply this mean by the
number of items in the scale (in this case, the number of items was 5).
19.6.4 Sum of z Scores
Summing raw scores may be reasonable when the items are all scored using the same
response alternatives or all measured in the same units. However, there are occasions when
researchers want to combine information across variables that are measured in quite different
units. Suppose a sociologist wants to create an overall index of socioeconomic status (SES)
by combining information about the following measures: annual income in dollars, years of
education, and occupational prestige rated on a scale from 0 to 100. If raw scores (in dollars,
years, and points) were summed, the value of the total score would be dominated by the value
of annual income. If we want to give these three factors (income, education, and occupational
prestige) equal weight when we combine them, we can convert each variable to a z score or
standard score and, then, form a unit-weighted composite of these z scores:
ztotal zX1 zX2 … zXp. (19.7)
To create a composite of z scores on income, education, and occupational prestige so as to
summarize information about SES, you could compute SES zincome zeducation zoccupationprestige.
You could also use the Mean function to obtain a mean of z scores for the items in a scale.
7
19.7 Assessment of Internal Homogeneity for Multiple-Item Measures
The internal consistency reliability of a multiple-item scale tells us the degree to which the
items on the scale measure the same thing. If the items on a test all measure the same
underlying construct or variable, and if all items are scored in the same direction, then the
correlations among all the items should be positive.
19.7.2 Cronbach’s Alpha Reliability Coefficient: Conceptual Basis
We can summarize information about positive intercorrelations between the items on a
multiple-item test by calculating a Cronbach’s alpha reliability. The Cronbach’s alpha has
become the most popular form of reliability assessment for multiple-item scales. As seen in
an earlier section, as we sum a larger number of items for each participant, the expected value
of ei approaches 0, while the value of p × T increases. In theory, as the number of items (p)
included in a scale increases, assuming other characteristics of the data remain the same, the
reliability of the measure (the size of the p × T component compared with the size of the e
component) also increases. The Cronbach’s alpha provides a reliability coefficient that tells
us, in theory, how reliable our estimate of the “stable” entity that we are trying to measure is,
when we combine scores from p test items (or behaviors or ratings by judges). The
Cronbach’s alpha uses the mean of all the inter-item correlations (for all pairs of items or
measures) to assess the stability or consistency of measurement.
The Cronbach’s alpha can be understood as a generalization of the Spearman-Brown
prophecy formula; we calculate the mean inter-item correlation (r–) to assess the degree of
agreement among individual test items, and then, we predict the reliability coefficient for a p-
item test from the correlations among all these single-item measures. Another possible
interpretation of the Cronbach’s alpha is that it is, essentially, the average of all possible split
half reliabilities. Here is one formula for the Cronbach’s from Carmines and Zeller (1979,
8
p. 44):
(19.11)
where p is the number of items on the test and r– the mean of the inter-item correlations.
The size of the Cronbach’s alpha depends on the following two factors:
As p (the number of items included in the composite scale) increases, and assuming that
r– stays the same, the value of the Cronbach’s alpha increases.
As r– (the mean of the correlations among items or measures) increases, assuming that
the number of items p remains the same, the Cronbach’s alpha increases.
It follows that we can increase the reliability of a scale by adding more items (but only if
doing so does not decrease r–, the mean inter-item correlation) or by modifying items to
increase r– (either by dropping items with low item-total correlations or by writing new items
that correlate highly with existing items). There is a trade-off: If the inter-item correlation is
high, we may be able to construct a reasonably reliable scale with few items, and of course, a
brief scale is less costly to use and less cumbersome to administer than a long scale. Note that
all items must be scored in the same direction prior to summing. Items that are scored in the
opposite direction relative to other items on the scale would have negative correlations with
other items, and this would reduce the magnitude of the mean inter-item correlation.
Researchers usually hope to be able to construct a reasonably reliable scale that does not
have an excessively large number of items. Many published measures of attitudes or
personality traits include between 4 and 20 items for each trait. Ability or achievement tests
(such as IQ) may require much larger numbers of measurements to produce reliable results.
Note that when the items are all dichotomous (such as true/false), the Cronbach’s alpha
may still be used to assess the homogeneity of response across items. In this situation, it is
9
sometimes called a Kuder-Richardson 20 (KR-20) reliability coefficient. However, the
Cronbach’s alpha is not appropriate for use with items that have categorical responses with
more than two categories.
19.7.3 Cronbach’s Alpha for Five Selected CESD Scale Items
Ninety-seven students filled out the 20-item CESD scale (items shown in the appendix to
this chapter) as part of a survey. The names given to these 20 items in the SPSS data file that
appears in Table 19.2 were dep1 to dep20. Questions 4, 8, 12, and 16 were reverse worded,
and therefore, it was necessary to recode the scores on these items. The recoded values were
placed in variables with the names dep4, dep8, dep12, and dep16. The SPSS reliability
procedure was used to assess the internal consistency reliability of their responses. The value
of the Cronbach’s alpha is an index of the internal consistency reliability of the depression
score formed by summing the first 5 items. In this first example, only the first 5 items (dep1,
dep2, dep3, dev4, and dep5) were included. To run SPSS reliability, the following menu
selections were made, starting from the top level menu for the SPSS data worksheet (see
Figure 19.11): <Analyze> <Scale> <Reliability>.
The reliability procedure dialog box appears in Figure 19.12. The names of the 5 items on
the CESD scale were moved into the variable list for this procedure. The Statistics button was
clicked to request additional output; the Reliability Analysis: Statistics window appears in
Figure 19.13. In this example, “Scale if item deleted” in the “Descriptives for” box and
“Correlations” in the “Inter-Item” box were checked. The syntax for this procedure appears in
Figure 19.14, and the output appears in Figure 19.15.
Insert Figure 19.11
Figure 19.11 SPSS Menu Selections for the Reliability Procedure
Insert Figure 19.12
10
Figure 19.12 SPSS Reliability Analysis for 5 CESD Scale Items: Dep1, Dep2, Dep3,
Dep4, and Dep5
Insert Figure 19.13
Figure 19.13 Statistics Selected for SPSS Reliability Analysis
Insert Figure 19.14
Figure 19.14 SPSS Syntax for Reliability Analysis
Insert Figure 19.15
Figure 19.15 SPSS Output From the First Reliability Procedure
NOTE: Scale: BriefCESD.
The Reliability Statistics panel in Figure 19.15 reports two versions of the Cronbach’s
alpha statistic for the entire scale including all 5 items. For the sum dep1 dep2 dep3
dep4 dep5, the Cronbach’s alpha estimates the proportion of the variance in this total that is
due to p × T, the part of the score that is stable or consistent for each participant across all 5
items. A score can be formed by summing raw scores (the sum of dep1, dep2, dep3, dep4,
and dep5), z scores, or standardized scores (zdep1 zdep2 … zdep5). The first value, .59, is
the reliability for the scale formed by summing raw scores; the second value, .61, is the
reliability for the scale formed by summing z scores across items. In this example, these two
versions of the Cronbach’s alpha (raw score and standardized score) are nearly identical.
They generally differ from each other more when the items that are included in the sum are
measured using different scales with different variances (as in the earlier example of an SES
scale based on a sum of income, occupational prestige, and years of education).
Recall that the Cronbach’s alpha, like other reliability coefficients, can be interpreted as a
proportion of variance. Approximately 60% of the variance in the total score for depression,
which is obtained by summing the z scores on Items 1 through 5 from the CESD scale, is
11
shared across these 5 items. A Cronbach’s reliability coefficient of .61 would be considered
unacceptably poor reliability in most research situations. Subsequent sections describe two
different things researchers can do that may improve the Cronbach’s alpha reliability:
deleting poor items or increasing the number of items.
A correlation matrix appears under the heading “Inter-Item Correlation Matrix.” This
reports the correlations between all possible pairs of items. If all items measure the same
underlying construct, and if all items are scored in the same direction, then all the correlations
in this matrix should be positive and reasonably large. Note that the same item that had a
small loading on the depression factor in the preceding FA (trouble concentrating) also
tended to have low or even negative correlations with the other 4 items. The Item-Total
Statistics table shows how the statistics associated with the scale formed by summing all five
items would change if each individual item were deleted from the scale. The Corrected Item-
Total Correlation for each item is its correlation with the sum of the other 4 items in the scale;
for example, for dep1, the correlation of dep1 with the “corrected total” (dep2 dep3 dep4
dep5) is shown. This total is called “corrected” because the score for dep1 is not included
when we assess how dep1 is related to the total. If an individual item is a “good” measure,
then it should be strongly related to the sum of all other items in the scale; conversely, a low
item-total correlation is evidence that an individual item does not seem to measure the same
construct as other items in the scale. The item that has the lowest item-total correlation with
the other items is, once again, the question about trouble concentrating. This low item-total
correlation is yet another piece of evidence that this item does not seem to measure the “same
thing” as the other 4 items in this scale.
The last column in the Item-Total Statistics table reports Cronbach’s’s Alpha if Item
Deleted; that is, what is the Cronbach’s alpha for the scale if each individual item is deleted?
12
For the item that corresponded to the question trouble concentrating, deletion of this item
from the scale would increase the Cronbach’s to .70. Sometimes the deletion of an item
that has low correlations with other items on the scale results in an increase in reliability. In
this example, we can obtain slightly better reliability for the scale if we drop the item trouble
concentrating, which tends to have small correlations with other items on this depression
scale; the sum of the remaining 4 items has a Cronbach’s of .70, which represents slightly
better reliability.
19.7.4 Improving Cronbach’s Alpha by Dropping a “Poor” Item
The SPSS reliability procedure was performed on the reduced set of 4 items: dep1, dep2,
dep3, and dep4. The output from this second reliability analysis (in Figure 19.16) shows that
the reduced 4-item scale had Cronbach’s reliabilities of .703 (for the sum of raw scores)
and .712 (for the sum of z scores). A review of the column headed “Cronbach’s’s Alpha if
Item Deleted” in the new Item-Total Statistics table indicates that the reliability of the scale
would become lower if any additional items were deleted from the scale. Thus, we have
obtained slightly better reliability from the 4-item version of the scale (Figure 19.16) than for
a 5-item version of the scale (Figure 19.15). The 4-item scale had better reliability because
the mean inter-item correlation was higher after the item trouble concentrating was deleted.
Insert Figure 19.16
Figure 19.16 Output for the Second Reliability Analysis: Scale Reduced to Four Items
NOTE: Item trouble concentrating has been dropped.
19.7.5 Improving the Cronbach’s Alpha by Increasing the Number of Items
Other factors being equal, Cronbach’s alpha reliability tends to increase as p, the number
of items in the scale, increases. For example, we obtain a higher Cronbach’s alpha when we
use all 20 items in the full-length CESD scale than when we examine just the first 5 items.
13
The output from the SPSS reliability procedure for the full 20-item CESD scale (with Items
4, 8, 12, and 16 reverse scored) appears in Figure 19.17. For the full scale formed by
summing scores across all 20 items, the Cronbach’s was .88.
Insert Figure 19.17
Figure 19.17 SPSS Output: Cronbach’s Alpha Reliability for the 20-Item CESD Scale
19.7.6 A Few Other Methods of Reliability Assessment for Multiple-Item Measures
19.7.6.1 Split-Half Reliability
A split-half reliability for a scale with p items is obtained by dividing the items into two
sets (each with p/2 items). This can be done randomly or systematically; for example, the first
set might consist of odd-numbered items and the second set might consist of even-numbered
items. Separate scores are obtained for the sum of the Set 1 items (X1) and the sum of the Set
2 items (X2), and a Pearson r (r12) is calculated between X1 and X2. However, this r12
correlation between X1 and X2 is the reliability for a test with only p/2 items; if we want to
know the reliability for the full test that consists of twice as many items (all p items, in this
example), we can “predict” the reliability of the longer test using the Spearman-Brown
prophecy formula (Carmines & Zeller, 1979):
(19.12)
where r12 is the correlation between the scores based on split-half versions of the test
(each with p/2 items), and rXX is the reliability for a score based on all p items.
Depending on the way in which items are divided into sets, the value of the split-half
reliability can vary. The Cronbach’s alpha can be interpreted as the mean of all possible
different split-half reliabilities.
19.7.6.2 Parallel Forms Reliability
Sometimes it is desirable to have two versions of a test that include different questions
14
but that yield comparable information; these are called parallel forms. Parallel forms of a test,
such as the Eysenck Personality Inventory, are often designated Form A and Form B. Parallel
forms are particularly useful in repeated measures studies where we would like to test some
ability or attitude on two occasions, but we want to avoid repeating exactly the same
questions. Parallel forms reliability is similar to split-half reliability, except that when
parallel forms are developed, more attention is paid to matching items so that the two forms
contain similar types of questions. For example, consider Eysenck’s Extraversion scale. Both
Form A and Form B include similar numbers of items that assess each aspect of extraversion
—for instance, enjoyment of social gatherings, comfort in talking with strangers, sensation
seeking, and so forth. A Pearson r between scores on Form A and Form B is a typical way of
assessing reliability; in addition, however, a researcher wants scores on Form A and Form B
to yield the same means, variances, and so forth, so these should also be assessed.
19.9 Validity Assessment
Validity of a measurement essentially refers to whether the measurement really measures
what it purports to measure. In psychological and educational measurement, the degree to
which scores on a measure correspond to the underlying construct that the measure is
supposed to assess is called construct validity. (Some textbooks used to list construct
validity as one of several types of measurement validity; in recent years, many authors use
the term construct validity to subsume all the forms of validity assessment described below.)
For some types of measurement (such as direct measurements of simple physical
characteristics), validity is reasonably self-evident. If a researcher uses a tape measure to
obtain information about people’s heights (whether the measurements are reported in
centimeters, inches, feet, or other units), the researcher does not need to go to great lengths to
persuade readers that this type of measurement is valid. However, there are many situations
15
where the characteristic of interest is not directly observable, and researchers can only obtain
indirect information about it. For example, we cannot directly observe intelligence (or
depression); but we may infer that a person is intelligent (or depressed) if he or she gives
certain types of responses to large numbers of questions that researchers agree are diagnostic
of intelligence (or depression). A similar problem arises in medicine, for example, in the
assessment of blood pressure. Arterial blood pressure could be measured directly by shunting
the blood flow out of the person’s artery through a pressure measurement system, but this
procedure is invasive (and generally, less invasive measures are preferred). The commonly
used method of blood pressure assessment uses an arm cuff; the cuff is inflated until the
pressure in the cuff is high enough to occlude the blood flow; a human listener (or a
microphone attached to a computerized system) listens for sounds in the brachial artery while
the cuff is deflated. At the point when the sounds of blood flow are detectable (the Korotkoff
sounds), the pressure on the arm cuff is read, and this number is used as the index of systolic
blood pressure—that is, the blood pressure at the point in the cardiac cycle when the heart is
pumping blood into the artery. The point of this example is that this common blood pressure
measurement method is quite indirect; research had to be done to establish that measurements
taken in this manner were highly correlated with measurements obtained more directly by
shunting blood from a major artery into a pressure detection system. Similarly, it is possible
to take satellite photographs and use the colors in these images to make inferences about the
type of vegetation on the ground, but it is necessary to do validity studies to demonstrate that
the type of vegetation that is identified using satellite images corresponds to the type of
vegetation that is seen when direct observations are made at ground level.
As these examples illustrate, it is quite common in many fields (such as psychology,
medicine, and natural resources) for researchers to use rather indirect assessment methods—
16
either because the variable in question cannot be directly observed or because direct
observation would be too invasive or too costly.
In cases such as these, whether the measurements are made through self-report
questionnaires, by human observers, or by automated systems, validity cannot be assumed;
we need to obtain evidence to show that measurements are valid.
For self-report questionnaire measurements, two types of evidence are used to assess
validity. One type of evidence concerns the content of the questionnaire (content or face
validity); the other type of evidence involves correlations of scores on the questionnaire with
other variables (criterion-oriented validity).
19.9.1 Content and Face Validity
Both content and face validity are concerned with the content of the test or survey items.
Content validity involves the question whether test items represent all theoretical dimensions
or content areas. For example, if depression is theoretically defined to include low self-
esteem, feelings of hopelessness, thoughts of suicide, lack of pleasure, and physical
symptoms of fatigue, then a content-valid test of depression should include items that assess
all these symptoms. Content validity may be assessed by mapping out the test contents in a
systematic way and matching them to elements of a theory or by having expert judges decide
whether the content coverage is complete.
A related issue is whether the instrument has face validity; that is, does it appear to
measure what it says it measures? Face validity is sometimes desirable, when it is helpful for
test takers to be able to see the relevance of the measurements to their concerns, as in some
evaluation research studies where participants need to feel that their concerns are being taken
into account.
If a test is an assessment of knowledge (e.g., knowledge about dietary guidelines for
17
blood glucose management for diabetic patients), then content validity is crucial. Test
questions should be systematically chosen so that they provide reasonably complete coverage
of the information (e.g., What are the desirable goals for the proportions and amounts of
carbohydrate, protein, and fat in each meal? When blood sugar is tested before and after
meals, what ranges of values would be considered normal?).
When a psychological test is intended for use as a clinical diagnosis (of depression, for
instance), clinical source books such as the Diagnostic and Statistical Manual of Mental
Disorders (DSM-IV) might be used to guide item selection, to ensure that all relevant facets
of depression are covered. More generally, a well-developed theory (about ability,
personality, mood, or whatever else is being measured) can help a researcher map out the
domain of behaviors, beliefs, or feelings that questions should cover to have a content-valid
and comprehensive measure.
However, sometimes, it is important that test takers should not be able to guess the
purpose of the assessment, particularly in situations where participants might be motivated to
“fake good,” “fake bad,” lie, or give deceptive responses. There are two types of
psychological tests that (intentionally) do not have high face validity: projective tests and
empirically keyed objective tests. One well-known example of a projective test is the
Rorschach test, in which people are asked to say what they see when they look at ink blots; a
diagnosis of psychopathology is made if responses are bizarre. Another is the Thematic
Apperception Test, in which people are asked to tell stories in response to ambiguous
pictures; these stories are scored for themes such as need for achievement and need for
affiliation. In projective tests, it is usually not obvious to participants what motives are being
assessed, and because of this, test takers should not be able to engage in impression
management or faking. Thus, projective tests intentionally have low face validity.
18
Some widely used psychological tests were constructed using empirical keying methods;
that is, test items were chosen because the responses to those questions were empirically
related to a psychiatric diagnosis (such as depression), even though the question did not
appear to have anything to do with depression. For example, persons diagnosed with
depression tend to respond “False” to the MMPI (Minnesota Multiphasic Personality
Inventory) item “I sometimes tease animals”; this item was included in the MMPI depression
scale because the response was (weakly) empirically related to a diagnosis of depression,
although the item does not appear face valid as a question about depression (Wiggins, 1973).
Face validity can be problematic; people do not always agree about what underlying
characteristic(s) a test question measures. Gergen, Hepburn, and Fisher (1986) demonstrated
that when items taken from one psychological test (the Rotter Internal/External Locus of
Control scale) were presented to people out of context and people were asked to say what
trait they thought the questions assessed, they generated a wide variety of responses.
19.9.2 Criterion-Oriented Types of Validity
Content validity and face validity are assessed by looking inside a test to see what
material it contains and what the questions appear to measure. Criterion-oriented validity is
assessed by examining correlations of scores on the test with scores on other variables that
should be related to it if the test really measures what it purports to measure. If the CESD
scale really is a valid measure of depression, for example, scores on this scale should be
correlated with scores on other existing measures of depression that are thought to be valid,
and they should predict behaviors that are known or theorized to be associated with
depression.
19.9.2.1 Convergent Validity
Convergent validity is assessed by checking to see if scores on a new test of some
19
characteristic X correlate highly with scores on existing tests that are believed to be valid
measures of that same characteristic. For example, do scores on a new brief IQ test correlate
highly with scores on well-established IQ tests such as the WAIS or the Stanford-Binet? Are
scores on the CESD scale closely related to scores on other depression measures such as the
BDI? If a new measure of a construct has reasonably high correlations with existing measures
that are generally viewed as valid, this is evidence of convergent validity.
19.9.2.2 Discriminant Validity
Equally important, scores on X should not correlate with things the test is not supposed to
measure (discriminant validity). For instance, researchers sometimes try to demonstrate that
scores on a new test are not contaminated by social desirability bias by showing that these
scores are not significantly correlated with scores on the Crown-Marlowe Social Desirability
scale or other measures of social desirability bias.
19.9.2.3 Concurrent Validity
As the name suggests, concurrent validity is evaluated by obtaining correlations between
scores on the test with current behaviors or current group memberships. For example, if
persons who are currently clinically diagnosed with depression have higher mean scores on
the CESD scale than persons who are not currently diagnosed with depression, this would be
one type of evidence for concurrent validity.
19.9.2.4 Predictive Validity
Another way of assessing validity is to ask whether scores on the test predict future
behaviors or group membership. For example, are scores on the CESD scale higher for
persons who later commit suicide than for people who do not commit suicide?
19.9.3 Construct Validity: Summary
Many types of evidence (including content, convergent, discriminant, concurrent, and
20
predictive validity) may be required to establish that a measure has strong construct validity
—that is, that it really measures what the test developer says it measures, and it predicts the
behaviors and group memberships that it should be able to predict. Westen and Rosenthal
(2003) suggested that researchers should compare a matrix of obtained validity coefficients or
correlations with a target matrix of predicted correlations and compute a summary statistic to
describe how well the observed pattern of correlations matches the predicted pattern. This
provides a way of quantifying information about construct validity based on many different
kinds of evidence.
Although the preceding examples have used psychological tests, validity questions
certainly arise in other domains of measurement. For example, referring to the example
discussed earlier, when the colors in satellite images are used to make inferences about the
types and amounts of vegetation on the ground, are those inferences correct? Indirect
assessments are sometimes used because they are less invasive (e.g., as discussed earlier, it is
less invasive to use an inflatable arm cuff to measure blood pressure) and sometimes because
they are less expensive (broad geographical regions can be surveyed more quickly by taking
satellite photographs than by having observers on the ground). Whenever indirect methods of
assessment are used, validity assessment is required.
Multiple-item assessments of some variables (such as depression) may be useful or even
necessary to achieve validity as well as reliability. How can we best combine information
from multiple measures? This brings us back to a theme that has arisen repeatedly throughout
the book; that is, we can often summarize the information in a set of p variables or items by
creating a weighted linear composite or, sometimes, just a unit weight sum of scores for the
set of p variables.
19.10 Typical Scale Development Study
21
If an existing multiple-item measure is available for the variable of interest, such as
depression, it is usually preferable to employ an existing measure for which we have good
evidence about reliability and validity. However, occasionally, a researcher would like to
develop a measure for some construct that has not been measured before or develop a
different way of measuring a construct for which the existing tests are flawed. An outline of a
typical research process for scale development appears in Figure 19.19. In this section, the
steps included in this diagram are discussed briefly. Although the examples provided involve
self-report questionnaire data, comparable issues are involved in combining physiological
measures or observational data.
Insert Figure 19.19
Figure 19.19 Possible Steps in the Development of a Multiple-Item Scale
19.10.1 Generating and Modifying the Pool of Items or Measures
When a researcher sets out to develop a measure for a new construct (for which there are
no existing measures) or a different measure in a research domain where other measures have
been developed, the first step is the generation of a pool of “candidate” items. There are many
ways in which this can be done. For example, to develop a set of self-report items to measure
“Machiavellianism” (a cynical, manipulative attitude toward people), Christie and Geis
(1970) drew on the writings of Machiavelli for some items (and also on statements by P. T.
Barnum, another notable cynic). To develop measures of love, Rubin (1970) drew on writings
about love that ranged from the works of classic poets to the lyrics of popular songs. In some
cases, items are borrowed from existing measures; for example, a number of research scales
have used items that are part of the MMPI. However, there are copyright restrictions on the
use of items that are part of published tests.
Brainstorming by experts, and interviews, focus groups, or open-ended questions with
22
members of the population who are the focus of assessment can also provide useful ideas
about items. For example, to develop a measure of college student life space, including
numbers and types of material possessions, Brackett (2004) interviewed student informants,
visited dormitory rooms, and examined merchandise catalogs popular in that age group.
A theory can be extremely helpful as guidance in initial item development. The early
interview and self-report measures of the global Type A behavior pattern drew on a
developing theory that suggested that persons prone to cardiovascular disease tend to be
competitive, time urgent, job-involved, and hostile. The behaviors that were identified for
coding in the interview thus included interrupting the interviewer and loud or explosive
speech. The self-report items on the Jenkins Activity Survey, a self-report measure of global
Type A behavior, included questions about eating fast, never having time to get a haircut, and
being unwilling to lose in games even when playing checkers with a child (Jenkins, Zyzanski,
& Rosenman, 1979).
It is useful for the researcher to try to anticipate the factors that will emerge when these
items are pretested and FA is performed. If a researcher wants to measure satisfaction with
health care, and the researcher believes that there are three separate components to
satisfaction (evaluation of practitioner competence, satisfaction with rapport or “bedside
manner,” and issues of cost and convenience), then he or she should pause and evaluate
whether the survey includes sufficient items to measure each of these three components.
Keeping in mind that a minimum of 4 to 5 items are generally desired for each factor or scale
and that not all candidate items may turn out to be good measures, it may be helpful to have
something like 8 or 10 candidate items that correspond to each construct or factor that the
researcher wants to measure.
19.10.2 Administer Survey to Participants
23
The survey containing all the candidate items should be pilot tested on a relatively small
sample of participants; it may be desirable to interview or debrief participants to find out
whether items seemed clear and plausible and whether response alternatives covered all the
options people might want to report. A pilot test can also help the researcher judge how long
it will take for participants to complete the survey. After making any changes judged
necessary based on the initial pilot tests, the survey should be administered to a sample that is
large enough to be used for FA (see Chapter 18 for sample size recommendations). Ideally,
these participants should vary substantially on the characteristics that the scales are supposed
to measure (because a restricted range of scores on T, the component of the X measures that
taps stable individual differences among participants, will lead to lower inter-item
correlations and lower scale reliabilities).
19.10.3 Factor Analyze Items to Assess the Number and Nature of Latent Variables or
Constructs
Using the methods described in Chapter 18, FA can be performed on the scores. If the
number of factors that are obtained and the nature of the factors (i.e., the groups of variables
that have high loadings on each factor) are consistent with the researcher’s expectations, then
the researcher may want to go ahead and form one scale that corresponds to each factor. If the
FA does not turn out as expected, for example, if the number of factors is different from what
was anticipated or if the pattern of variables that load on each factor is not as expected, the
researcher needs to make a decision. If the researcher wants to make the FA more consistent
with a priori theoretical constructs, it may be necessary to go back to Step 1 to revise, add,
and drop items. If the researcher sees patterns in the data that were not anticipated from
theoretical evaluations (but the patterns make sense), he or she may want to use the empirical
factor solution (instead of the original conceptual model) as a basis for grouping items into
24
scales. Also, if a factor that was not anticipated emerges in the FA, but there are only a few
items to represent that factor, the researcher may want to add or revise items to obtain a better
set of questions for the new factor.
In practice, a researcher may have to go through these first three sets several times; that
is, the researcher may run FA, modify items, gather additional data, and run a new FA several
times until the results of the FA are clear, and the factors correspond to meaningful groups of
items that can be summed to form scales.
Note that some scales are developed based on the predictive utility of items rather than on
the factor structure; for these, DA (rather than FA) might be the data reduction method of
choice. For example, items included in the Jenkins Activity Survey (Jenkins et al., 1979)
were selected because they were useful predictors of a person having a future heart attack.
19.10.4 Development of Summated Scales
After FA (or DA), the researcher may want to form scales by combining scores on
multiple measures or items. There are numerous options at this point.
1. One or several scales may be created (depending on whether the survey or test measures
just one construct or several separate constructs).
2. Composition of scales (i.e., selection of items) may be dictated by conceptual grouping
of items or by empirical groups of items that emerge from FA. In most scale
development research, researchers hope that the items that are grouped to form scales
can be justified both conceptually and empirically.
3. Scales may involve combining raw scores or standardized scores (z scores) on multiple
items. Usually, if the variables use drastically different measurement units (as in the
example above where an SES index was formed by combining income, years of
education, and occupational prestige rating), z scores are used to ensure that each
25
variable has equal importance.
4. Scales may be based on sums or means of scores across items.
19.10.5 Assess Scale Reliability
At a minimum, the internal consistency of each scale is assessed, usually by obtaining a
Cronbach’s alpha. Test-retest reliability should also be assessed if the construct is something
that is expected to remain reasonably stable across time (such as a personality trait), but high
test-retest reliability is not a requirement for measures of things that are expected to be
unstable across time (such as moods).
19.10.6 Assess Scale Validity
If there are existing measures of the same theoretical construct, the researcher assesses
convergent validity by checking to see whether scores on the new measure are reasonably
highly correlated with scores on existing measures. If the researcher has defined the construct
as something that should be independent of verbal ability or not influenced by social
desirability, the researcher should assess discriminant validity by making sure that
correlations with measures of verbal ability and social desirability are close to 0. To assess
concurrent and predictive validity, scores on the scale can be used to predict current or future
group membership and current or future behaviors, which it should be able to predict. For
example, scores on Zick Rubin’s Love Scale (Rubin, 1970) were evaluated to see if they
predicted self-rated likelihood that the relationship would lead to marriage and whether
scores predicted which dating couples would split up and which ones would stay together
within the year or two following the initial survey.
19.10.7 Iterative Process
At any point in this process, if results are not satisfactory, the researcher may “cycle
back” to an earlier point in the process; for example, if the factors that emerge from FA are
26
not clear or if internal consistency reliability of scales is low, the researcher may want to
generate new items and collect more data. In addition, particularly for scales that will be used
in clinical diagnosis or selection decisions, normative data are required; that is, the mean,
variance, and distribution shape of scores must be evaluated based on a large number of
people (at least several thousand). This provides test users with a basis for evaluation. For
example, for the BDI (Beck et al., 1961), the following interpretations for scores have been
suggested based on normative data for thousands of test takers: scores from 5 to 9, normal
mood variations; 10 to 18, mild to moderate depression; 19 to 29, moderate to severe
depression; and 30 to 63, severe depression. Scores of 4 or below on the BDI may be
interpreted as possible denial of depression or faking good; it is very unusual for people to
have scores that are this low on the BDI.
19.10.8 Create Final Scale
When all the criteria for good quality measurement appear to be satisfied (i.e., the data
analyst has obtained a reasonably brief list of items or measurements that appears to provide
reliable and valid information about the construct of interest), a final version of the scale may
be created. Often such scales are first published as tables or appendixes in journal articles. A
complete report for a newly developed scale should include the instructions for the test
respondents (e.g., what period of time should the test taker think about when reporting
frequency of behaviors or feelings?); a complete list of items, statements, or questions; the
specific response alternatives; indication whether any items need to be reverse coded; and
scoring instructions. Usually, the scoring procedure consists of reversing the direction of
scores for any reverse-worded items and then summing the raw scores across all items for
each scale. If subsequent research provides additional evidence that the scale is reliable and
valid, and if the scale measures something that has a reasonably wide application, at some
27
point, the test author may copyright the test and perhaps have it distributed on a fee per use
basis by a test publishing company. Of course, as years go by, the contents of some test items
may become dated. Therefore, periodic revisions may be required to keep test item wording
current.
19.11 Summary
To summarize, measurements need to be reliable. When measurements are unreliable, it leads
to two problems. Low reliability may imply that the measure is not valid (if a measure does
not detect anything consistently, it does not make much sense to ask what it is measuring). In
addition, when researchers conduct statistical analyses, such as correlations, to assess how
scores on an X variable are related to scores on other variables, the relationship of X to other
variables becomes weaker as the reliability of X becomes smaller; the attenuation of
correlation due to unreliability of measurement was discussed in Chapter 7. To put it more
plainly, when a researcher has unreliable measures, relationships between variables usually
appear to be weaker. It is also essential for measures to be valid: If a measure is not valid,
then the study does not provide information about the theoretical constructs that are of real
interest. It is also desirable for measures to be sensitive to individual differences, unbiased,
relatively inexpensive, not very invasive, and not highly reactive.
Research methods textbooks point out that each type of measurement method (such as
direct observation of behavior, self-report, physiological or physical measurements, and
archival data) has strengths and weaknesses. For example, self-report is generally low cost,
but such reports may be biased by social desirability (i.e., people report attitudes and
behaviors that they believe are socially desirable, instead of honestly reporting their actual
attitudes and behaviors). When it is possible to do so, a study can be made much stronger by
including multiple types of measurements (this is called “triangulation” of measurement). For
28
example, if a researcher wants to measure anxiety, it would be desirable to include direct
observation of behavior (e.g., “um”s and “ah”s in speech and rapid blinking), self-report
(answers to questions that ask about subjective anxiety), and physiological measures (such as
heart rates and cortisol levels). If an experimental manipulation has similar effects on anxiety
when it is assessed using behavioral, self-report, and physiological outcomes, the researcher
can be more confident that the outcome of the study is not attributable to a methodological
weakness associated with one form of measurement, such as self-report.
The development of a new measure can require a substantial amount of time and effort. It
is relatively easy to demonstrate reliability for a new measurement, but the evaluation of
validity is far more difficult and the validity of a measure can be a matter of controversy.
When possible, researchers may prefer to use existing measures for which data on reliability
and validity are already available.
For psychological testing, a useful online resource is the American Psychological
Association FAQ on testing: www.apa.org/science/testing.html.
Another useful resource is a directory of published research tests on the Educational
Testing Service (ETS) Test Link site www.ets.org/testcoll/index.html, which has information
on about 20,000 published psychological tests.
Although most of the variables used as examples in this chapter were self-report
measures, the issues discussed in this chapter (concerning reliability, validity, sensitivity,
bias, cost effectiveness, invasiveness, and reactivity) are relevant for other types of data,
including physical measurements, medical tests, and observations of behavior.
Appendix: The CESD Scale
INSTRUCTIONS: Using the scale below, please circle the number before each statement
which best describes how often you felt or behaved this way DURING THE PAST WEEK.
29
1 Rarely or none of the time (less than 1 day)
2 Some or a little of the time (1–2 days)
3 Occasionally or a moderate amount of time (3–4 days)
4 Most of the time (5–7 days)
The total CESD depression score is the sum of the scores on the following twenty
questions with Items 4, 8, 12, and 16 reverse scored.
1. I was bothered by things that usually don’t bother me.
2. I did not feel like eating; my appetite was poor.
3. I felt that I could not shake off the blues even with help from my family or friends.
4. I felt that I was just as good as other people. (reverse worded)
5. I had trouble keeping my mind on what I was doing.
6. I felt depressed.
7. I felt that everything I did was an effort.
8. I felt hopeful about the future. (reverse worded)
9. I thought my life bad been a failure.
10. I felt fearful.
11. My sleep was restless.
12. I was happy. (reverse worded)
13. I talked less than usual.
14. I felt lonely.
15. People were unfriendly.
16. I enjoyed life. (reverse worded)
17. I had crying spells.
18. I felt sad.
30
19. I felt that people dislike me.
20. I could not get “going.”
A total score on CESD is obtained by reversing the direction of scoring on the four
reverse-worded items (4, 8, 12, and 16), so that a higher score on all items corresponds to a
higher level of depression, and then summing the scores across all 20 items.
Appendix Source: Radloff, L. S. (1977). The CESD Scale: A self-report depression scale for
research in the general population. Applied Psychological Measurement, 1, 385–401.
WWW Links: Resources on Psychological Measurement
American Psychological Association www.apa.org/science/testing.html
Goldberg’s International Personality Item Pool—royalty-free versions of scales that
measure “Big Five” personality traits
http://ipip.ori.org/ipip/
Mental Measurements Yearbook Test Reviews online
http://buros.unl.edu/buros/jsp/search.jsp
PsychWeb information on psychological tests
www.psychweb.com/tests/psych_tests
31
Figure 19.5 Computing a Recoded Variable (Dep4) From the Reverse Scored Item Revdep4
32
Figure 19.7 Computation of Brief Five-Item Version of Depression Scale: Adding Scores Across Items Using Plus Signs
33
Figure 19.8 Combining Scores from Five Items Using the SPSS MEAN Function (Multiplied By Number of Items)
34
Figure 19.11 SPSS Menu Selections for Reliability Procedure
35
Figure 19.12
SPSS Reliability Analysis for Five CESD Items: Dep1, Dep2, Dep3, Dep4, Dep5
NOTE: Dep4 is the recoded version of revdep4, corrected so that the direction of scoring is the same as for other items on the scale.
36
Figure 19.13 Statistics Selected for SPSS Reliability Analysis
37
Figure 19.14 SPSS Syntax for Reliability Analysis
38
Figure 19.15
SPSS Output from First Reliability Procedure for Scale: Briefcesd
Reliability Statistics
Cronbach's Alpha
Cronbach's Alpha Based
on Standardized
Items N of Items
.585 .614 5
Inter-Item Correlation Matrix
dep1 dep2 dep3 dep4 dep5
dep1 1.000 .380 .555 .302 .062dep2 .380 1.000 .394 .193 .074dep3 .555 .394 1.000 .446 .115dep4 .302 .193 .446 1.000 -.129dep5 .062 .074 .115 -.129 1.000
Item-Total Statistics
Scale Mean if Item
Deleted
Scale Variance if
Item Deleted
Corrected Item-Total Correlation
Squared Multiple
Correlation
Cronbach's Alpha if Item
Deleted
dep1 5.6701 5.786 .511 .341 .455dep2 5.7010 5.941 .398 .195 .504dep3 5.6082 4.845 .615 .434 .365dep4 7.4742 5.710 .294 .237 .562dep5 4.8247 7.042 .032 .055 .703
39
Figure 19.16
Output for the Second Reliability Analysis: Scale reduced to Four Items
NOTE: dep5, “Trouble Concentrating”, has been droppedReliability Statistics
Cronbach's Alpha
Cronbach's Alpha Based
on Standardized
Items N of Items
.703 .712 4
Inter-Item Correlation Matrix
dep1 dep2 dep3 dep4
dep1 1.000 .380 .555 .302dep2 .380 1.000 .394 .193dep3 .555 .394 1.000 .446dep4 .302 .193 .446 1.000
Item-Total Statistics
Scale Mean if Item
Deleted
Scale Variance if
Item Deleted
Corrected Item-Total Correlation
Squared Multiple
Correlation
Cronbach's Alpha if Item
Deleted
dep1 3.1753 4.625 .541 .341 .617dep2 3.2062 4.811 .407 .194 .686dep3 3.1134 3.810 .633 .419 .542dep4 4.9794 4.166 .410 .204 .702
40
Figure 19.17 note to copyeditor: this figure remains the same as in the first edition
SPSS Output: Cronbach Alpha Reliability for 20 Item CES-D Scale
Scale: CESDTotal
Case Processing Summary
94 95.94 4.1
98 100.0
ValidExcludeda
Total
CasesN %
Listwise deletion based on allvariables in the procedure.
a.
Reliability Statistics
.880 20
Cronbach'sAlpha N of Items
41
Figure 19.xx Possible Steps in the Development of a Multiple Item Scale.