Investigating Content and Construct Representation of a ......Investigating Content and Construct...
Transcript of Investigating Content and Construct Representation of a ......Investigating Content and Construct...
Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper presented at the annual meeting of the National Council on Measurement in Education New Orleans, LA M. Assunta Hardy, Brigham Young University
Michael J. Young, Pearson
Qing Yi, Pearson
Richard R. Sudweeks, Brigham Young University
Damon L. Bahr, Brigham Young University April 2011
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 1
The authors thank Dr. Eula Ewing Monroe from Brigham Young University for providing elementary mathematics content-area expertise. To obtain a copy of this report, send an email request to M. Assunta Hardy, Department of Instructional Psychology & Technology, Brigham Young University, Provo, Utah 84602. E-mail: [email protected].
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 2
Abstract
According to the equating guidelines, a set of common items should be a mini version of the total
test in terms of content and statistical representation (Kolen & Brennan, 2004). Differences
between vertical scaling and equating would suggest that these guidelines may not apply to
vertical scaling in the same way that they apply to equating. This study investigated how well the
guideline of content and construct representation was maintained while evaluating two stability
assessment criteria (Robust z and 0.3-logit difference). The results indicated that linking sets that
were not totally representative of the full test forms produced different vertical scales than the
linking sets that were most representative of the full test forms. The results also showed that
large disparities in the composition of linking sets produced statistically significant differences in
the growth patterns of the resulting vertical scales, but small disparities in the composition of
linking sets produced very similar vertical scales. Overall, the Robust z procedure was a more
conservative approach to flagging unstable items.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 3
Introduction
The common-item design (CID) is a data collection plan widely used in creating a
vertical scale. Examinees’ performance on the common items across test forms is used to
indicate the amount of growth that occurs from grade to grade (Kolen & Brennan, 2004).
Different decisions regarding the structure of the design and the composition of the linking item
set may lead to different vertical scales (Camilli, Yamamoto, & Wang, 1993; Harris, 2007; Loyd
& Hoover, 1980; Williams, Pommerich, & Thissen, 1998; Yen, 1986).
The literature on test score equating provides some guidelines for constructing a test that
includes common items as a method of collecting data (Kolen & Brennan, 2004). According to
the guidelines, the set of common items should be a mini version of the total test in terms of
content and statistical representation. Appropriately selecting common items for the linking set
ensures that the common items represent the total test sufficiently.
Potential common items are identified when adjacent test forms are constructed, but the
common items that become part of the final linking set are those common items that are
reasonably stable in difficulty across forms. The equating literature also provides several criteria
for screening common items. Different criteria may result in different sets of linking items.
The research on equating has produced helpful guidelines for selecting and screening
common items, yet the differences between vertical scaling and equating would suggest that
these guidelines may not apply to vertical scaling in the same way that they apply to equating.
Through the equating process, the examinees’ location estimates are adjusted to account for
differences in difficulty between the test forms and placed onto a common metric. In vertical
scaling, the examinee groups that are administered the level tests are assumed to be different in
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 4
ability. The set of test questions from one test form to the other are deliberately designed to
assess different levels of achievement.
In a review of the literature, Cook and Petersen (1987) concluded that when groups differ
in level of ability, special care must be taken when selecting the set of common items for the
anchor test. Content representativeness of the items is an important concern and can seriously
affect conventional equating results (Cook, Eignor, & Taft, 1985; Klien & Jajoura, 1985). In the
context of vertical scaling, since the examinee groups are expected to differ in their level of
achievement and the test forms differ in difficulty level, shifts in construct and content
specifications tested across test forms can occur simply by design.
In equating when common items are screened for stability, since the two test forms are
expected to be interchangeable, the item-difficulty estimates for common items are expected to
be almost the same. However, in vertical scaling the item difficulty estimates from two test
forms across adjacent grades are expected to differ somewhat. Common items should typically
be easier for the students at the higher grade level and more difficult for the students at the lower
grade level and this should be reflected in the item difficulty estimates.
Background and Rationale
Given the differences between equating and vertical scaling, this study investigated how
well the guideline of content and construct representation was maintained in the context of
creating a vertical scale while evaluating two stability assessment criteria. This was
accomplished by analyzing data gathered using the CID illustrated in Figure 1. This CID is a
variation of a design proposed by Sudweeks et al. (2008) and it involved constructing a separate
test for two mathematical constructs (Geometry and Measurement) intended to assess
achievement relative to objectives in the Utah Core Curriculum for several adjacent grade levels.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 5
Curricular Grade Level of Items
G1 G2 G3 G4 G5 G6 G7 G8
3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h
Grade Level of
Examinees
Figure 1. An illustration of the common-item design used to collect the response data.
Using this CID, students in grade 3, 4, 5, and 6 were administered test forms made up of
test questions intended to assess achievement along a continuum. The test questions included in
each form were classified into sets, according to the curricular grade level of the items, and
labeled alphabetically to display the progression along the continuum. That is, moving from left
to right, the item blocks contained test questions that became progressively more difficult as the
content tested became more complex (G1 through G8).
The students’ responses from the two tests were combined into one data set to investigate
the guideline of content and construct representation. In this study, content representation refers
to how well the curriculum assessed by the linking set matches the curriculum assessed by
adjacent grade level tests and construct representation refers to how well a specific content area
assessed (e.g., Geometry) by the linking set matches the content area assessed by adjacent grade
level tests. Due to the unique structure of the data, many common items were drawn upon for
linking purposes.
Different combinations of common items were included in the linking set to create
different vertical scales, thereby altering the degree of content and construct representativeness.
First, since the item blocks in the CID were intended to assess achievement relative to objectives
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 6
in the Utah Core Curriculum for particular grade levels, content representation could be
investigated by varying the grade level targeted by the common-item block. Second, since the
students’ responses were gathered from two tests that were developed and administered
separately and later combined, construct representation could be investigated by altering the
common items included in the linking set depending on the specific content area assessed by the
common items.
Screening common items. The screening criteria reviewed in this study are those for
which tests are calibrated using the Rasch IRT model: (a) the Robust z statistic and (b) the 0.3-
logit difference.
The Robust z statistic is a z-score-like statistic that is not affected by outliers. The z
statistic is normally computed using the mean and standard deviation in its calculation however,
both the mean and standard deviation are sensitive to outliers. Instead the Robust z statistic,
developed by Huynh as part of the South Carolina Basic Skills Assessment (Huynh, Gleaton, &
Seaman, 1992), uses the median and the interquartile range, which are insensitive to outliers.
For each potential common item, the Robust z is defined as
,
74.0
)(
IQR
Mbbz diTiB (1)
where biB is the b parameter value for common item i for the base grade, biT is the b parameter
value for common item i for the grade being transformed, Md is the median item difficulty
difference of all potential linking items, and IQR is the interquartile range of the difference of all
linking items.
The 0.3-logit difference criterion is based on a fixed difference in difficulty parameter
estimates for common items from two test forms. The average standard error of an item Rasch
difficulty was around 0.15 logits for achievement tests which were calibrated on the test results
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 7
of 500 examinees. Using the traditional 95% confidence interval, two standard errors would
result in 0.3 logits. Thus, the 0.3 logits criterion represents two standard errors.
Two studies have compared stability assessment procedures (Huynh & Rawls, 2009;
Miller, Rotou, & Twing, 2004), but these analyses were conducted in the context of equating.
Given that these procedures are also used in the context of vertical scaling, this study sought to
understand how the two procedures differed when the common items were screened for the
purpose of establishing linking sets to construct a vertical scale.
Purpose of Study
The purpose of this study was fourfold:
1. Combine the response data from two tests into one data set (Geometry test with the
Measurement test) and vertically scale the combined items.
2. Compare the effects of varying the content representativeness of the linking sets in
creating the vertical scale by altering the grade level targeted by the common items.
3. Compare the effects of varying the construct representativeness of the linking sets in
creating the vertical scale by altering the content area composition of the common items.
4. Evaluate two procedures for assessing the stability of the common items.
Research Questions
More specifically, the following three research questions were investigated in this study
for a test measuring students’ proficiency levels in Geometry and Measurement:
1. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-
grade variability across the four consecutive grades when three different grade-level
targets (on-level and/or out-of-level common-item blocks) were used in the linking
process?
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 8
2. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-
grade variability across the four consecutive grades when three different sets of content
area linking items were used for each combined data set?
3. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-
grade variability across the four consecutive grades when two stability assessment
procedures (Robust z and 0.3-logit difference) were used to select the common items?
This vertical scaling study made use of a unique data set, which allowed the above
questions to be investigated. This study provides a rare opportunity to use operational data to
address important issues in vertical scaling such as the grade-level targeting of common-item
linking sets and their content composition. The findings of this study could clarify how the
equating guidelines on common-item selection transfer to the process of creating a vertical scale.
Method
Common-item Test Design (CID)
The CID used in this study encompassed the following four premises:
1. A separate test would be developed for each mathematical construct.
2. The state indicators selected for the test blueprint would measure understandings and
skills that were developmentally appropriate for students at each grade level.
3. The skills and understandings specified for the various grades successively increased in
cognitive complexity in grade-level order.
4. For each test, the collective set of ordered skills and understandings aggregated across
grades defined a single developmental continuum representing progressive levels of
attainment of a single underlying construct.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 9
Therefore for each mathematical construct, the test consisted of eight blocks of items, labeled a
through h, intended to assess achievement along a continuum relative to objectives in the Utah
Core Curriculum for grade levels ranging from G1 to G8 (see Figure 1). For example, the item
block labeled a represented a set of items that assessed achievement for G1, and so forth.
One test form was constructed for each grade level (grades 3, 4, 5, and 6) and each form
contained five different blocks of items. For example, for students in the third grade, a test form
was constructed including item blocks a through e. The items in these five blocks were intended
to assess achievement along a continuum for grade levels ranging from G1 to G5. The same
procedure was taken in constructing the forms for grades 4, 5, and 6. In other words, at each
grade level students were administered blocks of items which included: (a) items assessing
objectives targeted one and two grades below the students’ classified grade level, (b) items
assessing objectives targeted at the students’ classified grade level, and (c) items assessing
objectives targeted one and two grades above the students’ classified grade level. This
assignment of items was done to minimize ceiling or floor effects for students that were either
above or below the average student’s ability level in their respective grades without penalizing
the average student.
Each item block consisted of eight items for the Geometry test and nine items for the
Measurement test. Therefore, each test form consisted of 40 items for the Geometry test and 45
items for the Measurement test.
Potential common-item links. Four of the six total possible blocks of items (67%)
across any two adjacent grades were purposely designed to be common-item blocks and could
subsequently be used in the linking set. Therefore, adjacent test forms shared 32 items in
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 10
common (8 items 4 blocks) for the Geometry test and 36 items in common (9 items 4 blocks)
for the Measurement test.
Sample
In the spring of 2009, the two tests were administered to approximately 2,270 students in
grades 3, 4, 5, and 6 in 15 schools from five districts on two separate days. Students that were
present during each of the two testing days participated in the study.
Most students were administered both tests. A total of 2,263 students responded to the
items in the Geometry test on one day (see Table 1). A total of 2,268 students responded to the
items in the Measurement test on another day. Two thousand and ninety eight of the same
students completed both the Geometry and Measurement tests.
Table 1
Number of Student Participants by Test and Grade
Test
Grade Geometry Measurement Geometry & Measurement
3 631 612 594 4 541 541 518 5 609 607 567 6 482 439 419
Total 2,263 2,199 2,098
Data
Since many of the same students took both tests, the students’ response data was
combined into one data set for the purpose of this analysis. Consequently, the data included a
total of 2,098 students’ responses to 85 items for students in grades 3, 4, 5, and 6 (Table 2).
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 11
Table 2
Assignment of Items for the Geometry and Measurement Tests Combined
Curricular Grade Level of Items Grade Level of
Examinees G1 G2 G3 G4 G5 G6 G7 G8
Total Number of Items Per Test
Form 3 17 17 17 17 17 85 4 17 17 17 17 17 85 5 17 17 17 17 17 85 6 17 17 17 17 17 85
Variations of the Linking Set
Table 3 summarizes the testing conditions used in this study. In total, 18 vertical scales
were constructed (3 testing content representation 3 testing construct representation 2
stability assessment procedures). The common items selected for linking were manipulated to
test the content and construct representation of the common-item set relative to the total test. In
addition, two stability assessment procedures were used separately to screen the items selected,
in which only the stable ones remained part of the linking set to create the vertical scales.
Table 3
Summary of Testing Conditions
Condition Tested Observed Set of Measures
Content Representation
1. On-level and out-of-level common items 2. On-level common items only 3. Out-of-level common items only
Construct Representation
1. Geometry and Measurement common items 2. Geometry common items only 3. Measurement common items only
Stability Assessment 1. Robust z 2. 0.3-logit difference
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 12
Two different approaches were taken to manipulate the composition of the linking set: (a)
grade-level-targeted common items, and (b) content-area-specific common items. By selecting
common items based on the grade level targeted by the items, content representation of the
common items relative to the total test could be investigated.
Three variations of grade-level-targeted common item sets were used. First, all possible
common items across adjacent grades, also referred to as on-level and out-of-level linking items,
were selected as potential linking items to be included in the linking sets. In Figure 2, the three
bold rectangles each identify four common-item blocks that represent the on- and out-of-level
linking items across two adjacent grades. Table 2 summarizes the manner in which the items
were distributed across grade levels. With four common-item blocks selected, the data included a
total of 68 potential common items (17 items 4 item blocks) across any two adjacent grades.
A second variation involved only the on-level linking items. The on-level linking items
are defined as the common items across adjacent grades that assess objectives corresponding to
the students’ classified level. In Figure 3, the three bold squares each identify two common-item
blocks that represent the on-level linking items across adjacent grades. The data included a total
of 34 potential common items (17 items 2 blocks) across any two adjacent grades.
The third variation involved only the out-of-level linking items. The out-of-level linking
items are defined as the common items across adjacent grades that assess objectives above and
below the students’ classified level. In Figure 4, the three pairs of bold rectangles identify two
common-item blocks that represent the out-of-level linking items across adjacent grades. The
data included a total of 34 potential common items (17 items 2 blocks) across adjacent grades.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 13
Curricular Grade Level of Items
G1 G2 G3 G4 G5 G6 G7 G8
3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h
Grade Level of
Examinees
Figure 2. On-level and out-of-level common items included in the linking set.
Curricular Grade Level of Items
G1 G2 G3 G4 G5 G6 G7 G8
3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h
Grade Level of
Examinees
Figure 3. Only on-level common items included in the linking set.
Curricular Grade Level of Items
G1 G2 G3 G4 G5 G6 G7 G8
3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h
Grade Level of
Examinees
Figure 4. Only out-of-level common items included in the linking set.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 14
The second approach used to alter the composition of the linking set involved selecting
content-area-specific common items. By selecting common items based on the mathematical
construct assessed by the items, construct representation of the common items relative to the
total test could be investigated.
Three variations of content-area-specific common item sets were used to create the
vertical scales. The data set was composed of students’ responses to a relatively even number of
items from both mathematical constructs. Therefore, the common items included in the linking
set were (a) items assessing both mathematical constructs, (b) items assessing the Geometry
construct, and (c) items assessing the Measurement construct.
Table 4 outlines the number of common items for each variation of the linking set. The
table indicates the total number of potential common items across any two adjacent grades by the
items’ targeted grade level and by the items’ content area. For example, when only the common
items assessing Geometry were included in the common-item set, the set comprised of 16 on-
level and 16 out-of-level common items for a total of 32.
Table 4
Total Number of Potential Common Items Across Any Two Adjacent Grades by Grade-Level Target and Content Area for the Geometry and Measurement Data
Grade-level-targeted Common Items Content-area-specific
Common Items On-level & Out-of-level On-level Out-of-level
Geometry & Measurement 68 34 34
Geometry 32 16 16
Measurement 36 18 18
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 15
Analysis
Table 5 summarizes the scaling process used in this study. The same IRT scaling method
was applied in creating the vertical scales for all variations of the linking set using two stability
assessment procedures.
Rasch scaling. Due to the widespread use of the Rasch model in large-scale assessment
and the fact that our study involved small sample sizes (Lord, 1983), the Rasch model was used
to analyze the student response data. Items that were not included in the students’ test booklets
were coded as Not Presented for those students. Items that were included in the students’ test
that were not reached by individual examinees were also coded as Not Presented. Students were
not penalized for not reaching items. Items that were omitted were coded as incorrect.
The WINSTEPS software (Linacre, 2006) was used to estimate the item and proficiency
parameters. Using item centering, the item parameters for each level test (grades 3, 4, 5, and 6)
were estimated separately.
Table 5
Summary of the Scaling Process
Element Description of Element
Scaling Method Item Response Theory (IRT)
Computer Software WINSTEPS (Linacre, 2006)
IRT Scaling Model Rasch Model
Calibration Method Separate calibration
Person Ability Estimation Joint Maximum Likelihood Estimate (JMLE)
Stability Assessment Robust z and 0.3-logit difference
Base Grade for Linking Grade 4
Scale Transformation Mean/Mean method
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 16
When separate calibration is used, because of the indeterminacy of IRT scales, each level
test (e.g., grades 3, 4, 5, and 6) is set to have a mean of 0 and a standard deviation of 1.
Therefore, some linear transformation procedure was needed to place all grades onto a common
scale. Prior to the linear transformations, the parameter estimates for the common items were
assessed to determine how stable the item parameters were across the adjacent grades. Common
items identified as unstable were removed from the linking sets.
The Robust z statistic and the 0.3-logit difference were computed for each potential
common item for every variation of the linking set. According to our CID (Figure 1), common
items appeared in multiple test forms. Item stability, or instability, was defined only between any
two forms; therefore, an item could be classified as stable in one pair of test forms and unstable
in another pair of test forms. Items that were labeled as unstable under each procedure were
excluded from the common-item sets.
Robust z procedure. Once the item difficulties were obtained from the separate
calibration procedure in WINSTEPS, stable and unstable common items across each pair of
adjacent grades (G3/G4, G4/G5, and G5/G6) were identified using the Robust z procedure
(Huynh & Rawls, 2009).
In this study, alpha was set at 10 percent and the positive critical value for z* was 1.645.
Potential common items with a Robust z statistic smaller than z* in absolute value were
identified as stable and kept as part of the linking set. Other items with a Robust z statistic
greater than or equal to z* in absolute value was identified as unstable and were excluded from
the linking set.
0.3-logit difference procedure. The 0.3-logit difference procedure involves a simple
computation, but a variant of the procedure was used in this study. In common-item equating of
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 17
two Rasch-calibrated tests, the absolute value of the item-difficulty difference is computed for
each common item (Miller et al., 2004). Since the two test forms are expected to be
interchangeable, either item-difficulty estimate could be subtracted from the other to compute the
difference. Taking the absolute value of the difference would result in the same value. Once the
absolute difference is computed, only those common items with an absolute difference in Rasch
difficulty estimate less than 0.3 logit are described as being stable and included in the linking
process. The unstable common items are dropped from the linking set.
Since this study involved multiple Rasch-calibrated tests that were vertically scaled, only
the item-difficulty difference was computed for each common item for adjacent grades. This
difference was computed by subtracting the item-difficulty estimate of the lower grade from the
item-difficulty estimate of the higher grade. Since in vertical scaling the item difficulty estimates
from two equated test forms across adjacent grades are expected to differ somewhat, a negative
difference is desirable (bn-1 > bn where n = grade). The item difficulty estimate for a common
item taken by students at the lower grade should be greater than the item difficulty estimate for
the same item taken by students at the higher grade. Taking the absolute value of a negative
difference could falsely identify a stable item as unstable. Therefore in the context of this study,
the 0.3-logit difference procedure only took the item-difficulty difference into consideration for
each common item for each pair of grades being linked.
Similar to the Robust z procedure, the item difficulties obtained from the separate
calibration procedure using the WINSTEPS software were applied to identify stable and unstable
common items across each pair of adjacent grades (G3/G4, G4/G5, and G5/G6) for the 0.3-logit
difference procedure. The item-difficulty difference was computed for each potential common
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 18
item and only those common items with a difference in Rasch difficulty estimate less than 0.3
logit were described as being stable and included in the linking process.
The Robust z statistic and the 0.3-logit difference were computed for each potential
common item. Many common items appeared on more than two test forms, therefore item
stability or instability for those common items was determined for each pair of test forms. It
could occur that the same common item was considered stable in one pair of test forms and
unstable between another pair of test forms.
Mean/mean method of linking. Once the unstable items were deleted from the linking
sets for both stability assessment procedures, the remaining common items for each linking set
were used in the scale transformation phase. The mean/mean method was the method used to
transform the estimates onto a common scale. The additive or equating constant was computed
for each pair of adjacent grades for each vertical scale. Since the vertical scales encompassed
four grade levels, three additive constants (G3/G4, G4/G5, and G5/G6) were computed for each
vertical scale. Subsequently, the appropriate equating constant was added to the parameter and
proficiency estimates of each level test to transform the estimates to the base-grade scale for each
vertical scale. In this study, G4 was designated as the base level for the common scale.
Following the scale transformations, all the grades were rescaled so that the common scale had a
mean of 50.0 and a standard deviation of 10.0.
Evaluation Criteria
The properties used to compare the scaling results included (a) grade-to-grade growth, (b)
grade-to-grade variability, and (c) separation of grade distributions (Kolen & Brennan, 2004).
These properties were compared by computing the following statistics: means, medians, standard
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 19
deviations, interquartile ranges, and effect sizes. The observed differences were used to assess
the impact that different choices about the linking set have on the resulting vertical scales.
Results
This study investigated three sets of grade-level-targeted common items, three sets of
content-area-specific common items, and two item stability procedures. A total of 18 vertical
scales were created. Appendix A enumerates the vertical scales and the testing conditions used to
construct the individual scales.
Robust z versus 0.3-logit difference
Table 6 reports the number of stable items identified in each testing condition for both the
Robust z and 0.3-logit difference procedures. Overall, the Robust z procedure was a more
conservative approach to flagging unstable items. The common items identified as unstable using
the 0.3-logit difference procedure were also identified as unstable using the Robust z procedure.
In addition, the Robust z procedure identified on average nine percent more items as unstable.
Regardless of the stability assessment procedure, the remaining common items in each
linking set represented at least 80 percent of the pool of linking items, except for three cases
under the Robust z procedure (see Table 6). First, when Measurement on- and out-of-level
common items were screened using the Robust z procedure, only 28 of the 36 common items
were retained, representing 78 percent of the linking pool. Second, only 26 of the 34 Geometry
and Measurement on-level common items were retained, which represented 77 percent of the
linking pool. And third, only 13 of the 18 Measurement on-level common items were retained,
which represented 72 percent of the linking pool.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 20
Table 6
Number and Percentage of Stable Items by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure
Robust z 0.3-logit difference
Content Area by Level G3/G4 G4/G5 G5/G6 G3/G4 G4/G5 G5/G6
On- and Out-of-Level Geometry & Measurement 61 (90%) 59 (87%) 65 (96%) 68 (100%) 67 (99%) 67 (99%)
Geometry 30 (94%) 30 (94%) 31 (97%) 32 (100%) 32 (100%) 32 (100%)
Measurement 28 (78%) 35 (97%) 33 (92%) 36 (100%) 35 (97%) 35 (97%)
On-Level
Geometry & Measurement 26 (77%) 31 (91%) 33 (97%) 34 (100%) 34 (100%) 33 (97%)
Geometry 14 (88%) 14 (88%) 14 (88%) 16 (100%) 16 (100%) 16 (100%)
Measurement 13 (72%) 16 (89%) 17 (94%) 18 (100%) 18 (100%) 17 (94%)
Out-of-Level
Geometry & Measurement 31 (91%) 32 (94%) 31 (91%) 34 (100%) 33 (97%) 34 (100%)
Geometry 15 (94%) 16 (100%) 14 (88%) 16 (100%) 16 (100%) 16 (100%)
Measurement 16 (89%) 16 (89%) 16 (89%) 18 (100%) 17 (94%) 18 (100%)
Table 7 displays the equating constants used to link across two adjacent grades for both
stability assessment procedures. Since the vertical scales encompassed four grade levels, three
additive constants (G3/G4, G4/G5, and G5/G6) were computed for each vertical scale. A fourth
column, representing the sum of the additive constants for G4/G5 and G5/G6, was included to
illustrate the magnitude of the combined transformations required to link G6 to G4. Comparing
across the Robust z and 0.3-logit difference procedures, only four of the 27 equating constants
were the same, indicating that in four cases, the same items were retained in the linking pool for
both procedures.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 21
Table 7
Equating Constants used to Link Across Two Adjacent Grades by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure
Equating Constant
Robust z 0.3 Logit Difference
Content Area by Level G3/G4 G4/G5 G5/G6 G4/5
+ G5/6
G3/G4 G4/G5 G5/G6 G4/5
+ G5/6
On- and Out-of-Level Geometry & Measurement -0.916 0.840 0.697 1.537 -0.957 0.821 0.729 1.550
Geometry -0.883 0.674 0.761 1.435 -0.890 0.691 0.796 1.486
Measurement -1.022 0.940 0.615 1.555 -1.017 0.940 0.667 1.607
On-Level
Geometry & Measurement -1.067 1.041 0.800 1.841 -1.053 1.012 0.800 1.813
Geometry -0.944 0.814 0.923 1.737 -0.955 0.843 0.814 1.658
Measurement -1.064 1.195 0.787 1.982 -1.140 1.163 0.787 1.950
Out-of-Level
Geometry & Measurement -0.816 0.591 0.568 1.159 -0.861 0.624 0.659 1.283
Geometry -0.871 0.538 0.638 1.176 -0.824 0.538 0.777 1.315
Measurement -0.764 0.644 0.469 1.113 -0.894 0.705 0.554 1.259
The differences observed in the two stability assessment procedures were not evident in
the resulting vertical scales. The vertical scales constructed using the linking-item sets identified
by the Robust z procedure exhibited very similar grade-to-grade growth and within-grade
variability as the vertical scales constructed using the linking-item sets identified by the 0.3-logit
difference. The similarities in the vertical scales are depicted in Figures 5, 6, and 7. The graphs
on the left of each figure represent the results obtained when the Robust z procedure was used to
screen the common items and the graphs on the right represent the results obtained when the 0.3-
logit difference procedure was used.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 22
Differences in Within-grade Variability from Grade to Grade
The same response data was used to create the 18 vertical scales. The difference between
the vertical scales depended on the common items used to calculate the equating constant, which
allowed the scores to be placed on the base scale. Therefore, the shape and spread of the theta
distributions at each grade remained constant across the 18 conditions. The distributions only
shifted up or down at each grade depending on the equating constant used.
Table 8 reports the standard deviation and interquartile range for each grade. The spread
was the same for each of the 18 vertical scales. The overall pattern in grade-to-grade variability
was a decrease in dispersion from grade 3 to 4, followed by greater variability in the scores as
the grades increased.
Table 8
Within-Grade Dispersion of Scaled Scores by Grade
Grade
Measure of Dispersion 3 4 5 6
Standard Deviation 8.82 8.62 8.97 9.60
Interquartile Range 11.63 11.20 11.40 12.10
The pattern of within-grade variability in students’ scaled scores is depicted graphically
in Figures 5, 6 and 7. The six graphs in Figure 5 summarize the variability within grade when
both on- and out-of-level common items were used in the linking set. The six graphs in Figure 6
summarize the variability within grade when on-level common items were used in the linking
set, and the six graphs in Figure 7 summarize the variability within grade when out-of-level
common items were used in the linking set. The two top graphs depict the results when both the
Geometry and Measurement common items were used, the two graphs in the middle row depict
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 23
the results when only the Geometry common items were used, and the two bottom graphs depict
the results when only the Measurement common items were used.
Each column in each of the graphs represents the distribution of scaled scores for the
students in one grade. The five points in each column describe the location of the 10th, 25th, 50th,
75th, and 90th percentiles in the distribution of scores for that grade. The horizontal lines connect
the same percentile in adjacent grades and show the pattern of accelerated or decelerated growth
from grade to grade for students at the specified percentile (discussed in the next section).
The interquartile range – the distance between the 25th and 75th percentiles – provides an
index of within-grade variability that is insensitive to the influence of outliers. In all the 18
vertical scales there was an increase in the interquartile range of the scores for the higher grades
(fourth through sixth grade). The increased spread at the higher grade levels was more evident at
the upper percentile (90th) of the respective grade-level distributions.
Differences in Grade-to-Grade Growth
Median grade-to-grade growth for the three grade-level-targeted common-item sets.
The solid horizontal black line (labeled P50) in each of the 18 graphs in Figures 5, 6 and 7
displays the median proficiency estimate for the students in each grade. The graphs also show the
pattern of increasing growth in students’ achievement across grades. Since the fourth grade was
used as base grade, the median scaled score for the fourth graders in the graphs for each vertical
scale is constant at 58.3.
On- and out-of-level common items. The six graphs in Figure 5 summarize the average
increase in achievement from grade to grade when on- and out-of-level common items were used
in the linking set for the Robust z and the 0.3-logit difference procedure. The two top graphs
summarize the results obtained from using on- and out-of-level common items from both the
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 24
Figure 5. Differences in grade-to-grade growth across corresponding percentile points for on- and out-of-level common items by content-area-specific common items and stability assessment procedure.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 25
Figure 6. Differences in grade-to-grade growth across corresponding percentile points for on-level common items by content-area-specific common items and stability assessment procedure.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 26
Figure 7. Differences in grade-to-grade growth across corresponding percentile points for out-of-level common items by content-area-specific common items and stability assessment procedure.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 27
Geometry and Measurement items, the two graphs in the middle row summarize the results
obtained from using only on- and out-of-level Geometry common items, and the two bottom
graphs summarize the results obtained from using only on- and out-of-level Measurement
common items.
The overall growth pattern depicted in the six graphs indicated a linear increase in
median performance from grade to grade when both the Geometry and Measurement common
items were used in the linking set and when only the Measurement common items were used.
The greater median performance from grade to grade was observed when only the Measurement
common items made up the linking set. A nonlinear increase in median performance from grade
to grade was observed when only the Geometry common items were use in the linking set.
The relatively flat pattern of growth in the vertical scale when only the Geometry
common items were used appears between grades four and five. This growth pattern between the
fourth and fifth grade was exhibited in a similar study conducted by Sudweeks et al. (2008) in
which two calibration methods were used to calibrate a different set of Geometry items that were
administered to a different set of students. Sudweeks et al. concluded that since the relative lack
of average growth from fourth and fifth grade was manifest in the results of both calibration
methods, this pattern was not an artifact of the calibration method, but could be attributed to: (a)
one or more characteristics of the test items, (b) differences in the Geometry curriculum, (c) the
characteristics of the students, and/or (d) the nature of the instruction provided to the students.
This pattern of decelerated growth between grades four and five was evident in both studies. The
findings of this study support the conclusion that this pattern is due to reasons other than the
psychometric properties of the Geometry items.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 28
On-level common items. The six graphs in Figure 6 summarize the average increase in
achievement from grade to grade when on-level common items were used in the linking set for
the Robust z and the 0.3-logit difference procedure. The two top graphs summarize the results
obtained from using on-level common items from both the Geometry and Measurement items,
the two graphs in the middle row summarize the results obtained from using only on-level
Geometry common items, and the two bottom graphs summarize the results obtained from using
only on-level Measurement common items.
The overall growth pattern depicted in the six graphs indicated a linear increase in
median performance from grade to grade when on-level common items from both the Geometry
and Measurement item pool were used in the linking set. A greater linear increase in median
performance was observed when the on-level common items from the Measurement item pool
made up the linking set. The least amount of median grade-to-grade growth was observed when
only the Geometry common items were use in the linking set. Again, a nonlinear increase in
median performance was observed when the linking set was made up of only on-level Geometry
common items.
Out-of-level common items. The six graphs in Figure 7 summarize the average increase
in achievement from grade to grade when out-of-level common items were used in the linking
set for both stability assessment procedures. The two top graphs summarize the results obtained
from using out-of-level common items from both the Geometry and Measurement items, the two
graphs in the middle row summarize the results obtained from using only out-of-level Geometry
common items, and the two bottom graphs summarize the results obtained from using only out-
of-level Measurement common items.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 29
The overall growth pattern depicted in the six graphs indicated a nonlinear increase in
median performance when out-of-level common items were used in the linking set regardless of
the content-area assessed by those common items. The greatest increase however, was observed
when the out-of-level Measurement common items made up the linking set. Similar median
grade-to-grade growth was observed when both the Geometry and Measurement common items
and when only the Geometry common items were use in the linking set.
Median grade-to-grade growth for the three content-area-specific common-item
sets. The median proficiency estimate for the students in each grade was compared according to
the content area assessed by the linking set when the grade-level target changed. The pattern of
growth is displayed across the individual graphs illustrated in Figures 5, 6 and 7.
Geometry and measurement common items. The two top graphs in Figure 5 summarize
the average increase in achievement from grade to grade when items assessing both Geometry
and Measurement were included in the on- and out-of-level common-item set. The two top
graphs in Figure 6 summarize the average increase in achievement from grade to grade when
items assessing both Geometry and Measurement were included in the on-level common-item
set. The two top graphs in Figure 7 summarize the average increase in achievement from grade
to grade when items assessing both Geometry and Measurement were included in the out-of-
level common-item set.
The growth patterns depicted in the six top graphs across Figures 5, 6 and 7 indicated that
when items assessing both content areas were included in the linking set, the greatest linear
increase was exhibited when the items were on-level-targeted common items. When the
Geometry and Measurement common items were taken from the on- and out-of-level common-
item pool, the median performance from grade to grade exhibited linear growth, but it was not as
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 30
great. The least amount of grade-to-grade growth (nonlinear) was depicted in the vertical scale
that was constructed using out-of-level Geometry and Measurement common items.
Geometry only common items. The two graphs in the middle row in Figure 5 summarize
the average increase in achievement from grade to grade when items assessing only Geometry
were included in the on- and out-of-level common-item set. The two graphs in the middle row in
Figure 6 summarize the average increase in achievement from grade to grade when items
assessing only Geometry were included in the on-level common-item set. The two graphs in the
middle row in Figure 7 summarize the average increase in achievement from grade to grade
when items assessing only Geometry were included in the out-of-level common-item set.
The growth patterns depicted in the six graphs in the middle rows across Figures 5, 6 and
7 indicated that when items assessing only Geometry content were included in the linking set, the
pattern of median performance from grade to grade was nonlinear. The greatest increase was
exhibited when the items were on-level-targeted common items. The least amount of growth was
depicted in the vertical scale that was constructed using out-of-level Geometry common items.
Measurement only common items. The two bottom graphs in Figure 5 summarize the
average increase in achievement from grade to grade when items assessing only Measurement
were included in the on- and out-of-level common-item set. The two bottom graphs in Figure 6
summarize the average increase in achievement from grade to grade when items assessing only
Measurement were included in the on-level common-item set. The two bottom graphs in Figure
7 summarize the average increase in achievement from grade to grade when items assessing only
Measurement were included in the out-of-level common-item set.
The growth patterns depicted in the six bottom graphs across Figures 5, 6 and 7 indicated
that when items assessing only Measurement content were included in the linking set, the pattern
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 31
of median performance from grade to grade was linear when on- and out-of-level common items
and when on-level common items were used, but nonlinear when only out-of-level common
items were used. The greatest increase was exhibited when the Measurement items were on-
level-targeted common items. When the on- and out-of-level Measurement items were used, the
median performance was not as great. The least amount of growth was depicted in the vertical
scale that was constructed using out-of-level Measurement common items.
Mean grade-to-grade growth. Figure 8 displays the mean proficiency estimate for the
students in each grade and the pattern of increasing growth in students’ achievement across
grades for the 18 vertical scales. Two styles of lines were used to distinguish between the vertical
scales according to stability assessment procedure. The dotted lines represent the vertical scales
constructed using the Robust z procedure and the solid lines represent the vertical scales
constructed using the 0.3-logit difference procedure.
Three colors were used to distinguish between the vertical scales according to grade-
level-targeted common items. The blue lines represent the vertical scales that were created using
on- and out-of-level common items, the burgundy lines represent the vertical scales that were
created using on-level common items, and the green lines represent the vertical scales that were
created using out-of-level common items.
Three shapes identifying the mean growth at each grade were used to distinguish between
the vertical scales according to content-area-specific common items. The triangles identify the
vertical scales that were created using linking items that assessed both Geometry and
Measurement content. The circles identify the vertical scales that were created using linking
items that assessed only Geometry content and the squares identify the vertical scales that were
created using linking items that assessed only Measurement content.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 32
Except for the vertical scales that were created using out-of-level common items,
particularly at the transition from grade 5 to 6, the growth pattern depicted in Figure 8
consistently indicated similar grade-to-grade growth for both stability assessment procedures.
The differences between the means were tested and the results indicated that the differences were
not statistically significant (see Appendix B).
Figure 8 further illustrates that the greatest grade-to-grade growth was displayed when
on-level common items (represented by the blue lines) were used in the linking set to create the
vertical scale. The least grade-to-grade growth was displayed when out-of-level common items
(represented by the green lines) were used in the linking set to create the vertical scale. The
vertical scales that included only common items that assessed Measurement content (represented
by the square) in the linking set exhibited the greatest grade-to-grade growth and the vertical
scales that included only common items that assessed Geometry content (represented by the
circle) exhibited the least grade-to-grade growth. The differences between the means were
statistically significant at each grade (see Appendix B).
Separation of Grade Distributions
Due to the difference in sample sizes in the four grades, this analysis used weighted
variances to calculate the effect size indices. According to Young (2006) the variances of the
groups being compared should be weighted by their respective sample sizes.
Between grade effect size indices. The effect size estimates computed for the different
scale score distributions are reported in Table 9 according to the composition of the linking sets
for each stability assessment procedure. The results indicated that the effect sizes produced by
the 18 vertical scales for corresponding grade-to-grade transitions were different, but four
distinct patterns were evident.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 33
Figure 8. Mean growth from grade to grade by grade-level-targeted and content-area-specific common items and by stability assessment procedure
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 34
First, the greatest growth was displayed at the transition between grades 3 and 4. This
increase was followed by a decrease in growth from grades 4 to 5 and another increase from
grades 5 to 6. Second, the vertical scales created using on-level common items demonstrated the
greatest growth at each grade-to-grade transition compared to the vertical scales created using
out-of-level common items or on- and out-of-level common items. Third, the largest effect sizes
were generally exhibited at each grade-to-grade transition when only items assessing
Measurement content or items assessing both Geometry and Measurement content were used in
the on- and out-of-level linking sets and in the on-level linking sets. Fourth, the decelerated
growth demonstrated in the vertical scales that used the Geometry only common items also
indicated low effect sizes for the transition from grade 4 to 5. These results support previous
findings.
Table 9
Effect Sizes Computed for Different Scale Score Distributions by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure
Effect Size
Robust z 0.3-logit difference
Content Area by Level G3/G4 G4/G5 G5/G6 G3/G4 G4/G5 G5/G6
On- and Out-of-Level
Geometry & Measurement 0.7722 0.4366 0.5728 0.8188 0.4155 0.6068
Geometry 0.7339 0.2488 0.6419 0.7416 0.2673 0.6794
Measurement 0.8934 0.5510 0.4838 0.8874 0.5510 0.5404
On-Level
Geometry & Measurement 0.9447 0.6650 0.6845 0.9287 0.6328 0.6845
Geometry 0.8042 0.4078 0.8172 0.8164 0.4406 0.6997
Measurement 0.9412 0.8403 0.6702 1.0284 0.8037 0.6702
Out-of-Level
Geometry & Measurement 0.6569 0.1541 0.4332 0.7089 0.1916 0.5314
Geometry 0.7198 0.0941 0.5087 0.6667 0.0941 0.6592
Measurement 0.5980 0.2141 0.3257 0.7464 0.2833 0.4178
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 35
Discussion and Conclusions
Content and Construct Representation Should Be Maintained
The importance of common-item sets reflecting the content of the full test forms was
stressed in the equating literature, particularly when the nonrandom groups in a common-item
equating design perform differentially (Cook, 2007; Cook, Eignor, & Taft, 1985, 1988; Cook &
Petersen, 1987; Klein & Jarjoura, 1985). Our results indicated that linking sets that were not
totally representative of the full test forms produced different vertical scales than the linking sets
that were most representative of the full test forms. The vertically scaled scores produced by the
nonrepresentative linking sets did not adequately correspond to the students’ achievement level
for the full test forms. Therefore, these findings suggest that content and construct representation
should also be maintained in the context of vertical scaling in order to capture a realistic
representation of students’ growth from grade to grade.
The importance of how common items are selected can not be overemphasized. Despite
the progressive nature of vertical scales, in that students’ achievement levels and test forms’
difficulty levels are expected to advance from grade to grade, the tests used in this study were
systematically assembled to minimize content or construct shifts from one grade to the next. That
is, the Geometry and Measurement tests were each strategically designed to assess skills and
understandings across grades along a single developmental continuum. Despite the latter, this
study revealed differences in the vertical scales depending on the linking sets used.
This approach of focusing on a single developmental continuum when constructing a test
is not commonly seen in practice. Thus it would seem reasonable to assume that the relative
emphasis given to different content areas (or constructs) change from grade to grade much more
in end-of-level state tests, thereby increasing the probability of shifts in content areas and/or
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 36
constructs assessed in the linking sets used to construct the vertical scale. Based on this
assumption and the results of this study, it could be concluded that common-item selection is
particularly important when creating a vertical scale, especially when the vertically scaled scores
are used in value-added models to estimate the contributions that individual teachers and schools
make to students’ learning.
Large Versus Small Disparities in the Linking Set
According to Kolen and Brennan (2004), students’ performance on the items included in
the linking set influences the amount of grade-to-grade growth exhibited in the resulting vertical
scale. In other words, different linking sets result in different vertical scales. The findings of this
study showed that when the linking sets differed considerably, the growth patterns in the
resulting vertical scales differed as well.
Linking sets made up of common items assessing different curricular grade levels and
different mathematical constructs resulted in different vertical scales. The vertical scales that
differed most from one another were the vertical scales that were constructed using only on-level
or out-of-level common items assessing one content area (Geometry or Measurement). The
growth patterns of some of the vertical scales did not differ as much from one another. These
were the vertical scales constructed using linking sets that contained some items in common
(e.g., a linking set included both groups of grade-level-targeted common items and/or both
groups of content-area-specific common items). In either case, this study showed that when the
linking sets varied according to grade level and content area, the mean differences at each grade
were statistically significant.
Conversely, this study also showed that when the linking sets contained many of the
same common items, the small differences that existed between the linking sets were not as
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 37
evident in the growth patterns of the resulting vertical scales. In particular, the linking sets used
to compare the two stability assessment procedures were very similar. On average, the linking
sets consisting of items screened using the Robust z procedure contained only nine percent fewer
common items than the linking sets consisting items screened using the 0.3-logit difference
procedure (see Table 6). When comparing the growth patterns of the vertical scales created using
the Robust z approach to those created using the 0.3-logit difference approach, the results of this
study suggest that small differences in the composition of the linking sets do not transfer over to
the resulting vertical scales.
These findings suggest that practitioners should pay particular attention to changes in the
composition of the linking set as vertical scales are maintained over the years. Small changes
should not have a great influence on the students’ growth patterns, but larger changes in the
linking set over time may artificially influence the grade-to-grade growth revealed by the
resulting vertical scales.
Robust z versus 0.3-logit difference
It was helpful to apply these two stability assessment procedures in the context of a
vertical scaling study because this study revealed that, while the Robust z procedure could be
utilized in the same manner in which it is used in equating, a variation of the 0.3-logit difference
procedure was needed to ensure that items were not mistakenly identified as unstable. This study
proposed and documented a method of using the 0.3-logit difference procedure when screening
common items for the purpose of creating a vertical scale.
The results of this study support Huynh and Rawls’ (2009) conclusion that either the
Robust z procedure or the 0.3-logit difference procedure could be used to identify stable items,
since most of the items under consideration were identically classified for both procedures. This
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 38
vertical scaling study also demonstrated that both procedures resulted in very similar increase in
achievement from year to year. Given the similarities, we concur with Huynh and Rawls that the
Robust z is the recommended procedure because it is a more conservative approach.
Test On-level
The study also revealed that the vertical scales constructed using the on-level common
items consistently produced the largest increase in achievement from year to year. The vertical
scales constructed using the on- and out-of-level common items consistently exhibited less
grade-to-grade growth. This would suggest that students’ performance on the out-of-level
common items lowered the overall test scores. Based on these findings, it can be reiterated that
students perform better when tested on content they have been most recently instructed on and
therefore the test items should be on-level.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 39
References
Camilli, G., Yamamoto, K., & Wang, M. (1993). Scale shrinkage in vertical equating. Applied
Psychological Measurement, 17, 379-388.
Cook, L. L. (2007). Practical problems in equating test scores: A practitioner’s perspective, In
Dorans N.J., Pommerich, M., & Holland, P.W. (Eds.), Linking and aligning scores and
scales. New York: Springer.
Cook, L. L., Eignor, D. R., & Taft, H. L. (1985). A comparative study of curriculum effects on
the stability of IRT and conventional item parameter estimates (RR-85-38). Princeton NJ:
Educational Testing Service.
Cook, L. L., Eignor, D. R. & Taft, H. L. (1988). A comparative study of the effects of recency of
instruction on the stability of IRT and conventional item parameter estimates. Journal of
Educational Measurement, 25 (1), 31-45.
Cook, L. L., & Petersen, N.S. (1987). Problems related to the use of conventional and Item
Response Theory equating methods in less than optimal circumstances. Applied
Psychological Measurement, 11, 225-244.
Harris, D. J. (2007). Practical issues in vertical scaling. In N.J. Dorans, M. Pommerich, & P.W.
Holland (Eds.), Linking and aligning scores and scales (pp. 233-251). New York:
Springer.
Huynh, H., Gleaton, J., & Seaman, S. P. (1992). Technical documentation for the South Carolina
high school exit examination of reading and mathematics: Paper No. 2 (2nd ed.).
Columbia, SC: University of South Carolina, College of Education.
Huynh, H., & Rawls, A. (2009). A comparison between robust z and 0.3-logit difference
procedures in assessing stability of linking items for the Rasch model. In Everett V.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 40
Smith Jr. & Greg E. Stone (Eds.) Applications of Rasch Measurement in Criterion-
Referenced Testing: Practice Analysis to Score Reporting. Maple Grove, MN: JAM
Press.
Klein, L.W. & Jarjoura, D. (1985). The importance of content representation for common-item
equating with nonrandom groups. Journal of Educational Measurement, 22, 197-206.
Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and
practices (2nd ed.). New York: Springer.
Linacre, J. M. (2006). User’s guide to WINSTEPS® computer program. Chicago: Winsteps.com.
Lord, F. M. (1983). Small N justifies the Rasch model. In D. J. Weiss (Ed.), New horizons in
testing (pp. 51-62). New York: Academic Press.
Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of
Educational Measurement, 17, 179-193.
Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the .3 logits screening criterion in
common item equating. Journal of Applied Measurement, 5(2), 172-177.
Sudweeks, R. R, Hardy, M. A., Bullough, R. V., Jr., Bahr, D. L., Monroe, E. E., Thayn, S., &
McEwen, M. (2008, March). Constructing vertically scaled mathematics test for tracking
student growth in value-added studies of teacher effectiveness. Paper presented at the
annual meeting of the National Council on Measurement in Education, New York City,
New York.
Williams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental
scales based on Thurstone methods and item response theory. Journal of Educational
Measurement, 35, 93-107.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 41
Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective.
Journal of Educational Measurement, 23, 399-425.
Young, M. J. (2006). Vertical scales. In S.M. Downing & T.M. Haladyna (Eds.), Handbook of
test development (pp. 469-485). Mahwah, NJ: Erlbaum.
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 42
Appendix A
Vertical Scales by ID Code
Grade Level Mathematical Construct Stability
Assessment
No. Vertical Scale ID
On-
and
Out
-of
-Lev
el
On-
Lev
el
Out
-of-
Lev
el
Geo
met
ry
and
Mea
sure
men
t
Geo
met
ry
only
Mea
sure
men
t on
ly
Rob
ust z
0.3-
Log
it D
iffe
renc
e
1 OnOut_GM_RobZ X X X
2 OnOut_G _RobZ X X X
3 OnOut_M_RobZ X X X
4 On_GM_RobZ X X X
5 On_G _RobZ X X X
6 On_M_RobZ X X X
7 Out_GM_RobZ X X X
8 Out_G _RobZ X X X
9 Out_M_RobZ X X X
10 OnOut_GM_0.3LD X X X
11 OnOut_G _0.3LD X X X
12 OnOut_M_0.3LD X X X
13 On_GM_0.3LD X X X
14 On_G _0.3LD X X X
15 On_M_0.3LD X X X
16 Out_GM_0.3LD X X X
17 Out_G _0.3LD X X X
18 Out_M_0.3LD X X X
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 43
Appendix B
Three Way ANOVA Tables for Grades 3, 5, and 6
Table B1 3-Way ANOVA(a) for Grade 3 Experimental Method Sum of Squares df Mean Square F Sig.
Score Main Effects (Combined) 8681.29 5 1736.26 22.32
.0000
Grade-level-targeted Common Items (G) 7064.28 2 3532.14 45.42
.0000
Content-area-specific Common Items (C) 1421.13 2 710.57 9.14
.0001
Stability Assessment Procedure (SAP) 195.88 1 195.88 2.52
.1125
2-Way Interactions (Combined) 1443.13 8 180.39 2.32
.0175
G * C 1143.48 4 285.87 3.68
.0054
G * SAP 38.01 2 19.01 .24
.7832
C * SAP 261.64 2 130.82 1.68
.1860
3-Way Interactions G * C * SAP 360.13 4 90.03 1.16
.3274
Model 10484.56 17 616.74 7.93
.0000 Residual 830156.01 10674 77.77 Total 840640.57 10691 78.63
aScore by G, C, SAP
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 44
Table B2 3-Way ANOVA(a) for Grade 5 Experimental Method Sum of Squares df Mean Square F Sig.
Score Main Effects (Combined) 38408.09 5 7681.62 95.52
.0000
Grade-level-targeted Common Items (G) 27885.87 2 13942.93 173.38
.0000
Content-area-specific Common Items (C) 10510.90 2 5255.45 65.35
.0000
Stability Assessment Procedure (SAP) 11.32 1 11.32 .14
.7075
2-Way Interactions (Combined) 1426.45 8 178.31 2.22
.0234
G * C 1327.26 4 331.81 4.13
.0024
G * SAP 81.63 2 40.82 .51
.6020
C * SAP 17.55 2 8.78 .11
.8966
3-Way Interactions G * C * SAP 118.76 4 29.69 .37
.8307
Model 39953.30 17 2350.19 29.22
.0000 Residual 819304.43 10188 80.42 Total 859257.72 10205 84.20
aScore by G, C, SAP
INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 45
Table B3 3-Way ANOVA(a) for Grade 6 Experimental Method Sum of Squares df Mean Square F Sig.
Score Main Effects (Combined) 49061.96 5 9812.39 106.55
.0000
Grade-level-targeted Common Items (G) 47194.28 2 23597.14 256.23
.0000
Content-area-specific Common Items (C) 1523.05 2 761.52 8.27
.0003
Stability Assessment Procedure (SAP) 344.63 1 344.63 3.74
.0531
2-Way Interactions (Combined) 3358.85 8 419.86 4.56
.0000
G * C 2289.71 4 572.43 6.22
.0001
G * SAP 1054.07 2 527.03 5.72
.0033
C * SAP 15.07 2 7.53 .08
.9214
3-Way Interactions G * C * SAP 45.67 4 11.42 .12
.9739
Model 52466.48 17 3086.26 33.51
.0000 Residual 692906.71 7524 92.09 Total 745373.19 7541 98.84
aScore by G, C, SAP