Investigating Content and Construct Representation of a ......Investigating Content and Construct...

46
Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper presented at the annual meeting of the National Council on Measurement in Education New Orleans, LA M. Assunta Hardy, Brigham Young University Michael J. Young, Pearson Qing Yi, Pearson Richard R. Sudweeks, Brigham Young University Damon L. Bahr, Brigham Young University April 2011

Transcript of Investigating Content and Construct Representation of a ......Investigating Content and Construct...

Page 1: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper presented at the annual meeting of the National Council on Measurement in Education New Orleans, LA M. Assunta Hardy, Brigham Young University

Michael J. Young, Pearson

Qing Yi, Pearson

Richard R. Sudweeks, Brigham Young University

Damon L. Bahr, Brigham Young University April 2011

Page 2: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 1

The authors thank Dr. Eula Ewing Monroe from Brigham Young University for providing elementary mathematics content-area expertise. To obtain a copy of this report, send an email request to M. Assunta Hardy, Department of Instructional Psychology & Technology, Brigham Young University, Provo, Utah 84602. E-mail: [email protected].

Page 3: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 2

Abstract

According to the equating guidelines, a set of common items should be a mini version of the total

test in terms of content and statistical representation (Kolen & Brennan, 2004). Differences

between vertical scaling and equating would suggest that these guidelines may not apply to

vertical scaling in the same way that they apply to equating. This study investigated how well the

guideline of content and construct representation was maintained while evaluating two stability

assessment criteria (Robust z and 0.3-logit difference). The results indicated that linking sets that

were not totally representative of the full test forms produced different vertical scales than the

linking sets that were most representative of the full test forms. The results also showed that

large disparities in the composition of linking sets produced statistically significant differences in

the growth patterns of the resulting vertical scales, but small disparities in the composition of

linking sets produced very similar vertical scales. Overall, the Robust z procedure was a more

conservative approach to flagging unstable items.

Page 4: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 3

Introduction

The common-item design (CID) is a data collection plan widely used in creating a

vertical scale. Examinees’ performance on the common items across test forms is used to

indicate the amount of growth that occurs from grade to grade (Kolen & Brennan, 2004).

Different decisions regarding the structure of the design and the composition of the linking item

set may lead to different vertical scales (Camilli, Yamamoto, & Wang, 1993; Harris, 2007; Loyd

& Hoover, 1980; Williams, Pommerich, & Thissen, 1998; Yen, 1986).

The literature on test score equating provides some guidelines for constructing a test that

includes common items as a method of collecting data (Kolen & Brennan, 2004). According to

the guidelines, the set of common items should be a mini version of the total test in terms of

content and statistical representation. Appropriately selecting common items for the linking set

ensures that the common items represent the total test sufficiently.

Potential common items are identified when adjacent test forms are constructed, but the

common items that become part of the final linking set are those common items that are

reasonably stable in difficulty across forms. The equating literature also provides several criteria

for screening common items. Different criteria may result in different sets of linking items.

The research on equating has produced helpful guidelines for selecting and screening

common items, yet the differences between vertical scaling and equating would suggest that

these guidelines may not apply to vertical scaling in the same way that they apply to equating.

Through the equating process, the examinees’ location estimates are adjusted to account for

differences in difficulty between the test forms and placed onto a common metric. In vertical

scaling, the examinee groups that are administered the level tests are assumed to be different in

Page 5: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 4

ability. The set of test questions from one test form to the other are deliberately designed to

assess different levels of achievement.

In a review of the literature, Cook and Petersen (1987) concluded that when groups differ

in level of ability, special care must be taken when selecting the set of common items for the

anchor test. Content representativeness of the items is an important concern and can seriously

affect conventional equating results (Cook, Eignor, & Taft, 1985; Klien & Jajoura, 1985). In the

context of vertical scaling, since the examinee groups are expected to differ in their level of

achievement and the test forms differ in difficulty level, shifts in construct and content

specifications tested across test forms can occur simply by design.

In equating when common items are screened for stability, since the two test forms are

expected to be interchangeable, the item-difficulty estimates for common items are expected to

be almost the same. However, in vertical scaling the item difficulty estimates from two test

forms across adjacent grades are expected to differ somewhat. Common items should typically

be easier for the students at the higher grade level and more difficult for the students at the lower

grade level and this should be reflected in the item difficulty estimates.

Background and Rationale

Given the differences between equating and vertical scaling, this study investigated how

well the guideline of content and construct representation was maintained in the context of

creating a vertical scale while evaluating two stability assessment criteria. This was

accomplished by analyzing data gathered using the CID illustrated in Figure 1. This CID is a

variation of a design proposed by Sudweeks et al. (2008) and it involved constructing a separate

test for two mathematical constructs (Geometry and Measurement) intended to assess

achievement relative to objectives in the Utah Core Curriculum for several adjacent grade levels.

Page 6: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 5

Curricular Grade Level of Items

G1 G2 G3 G4 G5 G6 G7 G8

3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h

Grade Level of

Examinees

Figure 1. An illustration of the common-item design used to collect the response data.

Using this CID, students in grade 3, 4, 5, and 6 were administered test forms made up of

test questions intended to assess achievement along a continuum. The test questions included in

each form were classified into sets, according to the curricular grade level of the items, and

labeled alphabetically to display the progression along the continuum. That is, moving from left

to right, the item blocks contained test questions that became progressively more difficult as the

content tested became more complex (G1 through G8).

The students’ responses from the two tests were combined into one data set to investigate

the guideline of content and construct representation. In this study, content representation refers

to how well the curriculum assessed by the linking set matches the curriculum assessed by

adjacent grade level tests and construct representation refers to how well a specific content area

assessed (e.g., Geometry) by the linking set matches the content area assessed by adjacent grade

level tests. Due to the unique structure of the data, many common items were drawn upon for

linking purposes.

Different combinations of common items were included in the linking set to create

different vertical scales, thereby altering the degree of content and construct representativeness.

First, since the item blocks in the CID were intended to assess achievement relative to objectives

Page 7: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 6

in the Utah Core Curriculum for particular grade levels, content representation could be

investigated by varying the grade level targeted by the common-item block. Second, since the

students’ responses were gathered from two tests that were developed and administered

separately and later combined, construct representation could be investigated by altering the

common items included in the linking set depending on the specific content area assessed by the

common items.

Screening common items. The screening criteria reviewed in this study are those for

which tests are calibrated using the Rasch IRT model: (a) the Robust z statistic and (b) the 0.3-

logit difference.

The Robust z statistic is a z-score-like statistic that is not affected by outliers. The z

statistic is normally computed using the mean and standard deviation in its calculation however,

both the mean and standard deviation are sensitive to outliers. Instead the Robust z statistic,

developed by Huynh as part of the South Carolina Basic Skills Assessment (Huynh, Gleaton, &

Seaman, 1992), uses the median and the interquartile range, which are insensitive to outliers.

For each potential common item, the Robust z is defined as

,

74.0

)(

IQR

Mbbz diTiB (1)

where biB is the b parameter value for common item i for the base grade, biT is the b parameter

value for common item i for the grade being transformed, Md is the median item difficulty

difference of all potential linking items, and IQR is the interquartile range of the difference of all

linking items.

The 0.3-logit difference criterion is based on a fixed difference in difficulty parameter

estimates for common items from two test forms. The average standard error of an item Rasch

difficulty was around 0.15 logits for achievement tests which were calibrated on the test results

Page 8: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 7

of 500 examinees. Using the traditional 95% confidence interval, two standard errors would

result in 0.3 logits. Thus, the 0.3 logits criterion represents two standard errors.

Two studies have compared stability assessment procedures (Huynh & Rawls, 2009;

Miller, Rotou, & Twing, 2004), but these analyses were conducted in the context of equating.

Given that these procedures are also used in the context of vertical scaling, this study sought to

understand how the two procedures differed when the common items were screened for the

purpose of establishing linking sets to construct a vertical scale.

Purpose of Study

The purpose of this study was fourfold:

1. Combine the response data from two tests into one data set (Geometry test with the

Measurement test) and vertically scale the combined items.

2. Compare the effects of varying the content representativeness of the linking sets in

creating the vertical scale by altering the grade level targeted by the common items.

3. Compare the effects of varying the construct representativeness of the linking sets in

creating the vertical scale by altering the content area composition of the common items.

4. Evaluate two procedures for assessing the stability of the common items.

Research Questions

More specifically, the following three research questions were investigated in this study

for a test measuring students’ proficiency levels in Geometry and Measurement:

1. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-

grade variability across the four consecutive grades when three different grade-level

targets (on-level and/or out-of-level common-item blocks) were used in the linking

process?

Page 9: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 8

2. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-

grade variability across the four consecutive grades when three different sets of content

area linking items were used for each combined data set?

3. How did the resulting vertical scales vary in terms of grade-to-grade growth and within-

grade variability across the four consecutive grades when two stability assessment

procedures (Robust z and 0.3-logit difference) were used to select the common items?

This vertical scaling study made use of a unique data set, which allowed the above

questions to be investigated. This study provides a rare opportunity to use operational data to

address important issues in vertical scaling such as the grade-level targeting of common-item

linking sets and their content composition. The findings of this study could clarify how the

equating guidelines on common-item selection transfer to the process of creating a vertical scale.

Method

Common-item Test Design (CID)

The CID used in this study encompassed the following four premises:

1. A separate test would be developed for each mathematical construct.

2. The state indicators selected for the test blueprint would measure understandings and

skills that were developmentally appropriate for students at each grade level.

3. The skills and understandings specified for the various grades successively increased in

cognitive complexity in grade-level order.

4. For each test, the collective set of ordered skills and understandings aggregated across

grades defined a single developmental continuum representing progressive levels of

attainment of a single underlying construct.

Page 10: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 9

Therefore for each mathematical construct, the test consisted of eight blocks of items, labeled a

through h, intended to assess achievement along a continuum relative to objectives in the Utah

Core Curriculum for grade levels ranging from G1 to G8 (see Figure 1). For example, the item

block labeled a represented a set of items that assessed achievement for G1, and so forth.

One test form was constructed for each grade level (grades 3, 4, 5, and 6) and each form

contained five different blocks of items. For example, for students in the third grade, a test form

was constructed including item blocks a through e. The items in these five blocks were intended

to assess achievement along a continuum for grade levels ranging from G1 to G5. The same

procedure was taken in constructing the forms for grades 4, 5, and 6. In other words, at each

grade level students were administered blocks of items which included: (a) items assessing

objectives targeted one and two grades below the students’ classified grade level, (b) items

assessing objectives targeted at the students’ classified grade level, and (c) items assessing

objectives targeted one and two grades above the students’ classified grade level. This

assignment of items was done to minimize ceiling or floor effects for students that were either

above or below the average student’s ability level in their respective grades without penalizing

the average student.

Each item block consisted of eight items for the Geometry test and nine items for the

Measurement test. Therefore, each test form consisted of 40 items for the Geometry test and 45

items for the Measurement test.

Potential common-item links. Four of the six total possible blocks of items (67%)

across any two adjacent grades were purposely designed to be common-item blocks and could

subsequently be used in the linking set. Therefore, adjacent test forms shared 32 items in

Page 11: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 10

common (8 items 4 blocks) for the Geometry test and 36 items in common (9 items 4 blocks)

for the Measurement test.

Sample

In the spring of 2009, the two tests were administered to approximately 2,270 students in

grades 3, 4, 5, and 6 in 15 schools from five districts on two separate days. Students that were

present during each of the two testing days participated in the study.

Most students were administered both tests. A total of 2,263 students responded to the

items in the Geometry test on one day (see Table 1). A total of 2,268 students responded to the

items in the Measurement test on another day. Two thousand and ninety eight of the same

students completed both the Geometry and Measurement tests.

Table 1

Number of Student Participants by Test and Grade

Test

Grade Geometry Measurement Geometry & Measurement

3 631 612 594 4 541 541 518 5 609 607 567 6 482 439 419

Total 2,263 2,199 2,098

Data

Since many of the same students took both tests, the students’ response data was

combined into one data set for the purpose of this analysis. Consequently, the data included a

total of 2,098 students’ responses to 85 items for students in grades 3, 4, 5, and 6 (Table 2).

Page 12: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 11

Table 2

Assignment of Items for the Geometry and Measurement Tests Combined

Curricular Grade Level of Items Grade Level of

Examinees G1 G2 G3 G4 G5 G6 G7 G8

Total Number of Items Per Test

Form 3 17 17 17 17 17 85 4 17 17 17 17 17 85 5 17 17 17 17 17 85 6 17 17 17 17 17 85

Variations of the Linking Set

Table 3 summarizes the testing conditions used in this study. In total, 18 vertical scales

were constructed (3 testing content representation 3 testing construct representation 2

stability assessment procedures). The common items selected for linking were manipulated to

test the content and construct representation of the common-item set relative to the total test. In

addition, two stability assessment procedures were used separately to screen the items selected,

in which only the stable ones remained part of the linking set to create the vertical scales.

Table 3

Summary of Testing Conditions

Condition Tested Observed Set of Measures

Content Representation

1. On-level and out-of-level common items 2. On-level common items only 3. Out-of-level common items only

Construct Representation

1. Geometry and Measurement common items 2. Geometry common items only 3. Measurement common items only

Stability Assessment 1. Robust z 2. 0.3-logit difference

Page 13: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 12

Two different approaches were taken to manipulate the composition of the linking set: (a)

grade-level-targeted common items, and (b) content-area-specific common items. By selecting

common items based on the grade level targeted by the items, content representation of the

common items relative to the total test could be investigated.

Three variations of grade-level-targeted common item sets were used. First, all possible

common items across adjacent grades, also referred to as on-level and out-of-level linking items,

were selected as potential linking items to be included in the linking sets. In Figure 2, the three

bold rectangles each identify four common-item blocks that represent the on- and out-of-level

linking items across two adjacent grades. Table 2 summarizes the manner in which the items

were distributed across grade levels. With four common-item blocks selected, the data included a

total of 68 potential common items (17 items 4 item blocks) across any two adjacent grades.

A second variation involved only the on-level linking items. The on-level linking items

are defined as the common items across adjacent grades that assess objectives corresponding to

the students’ classified level. In Figure 3, the three bold squares each identify two common-item

blocks that represent the on-level linking items across adjacent grades. The data included a total

of 34 potential common items (17 items 2 blocks) across any two adjacent grades.

The third variation involved only the out-of-level linking items. The out-of-level linking

items are defined as the common items across adjacent grades that assess objectives above and

below the students’ classified level. In Figure 4, the three pairs of bold rectangles identify two

common-item blocks that represent the out-of-level linking items across adjacent grades. The

data included a total of 34 potential common items (17 items 2 blocks) across adjacent grades.

Page 14: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 13

Curricular Grade Level of Items

G1 G2 G3 G4 G5 G6 G7 G8

3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h

Grade Level of

Examinees

Figure 2. On-level and out-of-level common items included in the linking set.

Curricular Grade Level of Items

G1 G2 G3 G4 G5 G6 G7 G8

3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h

Grade Level of

Examinees

Figure 3. Only on-level common items included in the linking set.

Curricular Grade Level of Items

G1 G2 G3 G4 G5 G6 G7 G8

3 a b c d e 4 b c d e f 5 c d e f g 6 d e f g h

Grade Level of

Examinees

Figure 4. Only out-of-level common items included in the linking set.

Page 15: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 14

The second approach used to alter the composition of the linking set involved selecting

content-area-specific common items. By selecting common items based on the mathematical

construct assessed by the items, construct representation of the common items relative to the

total test could be investigated.

Three variations of content-area-specific common item sets were used to create the

vertical scales. The data set was composed of students’ responses to a relatively even number of

items from both mathematical constructs. Therefore, the common items included in the linking

set were (a) items assessing both mathematical constructs, (b) items assessing the Geometry

construct, and (c) items assessing the Measurement construct.

Table 4 outlines the number of common items for each variation of the linking set. The

table indicates the total number of potential common items across any two adjacent grades by the

items’ targeted grade level and by the items’ content area. For example, when only the common

items assessing Geometry were included in the common-item set, the set comprised of 16 on-

level and 16 out-of-level common items for a total of 32.

Table 4

Total Number of Potential Common Items Across Any Two Adjacent Grades by Grade-Level Target and Content Area for the Geometry and Measurement Data

Grade-level-targeted Common Items Content-area-specific

Common Items On-level & Out-of-level On-level Out-of-level

Geometry & Measurement 68 34 34

Geometry 32 16 16

Measurement 36 18 18

Page 16: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 15

Analysis

Table 5 summarizes the scaling process used in this study. The same IRT scaling method

was applied in creating the vertical scales for all variations of the linking set using two stability

assessment procedures.

Rasch scaling. Due to the widespread use of the Rasch model in large-scale assessment

and the fact that our study involved small sample sizes (Lord, 1983), the Rasch model was used

to analyze the student response data. Items that were not included in the students’ test booklets

were coded as Not Presented for those students. Items that were included in the students’ test

that were not reached by individual examinees were also coded as Not Presented. Students were

not penalized for not reaching items. Items that were omitted were coded as incorrect.

The WINSTEPS software (Linacre, 2006) was used to estimate the item and proficiency

parameters. Using item centering, the item parameters for each level test (grades 3, 4, 5, and 6)

were estimated separately.

Table 5

Summary of the Scaling Process

Element Description of Element

Scaling Method Item Response Theory (IRT)

Computer Software WINSTEPS (Linacre, 2006)

IRT Scaling Model Rasch Model

Calibration Method Separate calibration

Person Ability Estimation Joint Maximum Likelihood Estimate (JMLE)

Stability Assessment Robust z and 0.3-logit difference

Base Grade for Linking Grade 4

Scale Transformation Mean/Mean method

Page 17: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 16

When separate calibration is used, because of the indeterminacy of IRT scales, each level

test (e.g., grades 3, 4, 5, and 6) is set to have a mean of 0 and a standard deviation of 1.

Therefore, some linear transformation procedure was needed to place all grades onto a common

scale. Prior to the linear transformations, the parameter estimates for the common items were

assessed to determine how stable the item parameters were across the adjacent grades. Common

items identified as unstable were removed from the linking sets.

The Robust z statistic and the 0.3-logit difference were computed for each potential

common item for every variation of the linking set. According to our CID (Figure 1), common

items appeared in multiple test forms. Item stability, or instability, was defined only between any

two forms; therefore, an item could be classified as stable in one pair of test forms and unstable

in another pair of test forms. Items that were labeled as unstable under each procedure were

excluded from the common-item sets.

Robust z procedure. Once the item difficulties were obtained from the separate

calibration procedure in WINSTEPS, stable and unstable common items across each pair of

adjacent grades (G3/G4, G4/G5, and G5/G6) were identified using the Robust z procedure

(Huynh & Rawls, 2009).

In this study, alpha was set at 10 percent and the positive critical value for z* was 1.645.

Potential common items with a Robust z statistic smaller than z* in absolute value were

identified as stable and kept as part of the linking set. Other items with a Robust z statistic

greater than or equal to z* in absolute value was identified as unstable and were excluded from

the linking set.

0.3-logit difference procedure. The 0.3-logit difference procedure involves a simple

computation, but a variant of the procedure was used in this study. In common-item equating of

Page 18: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 17

two Rasch-calibrated tests, the absolute value of the item-difficulty difference is computed for

each common item (Miller et al., 2004). Since the two test forms are expected to be

interchangeable, either item-difficulty estimate could be subtracted from the other to compute the

difference. Taking the absolute value of the difference would result in the same value. Once the

absolute difference is computed, only those common items with an absolute difference in Rasch

difficulty estimate less than 0.3 logit are described as being stable and included in the linking

process. The unstable common items are dropped from the linking set.

Since this study involved multiple Rasch-calibrated tests that were vertically scaled, only

the item-difficulty difference was computed for each common item for adjacent grades. This

difference was computed by subtracting the item-difficulty estimate of the lower grade from the

item-difficulty estimate of the higher grade. Since in vertical scaling the item difficulty estimates

from two equated test forms across adjacent grades are expected to differ somewhat, a negative

difference is desirable (bn-1 > bn where n = grade). The item difficulty estimate for a common

item taken by students at the lower grade should be greater than the item difficulty estimate for

the same item taken by students at the higher grade. Taking the absolute value of a negative

difference could falsely identify a stable item as unstable. Therefore in the context of this study,

the 0.3-logit difference procedure only took the item-difficulty difference into consideration for

each common item for each pair of grades being linked.

Similar to the Robust z procedure, the item difficulties obtained from the separate

calibration procedure using the WINSTEPS software were applied to identify stable and unstable

common items across each pair of adjacent grades (G3/G4, G4/G5, and G5/G6) for the 0.3-logit

difference procedure. The item-difficulty difference was computed for each potential common

Page 19: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 18

item and only those common items with a difference in Rasch difficulty estimate less than 0.3

logit were described as being stable and included in the linking process.

The Robust z statistic and the 0.3-logit difference were computed for each potential

common item. Many common items appeared on more than two test forms, therefore item

stability or instability for those common items was determined for each pair of test forms. It

could occur that the same common item was considered stable in one pair of test forms and

unstable between another pair of test forms.

Mean/mean method of linking. Once the unstable items were deleted from the linking

sets for both stability assessment procedures, the remaining common items for each linking set

were used in the scale transformation phase. The mean/mean method was the method used to

transform the estimates onto a common scale. The additive or equating constant was computed

for each pair of adjacent grades for each vertical scale. Since the vertical scales encompassed

four grade levels, three additive constants (G3/G4, G4/G5, and G5/G6) were computed for each

vertical scale. Subsequently, the appropriate equating constant was added to the parameter and

proficiency estimates of each level test to transform the estimates to the base-grade scale for each

vertical scale. In this study, G4 was designated as the base level for the common scale.

Following the scale transformations, all the grades were rescaled so that the common scale had a

mean of 50.0 and a standard deviation of 10.0.

Evaluation Criteria

The properties used to compare the scaling results included (a) grade-to-grade growth, (b)

grade-to-grade variability, and (c) separation of grade distributions (Kolen & Brennan, 2004).

These properties were compared by computing the following statistics: means, medians, standard

Page 20: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 19

deviations, interquartile ranges, and effect sizes. The observed differences were used to assess

the impact that different choices about the linking set have on the resulting vertical scales.

Results

This study investigated three sets of grade-level-targeted common items, three sets of

content-area-specific common items, and two item stability procedures. A total of 18 vertical

scales were created. Appendix A enumerates the vertical scales and the testing conditions used to

construct the individual scales.

Robust z versus 0.3-logit difference

Table 6 reports the number of stable items identified in each testing condition for both the

Robust z and 0.3-logit difference procedures. Overall, the Robust z procedure was a more

conservative approach to flagging unstable items. The common items identified as unstable using

the 0.3-logit difference procedure were also identified as unstable using the Robust z procedure.

In addition, the Robust z procedure identified on average nine percent more items as unstable.

Regardless of the stability assessment procedure, the remaining common items in each

linking set represented at least 80 percent of the pool of linking items, except for three cases

under the Robust z procedure (see Table 6). First, when Measurement on- and out-of-level

common items were screened using the Robust z procedure, only 28 of the 36 common items

were retained, representing 78 percent of the linking pool. Second, only 26 of the 34 Geometry

and Measurement on-level common items were retained, which represented 77 percent of the

linking pool. And third, only 13 of the 18 Measurement on-level common items were retained,

which represented 72 percent of the linking pool.

Page 21: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 20

Table 6

Number and Percentage of Stable Items by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure

Robust z 0.3-logit difference

Content Area by Level G3/G4 G4/G5 G5/G6 G3/G4 G4/G5 G5/G6

On- and Out-of-Level Geometry & Measurement 61 (90%) 59 (87%) 65 (96%) 68 (100%) 67 (99%) 67 (99%)

Geometry 30 (94%) 30 (94%) 31 (97%) 32 (100%) 32 (100%) 32 (100%)

Measurement 28 (78%) 35 (97%) 33 (92%) 36 (100%) 35 (97%) 35 (97%)

On-Level

Geometry & Measurement 26 (77%) 31 (91%) 33 (97%) 34 (100%) 34 (100%) 33 (97%)

Geometry 14 (88%) 14 (88%) 14 (88%) 16 (100%) 16 (100%) 16 (100%)

Measurement 13 (72%) 16 (89%) 17 (94%) 18 (100%) 18 (100%) 17 (94%)

Out-of-Level

Geometry & Measurement 31 (91%) 32 (94%) 31 (91%) 34 (100%) 33 (97%) 34 (100%)

Geometry 15 (94%) 16 (100%) 14 (88%) 16 (100%) 16 (100%) 16 (100%)

Measurement 16 (89%) 16 (89%) 16 (89%) 18 (100%) 17 (94%) 18 (100%)

Table 7 displays the equating constants used to link across two adjacent grades for both

stability assessment procedures. Since the vertical scales encompassed four grade levels, three

additive constants (G3/G4, G4/G5, and G5/G6) were computed for each vertical scale. A fourth

column, representing the sum of the additive constants for G4/G5 and G5/G6, was included to

illustrate the magnitude of the combined transformations required to link G6 to G4. Comparing

across the Robust z and 0.3-logit difference procedures, only four of the 27 equating constants

were the same, indicating that in four cases, the same items were retained in the linking pool for

both procedures.

Page 22: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 21

Table 7

Equating Constants used to Link Across Two Adjacent Grades by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure

Equating Constant

Robust z 0.3 Logit Difference

Content Area by Level G3/G4 G4/G5 G5/G6 G4/5

+ G5/6

G3/G4 G4/G5 G5/G6 G4/5

+ G5/6

On- and Out-of-Level Geometry & Measurement -0.916 0.840 0.697 1.537 -0.957 0.821 0.729 1.550

Geometry -0.883 0.674 0.761 1.435 -0.890 0.691 0.796 1.486

Measurement -1.022 0.940 0.615 1.555 -1.017 0.940 0.667 1.607

On-Level

Geometry & Measurement -1.067 1.041 0.800 1.841 -1.053 1.012 0.800 1.813

Geometry -0.944 0.814 0.923 1.737 -0.955 0.843 0.814 1.658

Measurement -1.064 1.195 0.787 1.982 -1.140 1.163 0.787 1.950

Out-of-Level

Geometry & Measurement -0.816 0.591 0.568 1.159 -0.861 0.624 0.659 1.283

Geometry -0.871 0.538 0.638 1.176 -0.824 0.538 0.777 1.315

Measurement -0.764 0.644 0.469 1.113 -0.894 0.705 0.554 1.259

The differences observed in the two stability assessment procedures were not evident in

the resulting vertical scales. The vertical scales constructed using the linking-item sets identified

by the Robust z procedure exhibited very similar grade-to-grade growth and within-grade

variability as the vertical scales constructed using the linking-item sets identified by the 0.3-logit

difference. The similarities in the vertical scales are depicted in Figures 5, 6, and 7. The graphs

on the left of each figure represent the results obtained when the Robust z procedure was used to

screen the common items and the graphs on the right represent the results obtained when the 0.3-

logit difference procedure was used.

Page 23: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 22

Differences in Within-grade Variability from Grade to Grade

The same response data was used to create the 18 vertical scales. The difference between

the vertical scales depended on the common items used to calculate the equating constant, which

allowed the scores to be placed on the base scale. Therefore, the shape and spread of the theta

distributions at each grade remained constant across the 18 conditions. The distributions only

shifted up or down at each grade depending on the equating constant used.

Table 8 reports the standard deviation and interquartile range for each grade. The spread

was the same for each of the 18 vertical scales. The overall pattern in grade-to-grade variability

was a decrease in dispersion from grade 3 to 4, followed by greater variability in the scores as

the grades increased.

Table 8

Within-Grade Dispersion of Scaled Scores by Grade

Grade

Measure of Dispersion 3 4 5 6

Standard Deviation 8.82 8.62 8.97 9.60

Interquartile Range 11.63 11.20 11.40 12.10

The pattern of within-grade variability in students’ scaled scores is depicted graphically

in Figures 5, 6 and 7. The six graphs in Figure 5 summarize the variability within grade when

both on- and out-of-level common items were used in the linking set. The six graphs in Figure 6

summarize the variability within grade when on-level common items were used in the linking

set, and the six graphs in Figure 7 summarize the variability within grade when out-of-level

common items were used in the linking set. The two top graphs depict the results when both the

Geometry and Measurement common items were used, the two graphs in the middle row depict

Page 24: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 23

the results when only the Geometry common items were used, and the two bottom graphs depict

the results when only the Measurement common items were used.

Each column in each of the graphs represents the distribution of scaled scores for the

students in one grade. The five points in each column describe the location of the 10th, 25th, 50th,

75th, and 90th percentiles in the distribution of scores for that grade. The horizontal lines connect

the same percentile in adjacent grades and show the pattern of accelerated or decelerated growth

from grade to grade for students at the specified percentile (discussed in the next section).

The interquartile range – the distance between the 25th and 75th percentiles – provides an

index of within-grade variability that is insensitive to the influence of outliers. In all the 18

vertical scales there was an increase in the interquartile range of the scores for the higher grades

(fourth through sixth grade). The increased spread at the higher grade levels was more evident at

the upper percentile (90th) of the respective grade-level distributions.

Differences in Grade-to-Grade Growth

Median grade-to-grade growth for the three grade-level-targeted common-item sets.

The solid horizontal black line (labeled P50) in each of the 18 graphs in Figures 5, 6 and 7

displays the median proficiency estimate for the students in each grade. The graphs also show the

pattern of increasing growth in students’ achievement across grades. Since the fourth grade was

used as base grade, the median scaled score for the fourth graders in the graphs for each vertical

scale is constant at 58.3.

On- and out-of-level common items. The six graphs in Figure 5 summarize the average

increase in achievement from grade to grade when on- and out-of-level common items were used

in the linking set for the Robust z and the 0.3-logit difference procedure. The two top graphs

summarize the results obtained from using on- and out-of-level common items from both the

Page 25: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 24

Figure 5. Differences in grade-to-grade growth across corresponding percentile points for on- and out-of-level common items by content-area-specific common items and stability assessment procedure.

Page 26: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 25

Figure 6. Differences in grade-to-grade growth across corresponding percentile points for on-level common items by content-area-specific common items and stability assessment procedure.

Page 27: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 26

Figure 7. Differences in grade-to-grade growth across corresponding percentile points for out-of-level common items by content-area-specific common items and stability assessment procedure.

Page 28: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 27

Geometry and Measurement items, the two graphs in the middle row summarize the results

obtained from using only on- and out-of-level Geometry common items, and the two bottom

graphs summarize the results obtained from using only on- and out-of-level Measurement

common items.

The overall growth pattern depicted in the six graphs indicated a linear increase in

median performance from grade to grade when both the Geometry and Measurement common

items were used in the linking set and when only the Measurement common items were used.

The greater median performance from grade to grade was observed when only the Measurement

common items made up the linking set. A nonlinear increase in median performance from grade

to grade was observed when only the Geometry common items were use in the linking set.

The relatively flat pattern of growth in the vertical scale when only the Geometry

common items were used appears between grades four and five. This growth pattern between the

fourth and fifth grade was exhibited in a similar study conducted by Sudweeks et al. (2008) in

which two calibration methods were used to calibrate a different set of Geometry items that were

administered to a different set of students. Sudweeks et al. concluded that since the relative lack

of average growth from fourth and fifth grade was manifest in the results of both calibration

methods, this pattern was not an artifact of the calibration method, but could be attributed to: (a)

one or more characteristics of the test items, (b) differences in the Geometry curriculum, (c) the

characteristics of the students, and/or (d) the nature of the instruction provided to the students.

This pattern of decelerated growth between grades four and five was evident in both studies. The

findings of this study support the conclusion that this pattern is due to reasons other than the

psychometric properties of the Geometry items.

Page 29: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 28

On-level common items. The six graphs in Figure 6 summarize the average increase in

achievement from grade to grade when on-level common items were used in the linking set for

the Robust z and the 0.3-logit difference procedure. The two top graphs summarize the results

obtained from using on-level common items from both the Geometry and Measurement items,

the two graphs in the middle row summarize the results obtained from using only on-level

Geometry common items, and the two bottom graphs summarize the results obtained from using

only on-level Measurement common items.

The overall growth pattern depicted in the six graphs indicated a linear increase in

median performance from grade to grade when on-level common items from both the Geometry

and Measurement item pool were used in the linking set. A greater linear increase in median

performance was observed when the on-level common items from the Measurement item pool

made up the linking set. The least amount of median grade-to-grade growth was observed when

only the Geometry common items were use in the linking set. Again, a nonlinear increase in

median performance was observed when the linking set was made up of only on-level Geometry

common items.

Out-of-level common items. The six graphs in Figure 7 summarize the average increase

in achievement from grade to grade when out-of-level common items were used in the linking

set for both stability assessment procedures. The two top graphs summarize the results obtained

from using out-of-level common items from both the Geometry and Measurement items, the two

graphs in the middle row summarize the results obtained from using only out-of-level Geometry

common items, and the two bottom graphs summarize the results obtained from using only out-

of-level Measurement common items.

Page 30: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 29

The overall growth pattern depicted in the six graphs indicated a nonlinear increase in

median performance when out-of-level common items were used in the linking set regardless of

the content-area assessed by those common items. The greatest increase however, was observed

when the out-of-level Measurement common items made up the linking set. Similar median

grade-to-grade growth was observed when both the Geometry and Measurement common items

and when only the Geometry common items were use in the linking set.

Median grade-to-grade growth for the three content-area-specific common-item

sets. The median proficiency estimate for the students in each grade was compared according to

the content area assessed by the linking set when the grade-level target changed. The pattern of

growth is displayed across the individual graphs illustrated in Figures 5, 6 and 7.

Geometry and measurement common items. The two top graphs in Figure 5 summarize

the average increase in achievement from grade to grade when items assessing both Geometry

and Measurement were included in the on- and out-of-level common-item set. The two top

graphs in Figure 6 summarize the average increase in achievement from grade to grade when

items assessing both Geometry and Measurement were included in the on-level common-item

set. The two top graphs in Figure 7 summarize the average increase in achievement from grade

to grade when items assessing both Geometry and Measurement were included in the out-of-

level common-item set.

The growth patterns depicted in the six top graphs across Figures 5, 6 and 7 indicated that

when items assessing both content areas were included in the linking set, the greatest linear

increase was exhibited when the items were on-level-targeted common items. When the

Geometry and Measurement common items were taken from the on- and out-of-level common-

item pool, the median performance from grade to grade exhibited linear growth, but it was not as

Page 31: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 30

great. The least amount of grade-to-grade growth (nonlinear) was depicted in the vertical scale

that was constructed using out-of-level Geometry and Measurement common items.

Geometry only common items. The two graphs in the middle row in Figure 5 summarize

the average increase in achievement from grade to grade when items assessing only Geometry

were included in the on- and out-of-level common-item set. The two graphs in the middle row in

Figure 6 summarize the average increase in achievement from grade to grade when items

assessing only Geometry were included in the on-level common-item set. The two graphs in the

middle row in Figure 7 summarize the average increase in achievement from grade to grade

when items assessing only Geometry were included in the out-of-level common-item set.

The growth patterns depicted in the six graphs in the middle rows across Figures 5, 6 and

7 indicated that when items assessing only Geometry content were included in the linking set, the

pattern of median performance from grade to grade was nonlinear. The greatest increase was

exhibited when the items were on-level-targeted common items. The least amount of growth was

depicted in the vertical scale that was constructed using out-of-level Geometry common items.

Measurement only common items. The two bottom graphs in Figure 5 summarize the

average increase in achievement from grade to grade when items assessing only Measurement

were included in the on- and out-of-level common-item set. The two bottom graphs in Figure 6

summarize the average increase in achievement from grade to grade when items assessing only

Measurement were included in the on-level common-item set. The two bottom graphs in Figure

7 summarize the average increase in achievement from grade to grade when items assessing only

Measurement were included in the out-of-level common-item set.

The growth patterns depicted in the six bottom graphs across Figures 5, 6 and 7 indicated

that when items assessing only Measurement content were included in the linking set, the pattern

Page 32: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 31

of median performance from grade to grade was linear when on- and out-of-level common items

and when on-level common items were used, but nonlinear when only out-of-level common

items were used. The greatest increase was exhibited when the Measurement items were on-

level-targeted common items. When the on- and out-of-level Measurement items were used, the

median performance was not as great. The least amount of growth was depicted in the vertical

scale that was constructed using out-of-level Measurement common items.

Mean grade-to-grade growth. Figure 8 displays the mean proficiency estimate for the

students in each grade and the pattern of increasing growth in students’ achievement across

grades for the 18 vertical scales. Two styles of lines were used to distinguish between the vertical

scales according to stability assessment procedure. The dotted lines represent the vertical scales

constructed using the Robust z procedure and the solid lines represent the vertical scales

constructed using the 0.3-logit difference procedure.

Three colors were used to distinguish between the vertical scales according to grade-

level-targeted common items. The blue lines represent the vertical scales that were created using

on- and out-of-level common items, the burgundy lines represent the vertical scales that were

created using on-level common items, and the green lines represent the vertical scales that were

created using out-of-level common items.

Three shapes identifying the mean growth at each grade were used to distinguish between

the vertical scales according to content-area-specific common items. The triangles identify the

vertical scales that were created using linking items that assessed both Geometry and

Measurement content. The circles identify the vertical scales that were created using linking

items that assessed only Geometry content and the squares identify the vertical scales that were

created using linking items that assessed only Measurement content.

Page 33: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 32

Except for the vertical scales that were created using out-of-level common items,

particularly at the transition from grade 5 to 6, the growth pattern depicted in Figure 8

consistently indicated similar grade-to-grade growth for both stability assessment procedures.

The differences between the means were tested and the results indicated that the differences were

not statistically significant (see Appendix B).

Figure 8 further illustrates that the greatest grade-to-grade growth was displayed when

on-level common items (represented by the blue lines) were used in the linking set to create the

vertical scale. The least grade-to-grade growth was displayed when out-of-level common items

(represented by the green lines) were used in the linking set to create the vertical scale. The

vertical scales that included only common items that assessed Measurement content (represented

by the square) in the linking set exhibited the greatest grade-to-grade growth and the vertical

scales that included only common items that assessed Geometry content (represented by the

circle) exhibited the least grade-to-grade growth. The differences between the means were

statistically significant at each grade (see Appendix B).

Separation of Grade Distributions

Due to the difference in sample sizes in the four grades, this analysis used weighted

variances to calculate the effect size indices. According to Young (2006) the variances of the

groups being compared should be weighted by their respective sample sizes.

Between grade effect size indices. The effect size estimates computed for the different

scale score distributions are reported in Table 9 according to the composition of the linking sets

for each stability assessment procedure. The results indicated that the effect sizes produced by

the 18 vertical scales for corresponding grade-to-grade transitions were different, but four

distinct patterns were evident.

Page 34: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 33

Figure 8. Mean growth from grade to grade by grade-level-targeted and content-area-specific common items and by stability assessment procedure

Page 35: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 34

First, the greatest growth was displayed at the transition between grades 3 and 4. This

increase was followed by a decrease in growth from grades 4 to 5 and another increase from

grades 5 to 6. Second, the vertical scales created using on-level common items demonstrated the

greatest growth at each grade-to-grade transition compared to the vertical scales created using

out-of-level common items or on- and out-of-level common items. Third, the largest effect sizes

were generally exhibited at each grade-to-grade transition when only items assessing

Measurement content or items assessing both Geometry and Measurement content were used in

the on- and out-of-level linking sets and in the on-level linking sets. Fourth, the decelerated

growth demonstrated in the vertical scales that used the Geometry only common items also

indicated low effect sizes for the transition from grade 4 to 5. These results support previous

findings.

Table 9

Effect Sizes Computed for Different Scale Score Distributions by Grade-level-targeted Common Items, Content-area-specific Common Items, and Stability Assessment Procedure

Effect Size

Robust z 0.3-logit difference

Content Area by Level G3/G4 G4/G5 G5/G6 G3/G4 G4/G5 G5/G6

On- and Out-of-Level

Geometry & Measurement 0.7722 0.4366 0.5728 0.8188 0.4155 0.6068

Geometry 0.7339 0.2488 0.6419 0.7416 0.2673 0.6794

Measurement 0.8934 0.5510 0.4838 0.8874 0.5510 0.5404

On-Level

Geometry & Measurement 0.9447 0.6650 0.6845 0.9287 0.6328 0.6845

Geometry 0.8042 0.4078 0.8172 0.8164 0.4406 0.6997

Measurement 0.9412 0.8403 0.6702 1.0284 0.8037 0.6702

Out-of-Level

Geometry & Measurement 0.6569 0.1541 0.4332 0.7089 0.1916 0.5314

Geometry 0.7198 0.0941 0.5087 0.6667 0.0941 0.6592

Measurement 0.5980 0.2141 0.3257 0.7464 0.2833 0.4178

Page 36: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 35

Discussion and Conclusions

Content and Construct Representation Should Be Maintained

The importance of common-item sets reflecting the content of the full test forms was

stressed in the equating literature, particularly when the nonrandom groups in a common-item

equating design perform differentially (Cook, 2007; Cook, Eignor, & Taft, 1985, 1988; Cook &

Petersen, 1987; Klein & Jarjoura, 1985). Our results indicated that linking sets that were not

totally representative of the full test forms produced different vertical scales than the linking sets

that were most representative of the full test forms. The vertically scaled scores produced by the

nonrepresentative linking sets did not adequately correspond to the students’ achievement level

for the full test forms. Therefore, these findings suggest that content and construct representation

should also be maintained in the context of vertical scaling in order to capture a realistic

representation of students’ growth from grade to grade.

The importance of how common items are selected can not be overemphasized. Despite

the progressive nature of vertical scales, in that students’ achievement levels and test forms’

difficulty levels are expected to advance from grade to grade, the tests used in this study were

systematically assembled to minimize content or construct shifts from one grade to the next. That

is, the Geometry and Measurement tests were each strategically designed to assess skills and

understandings across grades along a single developmental continuum. Despite the latter, this

study revealed differences in the vertical scales depending on the linking sets used.

This approach of focusing on a single developmental continuum when constructing a test

is not commonly seen in practice. Thus it would seem reasonable to assume that the relative

emphasis given to different content areas (or constructs) change from grade to grade much more

in end-of-level state tests, thereby increasing the probability of shifts in content areas and/or

Page 37: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 36

constructs assessed in the linking sets used to construct the vertical scale. Based on this

assumption and the results of this study, it could be concluded that common-item selection is

particularly important when creating a vertical scale, especially when the vertically scaled scores

are used in value-added models to estimate the contributions that individual teachers and schools

make to students’ learning.

Large Versus Small Disparities in the Linking Set

According to Kolen and Brennan (2004), students’ performance on the items included in

the linking set influences the amount of grade-to-grade growth exhibited in the resulting vertical

scale. In other words, different linking sets result in different vertical scales. The findings of this

study showed that when the linking sets differed considerably, the growth patterns in the

resulting vertical scales differed as well.

Linking sets made up of common items assessing different curricular grade levels and

different mathematical constructs resulted in different vertical scales. The vertical scales that

differed most from one another were the vertical scales that were constructed using only on-level

or out-of-level common items assessing one content area (Geometry or Measurement). The

growth patterns of some of the vertical scales did not differ as much from one another. These

were the vertical scales constructed using linking sets that contained some items in common

(e.g., a linking set included both groups of grade-level-targeted common items and/or both

groups of content-area-specific common items). In either case, this study showed that when the

linking sets varied according to grade level and content area, the mean differences at each grade

were statistically significant.

Conversely, this study also showed that when the linking sets contained many of the

same common items, the small differences that existed between the linking sets were not as

Page 38: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 37

evident in the growth patterns of the resulting vertical scales. In particular, the linking sets used

to compare the two stability assessment procedures were very similar. On average, the linking

sets consisting of items screened using the Robust z procedure contained only nine percent fewer

common items than the linking sets consisting items screened using the 0.3-logit difference

procedure (see Table 6). When comparing the growth patterns of the vertical scales created using

the Robust z approach to those created using the 0.3-logit difference approach, the results of this

study suggest that small differences in the composition of the linking sets do not transfer over to

the resulting vertical scales.

These findings suggest that practitioners should pay particular attention to changes in the

composition of the linking set as vertical scales are maintained over the years. Small changes

should not have a great influence on the students’ growth patterns, but larger changes in the

linking set over time may artificially influence the grade-to-grade growth revealed by the

resulting vertical scales.

Robust z versus 0.3-logit difference

It was helpful to apply these two stability assessment procedures in the context of a

vertical scaling study because this study revealed that, while the Robust z procedure could be

utilized in the same manner in which it is used in equating, a variation of the 0.3-logit difference

procedure was needed to ensure that items were not mistakenly identified as unstable. This study

proposed and documented a method of using the 0.3-logit difference procedure when screening

common items for the purpose of creating a vertical scale.

The results of this study support Huynh and Rawls’ (2009) conclusion that either the

Robust z procedure or the 0.3-logit difference procedure could be used to identify stable items,

since most of the items under consideration were identically classified for both procedures. This

Page 39: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 38

vertical scaling study also demonstrated that both procedures resulted in very similar increase in

achievement from year to year. Given the similarities, we concur with Huynh and Rawls that the

Robust z is the recommended procedure because it is a more conservative approach.

Test On-level

The study also revealed that the vertical scales constructed using the on-level common

items consistently produced the largest increase in achievement from year to year. The vertical

scales constructed using the on- and out-of-level common items consistently exhibited less

grade-to-grade growth. This would suggest that students’ performance on the out-of-level

common items lowered the overall test scores. Based on these findings, it can be reiterated that

students perform better when tested on content they have been most recently instructed on and

therefore the test items should be on-level.

Page 40: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 39

References

Camilli, G., Yamamoto, K., & Wang, M. (1993). Scale shrinkage in vertical equating. Applied

Psychological Measurement, 17, 379-388.

Cook, L. L. (2007). Practical problems in equating test scores: A practitioner’s perspective, In

Dorans N.J., Pommerich, M., & Holland, P.W. (Eds.), Linking and aligning scores and

scales. New York: Springer.

Cook, L. L., Eignor, D. R., & Taft, H. L. (1985). A comparative study of curriculum effects on

the stability of IRT and conventional item parameter estimates (RR-85-38). Princeton NJ:

Educational Testing Service.

Cook, L. L., Eignor, D. R. & Taft, H. L. (1988). A comparative study of the effects of recency of

instruction on the stability of IRT and conventional item parameter estimates. Journal of

Educational Measurement, 25 (1), 31-45.

Cook, L. L., & Petersen, N.S. (1987). Problems related to the use of conventional and Item

Response Theory equating methods in less than optimal circumstances. Applied

Psychological Measurement, 11, 225-244.

Harris, D. J. (2007). Practical issues in vertical scaling. In N.J. Dorans, M. Pommerich, & P.W.

Holland (Eds.), Linking and aligning scores and scales (pp. 233-251). New York:

Springer.

Huynh, H., Gleaton, J., & Seaman, S. P. (1992). Technical documentation for the South Carolina

high school exit examination of reading and mathematics: Paper No. 2 (2nd ed.).

Columbia, SC: University of South Carolina, College of Education.

Huynh, H., & Rawls, A. (2009). A comparison between robust z and 0.3-logit difference

procedures in assessing stability of linking items for the Rasch model. In Everett V.

Page 41: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 40

Smith Jr. & Greg E. Stone (Eds.) Applications of Rasch Measurement in Criterion-

Referenced Testing: Practice Analysis to Score Reporting. Maple Grove, MN: JAM

Press.

Klein, L.W. & Jarjoura, D. (1985). The importance of content representation for common-item

equating with nonrandom groups. Journal of Educational Measurement, 22, 197-206.

Kolen, M. J. & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and

practices (2nd ed.). New York: Springer.

Linacre, J. M. (2006). User’s guide to WINSTEPS® computer program. Chicago: Winsteps.com.

Lord, F. M. (1983). Small N justifies the Rasch model. In D. J. Weiss (Ed.), New horizons in

testing (pp. 51-62). New York: Academic Press.

Loyd, B. H., & Hoover, H. D. (1980). Vertical equating using the Rasch model. Journal of

Educational Measurement, 17, 179-193.

Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the .3 logits screening criterion in

common item equating. Journal of Applied Measurement, 5(2), 172-177.

Sudweeks, R. R, Hardy, M. A., Bullough, R. V., Jr., Bahr, D. L., Monroe, E. E., Thayn, S., &

McEwen, M. (2008, March). Constructing vertically scaled mathematics test for tracking

student growth in value-added studies of teacher effectiveness. Paper presented at the

annual meeting of the National Council on Measurement in Education, New York City,

New York.

Williams, V. S. L., Pommerich, M., & Thissen, D. (1998). A comparison of developmental

scales based on Thurstone methods and item response theory. Journal of Educational

Measurement, 35, 93-107.

Page 42: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 41

Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective.

Journal of Educational Measurement, 23, 399-425.

Young, M. J. (2006). Vertical scales. In S.M. Downing & T.M. Haladyna (Eds.), Handbook of

test development (pp. 469-485). Mahwah, NJ: Erlbaum.

Page 43: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 42

Appendix A

Vertical Scales by ID Code

Grade Level Mathematical Construct Stability

Assessment

No. Vertical Scale ID

On-

and

Out

-of

-Lev

el

On-

Lev

el

Out

-of-

Lev

el

Geo

met

ry

and

Mea

sure

men

t

Geo

met

ry

only

Mea

sure

men

t on

ly

Rob

ust z

0.3-

Log

it D

iffe

renc

e

1 OnOut_GM_RobZ X X X

2 OnOut_G _RobZ X X X

3 OnOut_M_RobZ X X X

4 On_GM_RobZ X X X

5 On_G _RobZ X X X

6 On_M_RobZ X X X

7 Out_GM_RobZ X X X

8 Out_G _RobZ X X X

9 Out_M_RobZ X X X

10 OnOut_GM_0.3LD X X X

11 OnOut_G _0.3LD X X X

12 OnOut_M_0.3LD X X X

13 On_GM_0.3LD X X X

14 On_G _0.3LD X X X

15 On_M_0.3LD X X X

16 Out_GM_0.3LD X X X

17 Out_G _0.3LD X X X

18 Out_M_0.3LD X X X

Page 44: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 43

Appendix B

Three Way ANOVA Tables for Grades 3, 5, and 6

Table B1 3-Way ANOVA(a) for Grade 3 Experimental Method Sum of Squares df Mean Square F Sig.

Score Main Effects (Combined) 8681.29 5 1736.26 22.32

.0000

Grade-level-targeted Common Items (G) 7064.28 2 3532.14 45.42

.0000

Content-area-specific Common Items (C) 1421.13 2 710.57 9.14

.0001

Stability Assessment Procedure (SAP) 195.88 1 195.88 2.52

.1125

2-Way Interactions (Combined) 1443.13 8 180.39 2.32

.0175

G * C 1143.48 4 285.87 3.68

.0054

G * SAP 38.01 2 19.01 .24

.7832

C * SAP 261.64 2 130.82 1.68

.1860

3-Way Interactions G * C * SAP 360.13 4 90.03 1.16

.3274

Model 10484.56 17 616.74 7.93

.0000 Residual 830156.01 10674 77.77 Total 840640.57 10691 78.63

aScore by G, C, SAP

Page 45: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 44

Table B2 3-Way ANOVA(a) for Grade 5 Experimental Method Sum of Squares df Mean Square F Sig.

Score Main Effects (Combined) 38408.09 5 7681.62 95.52

.0000

Grade-level-targeted Common Items (G) 27885.87 2 13942.93 173.38

.0000

Content-area-specific Common Items (C) 10510.90 2 5255.45 65.35

.0000

Stability Assessment Procedure (SAP) 11.32 1 11.32 .14

.7075

2-Way Interactions (Combined) 1426.45 8 178.31 2.22

.0234

G * C 1327.26 4 331.81 4.13

.0024

G * SAP 81.63 2 40.82 .51

.6020

C * SAP 17.55 2 8.78 .11

.8966

3-Way Interactions G * C * SAP 118.76 4 29.69 .37

.8307

Model 39953.30 17 2350.19 29.22

.0000 Residual 819304.43 10188 80.42 Total 859257.72 10205 84.20

aScore by G, C, SAP

Page 46: Investigating Content and Construct Representation of a ......Investigating Content and Construct Representation of a Common-item Design When Creating a Vertically Scaled Test Paper

INVESTIGATING CONTENT AND CONSTRUCT REPRESENTATION 45

Table B3 3-Way ANOVA(a) for Grade 6 Experimental Method Sum of Squares df Mean Square F Sig.

Score Main Effects (Combined) 49061.96 5 9812.39 106.55

.0000

Grade-level-targeted Common Items (G) 47194.28 2 23597.14 256.23

.0000

Content-area-specific Common Items (C) 1523.05 2 761.52 8.27

.0003

Stability Assessment Procedure (SAP) 344.63 1 344.63 3.74

.0531

2-Way Interactions (Combined) 3358.85 8 419.86 4.56

.0000

G * C 2289.71 4 572.43 6.22

.0001

G * SAP 1054.07 2 527.03 5.72

.0033

C * SAP 15.07 2 7.53 .08

.9214

3-Way Interactions G * C * SAP 45.67 4 11.42 .12

.9739

Model 52466.48 17 3086.26 33.51

.0000 Residual 692906.71 7524 92.09 Total 745373.19 7541 98.84

aScore by G, C, SAP