Journal of Educational and Behavioral Statistics-2015-Liang-5-34

8/9/2019 Journal of Educational and Behavioral Statistics-2015-Liang-5-34

1/30

A Quasi-Parametric Method for Fitting FlexibleItem Response Functions

Longjuan Liang

Educational Testing Service

Michael W. Browne

Ohio State University

If standard two-parameter item response functions are employed in the analysis

of a test with some newly constructed items, it can be expected that, for some

items, the item response function (IRF) will not fit the data well. This lack of fit

can also occur when standard IRFs are fitted to personality or psychopathology

items. When investigating reasons for misfit, it is helpful to compare item

response curves (IRCs) visually to detect outlier items. This is only feasible if

the IRF employed is sufficiently flexible to display deviations in shape from the

norm. A quasi-parametric IRF that can be made arbitrarily flexible by increas-

ing the number of parameters is proposed for this purpose. To take capitaliza-

tion on chance into account, the use of Akaike information criterion or Bayesian

information criterion goodness of approximation measures is recommended for

suggesting the number of parameters to be retained. These measures balance

the effect on fit of random error of estimation against systematic error of

approximation. Computational aspects are considered and efficacy of the

methodology developed is demonstrated.

Keywords: item response theory; flexible item response function; monotonic polynomial

1. Introduction

When dichotomous items of an ability test are being analyzed, the most

widely employed item response functions (IRFs) have two parameters. Although

these IRFs have been found to be useful in general, they lack flexibility and there

are situations where they fail to fit some items. When this happens, it could be

either that the items have flaws or the data have characteristics that cannot be

handled by the IRF. In this situation, it is helpful to have access to a flexible IRF

that yields an item response curve (IRC) that will display differences in shape

between items.

Journal of Educational and Behavioral Statistics

2015, Vol. 40, No. 1, pp. 5–34

DOI: 10.3102/1076998614556816

# 2014 AERA. http://jebs.aera.net

5

at Alexandru Ioan Cuza on February 8, 2015http://jebs.aera.netDownloaded from

http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/http://jebs.aera.net/


2/30

A number of articles on flexible IRFs have appeared. For example, Drasgow,

Levine, Williams, McLaughlin, and Candell (1989) describe and illustrate the

use of multilinear formula score theory for nonparametric IRFs; Ramsay and

Winsberg (1991) use monotonic spline basis functions and calculate maximum

marginal likelihood (MML) estimates for the item parameters; Meijer and

Baneke (2004) discuss the use of nonparametric methods when analyzing psy-

chopathology and personality items. Other approaches for improving goodness

of fit of item response models are also possible. For example, Woods and Thissen

(2006) employ the two-parameter logistic (2PL) function for the IRF in all items

but replaced the standard normal ability distribution by a spline-based density. A

Bayesian approach to nonparametric item response modeling has been recently

developed by Duncan and MacEachern (2008, 2013). Although very good esti-

mates of nonstandard IRCs are produced, a considerable amount of computation

is required.

In his seminal article on nonparametric item response theory, Ramsay (1991)

suggested use of a nonparametric regression of item scores on normalized ability

surrogates using kernel smoothing as IRC. This approach is simple and robust

and allows the shape of the IRC to vary freely from one item to another. It has

made a substantial impact. Kernel smoothing does not constrain the IRF to be

monotonic, so that it can provide option response curves for incorrect options.

When the correct option of ability items is being analyzed, however, it is desir-

able to be able to constrain the IRCs to be monotonic. Lee (2007) investigated

this matter and used isotonic regression in conjunction with Ramsay’s approach

to obtain monotonic IRFs. A disadvantage of these nonparametric approaches is

that the IRF is not readily portable to scores of examinees that are not in the cali-

bration sample so that scoring test results for future examinees is difficult.

This article presents an IRF that is ‘‘quasi-parametric’’ (Ramsay, 1991, p. 613)

in the sense that it employs parameters that are intended solely for the provision of

a graphical representation of the IRC and not for interpretation in terms of some

underlying psychological process. Monotonicity can be guaranteed with this

approach that permits flexibility of the IRC and facilitates the use of the existing

parametric Bayesian Expected A Posteriori (EAP) estimates for the stochastic abil-

ity parameter. Our proposed filtered monotonic polynomial (FMP) IRF is the com-

position of a logistic function and a monotonic polynomial. Results concerning

cumulative distribution functions (cdfs) given by Elphinstone (1983) show that the

FMP IRF may be used to approximate any IRF with a continuous derivative arbi-

trarily closely by increasing the number of parameters in the monotonic polyno-

mial. Thus, this FMP IRF not only is flexible but is formulated as an algebraic

expression that is easily portable to future examinees not in the calibration sample.

When the FMP function is required to approximate some ‘‘population’’ IRF,

with few parameters it can happen that the approximating FMP IRF will require

more parameters than the (usually unknown) population IRF. When samples are

very large, and sampling error can be disregarded, the necessity for many FMP

A Quasi-Parametric Method for Fitting

6




3/30

parameters would not matter. In smaller samples where sampling error needs to

be considered, an attempt to use many FMP parameters could result in appreci-

able capitalization on chance. In general, it is preferable to retain fewer para-

meters in smaller samples than in large samples (see Browne, 2000; Cudeck &

Henly, 1991). We shall implement this principle by using the Akaike information

criteria (AIC; Akaike, 1973) or the Bayesian information criteria (BIC; Schwarz,

1978) as guides when choosing the number of FMP parameters for an item. These

are goodness of approximation measures that make no assumption of an exactly

correct model in the population, take sample size into account, limit the number

of parameters when the sample size is small, and allow more parameters as the

sample size increases.

The FMP approach involves no assumption that the number of item para-

meters is the same for all items, so that the shape of the IRC can vary from one

item to another, as is the case with the nonparametric regression approaches.

Because the usual two PL (2PL) IRF is a special case of the FMP family of IRFs,

it can be fitted at the same time as more flexible IRFs for comparative purposes.

Furthermore, the FMP requirement of monotonicity for the IRF may be discarded

to result in a filtered unconstrained polynomial (FUP) procedure that can assist

the diagnosis of nonmonotonic items. Although this article concentrates on

extensions to the computationally convenient 2PL IRF, basic theory is presented

in a manner that can be extended to other IRF families.

Unlike the 2PL, the one PL (1PL) IRF, or Rasch model, constrains an item para-

meter (discrimination) to equality across items (cf. Thissen & Orlando, 2001, p. 76,

equation 3). Consequently, the 1PL does not fit into the FMP computational frame-

work that estimates item parameters successively, 1 item at a time, rather than con-

currently for all items. Furthermore, the fundamental philosophy of the FMP

approach is to seek a model that fits given data as well as possible and contradicts

that of the Rasch model which requires that data should fit a given model to satisfy

mandatory measurement requirements (cf. Thissen & Orlando, 2001, pp. 90–91).

The following section gives a brief review of the parametric and nonpara-

metric approaches for estimating the IRF, including the joint maximum like-

lihood (JML) and MML parametric estimation methods. Thereafter, we

introduce the filtered polynomial IRF estimation method and consider the

choice of the number of parameters using the AIC information theoretical

approach. Subsequently, we present results from simulation studies and an

example with actual data. A summary of findings and conclusions of the

research is provided in the closing part of the article. Details concerning

parameter estimation are given in Online Appendices A and B.

2. Item Response Theory

Consider a N n data matrix, Y , with typical element y si which represents theresponse of examinee s, to item i with y si ¼ 1 if the response is correct and y si ¼ 0

Liang and Browne

7




4/30

otherwise. The responses of all examinees to item i are contained in the N 1vector, y#i, formed from column i of Y . A row s of Y provides the response patternfor examinee s and will be denoted by the 1 n vector, y0 s! with ith element y si.Thus, a column of Y represents scores of N examinees on an item and a row rep-resents scores of an examinee on n items.

We assume that there is a single latent trait, y, that influences an examinee’s

response to each item. The IRF for item i

PiðyÞ ¼ Probð yi ¼ 1jyÞ ð1Þgives the probability that an examinee with ability y will give the ‘‘correct’’

answer to item i. Because Pi(y) represents a probability, it must be bounded

below by 0 and above by 1. With ability or achievement tests, also, it makes sense

to assume that the probability of passing an item increases as y increases, so thatthe IRF will be monotonically increasing, bounded below by 0 and above by 1.

Any IRF Pi(y) that decreases as y increases would be symptomatic of an unusual

item.

The vectors, y s!, are regarded as independent realizations of a randomvector y. For each examinee s, there corresponds a realization y s of the latent

trait, y, that represents examinee ability. We assume local independence, that

is, that conditionally on y ¼ y s the elements of y are independently distrib-uted. Consequently, the probability of a specific response pattern y s! given

y s is

Probð y s!jy ¼ y sÞ ¼Yni¼1

P y si si 1 P sið Þ 1 y sið Þ; ð2Þ

where

P si ¼ Piðy sÞ: ð3Þ

2.1. Parametric IRFs

In addition to the abilities, y s, parametric IRFs involve additional item-specific parameters. As examples, we shall consider two well-known IRFs each

with two parameters, an item discrimination parameter ai and an item difficulty

parameter bi. The normal ogive IRF for item i is given (e.g., Lord & Novick,

1968, p. 366) by

Pi yð Þ ¼Z miðyÞ

11 ffiffiffiffiffiffi2

p expð z 2Þdz ð4Þ

and the two PL IRF (Birnbaum, 1968) by

Pi yð Þ ¼ 11 þ expfmi yð Þg

ð5Þ


8




5/30

where mi(y) is a linear function of y:

mi yð Þ ¼ ai y bið Þ: ð6ÞBoth IRFs are monotonic increasing if ai > 0 and are bounded by 0 and 1.

They are similar but differ in the scale of ai. This difference between the two

IRFs can be reduced substantially by replacing the function mi(y) in Equation

5 by the rescaled function 1:702mi yð Þ:1Early in the development of item response theory, the normal ogive IRF

was used predominantly but was replaced subsequently by the 2PL which is com-

putationally more convenient. Two methods are best known for obtaining para-

meter estimate using the 2PL. The first was originally suggested by Birnbaum

(1968). The unobservable ability variables, y s, are regarded as parameters to be

estimated rather than as realizations of a latent variable with a prespecified normal

distribution. A likelihood function is maximized jointly with respect to the item

parameters ai; bi; i ¼ 1; . . . ; n and the ability parameters y s; s ¼ 1; . . . ; N usingan alternating iterative algorithm. This method of estimation is referred to as JML

estimation. Consistency of the estimates has never been proved (e.g., Baker, 1992,

pp. 104–105) and ‘‘tuning’’ of the algorithm is necessary (Baker, 1992, p. 112).

The second method of estimation, introduced by Bock and Lieberman (1970)

for the 2PL, treated ability as a latent trait with a specified normal distribution

and maximized the marginal likelihood for item parameters alone integrating out

the ability variable, y. A Newton–Raphson algorithm was proposed and gave

acceptable results for a small number of items but was not practical for many

items. Significant improvements were provided by Bock and Aitkin (1981) who

approximated the density of y by a step function that facilitated the use of an

expectation-maximization algorithm (EM algorithm). This method of estimation

is known as MML and is now frequently employed.

The MML estimation method has advantages over JML in that it can obtain

estimates of the item parameters without estimating ability parameters,

y s; s ¼ 1; . . . ; N at the same time. Estimates for the latent traits, y s, may beobtained subsequently using a Bayesian method.

2.2. Ramsay’s Nonparametric IRF

Ramsay (1991, 2000) introduced a nonparametric approach to estimating an

IRC using kernel smoothing. This approach requires a surrogate ability value,~y s, for each examinee. All examinees are ranked according to total test score and ~yr is defined to be the estimated quantile of the standard normal distribution cor-

responding to rank r . A smoothed estimate of the IRF is given by:

b PiðyÞ ¼ P N r ¼1 K ~yr yh yriP N

r ¼1 K ~yr y

h

; ð7Þ

Liang and Browne

9




6/30

where yri is the score on item i of the examinee with rank r . The symmetric non-

negative weighting function

K z

ð Þ ¼ ð2

Þ1=2

exp

ð z 2=2

Þwhere z

¼ ~yr

y =h ð8Þ

is known as the Gaussian kernel. It will have a maximum when z ¼ 0 and decrease toward zero as j z j increases. An increase in the bandwidth h will resultin slower changes of the function b PiðyÞ but also increase bias of the function.Rapid changes or wiggles due to sampling fluctuation decrease as N increases,

so that bias can be reduced by reducing h as N increases. In the computer program

TestGraf (Ramsay, 2000), the default of h ¼ 1:1 N 0:2 is employed, so that h is afunction of N alone and decreases as N increases.

It is possible to use the nonparametric IRF defined by Equation 7 to compute

maximum likelihood estimates of the ability parameters y s; s ¼ 1; . . . ; N ; feed them back into the process for reranking the examinees, obtaining new surrogate

variables ~y s; s ¼ 1; . . . ; N , and carrying out an iterative procedure. In the Test-Graf (Ramsay, 2000) program, this can be done, but a manual intervention at

each iteration is required. This precludes use of the iterative procedure in random

sampling experiments. If no iterations are carried out, the original surrogate vari-

able values, ~y s, are output as estimates of the ability variables.

3. Quasi-parametric IRFs

The extension of the IRF for the 2PL in Equation 5 to yield IRFs that are

simultaneously both flexible and parametric will now be considered.

Elphinstone (1983, 1985) proposed a monotonic polynomial–based approach

for estimating an unknown univariate distribution function. Sinnott (1997) sub-

sequently named it the ‘‘filtered polynomial’’ distribution estimation method and

extended it to a multivariate setting. Here, the general methodology provided by

Elphinstone (1983) will be adapted to estimate an IRF of unknown functional

form. The likelihood function appropriate here for estimating an unknown IRFis different to that used by Elphinstone (1985) for estimating a distribution func-

tion of unknown functional form.

3.1. Filtered Polynomials

The IRF Pi(y) yields the probability that an examinee with ability y will

answer a specified item, i, correctly. Unless otherwise stated, each IRF to be con-

sidered here is assumed (i) to be monotonic increasing, (ii) to be bounded by 0

and 1, and (iii) to have a continuous first derivative with respect to y implyingthat the IRF is also continuous. Suppose that the functional form of some ‘‘true’’

IRF ~ Pi yð Þ is not known, but a known scalar valued function, H (m), of a scalar


10




7/30

valued argument, m, satisfies the three requirements of an IRF specified previ-

ously: for example, either the logistic function

H m

ð Þ ¼ 1

1 þ expðmÞ;

ð9

Þor normal ogive

H mð Þ ¼Z m

11 ffiffiffiffiffiffi2

p expð z 2=2Þdz ; ð10Þ

would be suitable.

It is known (e.g., Elphinstone, 1983, p. 167) that there exists at least one con-

tinuous monotonic function ~mi yð Þ such that~ Pi yð Þ ¼ H ~mi yð Þð Þ: ð11Þ

This monotonic function, ~mi yð Þ is, in general, not of a known functional form.It may, however, be approximated arbitrarily closely by a polynomial mi yð Þ of odd degree, 2k i þ 1, where k i 0, if k i is made sufficiently large (Elphinstone,1983, section 4). Thus,

mi yð Þ ¼ b0i þ b1iy þ b2iy2 þ þ b2k þ1;iy2k iþ1 ~mi yð Þ; ð12Þ

with 2k i

þ2 parameters represented by the vector b

0i

¼ b0i; b1i; . . . ; b2k

þ1;i

. A

reparameterization of bi that is used to ensure that the polynomial in Equation12 is monotonic will be described in subsection 3.2.

Any population IRF ~ Pi yð Þ in Equation 11 that is of an unknown functionalform may be approximated arbitrarily closely by the IRF of known functional

form

Pi yð Þ ¼ H mi yð Þð Þ; ð13Þif k i is sufficiently large. That is, the IRF is the composition of the filter with the

monotonic polynomial, P

¼ H

m. Thus, the ‘‘filter’’ H

ðÞ transforms the

unbounded monotonic polynomial, mi yð Þ, in Equation 12 into a monotonic IRC, Pi yð Þ; that is bounded by 0 and 1. (This terminology is motivated by an analogoussituation in signal processing in which a potentially unbounded signal is trans-

formed into a bounded signal through a device known as a ‘‘filter.’’) The IRF

defined in Equation 13 will be consequently referred to as an FMP model. If

no constraints are imposed on the coefficients in Equation 12 to ensure that mi(y)

is monotonic, the filtered function in Equation 13 will still be bounded by 0 and 1

but need not be monotonic. The resulting model will then be referred to as a fil-

tered unconstrained polynomial (FUP) model.

The logistic function in Equation 9 will be used henceforth as a filter because

it is algebraically convenient to do so. Use of the normal ogive as a filter would

give essentially the same results but leads to algebraic expressions that are more

Liang and Browne

11




8/30

complicated and less easily evaluated. Substitution of the polynomial in Equation

12 into the logistic filter in Equation 9 yields the IRF:

Pi y

ð Þ ¼ P y

jbi

ð Þ ¼ 1

1 þ exp b0i þ b1iy þ b2iy2 þ þ b2k iþ1;iy2k iþ1 ; ð14Þwhich applies to both the FUP and the FMP models. The difference is that the

coefficient vector, bi, is unconstrained for the FUP model and constraints are

applied to bi to ensure monotonicity of the IRC for the FMP model. These con-

straints are applied by means of reparameterizations that will be described in

Subsection 3.2.

Because k i can vary from one item to another, the shapes of the IRC for

different items may be different. When k i¼ 0, the IRF in Equation 14 isequivalent to the 2PL IRF of Equations (5) and (6) with b0i ¼ aibi and b1i ¼ ai. The filter, H ðÞ, may be any monotonic function that is bounded

by zero and one and has a continuous first derivative. It is also desirable that

H ðÞ should have the same domain as the domain hypothesized for theunknown ~ Pi yð Þ, so that the approximating IRF Pi(y) and the approximated IRF ~ Pi yð Þ have domains that match. The filter has the same mathematical

properties as a statistical cdf, so that alternative cdfs to those in Equations

9 and 10 could be tried as a filter.

Consequently in situations where it is plausible to restrict y to the nonnegative

real line, the gamma ogive could be tried as a filter. If y is assumed to be con-tained in a closed interval, the beta ogive could be employed. It is worth bearing

in mind that filters that are close in shape to that of the unknown IRF ~ Pi yð Þ willneed a lower degree for the monotonic polynomial than those that are more dis-

similar. The choice of filter is not always critical, however, because one can com-

pensate for an inappropriate filter to some extent by increasing the degree of the

monotonic polynomial. There are, however, practical limits to the degree of the

monotonic polynomial because computational instabilities are associated with

polynomial models of high degree.

3.2. Monotonicity Constraints

A necessary condition for the polynomial, mi(y), given in Equation 12 to be

monotonic is that it be of odd degree, 2k i þ 1. Here, we shall employ a parame-terization of an odd-degree polynomial that ensures that it is monotonic. The key

ideas were contained in a single formula that was presented by Ramsay (1977,

p. 108) in the context of monotonic transformations to additivity. These were

developed in detail by Elphinstone (1983, section 4) in the context of distribution

estimation.

A necessary and sufficient condition for mi(y) to be monotonic is that its first

derivative be a nonnegative polynomial


12




9/30

pi yð Þ ¼ d d y

mi yð Þ ¼ a0i þ a1iy þ þ a2k i;iy2k i 0 for all y ð15Þ

and must consequently be of even degree, 2k i. This polynomial, pi(y), has 2k i

þ1

coefficients that will be represented by the vector a0i ¼ a0i; a1i; . . . ; a2k i;i .Given the nonnegative polynomial pi(y) in Equation 15, the corresponding

monotonic polynomial mi(y) in Equation 12 is obtained from the indefinite

integral

mi yð Þ ¼ i þZ

pi yð Þd y; ð16Þ

where i is the constant of integration. Consequently, the relationships betweenthe coefficients of mi(y) and those of pi(y) are given by:

b0i ¼ i and b j ;i ¼a j 1;i

j for j ¼ 1; 2; . . . ; 2k i þ 1: ð17Þ

The polynomial pi(y) in Equation 15 needs to be evaluated subject to the

requirement that pi yð Þ 0 for all admissible y. This may be accomplished byusing the following reparameterization of pi(y) (Elphinstone, 1983, p. 173):

pþi y

ð Þ ¼ i Q

k i

j ¼11

2a j ;iy

þ ða2 j ;i

þb j ;i

Þy2

h i; k i > 0

i; k i ¼ 0:

8>>>: ð

18

Þ

The 2k i þ 1 coefficients i; a1;i;b1;i; . . . ;ak i ; bk i

of pþi yð Þ are required tosatisfy the k i þ 1 inequality constraints

i 0; and b j 0; j ¼ 1; . . . ; k i: ð19ÞThen, given the parameter vector

γ 0i

¼ ð i; i;a1;i;b1;i; . . . ;ak i ;bk i

Þ ð20

Þfor pþi yð Þ; the procedure described in Online Appendix A may be used to com-

pute the corresponding parameter vector

a0i ¼ a0i; a1i; . . . ; a2k i;i

ð21Þfor pi(y) that will ensure that pi yð Þ ¼ pþi yð Þ > 0. This procedure makes use of arecurrence relation. When ai has been obtained, Equation 17 may be used to

obtain the parameter vector

b

0

i ¼ b0i; b1i; . . . ; b2k iþ1;i ð22Þthat ensures that mi(y) in Equation 12 will be monotonic increasing in y. Thus,

the complicated inequality constraints on bi that are required for monotonicity

Liang and Browne

13




10/30

of the polynomial in Equation 12 are imposed by means of a double reparame-

terization: The parameter vector bi is a function (17) of ai, which in turn is a func-

tion of the parameter vector γ i that satisfies the simple linear inequality

constraints in Equation 19.

As an alternative to the reparameterization approach employed here, an

approach due to Hawkins (1994) for dealing with a monotonic polynomial by

applying equality constraints at judiciously chosen values of y would be worth

investigation.

3.3. Parameter Estimation

A two-stage estimation method based on Ramsay’s (1991) procedure will be

employed to estimate the item parameters and the abilities. Stage 1 is to obtain sur-

rogate values, ~y s; s ¼ 1; . . . ; N , for the examinees’ abilities, y s. In Ramsay’s pro-cedure, these surrogates are the quantiles of a standard normal distribution based

on ranked total test scores. A problem with ranking test scores is that ties can occur

very frequently especially for a short test with many examinees. In Ramsay’s Test-

Graf, ranks are randomly assigned to the tied test scores. To avoid this need for

random rank assignment, first principal component scores are used here to assign

ranks. Component scores are consequently obtained from the left singular vector

corresponding to the largest singular value of the centered data matrix Y 1y0Þð .If the sum of elements of the corresponding right singular vector is negative, both

the left and right singular vector are reflected. In addition to eliminating the occur-

rence of tied ranks, the first principal component scores optimally summarizes the

data matrix Y in one dimension. The principal component score ranks are trans-

formed to the quantiles, qi, of a standard normal distribution to yield the N 1 vec-tor, ~; of ability surrogates. This normalization of surrogate ability scores providesan identification constraint (Ramsay, 1991, p. 614) required for the model.

In the second stage, after the vector, ~; of normalized ability surrogates is

available, the conditional maximum likelihood estimates,

b γ i; i ¼ 1; . . . ; n, of the

item parameter vectors, given ~

, are obtained. Because of the assumption of localindependence, these estimates may be obtained 1 item at a time by minimizing

the scaled negative log-likelihood objective function:

F i ¼ N 1ln L γ ijy#i; ~ ¼ N 1 X N

s¼1 y si ln ð P siÞ þ 1 y sið Þ ln 1 P sið Þf g; ð23Þ

where P si ¼ Pi ~ s

: (The scaling by N 1 in Equation 23 is convenient because itavoids dependence of the magnitude of F i on sample size.) Computational details

are given in Online Appendix B.

This procedure may be viewed as a modified version of the JML estimation

method that is truncated after the first iteration. In the initial stages of our

research, a full JML iterative process for jointly estimating the FMP item


14




11/30

parameters, γ , and abilities, θ, by maximum likelihood was tried out on data that

were randomly generated according to an FMP model. After obtaining the con-

ditional maximum likelihood estimates

b γ ¼

b γ 01; . . . ; b

γ 0n

0 by minimizing Equa-

tion 23, each examinee’s ability parameter y s s ¼ 1; . . . ; N ; was estimated, one ata time, by minimizing the scaled conditional negative log-likelihood objective

function

N 1ln L y sjy s!; b γ ð Þ ¼ Xni¼1

y si lnð P siÞ þ 1 y sið Þ ln 1 P sið Þf g; s ¼ 1; . . . ; N ; ð24Þ

with respect to y s. The estimates obtained were then ranked and normalized to

replace the surrogates and iterative cycles were continued until convergence.

This procedure was not found satisfactory in the present context and was con-

sequently discarded. During iteration, item parameter estimates often drifted away

from the known values. This tendency increased as k was increased. Concurrently,

there was a tendency for the ability estimates, by s, to drift away from the randomlygenerated, and therefore known, y s, as the cycling procedure continued whether or

not convergence occurred. Thus, the iterated JML estimate of g was less satisfac-

tory than the currently used surrogate-based estimate. Further evidence that this

type of iterative algorithm is unsatisfactory will be found in Subsection 4.2.

Rather than regarding the abilities, y s, as parameters estimated by minimizing

Equation 24, they are therefore regarded here as realizations of a random variableand the EAP approach (cf. Bock & Moustaki, 2007, Subsection 5.3) is used to

obtain Bayesian estimates. To be consistent with the normalization of surrogates

in the first stage, the standard normal density j(y) is employed for y, so that the

expected value of the a posteriori distribution of abilities is given by:

E ðyjys; γ Þ ¼R 11

Qni¼1 PiðyÞ y si 1 PiðyÞf g1 y si yj yð Þd yR 1

1Qn

i¼1 PiðyÞ y si 1 PiðyÞf g1 y si jðyÞd yð25Þ

This expected value is estimated by replacing item parameters by estimates in

Equation 25 and approximating the two integrals involved using rectangular quadrature to obtain

by s ¼ "̂ yjy s!; ^ γ ð Þ ¼PQ

r ¼1Qn

i¼1½ Pið€yr Þ y si 1 Pið€yr Þn o1 y si

jð€yr Þ€yr PQr ¼1

Qni¼1½ Pið€yr Þ y si 1 Pið€yr Þ

n o1 y sijð€yr Þ

; s ¼ 1; . . . ; N ; ð26Þ

where €yr ; r ¼ 1; . . . ; Q are equally spaced points on the closed interval [4, 4].

3.4. Choice of the Number of Parameters

In Subsection 3.1, a hypothetical ‘‘true’’ IRF, ~ Pi yð Þ; for item i is specified.Because its functional form is unknown, it cannot be estimated directly but can

Liang and Browne

15




12/30

be approximated arbitrarily closely by an IRF, Pi(y), of known functional form

(Equation 13) by using a sufficient number of parameters. In this situation, it

is not possible to provide a goodness-of-fit test with a null hypothesis involving

an algebraic specification for a ‘‘true’’ model, ~

Pi yð Þ. The AIC (Akaike, 1973;Burnham & Anderson, 2004, pp. 266–268) is helpful under these circumstances,however. It may be regarded as an estimate of an expected cross-validation cri-

terion using the Kullback–Leibler measure of the distance between two distribu-

tions (De Leeuw, 1992) and is based on information theory (Burnham &

Anderson, 2004, section 2, pp. 264–266) rather than on classical statistical infer-

ence. The AIC is evaluated for each set of candidate models for Item i where each

model is obtained by varying the number of parameters in the approximating

IRF, PiðyÞ. The candidate model yielding the smallest AIC tentatively suggeststhe number of parameters to be employed. No statistical test, null hypothesis, or

significance level is involved.

The computing procedure described in Online Appendix B produces a

sequence of nested FMP models with k i ¼ 0; 1; 2; . . . ; k max because the final iter-ated parameter values for one model are employed in the definition of good start-

ing values for the next model. This sequence of models also provides a

convenient candidate set for the AIC. Because the unknown ‘‘true’’ model (11)

cannot be included in this candidate set, the aim of the analysis can only be

‘‘model approximation’’ and not ‘‘model verification.’’

The AIC is defined as AIC¼

2 ln Lþ

2q (e.g., Burnham & Anderson, 2004,

p. 268) where L is the likelihood function and q represents the number of estim-

able parameters for a model in the candidate set. Only the rank order of models

according to the AIC is employed in the selection process. Consequently, all val-

ues of the AIC in the candidate set may be multiplied by the same positive con-

stant without affecting any conclusions. The AIC increases without bound as N

increases, but this problem may be corrected by multiplying the AIC for each

candidate model by N 1. The scaled AIC for item i then is

AICi ¼

2 N 1 ln L b γ ijy#i; ~ þ

2qi

N ¼ 2 L

þ 2

N qi;

ð27

Þwhere L ¼ N 1 P N

s¼1 L si;

L si ¼ L b γ ij y si; ~y s ¼ y si ln P si þ 1 y sið Þ ln 1 P sið Þf g > 0 s ¼ 1; . . . ; N ; ð28Þ

is the contribution of examinee s to the log likelihood, P si ¼ Pið~y sÞ and

qi ¼ 2k i þ 2; ð29Þis the number of parameters. Because L is the mean of the identically distributed

L si, s ¼ 1; . . . ; N , its expected value E ð LÞ remains constant as N increases.


16




13/30

The negative log likelihood, N 1 ln L b γ ijy#i; ~ , is minimized with respectto the parameter vector γ i so that it decreases as the number of parameters, qi,

increases. Given N , therefore, the first term in Equation 27 decreases and the sec-

ond term, (2/ N )qi, increases as qi increases and, as a result, acts as a penalty onAICi. If sample size, N , is very large, however, the effect of an increase of qi on

the penalty (2/ N )qi will be negligibly small and the candidate model with the

largest number of parameters will yield the smallest AICi. Thus, the AIC tends

to favor IRFs with few parameters when samples are small, thereby avoiding

overfitting, and to favor IRFs with many parameters in large samples when over-

fitting is not an issue. Use of the AIC is not intended to provide an estimate of

some correct number of parameters in a population but rather to lead to a model,

possibly with few parameters, that will predict optimally outside the calibration

sample. Examples of the effect of sample size on the AIC in the analysis of cov-ariance structures are given in Browne (2000, Subsection 4.8).

The theoretical justification for the penalty term, (2/ N )qi, of AIC involves an

assumption that the likelihood function is correctly specified, so that the item

parameter estimates in b γ are maximum likelihood. This is not the case in the pres-ent situation because item parameter estimates are obtained by regarding the

latent ability variables y s as observed quantities, whereas in practice, they are

unobservable and replaced by surrogates ~y s. Consequently, the item parameter

estimates may only be regarded as some sort of pseudo-maximum likelihood. For

this reason, it is best to regard AICi in Equation 27 as a pseudo-AIC, having the

same formula as a legitimate AIC but being applied under other assumptions.

This pseudo-AIC still has the property of penalizing models with many para-

meters when the sample size is small, but the value of the penalty may not be

optimal.

The BIC proposed by Schwarz (1978) is similar to the AIC but has a different

penalty term. Burnham and Anderson (2004) compare the AIC and BIC and point

out that the BIC is not related to information theory. After scaling by N 1, the

BIC becomes

BICi ¼ 2 L þ ln N N

qi: ð30Þ

Again the BIC penalty term ðln N = N Þqi tends toward zero as N increases.Also, as used here, the BIC is in effect a pseudo-BIC. Characteristics of the AIC

and BIC that are shared by the pseudo-AIC and pseudo-BIC are as follows.

Both the AIC and BIC favor a small number of parameters in ‘‘small’’ samples

but can favor many parameters in ‘‘large’’ samples. If N 8, the number of para-meters indicated by the BIC will not be greater than that indicated by the AIC.

Neither the AIC nor the BIC is intended to suggest a ‘‘correct’’ model. Rather,

they give an indication of the number of parameters to use in order to give a good

approximation to an unspecified ‘‘true’’ IRF taking sample size into account. In

Liang and Browne

17




14/30

small samples where coefficient estimates are contaminated by error, fewer poly-

nomial coefficients should be used than in large samples where estimates will be

more accurate (cf. Browne, 2000).

The choice of the number of parameters q ¼

2k þ

2 using the AIC or BIC

plays a similar role when using the FMP or FUP models as the choice of the

bandwidth h in Ramsay’s nonparametric IRF. However, q and h operate in oppo-

site directions. Flexibility of the FMP IRC increases as the positive integer q

increases, whereas flexibility of the nonparametric IRC increases as the positive

real number h decreases toward zero. The use of h ¼ 1:1 N 0:2 for nonparametricIRC smoothing and k for flexibility of the FMP IRC have similar aims. Both are

intended to guard against using overflexible IRCs if sample sizes are small so that

random sampling fluctuations are large and overfitting can occur.

4. Numerical Studies

Two simulation studies are reported to illustrate properties of the FMP

approach in comparison with other approaches. These are followed by a numer-

ical illustration of the effect of alternative identification conditions for the FMP

model.

4.1. Simulation Study Design

This section deals with notation and with common aspects of the simulations.FMP_ k will represent an FMP model with index k yielding a monotonic poly-

nomial of degree 2k þ 1 in Equation 12 and an IRF with q ¼ 2k þ 2 parametersin Equation 14. In particular, the IRF for FMP_ 0 is a reparameterization of the

IRF for the 2PL, so that the two models are equivalent, even if estimation meth-

ods differ.

In both of the simulation studies, 100 random samples were generated. Each

sample consisted of 2,000 examinees’ responses on 20 items. Sets of 20 items

had IRF of the same algebraic form. Population parameter values were chosen

by generating them from specified distributions. Appropriate details will begiven in subsections 4.2 and 4.3.

Ability variables, y s, generated for Subsections 4.2 and 4.3, were independently

distributed according to the normal distribution with mean 0 and variance 1. Given

an IRF, P(y), and an examinee ability value, y s, the examinee’s response, y s, was

computed by drawing a random number, u s, from a uniform distribution U [0,1] and

defining the response by y s ¼ 1 when u s < P y sð Þ and y s ¼ 0 otherwise.Performance of the models was evaluated from the following two perspec-

tives: (i) for each item i, the closeness of the estimated IRC, b Pi yð Þ, to the chosen

population IRC, Pi yð Þ, and (ii) closeness of the N estimated abilities, by s estimated from Equation 26 to the actual randomly generated abilities, y s, employed to

produce the data. The root integrated mean square error (RIMSE; Ramsay,


18




15/30

1991, p. 621) was used as a measure of the closeness of the estimated IRC, b Pi yð Þ, to the IRC, Pi yð Þ, used for data generation of item i; i ¼ 1; . . . ; n. Thisis defined as

RIMSEðiÞIRC ¼

P Rr ¼1 ^ Pi €yr

Pi €yr 2jð€yr ÞP Rr ¼1 jð€yr Þ

264375

12

; ð31Þ

where the €yr , r ¼ 1; 2; . . . ; R, are evaluation points that are equally spaced on½3:5; 3:5, and j ð Þ represents the density function of a standard normal distri-

bution. Closeness of the estimates, by s, to the actually generated abilities, y s, wasevaluated using the root mean square error

RMSEy ¼P N

s¼1 by s y s 2 N

264375

12

; ð32Þ

where N is the number of examinees.

Because the rank order of ability estimates are often regarded as more impor-

tant than their actual values, the Spearman rank correlation coefficient, rŷ;y,

between the estimates,

by s, and the actually generated ability variables, y s, was

also used as a measure of equivalence of the estimated and actual ability

variables.

4.2. Simulation Study 1

Comparison of FMP_0 With MML and JML for the 2PL Model

The 2PL and FMP_0 IRCs are equivalent. This section compares our FMP_0

estimates of this IRC with two alternative estimates: the MML estimates

obtained using MULTILOG (Thissen, Chen, & Bock, 2003) and the JML esti-

mates obtained using the TESTAT module of the SYSTAT software package

(Version 10.2). MML estimates were chosen as a gold standard for comparison

with FMP_0 estimates because they appear to be the most widely employed. As

pointed out in Subsection 3.3, the FMP_0 estimation procedure may be regarded

as a JML estimation procedure truncated after the first iteration. Although JML

estimates are often regarded less favorably than MML estimates, JML estimates

provided by an independently written commercial program were also included to

demonstrate that the FMP_0 estimates, although related, do not have the same

suboptimal performance as the JML estimates.

Data for this simulation study were generated using the parameterization of

the 2PL IRF defined by Equations 5 and 6. Population parameter values for each

of the 20 items were chosen randomly. Discrimination parameters, a j ,

j ¼ 1; . . . ; 20 were drawn from a uniform distribution a U ½1:1; 1:8 and the

Liang and Browne

19




16/30

difficulty parameters, b j , from a normal distribution, b Nð0; 1Þ, truncated at2:5 and þ2:5.

In Figure 1, three different estimates of the same IRC are plotted for 4 selected

items in one of the samples. The FMP_0 estimate of the IRC was obtained using

our FMP computer program, the MML estimate of the IRC with MULTILOG

Version 7 (Thissen et al., 2003), and the JML estimate of the IRC with the TES-

TAT module of the SYSTAT (Version 10.2) software package. For each item, the

population IRC is also shown. In general, the MML estimated curve almost coin-

cides with the population curve, and the FMP_0 estimated curve is very slightly

further away. This suggests that the FMP_0 surrogate-based IRF estimates (k ¼0) are almost as good as the MML estimates. In all four diagrams in Figure 1, the

JML estimated curve is clearly further away from the population curve than is the

FMP_0 estimated curve. This indicates superiority of the FMP_0 item parameter

estimates over the JML estimates. In view of the fact that the FMP procedure is

a JML algorithm terminated after the first iteration, it appears that the further itera-

tion is harmful rather than helpful. This finding is concordant with comments in

Subsection 3.3 and is not surprising because of difficulties associated with maxi-

mum likelihood estimation when the number of parameters increases as the num-

ber of examinees increases (Neyman & Scott, 1948).

true

FMP_0

MML

JML

true

FMP_0

MML

JML

true

FMP_0

MML

JML

true

FMP_0

MML

JML

Item 1 Item 2

Item 3

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-4 -2 0 2 4

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-4 -2 0 2 4

Item 4

ability, θ

P r o b a b i l i t y

FIGURE 1. Comparisons of estimated IRCs among FMP, MML, and JML. IRC ¼ itemresponse curve; FMP ¼ filtered monotonic polynomial; JML ¼ joint maximumlikelihood.


20




17/30

Plots of pairs of measures of closeness of estimated abilities, by s, s ¼ 1; . . . ; 100, to the actual randomly generated abilities, y s, are given inFigure 2. In this figure, the left plot compares FMP_0 with MML and JML

in terms of RMSEy and the right plot compares the rank correlations. MML

estimates have very slightly smaller (better) RMSEy’s than the FMP_0 esti-mates obtained using Equation 26. The rank correlations from FMP_0 esti-

mates and from MML estimates are very close. The FMP_0 ability

estimates produce smaller (better) RMSEy values and higher (better) rank

correlation values than the JML estimates. This finding is in agreement with

the discussion in Subsection 3.3.

Means and standard deviations (in parentheses) of accuracy measures over the

100 generated samples are shown in Table 1. As an overall measure of accuracy

of estimated IRCs, the average RIMSEIRC (see Equation 31)

RIMSEIRC ¼ 120

X20i¼1

RIMSEðiÞIRC ð33Þ

was used. Mean accuracy measures, RIMSEIRC, are shown in the first row. The

RIMSEIRC measures for FMP_0 and MML are quite close; the measure for MML

being smaller (better), as can be expected. The RIMSEIRC measure for JML is

clearly inferior to (higher than) those of FMP_0 and MML. This observation is

concordant with the trends visible in Figure 1.

The second and third rows give the mean RMSEy and rank correlation mea-sures of accuracy of the ability variable estimates, by, provided by the three esti-mation procedures. Accuracy as measured mean RMSEy is essentially the same

0.36 0.38 0.40 0.42

0 . 3

6

0 . 3

8

0 . 4

0

0 . 4

2

FMP_0 (RMSE for abilities)

M M L o r J M L

MMLJML

0.91 0.92 0.93 0.94

0 . 9

1

0 . 9

2

0 . 9

3

0 . 9

4

FMP_0 (rank correlations for abilities)

M M L o r J M L

MMLJML

FIGURE 2. Comparisons of RMSE y’s for FMP_0, MML, and JML. RMSE ¼ root mean square error; FMP ¼ filtered monotonic polynomial; JML ¼ joint maximum likelihood.

Liang and Browne

21




18/30

for FMP_0 and MML and is somewhat inferior for JML. The mean rank correla-

tion is essentially the same for the three methods.The overall impression given by this simulation study is that the FMP_0 esti-

mates are nearly as accurate as MML estimates and are clearly more accurate

than JML despite the fact that FMP_0 may be regarded as JML truncated after

one iteration (also see subsection 3.3).

4.3. Simulation Study 2

This simulation study represents the type of situation for which the FMP

model is intended (see subsection 3.1). The true IRF, ~ Pi yð Þ, is unknown to theuser and is approximated by the FMP in Equation 13.

In this simulation study, the true IRF was chosen to be the cdf of a mixture of

two normal distributions:

~ P yj;m1;s1;m2;s2ð Þ ¼ F yjm1;s1ð Þ þ ð1 ÞF yjm2;s2ð Þ ð34Þwhere is the selection probability and F yjm; sð Þ represents the cdf of a normaldistribution with mean m and variance s. Values of these parameters for each

of the n

¼ 20 items were generated randomly using the distributions:

U ½0:3; 0:7, m1 N 1:5; 0:1ð Þ;s1 N 1; 0:1ð Þ, m2 N 1:0; 0:1ð Þ, and s2 N 0:4; 0:1ð Þ.

The Ramsay TestGraf model, with the default bandwidth h ¼ 1:120000:2 ¼ 0:24, and FMP_ k models with k ¼ 0; . . . ; 4 were fitted to 100 ran-dom samples with N ¼ 2,000 and n ¼ 20. Thus, the simplest FMP model wasFMP_0 (2PL) with 2 parameters and the most complex FMP_4 with 10 para-

meters, while the ‘‘true’’ model, treated as unknown in all analyses, had 5

parameters.

Figure 3 shows IRCs for 4 of the items estimated from one of the samples. Values

for k AIC (k suggested by AIC) are shown in the lower right-hand corner. All k AICturned out to be equal to 1, not only in the 4 selected items but also in the remaining

16 items. The true curve, TestGraf curve, and the FMP curves for k AIC are shown. It

TABLE 1.

Means and Standard Deviations of Accuracy Measures.

FMP_0 MML JML

RIMSEIRCð 0Þ 0.024 (0.001) 0.014 (0.001) 0.076 (0.001)RMSEyð 0Þ 0.382 (0.011) 0.379 (0.011) 0.403 (0.011)Rank Corr yð Þð 1Þ 0.928 (0.006) 0.928 (0.006) 0.924 (0.006) Note. FMP ¼ filtered monotonic polynomial; MML ¼ maximum marginal likelihood; JML ¼ jointmaximum likelihood; RIMSE ¼ root integrated mean square error; RMSE ¼ root mean square error.


22




19/30

is difficult to compare the TestGraf and FMP curves because they use different

quantities (h and k ) to control smoothness. In this study, however, TestGraf tends

to fit the straight lower part of the true curve better and FMP the sharp curve in the

upper half.

Figure 4 compares the RMSEy fit measures for estimated ability by between TestGraf and FMP_ k AIC. The top figure shows that the RMSEy’sof FMP_ k AIC are better (closer to zero) than those of TestGraf in this simu-

lation study. In the bottom figure, the rank correlations for FMP_ k AIC are

again better (closer to 1) than those for TestGraf. It should be borne in mind,

however, that the default choice of normalized total test scores for ability

estimates was used in TestGraf. The alternative iterative facility requires a

user intervention at each iteration and therefore is not practical in simulation

experiments. A possible explanation for the poorer results of TestGraf is that

sample size is N

¼2,000 and number of items is n

¼20 which would lead to

many ties in the total scores used to provide ranks for the normalization pro-

cess. These ties are resolved in TestGraf by generating random orderings

within ties (cf. subsection 2.2).

Item 13

TrueTestGraf

FMP_k1_AIC

TrueTestGraf

FMP_k1_AIC

TrueTestGraf

FMP_k1_AIC

TrueTestGraf

FMP_k1_AIC

Item 14

Item 15

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-2 -1 0 1 2

0 . 0

0 . 2

0 . 4

0 . 6

0

. 8

1 . 0

-2 -1 0 1 2

Item 16

ability, θ


FIGURE 3. Comparisons of the estimated IRCs ( N ¼ 2,000). IRC ¼ item response curve.

Liang and Browne

23




20/30

Table 2 summarizes accuracy measures for TestGraf, FMP_ k AIC, and

FMP_ k AIC. Entries are means and standard deviations (in parentheses) calculated

over items and random samples (i.e., 20 100 ¼ 2; 000 observations). The firstrow shows that the three estimation methods yielded essentially the same

RIMSEIRC, so that there was little to choose in overall accuracy of the three

methods for approximating the chosen true IRCs. When abilities, y, are esti-

mated, the situation changes. It can be seen from row 2 that the mean RMSEy

was essentially the same for FMP_ k AIC and FMP_ k BIC, but these were noticeably better (smaller) than that for TestGraf. Again row 3 shows that the FMP_ k AIC and

FMP_ k BIC yielded essentially the same Rank_Corr (y) which was noticeably bet-

ter (larger) than that for TestGraf.

To evaluate how the FMP model performs with a smaller sample size, the

FMP IRF was also fitted to the first 300 of the 2,000 simulated examinees for

each of the 100 simulated samples. Table 3 summarizes the same information

as in Table 2, but with a sample size of N ¼ 300 instead of N ¼ 2,000.As can be expected, the IRC fit measures, RIMSEIRC, in Tables 2 and 3 indi-

cate less accuracy of estimates when sample size drops from 2,000 to 300. On theother hand, interpretation of IRCs is hardly affected by the reduction of sample

size. Figure 5 shows IRCs based on N ¼ 300 for the same 4 items plotted in

0.45 0.50 0.55 0.60 0.65 0.70 0.75

0 . 4

5

0 . 5

5

0 . 6

5

0 . 7

5

AIC selected model (RMSE for abilities)

T E S

T G R A F

0.70 0.75 0.80 0.85

0 . 7

0

0 . 7

5

0 . 8 0

0 . 8

5

AIC selected model (rank correlation for abilities)

T E S T G R

A F

FIGURE 4. Comparisons between TestGraf and FMP_ k AIC of accuracy measures of

by.

FMP

¼ filtered monotonic polynomial.


24




21/30

Figure 3 for N ¼ 2,000. Comparison of Figures 5 and 3 suggests that conclusionsdrawn from IRCs based on N ¼ 300 do not differ much from those drawn fromthe corresponding IRCs based on N ¼ 2,000.

Differences in ability fit measures, RMSEy and Rank_Corr (y), between

Tables 2 and 3 seem sufficiently small to be disregarded.

4.4. A Numerical Experiment to Investigate the Assumption of a Normal

Distribution for y

As pointed out by Ramsay (1991, p. 614, equation 6), a change of distributionfor y does not affect model fit, provided that it is accompanied by an appropriate

change in the IRF. Thus, the assumption of a normal distribution for y is an iden-

tification condition for the data generation process when the IRFs are uncon-

strained. The distribution of y cannot be estimated unless constraints are

imposed on the functional form of the IRFs (cf. Woods & Thissen, 2006). It is

not possible to simultaneously estimate the density of y and the item IRFs.

The FMP methodology proposed here for estimating an IRF specifies a

N ð0; 1

Þdistribution of y for identification purposes and uses normalized surro-

gate abilities (see subsection 3.3). Furthermore, when the EAP procedure (Equa-

tion 26) for obtaining ability estimates, by, is employed, a N ð0; 1Þ is assumed again as the prior distribution for y.

TABLE 2.

Means and Standard Deviations of RMSEs for TestGraf and FMP.

TestGraf FMP_ k AIC FMP_ k BIC

RIMSEIRC 0ð Þ 0.041 (0.003) 0.042 (0.003) 0.042 (0.004)RMSEy 0ð Þ 0.707 (0.048) 0.481 (0.012) 0.482 (0.012)Rank Corr ðyÞ 1ð Þ 0.769 (0.037) 0.834 (0.012) 0.835 (0.012) Note. N ¼ 2,000. FMP ¼ filtered monotonic polynomial; RIMSE ¼ root integrated mean squareerror; RMSE ¼ root mean square error.

TABLE 3.

Means and Standard Deviations of RMSEs for TestGraf and FMP.

TestGraf FMP_ k AIC FMP_ k BIC

RIMSEIRC 0ð Þ 0.069 (0.008) 0.064 (0.009) 0.075 (0.007)RMSEy 0ð Þ 0.695 (0.076) 0.492 (0.021) 0.492 (0.021)(Rank Corr ðyÞ 1ð Þ 0.763 (0.050) 0.828 (0.027) 0.829 (0.027) Note. N ¼ 300. FMP ¼ filtered monotonic polynomial; RIMSE ¼ root integrated mean square error;RMSE ¼ root mean square error.

Liang and Browne

25




22/30

We shall demonstrate by means of a numerical example that if

i. in an artificially constructed population, the generation distribution used for y is

nonnormal (e.g., bimodal) and simultaneously all IRFs are generated as 2PL

ii. the identification condition that y is normal is used in the FMP estimation proce-

dure by normalizing the surrogates then

iii. the resulting unconstrained estimates of the item IRFs are not 2PL.

This result is stated at the population level but is investigated here using two

finite, but very large ( N ¼ 100,000) data sets, regarded as finite pseudo- populations. These are used to demonstrate the effect of changing the distribution

chosen for y without changing the IRCs for n ¼ 20 items. In one data set, referred to as DS-B, the distribution used for generating y is chosen to be symmetric and

strongly bimodal with a mean of 0 and a standard deviation of 1. This bimodal

distribution is generated by the mixture of a N ð2; 51=2Þ and an independent N ð2; 51=2Þ with a probability of .5 for selecting each component distribution.For the other data set, referred to as DS-N, the distribution used for generating y

Item 13

True

TestGraf

FMP_k1_AIC

True

TestGraf

FMP_k2_AIC

True

TestGraf

FMP_k1_AIC

True

TestGraf

FMP_k1_AIC

Item 14

Item 15

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-2 -1 0 1 2

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-2 -1 0 1 2

Item 16

ability, θ


FIGURE 5. Comparisons of the estimated IRCs ( N ¼ 300). IRC ¼ item response curve.


26




23/30

is N ð0; 1Þ. Item scores, y s, for the two data sets are generated as described in sub-section 4.1 using a 2PL (FMP_0) IRF defined by Equations 5 and 6 for each item.

Item parameter values are equal across the two data sets for each of the 20

items. The only difference in the generation process for the two data sets is that

bimodal y’s are used for DS-B and normal y’s for DS-N. Superimposed kernel-

smoothed density functions for y in the two data sets are shown in Figure 6.

Both data sets are then analyzed in the same way using the FMP method

described in Subsection 3.3. For item parameter estimation in both DS-B and

DS-N, k i ¼ 2 is chosen for all items to yield equally flexible IRCs in the two datasets. Thus, the y distribution identification condition used when generating DS-N

matches the y distribution identification condition made in its analysis. However,

the y distribution identification condition used when generating DS-B conflicts

with the y distribution identification condition made in its analysis. Because the

two data sets employ the same items and are analyzed in exactly the same man-

ner, any differences in estimated IRFs can be attributed to the conflict of identi-

fication conditions for the distribution of y in the analysis of DS-B.

Superimposed IRCs obtained from the two data sets are shown for 4 of the items

in Figure 7. In all four figures, the estimated B-IRC does not coincide with the esti-

mated N-IRC although the same IRF was used at the generation stage. This is due

to the conflict in identification conditions on the distribution of y in DS_B at the

generation stage with those at the estimation stage. There is no such conflict in

DS-N. The distortion of B-IRC in the two figures in the first row occurs with items

of medium difficulty and is hardly noticeable. In the second row, the distortion is

more visible and occurs with items of high and of low difficulty. Thus, the differ-

ence in identification conditions on y employed at the generation and estimation

stages can affect different types of items in different ways.

Ability, θ

D e n s i t y

Bimodal θ

Normal θ

-4 -2 0 2 4

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

FIGURE 6. Superimposed normal and bimodal densities for y.

Liang and Browne

27




24/30

Without knowledge of the generation process, the two B-IRCs in the second

row could easily be misinterpreted as indicating that a 2PL IRF is inappropriate

for the data. There is, however, a way of detecting (without prior knowledge) a

difference between the y distribution at the generation stage from the known

assumption of a normal distribution at the estimation stage. This is to obtain EAP

estimates, by s, of the abilities using Equation 26 and estimate their density usingkernel smoothing. Figure 8 shows kernel smoothed plots of densities for these

ability estimates obtained from data sets N and B. The density of by from DS-Bis clearly bimodal, although not as noticeably as that in Figure 6, and the densityof by from DS-N is essentially normal.

In summary, Figures 7 and 8 indicate that a conflict of distribution assump-

tions affects both the estimated IRFs and the distribution of ability estimates.

5. An Example Using Actual Data

In the previous section, the FMP approach was shown to be useful by means of

simulation studies. Here, FMP and FUP models will be applied to an actual data

set that is included with the TestGraf distribution (Ramsay, 2000). The FMP and

FUP results will be compared with those from the TestGraf program that does not

impose monotonicity requirements on the estimated IRCs. The data set consists

Bimodal θ

Normal θ

Bimodal θ

Normal θ

Bimodal θ

Normal θ

0 . 0

0 . 2

0 . 4

0 .

6

0 . 8

1 . 0

-4 -2 0 2 4

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-4 -2 0 2 4

Bimodal θ

Normal θ

Ability, θ


k= 2

FIGURE 7. Examples of estimated IRCs when density of y is either bimodal or normal.

IRC ¼ item response curve.


28




25/30

of 379 students’ responses to an examination with 100 four-option multiple-

choice questions which was given for Psychology 101, an introductory psychol-

ogy course.The data in the original file were recoded dichotomously with a ‘‘1’’ for a

correct response and a ‘‘0’’ otherwise. Missing responses were treated as incor-

rect responses. To decide on the degree of the polynomial, models were fitted

sequentially with k ¼ 0; 1; . . . ; 4 yielding corresponding polynomials of degree1, 3, . . . , 9 and the optimal values of k suggested by both the AIC and BIC were

recorded. This was done independently for the FMP and FUP. The default value

h ¼ 1:1 3790:2 ¼ 0.34 of the bandwidth was employed for TestGraf.FMP, FUP, and TestGraf IRFs may all be regarded as different regressions

with the probability of passing an item as independent variable on the surrogateability score, ~y, as independent variable. In order to provide a graphical represen-

tation of the relationship between data and the IRCs, reference points were

plotted on the same graph as the IRCs. To obtain these points, a truncated ability

range of [3, 3] was first divided into 12 intervals of length .5. Corresponding toeach interval, a single reference point (y, p) was obtained with y equal to the mid-

point of the interval and p equal to the proportion of examinees with surrogate

abilities,

by s, in the interval who correctly answered the item. If any interval was

empty, the corresponding reference point was omitted. These reference points are

valid for the FMP and FUP IRCs for all k because the same surrogate abilities areused. For convenience, these reference points could also be used for the IRC from

TestGraf that uses different surrogate values.

k=2

EAP θ^

D e n s i t y

Bimodal θ

Normal θ

-4 -2 0 2 4

0 . 0

0 . 1

0 . 2

0 . 3

0 . 4

0 . 5

FIGURE 8. Superimposed densities of estimates by from DS-B and DS-N.

Liang and Browne

29




26/30

IRC plots for some selected non-2PL items from the introductory psychology test

are shown in Figure 9. For each item, three IRCs are plotted: (i) TestGraf, (ii)

FMP_ k AICwhere k AICyields the lowest AIC for FMP, and (iii) FUP_ k AIC, where k AICyields the lowest AIC for FUP. (In the legend, FMP-k1 stands for FMP_ k AIC¼ 1and so on.) For each item, the reference points are represented by small circles.

The following observations may be made from Figure 9. The unconstrained

FUP_ k AIC and TestGraf curves tend to be similar, but the FUP curves tend to undu-

late more smoothly and the TestGraf curves to wiggle more. It is difficult to say

whether or not this difference is due to inherent properties of the two fitting meth-

ods or to the different criteria, k and h, for controlling flexibility in FUP and Test-

Graf (cf. Items 69 and 96). Also it is of interest to inspect closeness of monotonic

FMP_ k AIC curves to nonmonotonic FUP_ k AIC and TestGraf curves. Note that Item

96 appears to be a problematic item. The IRCs from both TestGraf and FUP show

that the probability of correctly answering this item decreases as ability increases.

With the constraint of monotonicity, the FMP IRC comes out as a flat line.

TestGraf FMP-k0FUP-k1

Item 3


Item 5


Item 13


Item 22


Item 24


Item 39

TestGraf FMP-k1

FUP-k1

Item 69

TestGraf FMP-k1

FUP-k1

Item 74

TestGraf FMP-k0

FUP-k2

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-3 -2 -1 0 1 2 3 0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Item 96

ability


FIGURE 9. Estimated IRCs for Psychology 101 data. IRC ¼ item response curve.


30




27/30

Figure 10 provides plots of the estimates by of abilities from TestGraf againstthose from FMP_ k AIC and FMP_ k BIC. The two plots are similar and in both cases,

estimated abilities from TestGraf are slightly lower than those from FMP at low

values and slightly higher at high values. In both cases, the TestGraf y estimates

are close to the FMP y estimates.

6. Summary and Conclusions

General filtered polynomial (FMP/FUP) approaches for constructing a flex-

ible IRF have been developed. The model is quasi-parametric because the para-

meters involved are not intended for interpretation. Their main function is to

define a flexible IRF that simultaneously (i) produces graphical displays of

deviations from the usually assumed S-shape and (ii) is easily portable to future

examinees not present in the calibration sample. Although the usual property of

monotonicity of an IRF is imposed in FMP, the monotonicity constraints are dis-

carded in FUP to provide a filtered unconstrained polynomial IRC that need not

be monotonic but is still bounded by 0 and 1.

The IRCs developed are intended for visual inspection to obtain diagnostic

information about deviant items. This will be helpful for detecting unsatisfactory

items when constructing ability tests. Another potential application will be in the

analysis of psychopathology scales (Meijer & Baneke, 2004) where the usual

assumptions made for ability tests are no longer applicable. Furthermore, the

FUP facility will be useful for providing option response curves for incorrect

options in multioption tests.

Computational procedures have been developed for estimation purposes and a

computer program, FMP, written in FORTRAN 90.2 Monotonicity constraints

are imposed by means of a reparameterization. This methodology has been tried

out in two simulation studies and on an actual example and found to compare

favorably with existing methods. In Simulation Study 1 where the true IRC was

FIGURE 10. Comparison of by0 s from TestGraf and FMP (Psychology 101 data).

Liang and Browne

31




28/30

a 2PL (or, equivalently, FMP_0), the FMP IRCs were very close to those from

the gold standard MML and were clearly superior to those from JML. This is

reassuring because the FMP_0 algorithm may be regarded as a first iteration

of JML and difficulties with JML are recognized. In Simulation Study 2, where

a nonstandard IRF was used for the generating model, the FMP approach yielded

as good an approximation to the actual generating IRF as the well-known non-

parametric method, implemented in the program TestGraf, and clearly more

accurate estimates of the abilities y s. In the actual example, the current approach

compares favorably with TestGraf but has the additional advantages of being

able to produce either monotonic increasing or nonmonotonic IRCs as well as

easily portable IRFs. Although the current article has concentrated on the use

of a logistic filter, the theory presented can easily be adapted to the use of other

filters such as those derived from normal, beta, or gamma ogives.

Acknowledgments

The authors are grateful to Michael Edwards, Steven MacEachern, the editor, and the

reviewers for their thought provoking comments and helpful suggestions.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research,

authorship, and/or publication of this article.

FundingThe author(s) disclosed receipt of the following financial support for the research, author-

ship, and/or publication of this article: This research was supported in part by NSF grant

SES-0437251. It was carried out in partial fulfillment of the requirements for the first

author’s PhD degree in quantitative psychology at the Ohio State University with the

second author as advisor.

Notes

1. Hayley (1952) suggested multiplication of the logit by D ¼ 1.702 to approxi-mate the Normal Ogive.

2. The program, FMP, is being prepared for distribution on the Internet. Please

address all inquiries to the first author.

Supplementary Material

The online appendices are available at http:/jeb.sagepub.com/supplemental.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-

ciple. In B. N. Petrox & F. Caski (Eds.), Second international symposium on informa-tion theory (pp. 267–281). Budapest, Hungary: Akademiai Kiado.

Baker, F. B. (1992). Item response theory: Parameter estimation techniques. New York,

NY: Marcel Dekker.


32




29/30

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s

ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores

(pp. 399–402). Reading MA: Addison-Wesley.

Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item para-

meters: Application of an EM algorithm. Psychometrika, 46 , 443–459.Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously

scored items. Psychometrika, 35, 179–197.

Bock, R. D., & Moustaki, I. (2007). Item response theory in a general framework. In C. R.

Rao & S. Sinharay (Eds.), Handbook of statistics, volume 26: Psychometrics (pp.

469–514). Amsterdam, The Netherlands: North-Holland.

Browne, M. W. (2000). Cross-validation methods. Journal of Mathematical Psychology,

44, 108–132.

Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC

and BIC in model selection. Sociological Methods and Research, 33, 261–304.

Cudeck, R., & Henly, S. J. (1991). Model selection in covariance structures analysis

and the ‘‘problem’’ of sample size: A clarification. Psychological Bulletin, 109,

512–519.

De Leeuw, J. (1992). Introduction to Akaike (1973) information theory and an extension

of the maximum likelihood principle. In S. Kotz & N. L. Johnson (Eds.), Break-

throughs in statistics (Vol. 1, pp. 599–609). London, England: Springer-Verlag.

Drasgow, F., Levine, M. V., Williams, B., McLaughlin, M. E., & Candell, G. L. (1989).

Modeling incorrect responses to multiple-choice items with multilinear formula score

theory. Applied Psychological Measurement , 13, 285–299.

Duncan, K. A., & MacEachern, S. N. (2008). Nonparametric Bayesian modeling for item

response. Statistical Modeling , 8, 41–66.

Duncan, K. A., & MacEachern, S. N. (2013). Nonparametric Bayesian modeling for item

response with a three parameter logistic prior mean. In M. C. Edwards & R. C.

MacCallum (Eds.), Current topics in the theory and application of latent variable

methods. New York, NY: Routledge.

Elphinstone, C. D. (1983). A target distribution model for nonparametric density estima-

tion. Communications in Statistics—Theory and Methods, 12, 161–198.

Elphinstone, C. D. (1985). A method of distribution and density estimation (Unpublished

dissertation). University of South Africa, Pretoria, South Africa.

Hayley, D.C. (1952). Estimation of the dosage mortality relationship when the dose is subject to error . (Technical Report No. 15). Stanford, CA: Stanford University, Applied

Mathematics and Statistics Laboratory.

Hawkins, D. M. (1994). Fitting monotonic polynomials to data. Computational Statistics,

9, 233–247.

Lee, Y.-S. (2007). A comparison of methods for nonparametric estimation of item char-

acteristic curves for binary items. Applied Psychological Measurement , 31, 121–134.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading,

MA: Addison Wesley.

Meijer, R. R., & Baneke, J. J. (2004). Analyzing psychopathology items: A case for non-

metric item response theory modeling. Psychological Methods, 9, 354–368. Neyman, J., & Scott, E. L. (1948). Consistent estimates based on partially consistent

observations. Econometrika, 16 , 1–32.

Liang and Browne

33




30/30

Ramsay, J. O. (1977). Monotonic weighted power transformations to additivity. Psycho-

metrika, 42, 83–109.

Ramsay, J. O. (1991). Kernel smoothing approaches to nonparametric item characteristic

curve estimation. Psychometrika, 56 , 611–630.

Ramsay, J. O. (2000). TestGraf: A program for the graphical analysis of multiple choicetest and questionnaire data [Computer program and manual]. Retrieved from http://

www.psych.mcgill.ca/faculty/ramsay/ramsay.html

Ramsay, J. O., & Winsberg, S. (1991). Maximum marginal likelihood estimation for semi-

parametric item analysis. Psychometrika, 56 , 365–379.

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 ,

461–464.

Sinnott, L. T. (1997). Filtered polynomial density approximations and their application to

discriminant analysis (MS Thesis). The Ohio State University, Columbus, OH.

Thissen, D., Chen, W.-H, & Bock, R. D. (2003). Multilog (version 7) [Computer soft-

ware]. Lincolnwood, IL: Scientific Software International.

Thissen, D., & Orlando, M. (2001). Item response theory for items scored in two cate-

gories. In D. Thissen & H. Wainer (Eds.), Test scoring (pp. 73–140). Mahwah, NJ:

Lawrence Erlbaum.

Woods, C. M., & Thissen, D. (2006). Item response theory with estimation of the latent

population distribution using spline-based densities. Psychometrika, 71, 281–301.

Authors

LONGJUAN LIANG is a psychometric manager at Educational Testing Service,Rosedale Rd, Princeton, NJ 08822; e-mail: [email protected]. Her research interests

Journal of Educational and Behavioral Statistics-2015-Liang-5-34

Documents

Transcript of Journal of Educational and Behavioral Statistics-2015-Liang-5-34