ON THE SAMPLING THEORY FOUNDATIONS OF ITEM RESPONSE THEORY...

PSYCHOMETRIKA--VOL. 55, NO. 4, 577-601 DECEMBER 1990

ON THE SAMPLING THEORY FOUNDATIONS OF ITEM RESPONSE THEORY MODELS

PAUL W . HOLLAND

EDUCATIONAL TESTING SERVICE

Item response theory (IRT) models are now in common use for the analysis of dichotomous item responses. This paper examines the sampling theory foundations for statistical inference in these models. The discussion includes: some history on the "stochastic subject" versus the random sampling interpretations of the probability in IRT models; the relationship between three versions of maximum likelihood estimation for IRT models; estimating 0 versus estimating 0-predictors; IRT models and loglinear models; the identifiability of IRT models; and the role of robustness and Bayesian statistics from the sampling theory perspective.

Key words: stochastic subjects, marginal maximum likelihood (MML), conditional maximum likelihood (CML), unconditional maximum likelihood (UML), joint maximum likelihood (JML), probability simplex, loglinear models, robustness.

1. Introduction

This paper is concerned with the statistical foundations of the probability models, based on item response theory (IRT), that are used to analyze educational and psychological test data. By "statistical foundations", I mean those aspects of IRT models used to draw statistical inferences from test data; that is, parameter estimates, their standard errors, and goodness-of-fit tests. The primary emphasis of this paper is on a formulation of these foundations that uses the random sampling of examinees from a population as its only source of probability. Many of the ideas in this paper have been discussed in one form or another by other writers so that its main contribution is the consistency with which the sampling theory approach is pursued. The remainder of this paper is organized as follows. Section 2 sets up the basic notation and discusses some relevant history. Section 3 discusses loglinear models for test data. In Section 4, I consider the geometric structure of IRT models. Section 5 shows the relation between three types of IRT likelihood functions and Section 6 introduces the idea of an ability predictor to replace ability estimates. Section 7 briefly sketches a few related topics.

2. Basic Notation and the Underlying Ideas

I will assume that our attention is focused on a specific test, T. Throughout this discussion, T will be considered as given and fixed. By a " tes t " I mean a specific set of questions with specific directions, given under standardized conditions of timing,

A presidential address can serve many different functions. This one is a report of investigations I started at least ten years ago to understand what IRT was all about. It is a decidedly one-sided view, but I hope it stimulates controversy and further research. I have profited from discussions of this material with many people including: Brian Junker, Charles Lewis, Nicholas Longford, Robert Mislevy, Ivo Molenaar, Donald Rock, Donald Rubin, Lynne Steinberg, Martha Stocking, William Stout, Dorothy Thayer, David Thissen, Wim van der Linden, Howard Wainer, and Marilyn Wingersky. Of course, none of them is responsible for any errors or misstatements in this paper. The research was supported in part by the Cognitive Science Program, Office of Naval Research under Contract No. N00014-87-K-0730 and by the Program Statistics Research Project of Educational Testing Service.

Requests for reprints should be sent to Paul W. Holland, Educational Testing Service, Rosedale Road 21-T, Princeton, NJ 08541.

0033-3123/90/1200-pa90 $00.75/0 © 1990 The Psychometric Society

577

578 PSYCHOMETRIKA

item presentation, and so forth. If any of these elements change, the resulting test is different from T. For my purposes here, this rough description of a test should be sufficient.

The test, T, is made up of J test questions or items which I will index by the letter j (with or without primes or subscripts, as needed). Each item is assumed to have a "correct" answer and we use the indicator variable xj to denote right or wrong answers to item j of T,

{ 10 if item j is answered correctly, xj = if item j is answered incorrectly.

Let me make my first simplifying assumption right here. I will not consider the possibility of unanswered questions on T in this paper, except briefly in section 7. Omitted or "not reached" responses are important considerations in real tests and are a crucial feature of computerized adaptive tests, but they are beyond the scope of this paper. The only values ofx j considered here are 0 and 1, although little effort is required to extend most of my comments to the case where xj is polytomous.

The pattern of correct and incorrect responses to the J items that an examinee might produce upon taking the test T will be denoted by the response vector or response pattern x, where

x = (xl . . . . . x j) . (2)

There are 2 J possible values of x. Up to this point, there is little that is different from other developments of IRT

except for the fact that the response vector x does not have a subscript on it to indicate the examinee who produced it. This is a characteristic of the notation I will use.--x merely indexes all of the possible 2 J response patterns.

We now come to an important difference between this and other developments of IRT. Let C denote a given population of potential examinees. If T were administered to everyone in C, a certain proportion of them, p(x) , would produce the response vector x. The values of the p(x) are proportions and, as such, they satisfy these two conditions:

p(x) >- o, (3)

and

~, p(x) = I, (4) X

where ~'x denotes a summation over all of the 2 J values of x. Let p denote the (2J) - dimensional vector with coordinates p(x) in some (e.g., lexographic) order:

p = (p(x)). (5)

If a person is sampled at random from C and tested with T, the probability that this randomly selected examinee will produce response pattern x is exactly p(x) . This probability is simply a consequence of what we mean by random sampling from C. It is convenient to let X denote the response pattern of a randomly sampled examinee from C. Then p(x) = Prob (X = x).

When N people are sampled without replacement from C and tested with T, let n(x) denote the number who produce response vector x. Then let n denote the (2J)-dimen - sional vector with coordinates n(x):

PAUL W. HOLLAND 5 7 9

Hence,

n = (n(x)). (6)

n(x) = N. x

The vector n is a (2J)-contingency table representation of the item response data from the N examinees and it has an approximate multinomial distribution with parameters N and p, provided that N is small relative to the size of C. (The exact distribution of n is multivariate hypergeometric). I now make a second simplifying assumption: C is very large relative to N so that we may ignore the fact that n does not have an exact multinomial distribution. In general, N will be known, and p will be unknown. The likelihood function (i.e., the probability of the observed data), is based on the multinomial distribution and is

I-I p(x)n(x), (7) x

and its logarithm, the log likelihood function, is

L = ~ n(x) log p(x). (8) x

This log likelihood function and its relationship to other IRT likelihood functions is discussed extensively in section 5.

At this point, I want to emphasize that the only source of probability that I will use in this paper is random sampling from C. There are no stochastic mechanisms operating except the sampling process by which examinees are selected from C and tested with T. I shall make no probabilistic assumptions about how an examinee arrives at his or her particular response vector. I only assume that an examinee, if tested with T, will produce some response vector. This assumption, along with the random sampling from C, will be the sole probability basis for statistical inferences. This is the random sampling perspective and it should be distinguished from the stochastic subject perspective discussed later in this section.

It is important to remember that performance on a test can depend on many factors in addition to the actual test questions--for example, the conditions under which the test was given, the conditions of motivation or pressure impinging on the examinee at the time of testing, and his or her previous experience with tests like T. In this paper, all of these factors are intended to be absorbed into either the precise definition of T or of the examinee population, C. It should be emphasized that I do not make the deter- ministic assumption that if an examinee were retested with T then the same response pattern would be produced by that examinee. Even if retesting were done under iden- tical conditions, the examinee is no longer the same person in the sense that relevant experiences (i.e., prior exposure to T) have occurred that had not occurred prior to the first testing. In short, the tested sample of examinees has changed from a sample from C to a sample from a new population, C'. Thus, we explicitly do not include individual variability in test performance over time as a source of probability in defining p(x). Rather, p(x) is merely the proportion of C who would produce response vector x if tested with T. Individual variability over time is absorbed into the definition of the population C. As the population C "ages" in some important ways, it changes and so do the values of p(x). This approach runs the risk of defining the population, C, so narrowly that it is of little general interest--all of the high school juniors who took the

580 PSYCHOMETRIKA

SAT for the first time on Saturday, June 2, 1990. However, I believe that such a level of specificity is necessary for a precise statistical foundation for IRT models.

So far, most of the features of IRT models have not made their appearance--for example, item response functions, latent traits, item parameters, and so on. It is now time to remedy this. We have defined the basic data vector, n, to which the data collection process (i.e., random sampling) gives a multinomial distribution with a known value for N but an unknown value for p. The next step is to build a model for p. From the random sampling perspective, this is the purpose of IRT.

In general, a model for p is a restriction on the possible values ofp . Let f~j denote the set of all possible (2J)-probability vectors, that is, l~j is the probability simplex defined by:

f~J = {all vect°rs q = (q(x)) such that q(x) >-O and ~ q(x) = l }

The vector p is a point in f~j but, aside from this, p is not restricted, as yet, in any way. The vector of observed proportions of examinees producing each response pattern is n/N, a point in f~j. For this reason f~j may be called the "data space" to distinguish it from the "parameter space" introduced in sections 3 and 4. Unfortunately, this dis- tinction may cause confusion because p is also a point in f~j and p is a parameter of the multinomial distribution that governs the statistical properties of n/N.

A model for p is a subset, M, of the data space, (M C f~j). In this framework, all IRT models correspond to various subsets of f~j. These IRT-subsets are all specified by special cases of (9), below

J

p(x) = f ~ ~ ( 0 ) ~ ( 0 ) 1-~ dF(O). j = l

(9)

In (9), Pj(O) = 1 - Qj(O) is the item response function (IRF, or, in earlier usage, "item characteristic curve", ICC), 0 is the latent trait or "ability", and F is the cumulative distribution function of 0 over the population C. In (9), I have used the Stieltjes form of the integral (i.e., the dF(O) notation). This allows for complete mathematical generality. If F(O) is a differentiable distribution function, then its derivative f(O) = F'(O) is the density function of 0 and dF(O) may be replaced byf(O)dO in (9). Finally, in (9) 0 may be a scalar or a vector. For clarity, let 0 be a scalar unless I state otherwise. Unidi- mensionality is not an important restriction on much of my discussion.

The usual IRT models correspond to specific parametric choices of the IRFs and/or o fF . For example, Lawley (1943), Tucker (1946), Lord (1952), Lord and Novick (I968), and Bock and Lieberman (1970) all study an explicit form of (9) with Pj(O) given by the "normal ogive" model,

P~(O) = * ( a A o - b j ) ) ,

where qb(z) is the standard normal distribution function, bj and aj are location and scale parameters, and F(O) is given by the standard normal distribution, F(O) = d~(O).

Birnbaum (1967) examines another version of (9) in which Pj(O) is given by the logistic model

Pj(O) = LGT(aj(O - bj)), (10)

where LGT(z) is the logistic distribution function,

PAUL W. HOLLAND 581

e z

- - - ( 1 1 ) LGT(z) 1 + e z'

and F(O) is also given by the logistic distribution, F(O) = LGT(O). Thissen (1982) considers the case of (9) in which Pj(0) is given by the one-param-

eter logistic (or Rasch) model,

Pj(O) -- LGT(O - bj), (12)

and F(O) is the normal distribution function with mean zero but unknown variance, F(O) = q)(0/cr). Tjur (1982) and Cressie and Holland (1983) consider the case of (9) in which Pj(0) is given by the one-parameter logistic model (12) and F(O) ranges over all possible distribution functions. Holland and Rosenbaum (1986) and Stout (1987) consider the case of (9) in which Pj(O) is only required to be monotone increasing in 0 and F(O) is any distribution function.

One may view (9) in at least two ways. On the one hand, it is a formula that gives legitimate values for p(x) (i.e., if ej(o) is restricted to lie in the interval [0, 1] and F(O) is a distribution function, then p(x ) satisfies (3) and (4)), and as such it defines a model, M, for p. The model M will depend on the restrictions that are put on the set of IRFs and on F. From this first point of view the origin of (9) does not really matter. It is satisfactory in so far as it gives rise to models that fit data. On the other hand, it is important to be able to give reasons why p(x) might satisfy (9). Furthermore, if (9) fails to fit the data for a particular family of IRFs and choice of F, it is important to be able to expand the model in reasonable ways so as to improve the fit. A rationale for (9) may aid in choosing new parameters to add to the model that are sensible and improve the fit, and therefore the utility, of the model, M.

The most common rationales for (9) may be divided into two types which I shall call respectively, the " random sampling" and the "stochastic subject" rationales. Both focus on the integrand of (9),

J

p(x[O) = 1--[ ej(o)xJOj(O) l - x j (13) j = l

Both rationales strive to give meaning to the notion that (13) gives the conditional "probabi l i ty" that an examinee with ability 0 will produce response vector x when tested with T. Both rationales interpret Pj(O) as the conditional "probabil i ty" that an examinee with ability 0 will answer item j correctly; and they both assume that local independence (i.e., (13)) is the proper way to combine these individual item "probabilities" to obtain the "probabil i ty" for the entire response vector. They differ in the way that Pj(O) is interpreted as a probability.

The R a n d o m Sampl ing Rat ionale

The " random sampling" rationale is explicitly stated in Birnbaum's chapter in Lord and Novick (1968). Birnbaum (1967) says

Item scores . . . are related to an ability 0 by functions that give the probability of each possible score on an item for a randomly selected examinee of given ability. These functions are . . . the i tem characteristic curve(s) . . . Pg(O) . . . (p. 397)

In the " random sampling" rationale, 0 defines strata, or subpopulations of the population C, with the same ability. The meaning of Pj(O) is the proportion of people in the

582 PSYCHOMETRIKA

0-th stratum of C who will answer i temj correctly if tested by T. This is what is meant by "the probability that a randomly sampled subject with ability 0 will correctly answer i t emj" .

The random sampling rationale also appears in other applications of latent variable models. In his description of latent structure analysis applied to a measure of "ethnocentricity", Lazarsfeld (1950) describes the trace line (which is the latent-structure equivalent of the IRF) in terms very close to my definition of p(x) except that it is conditional on 0:

the proportion of people with a given degree of ethnocentricity who make a positive response to an item. (p. 370)

The rationale for local independence within the random sampling point of view is associated with the general way that latent traits are viewed as "explaining" a set of data. Lazarsfeld (1950) describes this notion of explanation in detail:

It is possible to formulate mathematically what is meant if we say that an underlying continuum accounts for the interrelationship of two test i t e m s . . . Such a formulation reduces to this idea: If people have the same position on the underlying x-continuum, then their answers to the two questions will be unrelated; the probability they will answer two questions positively is then the product of the probabilities that they will answer each questions alone positively. (p. 369)

In general then, a latent trait "explains" the intercorrelations among a set of variables in a population if, conditionally given the value of the latent trait, all of the variables are mutually independent (i.e., if local independence holds). This notion of explanation and some of its limitations are discussed in Holland and Rosenbaum (1986).

The random sampling rationale fits in very naturally with the definition I have given for p(x). I f we randomly sample an examinee from C then, given this examinee's 0-stratum in C, we automatically randomly sample the examinee from that 0-stratum of C.

One serious limitation with the random sampling rationale is that it does not suggest any choice of the form of the IRFs. Without a more substantive interpretation of Pj(O), it is difficult to see how one might be led to any of the item parameters that are in common usage. The important exception is the item difficulty parameter. These parameters allow p(x) to fit all of the one-dimensional marginal proportions exactly, and from general considerations of models for (2J)-dimensional contingency tables, one might argue that this is an absolute necessity for any model that aspires to fit real data. We return to this point again Sections 3 and 4. In any event, while the random sampling rationale for Pj(O) is neatly consistent with the assumption that examinees are randomly sampled from C, it does not lead to specific choices of the form of Pj(O).

The Stochastic Subject Rationale The "stochastic subject" rationale for (9) views the performance of an individual

examinee on each item in T as inherently unpredictable for various reasons. As a mathematical model for this unpredictability, the responses of the examinees are assumed to be governed by a stochastic mechanism that is characterized by an ability parameter 0 and by parameters that describe the characteristics of the items. In the stochastic subject rationale, 0 is a person parameter and varies from examinee to examinee in C. The discussions in the literature as to the meaning of the stochastic mechanisms that produce responses are generally tautological: subjects are stochastic

PAUL W. HOLLAND 583

because human behavior is uncertain. For example, Rasch (1960) justifies his use of a probability model as follows:

• . . we return t o . . . the description of certain human acts by a model of chan- c e . . . Even if we know a person to be very capable, we cannot be sure that he will solve a certain difficult problem, nor even a much easier one. There is always a possibility that he fails--he may be tired or his attention is led astray, or some other excuse may be given. And a person of slight ability may hit upon the correct solution of a difficult problem. Furthermore, if the problem is neither " too easy" nor " too difficult" for a certain person, the outcome is quite unpredictable. But we may in any case attempt to describe the situation by ascribing to every person a probability of solving each problem correctly, and this probability will be our indicator of "how easily" the problem is solved. The probability that a very able person solves a very easy problem is near unity, but not necessarily equal to 1, and the probability that a person of small ability solves a difficult problem is very near to 0. (p. 73)

Along these same lines, Lord and Novick (1968) refer to the propensity distribution of a single examinee's responses to a test in the following terms:

Most students taking college entrance examinations are convinced that how they do on a particular day depends to some extent on "how they feel that day". A student who receives scores which he considers surprisingly low often attributes this unfortunate circumstance to a physical or psychological indisposition or to some more serious temporary state of affairs not related to the fact that he is taking the test that day. To provide a mathematical model for such cyclic variations, we conceive initially of a sequence of independent observations . . . . . and consider some effects, such as the subject's ability, to be constant, and others, such as the transient state of the person, to be random. We then consider the distribution that might be obtained over a sequence of such statistically independent measurements if each were governed by the propensity distribution... The propensity distribution is a hypothetical one because . . . . . it is not usually possible in psychology to obtain more than a few independent observations. Even though this distribution cannot in general be determined, the concept will prove useful. (p. 30)

The stochastic subject interpretation of Pj(O) is related to the probabilistic learning models described in Bush and Mosteller (1955). In describing the development of mathematical models in psychology, these authors argue that the stochastic subject view is an inescapable fact of life:

These advances indicated a growing awareness that performance is an unpredictable thing, that choices and decisions are an ineradicable feature of intelligent behavior, and that the data we gather from psychological experiments are ines- capably statistical in character. Given these basic facts of the theoretical psychol- ogist's life, statistical theories in psychology would seem to have come to stay. (p. 336)

A similar view is expressed in Samejima (1983).

There may be an enormous number of factors eliciting his or her specific overt

584 PSYCHOMETRIKA

reactions to a stimulus, and, therefore, it is suitable, even necessary, to handle the situation in terms of the probabilistic relationship between the two. (p. 159)

I believe that no completely satisfactory justification of the "stochastic subject" is possible, but I also believe that most users think intuitively about IRT models in terms of stochastic subjects. It has great heuristic value even though no one readily admits to the belief that there is some sort of random mechanism within each subject generating his or her response vector by (mental) flips of biased coins. A simple example of this heuristic value arises in a commonly given rationale for local independence. If it is plausible to believe that T is such that the response of an examinee to one question will not influence his or her answer to some other question, then it is not difficult to accept the view that the responses from such a stochastic subject will exhibit local independence. "Guessing parameters" provide another example of the heuristic value of stochastic subjects. It is difficult to imagine how guessing parameters would have been conceived without the aid of a stochastic subject interpretation of Pj(O). Finally, if 0 represents an "ability to solve a problem" then it is not difficult to suppose, for stochastic subjects and tests with "correct answers", that Pj(O) ought to increase in 0--as it does for most of the common parametric models.

I view both the random sampling and the stochastic subject rationales for (9) as useful tools for understanding IRT models. The stochastic subject is a powerful met- aphor that aids our intuition in the construction of models (i.e., choosing M C_ f~j). The random sampling rationale gives a firm logical basis for statistical inference in these models. Neither of these two important roles should be ignored.

Lord (1974) discusses an additional interpretation of Pj(O). In his words:

Pia is most simply interpreted as the probability that the examinee a will give the right answer to a randomly chosen item having the same ICC as item i. (p. 249)

While this interpretation of Pj(O) has a clear sampling theory basis, I did not include it in my discussion because it does not apply to the case assumed here in which the test Tis a fixed set of questions. It would apply to a testing situation in which every question presented to every examinee was sampled afresh from one of J item pools. Such applications can arise with computerized testing. Lord also mentions briefly the difference between the random sampling and stochastic subject interpretations of Pj(O). Continuing the above quotation he says:

An alternative interpretation is that Pi(Oa) is the probability that item i will be answered correctly by a randomly chosen examinee of ability level 0 = Oa. (P. 250)

This is an explicit statement of the random sampling rationale. Then he goes on to say that:

These interpretations tell us nothing about the probability that a specified examinee will answer a specific item correctly. (p. 250)

Here, Lord clearly allows for the possibility that a specific examinee might behave as a stochastic subject in certain circumstances--which, in that paper, concern responses to previously omitted items.

Lazarsfeld (1959) gives the following, rather graphic, description of a stochastic subject---quoted at length in Lord and Novick (1968, pp. 29-30)---only nine years after his equally clear description of the random sampling rationale mentioned above:

PAUL W. HOLLAND 585

Suppose we ask an individual, Mr. Brown, repeatedly whether he is in favor of the United Nations; suppose further that after each question we "wash his brains" and ask him the same question again. Because Mr. Brown is not certain as to how he feels about the United Nations, he will sometimes give a favorable and sometimes an unfavorable answer. Having gone through this procedure many times, we then compute the proportion of times Mr. Brown was in favor of the United Nations. (p. 493-494)

Thus writers on IRT models have been eclectic in their interpretation of the meaning of the "probability" that the IRF is supposed to represent. More often, they are silent and make no effort to interpret it at all; for example, Lawley (1943), Tucker (1946), Lord (1952), Samejima (1969, 1972), Bock and Lieberman (1970), Wright and Douglas (1977), Wright and Stone (1979), and Andersen (1980). To discuss the statistical foundations of IRT models without ambiguity or confusion, we can afford to be neither silent nor eclectic. In this paper I adopt the view that (9) and assumptions on the IRFs and F(O) define locally independent IRT models for p(x). I will use the random sampling rationale for the meaning of Pj(O) whenever that is important, and will only make use of particular choices of the IRFs to define subsets M C_ f~j without regard to the substantive interpretation of these choices of IRFs in terms of stochastic subjects.

Lewis (1985, 1990) gives a Bayesian analysis of dichotomous item responses that presumes neither the random sampling nor the stochastic subject rationale. His approach provides an alternative perspective on the statistical foundations of IRT models to the one developed here.

But We Don't Sample People at Random!

Let me briefly address this potential objection to the position I have adopted in this paper. In some practical situations, we are unable to specify a population of potential examinees, let alone randomly sample from such a population. Nonetheless, in building a foundation for statistical inference it is important to begin with a simple situation in which the statistical issues are relatively clear-cut. Once the problems are understood in such a context, we may then move on to more complex situations in which the sampling may be biased or for which the idea of a population of examinees may be meaningless. That is the methodological path that I will follow here. The theory of estimation and testing for the multinomial distribution is thoroughly understood in a variety of situations--a basic paper is Birch (1964)--and for this reason I believe it is an appropriate place to begin building the statistical foundations for item response theory models.

I hope that the rest of this paper proves my point. Making the population of potential examinees an explicit part of the model runs counter to most developments of IRT models which start with an individual (stochastic?) subject and build a probability model for his or her response vector. While the approach of starting with an individual subject may appear to obviate the need for ever mentioning an examinee population, it is my opinion that this is an illusion. Item parameters and subject abilities are always estimated relative to a population, even if this fact may be obscured by the mathematical properties of the models used. Hence, I view as unattainable the goal of many psychometric researchers to remove the effect of the examinee population from the analysis of test data, (i.e., "sample-free item analysis", Wright and Douglas, 1977). The effect of the population will always be there, the best that we can hope for is that it is small enough to be ignored.

586 PSYCHOMETRIKA

3. Loglinear Models for p(x)

It is useful to consider, briefly, some alternatives to the IRT models for p(x) defined, in general, by (9). The class of loglinear models is such an alternative. These models are specified by equations of the form

R

log p(x) = t~o + ~ [3jbj(x), j = l

(I4)

where s0 is a normalizing constant to insure that ~x p(x) = I; {/3j} are the loglinear model parameters; and the {bj(x)} are known functions of x. Examples of bj(x) that arise are

b(x) = xi,

b(x) = xixj, i < j ,

b(x) = xixjxk, i < j < k,

and

J

b(x)= ~ ~ = x + , j = l

b(x) = (x + )2, b(x) = (x + )3 . . . .

(15)

{~ if x+ = t , b(x) = ~t(x), where ~St(x) = otherwise.

A saturated model is a loglinear model for which R = 2 J - I and the 2 J - 1 functions {bj(x)} are linearly independent. A saturated model completely spans the entire probability simplex, f~j, and, therefore, for any p E f~j there is a unique choice of loglinear parameters, {/3j}, such that p satisfies (14). Saturated models are of little use when J is large except to indicate that by adding enough parameters, any set of data can be fit or "expla ined" arbitrarily well by a loglinear model.

An important class of loglinear models are the hierarchical or ANOVA-type models defined by

logp(x) = a0 + ~ ajxj + ~ 3,uxixj + ~ AUkxixjxk + ' ' ' (16) j i<j i<j<k

In (16) the sums of products, xixjxk . . . . . et cetera, continue up to some level, say to products of t distinct factors, and the model is then said to include only interactions of order t or less. Important special cases of these hierarchical loglinear models are the model o f statistical independence,

log p(x) = ao + ~ oljxj, (17) J

and the second-order exponential model of Tsao (1967),

log p(x) = ao + ~_, ajxj + ~ Tuxixj. (18) j i<j

PAUL W. HOLLAND 587

Loglinear models are examples of exponential family models and, as such, they enjoy many elegant "likelihood properties". For example, under the multinomial approximation mentioned in section 2, the log likelihood function for models of the form (14) is

R ,) L = ~x n(x) log p(x) = Nao + j~'= 1 i j bj(x)n(x . (19)

From (19) it follows that the sufficient statistics for these models are N and {~x bj(x)n(x)}R1. In addition, maximum likelihood estimation for these models corresponds to the method of moments in which the fitted probabilities,/~(x), satisfy the well-known "moment matching" conditions:

~, bj(x)Np(x) = ~, bj(x)n(x), j = 1 . . . . . g . (20) x x

The right-hand-side of (20) is a sufficient statistic of the model while the left-hand-side is the maximum likelihood estimate of its expected value under the model; hence, the phrase "moment matching".

When the b j-functions include the indicator functions xl, x2 . . . . . x j , the moment matching conditions insure that the sample marginal proportions, pj = ~'x xjn(x)/N, equal th~ fitted marginal proportions Y'x xjp(x). For example, any loglinear model that includes the model of statistical independence (17) as a special case will have the property that the fitted probabilities, t5(x), from maximum likelihood estimation, will have univariate marginal proportions that exactly match the observed marginal proportions correct,/Sj. If the random vector X has a joint distribution specified by p(x), then the marginal proportion correct, pj = P(Xj = 1), does not reflect any of the dependence in the distribution of X. Because the primary goal of models for the distribution of X is to account for its dependence, it is usual for the model of statistical independence to be a submodel of most useful probability models for p(x). This inclu- sion of the model of statistical independence as a submodel insures that each univariate margin in the data can be fit exactly by the model. For IRT models this is the role played by item difficulty parameters.

One virtue of loglinear models is the straightforward way that the parameter vector t = (ill . . . . . /3 R) is mapped via (14) into a point, P(t) = (p(x; t)), in the simplex ~ j . The region of ~ j that is specified by a loglinear model of the form (14), in which the {bj(x)} are linearly independent functions of x, is an R-dimensional "surface" or manifold in (2 J - 1)-dimensional space. For example, the model of statistical independence (17) is a J-dimensional manifold in f l j , while the second-order exponential model (18) is a (J + (~))-dimensional manifold in f l j . All of the models for p considered here correspond to smooth, high dimensional "surfaces" or manifolds in ~ j . For example, a curved 2-dimensional surface in n-dimensional space is a 2-dimensional manifold. My discussion of manifolds will be intuitive, but I believe that it is essentially correct and capable of more precise formalization into the language of differential geometry--a task I leave to others. The dimensionality of the model M as a subset of ~ j governs the degrees of freedom of chi-square test statistics and represents the number of identifiable model parameters.

It is useful to distinguish between the parameter space of a model and the corresponding subset M of the data space, f l j . In many cases the parameter space will be a simple region of Euclidean space and its image, M, a manifold of the same dimension as the parameter space that is embedded in the much higher dimensional data space,

588 PSYCHOMETRIKA

f~j. This is true for a loglinear model in which the b j-functions are linearly independent and/3 is free to vary over a parameter space consisting of all of Euclidean R-space. As we shall see, IRT models are sometimes more complicated than this.

4. The Structure of IRT Models

This section is concerned with the structure, mainly the dimension, of IRT models when viewed as subsets of the data space, f l j . These subsets of l ) j are obtained from (9) by letting the IRFs, {Pj(O)}, and the ability distribution, F(O), range over specific classes of functions such as those mentioned in section 2.

Before discussing IRT models more generally, it is useful to describe two important subsets of 1) s the independence model and the Guttman scales.

The Independence Model

If, in (9), Pj(O) does not depend on 0, for example,

e~(0) = p j ,

then p(x) has the form

(21)

p(x) = I-~ P]~q]-x~, (22) J

where pj = 1 - qj is the (marginal) proportion of examinees in C who get i temj correct. Taking logs and reparameterizing (22) shows it to be equivalent to the loglinear model in (17) with

aj = log (pj/qj), j = 1 . . . . . J. (23)

It is clear that, as each of the pj in (22) range over the closed interval [0, I], the vector p = (p(x)) traces out a J-dimensional connected subset of f~j. This subset, MIN o , is the model of statistical independence mentioned in section 3. The parameter space for MIND is the J-dimensional unit cube, K j , defined by

Kj ={(Pl . . . . . PJ): O<-Pj <- 1 , j = 1 . . . . . J}.

MINt) is the image of Kj under the mapping specified by (22).

The Guttman Scales

If, in (9), Pj(O) is a 0 to I step function with a jump at 0 = tj,

Pj(O)=[lo if O>-tJ if 0 < tj, (24)

and F is a fixed, continuous distribution function, then the resulting values ofp(x) may be described as follows. Suppose {gj} is a permutation of {1 . . . . . J} that orders the jump points,

tg~ <- t g 2 ~ " " " ~ t g I . (25)

The marginal proportions correct, p j, are given by

f f; pj = Pj(O) dF(O) = dF(O) = 1 - f ( t j ) . (26)

P A U L W . H O L L A N D 5 8 9

The permutation {gj} orders these proportions from highest (easiest item) to lowest (hardest item). We have 1 - Pa, -> " " " -> Pax - 0. Next, f o r j = 0, 1 , . . . J , let y(J) be the 0/1 vector ( y t j) . . . . . y j j r j such that

{~ ifi--<J, Yg(/) = if i > j . (27)

If (24) holds, then all response vectors, x, except for the J + 1 vectors, {y(J)}, must have p(x) = 0. The values o f p ( y (j)) are given by

p(y(O)) = I - p a , ,

P(Y(J)) = Pgj - Pgj+t, j = l . . . . . J - l , (28)

and

p(y(Y)) = p g .

This probability distribution puts positive probability on at most J + 1 response vectors and is called a Guttman scale after Guttman (I94I, 1950). Like MIN D, the Guttman scales, p E MGUr, are completely determined by their marginal proportions correct, P 1, • • • , PJ. As in MIN D, the parameter space for M a u T is the J-dimensional unit cube Kj , defined earlier. As the pj vary over the interval [0, 1], p = (p(x)), given by (28), traces out a J-dimensional boundary of f~j. It is a boundary because at most J + 1 coordinates of p are nonzero for p E MGU T.

All IRT models may be thought of as being " in between" MIN D (which has no dependence) and MGu T (which has "per fec t" dependence). Both MIN D and MGU T are characterized by their one-dimensional marginal proportions correct (P l, P2, • • • , P J); that is, they are J-dimensional manifolds in l-lj. MIND is a smoothly curved manifold while MruT" is a piecewise linear manifold because it is part of the J-dimensional boundary of l~j.

Parametric IRT Models

Sometimes it is easy to guess the dimension of an IRT model from a count of the number of item and ability distribution parameters minus the number of constraints on them. The usual parametric forms for Pj(O) and F(O) may be defined by

Pj(O) = Po(aj(O - bj); cj), (29)

and

F(O) = Fo((O - Mo' ; v), (30)

where P0 and F 0 are specified functions, cj denotes any additional item parameters beyond the usual location and scale parameters, bj and a j , and v represents any additional ability distribution parameters beyond the location and scale,/~ and o-. When aj, bj, Iz and tr vary freely, it is easy to see that 0 in (9) may be transformed to t h = ( 0 - Iz)/o-, thereby eliminating ~ and tr as free parameters. As an example, we consider the model used by Thissen (1982) in which Pj(O) has the one-parameter logistic form (I2) and F(O) = qb(0/cr). By a change of scale in F, this is equivalent to the model in which F(O) = q~(0) and Pj(O) has the 2-parameter logistic form (10) in which all of the aj parameters are equal, a i = a. By a suitable choice of sequence 7r (t) ----- (a (t), bt t) . . . . ,

t) ~ (t) b} ), we can arrange for p(~" ) in this model to converge to either a point in MIN D (let

590 PSYCHOMETRIKA

a (t) - + O) or tO a point in M a u r (let a (t) --> oo). In addition, by choosing the b) t) correctly, we can force equality of the marginal distributions for all t,

F = J ZGT(a(t)(O - b(t))) ddP(O), Pj

for all ~r (0. Thus, for any choice of marginal proportions correct (Pl . . . . . p j) in Kj there is a curve, indexed by the a-parameter, that moves continuously from a point on MINO with these marginal proportions to the corresponding point o n M G U T with the same marginal proportions. This curve traces out all of the points in Thissen's model that have the specified marginal proportions correct. Hence, Thissen's model is a J + 1 dimensional manifold in l l j . I f F is changed to a different, but fixed distribution (with nonzero variance), we get a different J + 1 dimensional manifold. At present, I do not know of any relationship between these various J + 1 dimensional manifolds that correspond to various versions of the Rasch model with essentially different ability distributions.

If there are K item parameters for each item (including location and scale) and L parameters for F, including location and scale, then the total number of parameters appearing in the formula for p(x) is KJ + L - 2. Is the dimension of the corresponding subset M of II j also KJ + L - 2? In some cases the answer is clearly yes and in others this is less clear. Birch (1964) gives an important condition that gives insight into this question. Let all of the item and ability distribution parameters be combined into a single vector parameter, 7r, that ranges over a parameter space of D dimensions. Let the transformation that maps ~r into the corresponding point p E l l j be denoted by

p(Tr) = (p(x; ~r)), (31)

where

f. p(x; zr) = I 1-I Po(aj(O - bj); cj)X~Qo(aj(O - bj); Cj) 1 - x j dFo(O; v). (32)

J J

I let (32) define what I mean by a parametric IRT model. I will assume that Fo(O, v) is a family of distribution functions whose location and scale have been fixed. Examples are Fo(O) = qb(0), or F0(0; v) = 1/2 and F0(1; v) = 3/4 for all v. Birch's condition, given in the context of proving the consistency and asymptotic normality of maximum likelihood estimates for the multinomial distribution, is that for any e > 0, there exists a 8 > 0 such that if

- , *11 > then IlP( r) - p 0 r * ) [ I > 8 , (33)

where U'II denotes Euclidian distance. Birch's condition prevents two values 7r, -a'*, that are not near each other in the parameter space, to give rise to two points in the model M that are close to each other.

Birch's condition (33) gives some useful insight into the structure of models in which Pj(O) has the standard 3-parameter form,

Pj(O) = P(O; ~rj) = cj + (1 - c j ) e o ( a j ( O - b j ) ) , (34)

where P0 is a continuous cumulative distribution function and F(O) = Fo(O) is a fixed distribution function. We envision moving the various parameters smoothly over the parameter space and seeing what happens to p(Tr) in IIj . Suppose {Tr (tl} is a sequence

P A U L W. H O L L A N D 5 9 1

of parameter values that converges to a parameter value rr. Suppose further that in this limit the IRFs are flat (i.e., they do not depend on 0),

lim Pj(O; ~r (t)) = pj. (35) t----~ o0

In the case of (34), (35) implies that we must have

l ima} t) = O,

l i m ¢ ( t ) = c j ,

t-...~ oo

(36)

and

,,(t)h(t) - a , lim . j ~.j - . . j , f----~ oo

where

cj + (1 - c j ) P o ( - A j ) = pj. (37)

Equations (36) and (37) show that many sequences of parameter values {~r (0} that are not near each other in the parameter space give rise to points in M that are all close to the same point in MIND, because

t "

I nox, -., = n J J • j

is always a point in MIND. This is a violation of Bitch's condition (33). It is easy to show that if, in (34), we restrict the parameter cj to be a fixed value, such as 0 or 1/5, for all j , then Aj in (37) is uniquely determined by the marginal proportion pj. This will prevent the violation of Birch's condition that can occur if the cj are allowed to vary freely. Furthermore, if we look at sequences of parameter values that approach the Guttman scale (i.e., for which aj is large), then the phenomenon we have just described can not occur. This type of analysis suggests that the well-known problem of the identifiability of the c-parameter in the 3-parameter logistic model,

Pj(O) = cj + (1 - cj)LGT(aj(O - bj)), (39)

occurs primarily when aj is small, and is eliminated by choosing an a priori fixed value for all the cj. Both of these facts are used in practice and this analysis shows why they are true.

Semiparametric I R T Models

When the Pj(O) have a parametric form like (29), but F(O) is allowed to vary over all distribution functions, the resulting models may be called semipararnetric after Oakes (1988). Tjur (1982) and Cressie and Holland (1983) examine the structure of the semi-parametric Rasch model in which Pj(O) has the 1-parameter logistic (Rasch) form specified in (12). They show that for this case, p(x) has the form of a loglinear model

592 PSYCHOMETRIKA

J J

l o g p ( x ) = o t 0 + ~ ajxj + ~, j = l k = 2

"~k6k(X), (40)

where 6k(x) is the function in (15) and {aj}, {Yk} are parameters, and they show that the Yk are subject to a system of inequality constraints but that the {aj} are not. Equation (40) shows that the dimension of MRASC~I for the semiparametric Rasch model is J + J - 2 because the inequality constraints do not lower the dimensionality of the {Yk}- However, Holland (1990) gives results that suggest that for this model, the inequality constraints mentioned above become very tight as J ~ ~ and that MRASC H approxi- mates a J + 1 dimensional manifold. We showed earlier that for Thissen's model, in which Pj has the Rasch form and F(O) = qb(0/tr), the dimension of M is J + 1. These two facts suggest that, at least when J is large, the dimensionality of the semi-parametric models may not be easy to guess from intuitive analyses. The development of a better method for reliably computing these dimensions is a useful line of future research. The work of de Leeuw and Verhelst (1986), Follman (1988), Levine (1989), and Lindsay, Clogg, and Grego (in press), which all emphasize discrete versions of F, may prove useful here.

The Nonparametric IRT Models When Pj(O) and F(O) are both allowed to vary over nonparametric classes of func-

tions, the resulting IRT models may be called nonparametric. Examples of analyses of these models are Levine (1989), Holland and Rosenbaum (1986), Stout (1987, 1990), and Junker (1988, 1989, in press). I know of no descriptions of M for any nonparametric IRT models beyond the partial conditions given in Holland (1981), Rosenbaum (1984), and Holland and Rosenbaum (1986). The results given there do not suggest that the dimension of M is reduced. However, the conjecture of Holland (1990) is that when J is large, and F and the {Pj} are sufficiently well-behaved, and 0 is of dimension D, then M is approximately a (D + 1)J dimensional manifold represented by a second-order exponential model of the form

log p(x) = ao + ~ oljxj "~- Z XiXj)tij, (41) j i<j

where (YU) is of rank D. Evidently, this is an area in which we know little and need to know more.

5. Which Likelihood Function?

One of the curiosities of estimating IRT models is the number of different procedures that all claim to result in "maximum likelihood estimates". A computer program that computed unconditional maximum likelihood (UML) or joint maximum likelihood (JML) estimates was described as early as Lord (1967), and the technique of marginal maximum likelihood (MML) estimation was described in Bock (1967). Conditional maximum likelihood (CML) estimation was suggested by Rasch (1960) and its properties established in Andersen (1970). Hence, for at least 20 years IRT modelers have had three types of IRT estimation procedures, all called "maximum likelihood" to choose from.

When the data are the item responses from a random sample of N examinees from a much larger population, the log-likelihood function for any IRT model is determined from the multinomial distribution and is

PAUL W. H O L L A N D 5 9 3

L = ~, n(x) log p(x). (42) x

Let p(x; ,r, F) denote the integral in (9) where 7r denotes all of the item parameters considered as a large vector and F is the distribution of ability. I assume that F has been centered and scaled in some way to eliminate the location/scale identification problem described in the discussion of parametric IRT models. In practice, this will result in (a) F being a fixed distribution, such as N(0, 1), or (b) F being a member of a nonlocation/ scale parametric family of distributions, such as F(O; v) = (LGT(O)) v, or (c) F being a member of a nonparametric family of distributions all subject to the same location/scale constraints.

Maximum likelihood estimation, from the perspective of this paper, consists of maximizing L in (42) over all legitimate ~ and F; that is, an MLE of (~r, F) satisfies

The maximum likelihood estimate (4r, F), defined by (43) is exactly the MML estimate of (rr, F) computed in a parametric and semi-parametric setting by Bock and his colleagues (Bock, 1967; Bock & Aitken, 1981; Bock & Leiberman, 1970), and by Levine and his colleagues (1989) in the nonparametric setting. I would simply call (~r, F) the "maximum likelihood estimate" of (~r, F), but traditions are strong and it is unlikely that the redundant adjective, "marginal", will ever be dropped. In the semiparametric setting, P is not unique, but this is not important for my discussion here.

What is the relationship between MML, UML/JML, and CML? It is a useful fact that both the UML/JML and the CML estimates of ~r can be viewed as approximations to the MML estimate of ~rjust described. I consider each of these relationships in turn.

Relationship of UML to MML

The "likelihood function" that is maximized to obtain the UML/JML estimate of ~r may be written as

where

Lv(Tr, {0i}) = ~ log Po(Oi; "lrj)UUQo(Oi; f f j ) l - t~ , i = 1

(44)

PAo) = P0(0; ~j)

is a parametric IRF with item parameter vector, ~rj, and u U denotes the response of subject i to item j. The ability parameters, 0/, i = 1 . . . . . N, are regarded as free parameters subject to a location/scale constraint such as,

Z i

and UML/JML programs, such as LOGIST, proceed by maximizing L v in (44) jointly in 7r and {Oi}. UML/JML estimation is called joint maximum likelihood estimation (JML) because it results in estimates for both item parameters {1rj} and ability parameters {0i}, jointly. The motivation for (44) is, as far as I can tell, a purely "stochastic subject" interpretation of (13). Stochastic subject i meets item j and has probability

594 PSYCHOMETRIKA

P(Oi; ,rj) of answering it correctly. All item/person combinations are statistically independent. Such reasoning leads to (44).

It is straightforward to show that if{0i} maximize Lv(~r, {0i}), t h e n 0i = Oi', for any two examinees, i and i', with the same response pattern, x. Hence, we may denote the UML/JML estimates of Oi as 0(x), and we may rewrite the UML/JML likelihood function as

The objective of UML/JML estimation may be viewed as maximizing Lcr(m 0(.)) over all legitimate values of ~ and all functions, 6(.) that assign 0-values to response patterns (i.e., 0(x)).

We may put the MML likelihood function from (43) into a form that is similar to (45) as follows. The MML likelihood function is

Now the integrand in (46),

I--[ P0(0; Irj)XJQ0(0; ~rj) 1 -xj, (47) J

is a smooth function of 0, and its integral may be viewed as an expectation with respect to the distribution, F. Hence, for each value of x, ~r and F, there is, by the mean-value theorem, a value 0(x; It, F) such that

f I-[ P0(0; ~rj)X~Q0(0; ~j)l-~ dF(0) J

= I-I Po(O(x; ~r, F); ,rj)XJQo(O(x; w, F); 7rj) 1 -x~ (48) J

If we substitute (48) into (46), we obtain

LM(Tr'F)=~n(x)l°g[~jP°(O(x;Tr'F);Trj)X~Q°(O(x;1r'F);1rj)l-x~] " x (49)

Hence, we see that the relation between UML/JML and MML is that they both may be viewed as maximizing the same function, namely Lu(Tr, 0(')). In the case of UML/JML, 0(.) is free to vary over all functions of response vectors during this maximization, whereas in MML, 0(-) is restricted to be a "mean-value" point, 0(-; 7r, F), for some allowable distribution function F. Hence, we may view UML/JML estimation as MML estimation in which certain complicated constraints on 0(x) are relaxed. This agrees with de Leeuw and Verhelst's (1986) reference to UML as "unconstrained" maximum likelihood.

There are at least two benefits to viewing UML/JML as an approximation to MML. The first is the realization that the maximized UML/JML likelihood is larger than the maximized MML likelihood because a constrained maximum must be smaller than an unconstrained maximum. I suspect that this results in overly optimistic estimates of standard errors for item parameters from UML/JML programs. Secondly, we may ask


how the UML/JML 0(x) differs from the 0(x; Or, F) that would be obtained from an MML program that actually implemented the constrained maximization--a fairly unlikely implementation given the complexity of the constraints. This question is fairly easy to answer. The UML/JML 0(x) will maximize, in 0, the function

I-I P0(0; ¢rj)X'ao(O; ¢rj) l - x j , (50)

J

where ~-j is the UML/JML estimate of ¢rj. The MML O(x; ~, 10) is a point along the 0 axis where the function in (50) takes on its mean value with respect to the MML estimates F, and ~-. In general, 0(x) and 0(x; ~-, P) will differ. However, when the number of items, J, is large, the function (50) will be very peaked; in this case, 0(x) and 0(x; ~-,/0) will be nearly equal. Thus, we expect UML/JML and MML to yield similar estimates of ~r as J ~ ~. There is some empirical evidence to support this in Mislevy and Stocking (I989).

R e l a t i o n s h i p o f C M L to M M L

Conditional maximum likelihood estimation is a general technique for optimal estimation of parameters in the presence of many nuisance parameters (Andersen, 1970). In the context of IRT models, the application of CML has been to the estimation of item parameters for the Rasch model in the presence of individual ability parameters; that is, the UML likelihood, L u, defined in (44). I will give an alternative interpretation of the CML likelihood function to show that it, too, can be viewed as an approximation to the MML likelihood (44). I will do this in a context more general than the Rasch model, but with the realization that the Rasch model provides the only known application of these ideas in an IRT context.

Let T(x) be any real (or vector-valued) function of response vectors. In the Rasch model case, T(x) = x+ = ~f--I xj . Let st be the probability function of T,

st = s t ( I t , F) = Prob (T(x ) = t) = p(x). (51) x:T(x) = t

Also, let r t be the sample frequency of T = t,

r t = E n(x) . (52) x:T(x) = t

Finally, let the conditional distribution of x given T(x) = t be denoted

p ( x l T = t) = p f x I T = t, 7r, F) = Prob (X = x t T ( X ) = t), (53)

where X is the random 0/1 vector with distribution p (x ) . We have the following basic identity,

p ( x ) = p(xlT = t)s t , (54)

and hence, the MML likelihood function may be written as

LM(Tr, F ) = ~'~ n (x ) log ( p ( x l T = t )s t ) x

= ~ rt log st(or, F ) t

+ ~'~ n (x ) log p ( x l T = t, rr, F ) . (55) x

596 PSYCHOMETRIKA

We write the two terms on the right hand side of (55) as

Lc(zr, F) = ~ ~_~ n(x) log p (x t r= t, Tr, F) t x :T(x)=t

and

And hence, we have

(56)

L*(qr, F) = ~ rt log st(Tr, F). (57) t

LM(Ir, F) = Lc(*r, F) + L*(,r, F). (58)

Now comes the useful part. Suppose that T eliminates F from the distribution of x and has a marginal distribution that does not depend on 7r:

CMLI: p(xlT = t, ~r, F) = p(xlT = t, ,r), (59)

and

CML2: st(~r, F) = st(F). (60)

If both CML1 and CML2 hold, then the MML likelihood, LM(~r, F), splits into two terms via (58),

LM(rr, F) = Lc(Tr) + L*(F). (61)

When this happens, the MML estimate of ,r can be obtained by maximizing the CML likelihood Lc(Ir) and MML estimates of F can be obtained by maximizing the "marginal" likelihood function L*(F).

For the Rasch model, in which F is an arbitrary distribution function, CML1 holds exactly, but CML2 only holds approximately. Hence, the version of (61) that obtains, in practice, for the Rasch model is

LM(Tr, F) = Lc(Tr) + L*(Tr, F). (62)

I regard CML estimation as approximate MML estimation in which L*(Tr, F) is ignored in (62). De Leeuw and Verhelst (1986) show that this approximation improves as J ---> o0

In summary, UML/JML and CML may both be viewed as approximations to MML in which either a constraint on the maximization is ignored or some part of the likelihood function is forgotten.

6. Estimates of Ability or of Ability Predictors?

The stochastic subject point of view gives rise to the log likelihood function, L u(zr, {0i}), given in (44), whereas the random sampling point of view leads to the log likelihood function, Lm(~r, F), given in (46). In the stochastic subject formulation, an examinee's ability is a parameter, 0 i, to be estimated, whereas in the random sampling formulation, the distribution of ability, F, is a "function-valued parameter" to be estimated. From the random sampling perspective, 0 in (46) is simply a variable of integration and cannot be estimated.

It was observed, in section 5, that the ability estimates that emerge from maximizing Lu(Tr, {0/}) may be viewed as a function 0(x) that assigns 0-values to response vectors, x. This function depends on the data, {n(x)}, only through the item parameter


estimates, ~-, so we may denote it 0(x) = 0(x; Or). As the number of examinees, N, increases, ¢r converges to some value ~-* and 0(x) will converge to the function O*(x) = 0(x; 7r*)---assuming sufficient smoothness of the IRFs, and so forth. When n(x) = O, 0(x) is undetermined by the UML/JML likelihood, (45). As N---~ ~, the probability that n(x) = 0 goes to zero for any fixed x. Haberman (1977) discusses a relevant asymptotic theory for this problem.

As stated earlier, from the random sampling perspective, it makes no sense to "es t imate" 0. Rather it is the distribution of 0 over the population of examinees, F, that can be estimated. Is there some way to incorporate the idea of estimating 0 into the random sampling perspective? I believe there is and the key idea is that of an ability predictor.

Suppose p(x) = p(x; 7r, F) is known and is specified by a parametric or semiparametric IRT model in which all the item parameters are collected into a single vector parameter, 7r, and the ability distribution is denoted by F. We may consider the posterior ability distribution given the response vector x,

J

P(O; 7rj)XJQ(O; 7rj) 1-xj dF(O) j = l

dF (0Ix) - p(x; 7r, F)

(63)

The posterior distribution of 0 summarizes our knowledge of 0 from the response vector of a randomly selected examinee from C.

An ability predictor is a function that maps response vectors into 0-values. An optimal ability predictor is an ability predictor that minimizes the expectation of a loss function over the posterior distribution of 0 given x. More carefully, suppose L(O, a) is a loss function that measures the prediction loss when 0 has a given value and the value of the predictor of 0 is a. Examples of loss functions are:

L(O, a) = (0 - a) 2, (64)

L(O, a) = 10 - al, (65)

o r

0 if 10 - a[ < e L(O, a) = if 10 - al - e. (66)

Suppose that aL(x) is the optimal ability predictor that minimizes the posterior expectation of L(O, a) for each x, given as

f L(O, aL(x)) dF (Olx) = inf f L(O, a) dF (0Ix). a

(67)

In this setting, we can interpret various "ability est imators" as optimal ability predictors. If L(O, a) is given by (64), then aZ(x) is the "expectat ion a posteriori" or EAP ability estimator,

aL(O) = f 0 dF (OIx) = E(OIX = x). (68)

598 PSYCHOMETRIKA

If L(O, a) is given by (65), then aL(x) is a posterior median of F(OIx). Finally, if L(O, a) is given by (66) for a suitable e, then aL(x) is a type of posterior mode of dF (O[x), and is, therefore, closely related to usual "maximum likelihood estimator" of 0, obtained by maximizing L v.

Note that aL(x) must depend on both ~r and F,

aL (x) = at. (x; 7r, F), (69)

and hence, the problem of estimating ability becomes, in the sampling theory framework, the problem of estimating optimal ability predictors. Once estimates of ~r and F are in hand, the obvious estimate of aL(x) is

~tL(X) = at.(x; Or, F). (70)

How should optimal ability predictors be compared? The conditional expectation of aL(X) given 0 is

~L(O) = E(aL(X)IO) = ~ aL(X)p(xlO), (71) X

where p(xlO) is given by (13). The bias of aL is, therefore, defined by

BIASL(O) = lzL(O) -- 0 = E(aL(X) - 010). (72)

The mean-square error of aL is

MSEL(O) = E((aL(X) -- 0)210). (73)

The average of MSEL(O) over F is minimized for aL(x) defined by (68).

7. Final Comments

My purpose has been to contrast the random sampling view of IRT models with the stochastic subject view and to show how the former provides a solid foundation for statistical inference in IRT modeling. So far, I have discussed model identification, likelihood functions, and ability predictors. I will end this section by briefly touching on a few more topics that the random sampling view can illuminate.

Robustness

In studying the robustness of a statistical procedure, we assess its sensitivity to violations of the model. A model, M C__ f~j, is wrong when p, the true probability vector that generates the sample data, is not in M. When p is not in M but we act as though it is and estimate p by the maximum likelihood estimate/~M that assumes p is in M, then /~M will, in large samples, converge to a point PM that is the "closest" point in M to the true p. This notion of "closeness" is that of information distance,

I(p, PM) = inf l (p, q), (74) q ~ M

where I(p, q) is the information distance function

I(p, q) = ~', p(x) log (p(x)/q(x)). x

(75)

PAUL W. HOLLAND 599

If M is specified by an IRT model, but p is not in M, the main question of robustness concern how conclusions based on/~M or PM related to what these conclusions would have been had they been based on p. For example, what does an ability predictor predict if the IRT model is wrong? Is it a meaningful "score" for the test?

Bayesian Methods There are two different types of Bayesian models that are easily differentiated in

this framework. In the first case, there is a basic model M C_ f i j and the prior distribution for p is concentrated only on M. This case arises in parametric IRT models in which priors are put on all item and ability distribution parameters. In the second case, the prior for p puts positive probability over all of l~j, not just on M. In the second case there might be a basic model M at which the prior for p is "centered" in the sense that the prior density is highest over M. The advantage of the second type of prior is that it cannot be shown to be wrong when N is large. One way this second type of prior is specified, is by adding random components to loglinear models for p as in Leonard (1975). I suspect that this second type of prior leads to a more satisfactory Bayesian analysis of IRT models than the first type does. It is a promising area for future work.

Computerized Adaptive Tests In a computerized adaptive test (CAT), items are selected adaptively for each

examinee and presented sequentially so as to obtain as efficient an estimate of the examinee's ability as possible, based on as few test items as possible. CATs are based on the idea that items that are too hard or too easy for an examinee are wasted in so far as efficient ability estimation is concerned (Wainer, 1990). One way that CATs may be put in terms of the sampling framework of this paper is the following. Let x denote the response vector for all the items in the item pool. A CAT divides x into two portions denoted (cat(x), rest(x)) such that cat(x) are the item responses that are actually observed in the CAT and rest(x) are those that are not observed. Note that in this formulation, I have assumed that the items that are in an individual's CAT are selected deterministically from the pool based on the prior responses given by the examinee. Adding "randomization" to the successive item choices (to avoid security problems, etc.) can also be accommodated in this formulation. The idea of a CAT is to define cat(x) in such a way that for a randomly selected response vector X, the distribution of rest(X) given cat(X) is as "uninformative" as possible. One way for rest(X) to be uninformative is for it to have a distribution concentrated on a single vector. This will happen to very able examinees for whom rest(X) will consist entirely of correct responses. A more general way for rest(X) to be uninformative is for the conditional variance of the number of correct responses in rest(X) given cat(X) to be as small as possible. These ideas can be used to develop a theory of CATs that is not highly tied to IRT models.

References

Andersen, E. B. (1970). Asymptotic properties of conditional maximum likelihood estimators. Journal of the Royal Statistical Society, Series B, 32, 283-301.

Andersen; E. B. (1980). Discrete statistical models with social science applications. Amsterdam: North Holland.

Birch, M. W. (1964). A new proof of the Pearson-Fisher theorem. Annals of Mathematical Statistics, 35, 718--824.

Birnbaum, Z. W. (1967). Statistical theory for logistic mental test models with a prior distribution of ability (ETS Research Bulletin RB-67-12). Princeton, NJ: Educational Testing Service.

600 PSYCHOMETRIKA

Bock, R. D. (1967, March). Fitting a response model for n dichotomous items. Paper read at the Psychometric Society Meeting, Madison, WI.

Bock, R. D., & Aitken, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychornetrika, 46, 443-459.

Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psy- chometrika, 35, 179-197.

Bush, R. R., & Mosteller, F. (1955). Stochastic models for learning. New York: Wiley. Cressie, N., & Holland, P. W. (1983). Characterizing the manifest probabilities of latent trait models. Psy-

chometrika, 48, 129-141. de Leeuw, J., & Verhelst, N. (1986). Maximum likelihood estimation in generalized Rasch models. Journal

of Educational Statistics, 11, 183-196. Follman, D. A. (1988). Consistent estimation in the Rasch model based on nonparametric margins. Psy-

chometrika, 53, 553-562. Guttman, L. (1.941). The quantification of a class of attributes: A theory and method of scale construction. In

P. Horst, et al. (Ed.), The prediction of personal adjustment (pp. 319-348). New York: Social Science Research Council.

Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stoufer, et al. (Ed.), Studies in social psychology in World War H, Vol. 4, measurement and prediction (pp. 60-90). Princeton, NJ: Princeton University Press.

Haberman, S. J. (1977). Maximum likelihood estimates in exponential response models. Annals of Statistics, 5, 815--841.

Holland, P. W. (1981). When are item response models consistent with observed data? Psychometrika, 46, 79-92.

Holland, P. W. (1990). The Dutch Identity: A new tool for the study of item response models. Psychometrika, 55, 5-18.

Holland, P. W., & Rosenbaum, P. R. (1986). Conditional association and unidimensionality in monotone latent variable models. Annals of Statistics, 14, 1523-1543.

Junker, B. W. (1988). Statistical aspects of a new latent trait model. Unpublished doctoral dissertation, University of Illinois at Urbana-Champaign, Department of Statistics.

Junker, B. W. (1989). Conditional association, essential independence and local independence. Unpublished manuscript, University of Illinois at Urbana-Champaign, Department of Statistics.

Junker, B. W. (in press). Essential independence and likelihood-based ability estimation for polytomous items. Psychometrika.

Lawley, D. N. (1943). On problems connected with item selection and test construction. Proceedings of the Royal Statistical Society of Edinburgh, 61,273-287.

Lazarsfeld, P. F. (1950). The logical and mathematical foundations of latent structure analysis. In S. A. Stoufer, et al. (Ed.), Studies in social psychology in World War H, Vol. 4, measurement and prediction (pp. 362-412). Princeton, NJ: Princeton University Press.

Lazarsfeld, P. F. (1959). Latent structure analysis. In S. Koch (Ed.), Psychology: A study of a science, Volume 3 (pp. 476-543). New York: McGraw Hill.

Leonard, T. (1975). Bayesian estimation methods for two-way contingency tables. Journal of the Royal Statistical Society, Series B, 37, 23-37.

Levine, M. V. (1989). Ability distribution, pattern probabilities and quasidensities (Final Report.) Cham- paign, IL: University of Illinois, Model Based Measurement Laboratory.

Lewis, C. (1985). Developments in nonparametric ability estimation. In D. J. Weiss (Ed.), Proceedings of the 1982 IRT/CAT conference (pp. 105-122). Minneapolis, MN: University of Minnesota.

Lewis, C. (1990). A discrete, ordinal IRT model. Paper presented at the Annual Meeting of the American Educational Research Association, Boston, MA.

Lindsay, B., Clogg, C. C., & Grego, J. (in press). Semi-parametric estimation in the Rasch model and related exponential response models, including a simple latent class model for item analysis. Journal of the American Statistical Association.

Lord, F. M. (1952). A theory of test scores. Psychometrika Monograph No. 7, 17 (4, Pt. 2). Lord, F. M. (1967). An analysis of the Verbal Scholastic Aptitude Test using Birnbaum's three-parametric

logistic model (ETS Research Bulletin RB-67-34). Princeton, NJ: Educational Testing Service. Lord, F. M. (1974). Estimation of latent ability and item parameters when they are omitted responses.

Psychometrika, 39, 247-264. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-

Wesley. Mislevy, R., & Stocking, M. (1989). A consumer's guide to LOGIST and BILOG. Applied Psychological

Measurement, 13, 57-75.

PAUL W. HOLLAND 601

Oakes, D. (1988). Semi-parametric models. In S. Kotz & N. L. Johnson (Eds.), Encyclopedia of statistical science, Volume 8 (pp. 367-369). New York: Wiley.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Nielson and Lydiche. (for Danmarks Paedagogiske Institut)

Rosenbaum, P. R. (1984). Testing the conditional independence and monotonicity assumptions of item response theory. Psychometrika, 49, 425-436.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 33, (4, Pt. 2).

Samejima, F. (1972). A general model for free response data. Psychometrika Monograph No. 18, 34, (4, Pt. 2).

Samejima, F. (1983). Some methods and approaches of estimating the operating characteristics of discrete item responses. In H. Wainer & S. Messick (Eds.), Principals (sic) of modern psychological measurement (pp. 154-182). Hillsdale, NJ: Lawrence Erlbaum Associates.

Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589--617.

Stout, W. (1990). A new item response theory modeling approach with applications to unidimensionality assessment and ability estimation. Psychometrika, 55, 293-325.

Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model. Psy- chometrika, 47, 175-186.

Tjur, T. (1982). A connection between Rasch's item analysis model and a multiplicative Poisson model. Scandinavian Journal of Statistics, 9, 23-30.

Tsao, R. (1967). A second order exponential model for multidimensional dichotomous contingency tables with applications in medical diagnosis. Unpublished doctoral dissertation, Harvard University, Depart- ment of Statistics.

Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psychometrika, 11, 1-14. Wainer, H., et al. (1990). Computerized adaptive testing: A primer. Hillsdale, NJ: Lawrence Erlbaum

Associates. Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Mea-

surement, 14, 97-116. Wright, B. D., & Douglas, G. A. (1977). Best procedures for sample-free item analysis. Applied Psycholog-

ical Measurement, 1,281-295. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago: Mesa Press.

ON THE SAMPLING THEORY FOUNDATIONS OF ITEM RESPONSE THEORY...

Documents

Transcript of ON THE SAMPLING THEORY FOUNDATIONS OF ITEM RESPONSE THEORY...