The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the...

28
The Up-and-Down Method in Computerized Adaptive Testing * Hung Chen, National Taiwan University Cheng-Der Fuh, Academia Sinica Jia-Fang Yang, National Taiwan University April 17, 2002 Abstract As an item selection method, recursive maximum likelihood estimation (R-MLE) has been commonly used in computerized adaptive tests (CAT). Based on the idea of getting item information, it select an item to maximize the Fisher information at the currently estimated trait level (θ). And it also gives an estimate of the ability θ of examinees at the end of the test. Although the R-MLE has been studied and used in most of the cases in item response theory; however, there are two major issues are concerned in application. First, the requisite of repeated calculation of maximum likelihood estimates (MLE) in implementation makes it less computational efficient. Secondly, the uncertainty of likelihood function due to model misspecification and data uncertainty, especially at early stages, makes it less statistical efficient. In this paper, we introduce the up-and-down method, as an approximation of the R-MLE, to select test items in CAT. To be more specific, we propose an a-stratified multistage up-and-down method, and investigate its statistical properties of the estimate of ability parameter θ. Bootstrap method for Markov chains is used to approximate the confidence interval of θ. It is argued that the up-and-down selection procedure should be used, at least at early stages of a test or the number of test item is not large. Results from pilot simulation studies show that there is not much differences between the up-and-down method and the R-MLE method, in the sense that the accuracy of point estimation and the construction of confidence interval for the trait level (θ). Hence, the up-and-down method can be regarded as an alternative of the classical R-MLE, and has the prevalence of easy implementation and more accurate statistical analysis. Key Words: Up-and-down method, computerized adaptive tests, bootstrap, Markov chains, * This research was supported in part by Grants from the National Science Council of Republic of China.

Transcript of The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the...

Page 1: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

The Up-and-Down Method in Computerized

Adaptive Testing ∗

Hung Chen, National Taiwan University

Cheng-Der Fuh, Academia Sinica

Jia-Fang Yang, National Taiwan University

April 17, 2002

Abstract

As an item selection method, recursive maximum likelihood estimation (R-MLE)

has been commonly used in computerized adaptive tests (CAT). Based on the idea

of getting item information, it select an item to maximize the Fisher information at

the currently estimated trait level (θ). And it also gives an estimate of the ability

θ of examinees at the end of the test. Although the R-MLE has been studied and

used in most of the cases in item response theory; however, there are two major issues

are concerned in application. First, the requisite of repeated calculation of maximum

likelihood estimates (MLE) in implementation makes it less computational efficient.

Secondly, the uncertainty of likelihood function due to model misspecification and data

uncertainty, especially at early stages, makes it less statistical efficient. In this paper, we

introduce the up-and-down method, as an approximation of the R-MLE, to select test

items in CAT. To be more specific, we propose an a-stratified multistage up-and-down

method, and investigate its statistical properties of the estimate of ability parameter

θ. Bootstrap method for Markov chains is used to approximate the confidence interval

of θ. It is argued that the up-and-down selection procedure should be used, at least

at early stages of a test or the number of test item is not large. Results from pilot

simulation studies show that there is not much differences between the up-and-down

method and the R-MLE method, in the sense that the accuracy of point estimation and

the construction of confidence interval for the trait level (θ). Hence, the up-and-down

method can be regarded as an alternative of the classical R-MLE, and has the prevalence

of easy implementation and more accurate statistical analysis.

Key Words: Up-and-down method, computerized adaptive tests, bootstrap, Markov chains,

∗This research was supported in part by Grants from the National Science Council of Republic of China.

Page 2: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

2

recursive maximum likelihood estimate, experimental design.

Page 3: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

3

1 Introduction

In the traditional paper-and-pencil (P&P) test, all examinees take the same test without

considering the difference among their abilities. Hence, each examinee may be required to

answer some test items that do not match the examinee’s ability. It is intuitively clear that

an examinee is measured most effectively when the test items are neither too difficult nor

too easy for him/her. Computerized adaptive testing (CAT) is then proposed in Lord (1970,

1971), Owen (1975), and Weiss (1976), among others, to construct such an individualized test.

Now, it is being implemented in the Graduate Recorded Examination (GRE), the Graduate

Management Admission Test and the National Council Licensure Examination for Nurses in

United States for getting a better gauge of examinee’s ability.

To construct an individualized test, the ability (θ) estimate is updated after the ad-

ministration of each item, and the next optimal item is selected from an item bank until a

prespecified number of items is administered. Items are selected to match the examinee’s

estimated θ according to an item response theory (IRT) model that is assumed to describe

an examinee’s response behavior. Without taking consideration on nonstatistical issues such

as content balancing, the standard approach to item selection in CAT has been to select the

item with the maximum Fisher item information as the next item (Lord, 1980, pp. 151-153)

at the examinee’s current estimated ability level administered. This method is the so-called

recursive maximum likelihood estimation (R-MLE) method. The major advantage of CAT

is that it provides more efficient trait estimates with fewer items than that required in con-

ventional tests. However, methodological as well as theoretical developments in CAT appear

to be rather small.

Although the R-MLE has been studied and used in most of the cases in item response

theory; however, there are two major issues are concerned in application. First, the requisite

of repeated calculation of maximum likelihood estimates (MLE) in implementation makes it

less computational efficient. Secondly, the uncertainty of likelihood function due to model

misspecification and data uncertainty, especially at early stages, makes it less statistical

efficient. Therefore, in this paper, we consider a nonparametric item selection rule, the

up-and-down method, originated in Dixon and Mood (1948), to avoid the just mentioned

repeated computation of MLE and the issue of model dependency. This selection rule was

also considered in Lord (1971) as an alternative to another nonparametric item selection rule,

Robbins-Monro method, to avoid preparing and storing too many test items.

To associate how his/her ability for which it determines his/her response behavior to

a test item, IRT model starts with a mathematical model as to how response depends on

the level of ability, and this relationship is given by the item response function. The reader

Page 4: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

4

is referred to Lord (1980) for details. In this paper, we consider only a dichotomously item

response in which the item response function is defined as the probability p (or p(θ)) of getting

a correct response to the item, where θ is the so-called ability parameter. It is expected that

p(θ) is a monotone increasing function of θ. Two forms of item response functions have been

commonly used, the logistic and the probit models. It is pointed out in Birnbaum (1968)

that a properly scaled logistic model differs from the probit model by less than 1% and that

the former is much easier to work mathematically. Furthermore, these two models provide

similar results for most practical works (Lord, 1980). Therefore, we only consider the logistic

model in this paper. The two-parameter logistic model (2-PLM) posits that the distribution

of the random observation Y , representing a 0 (incorrect answer) or 1 (correct answer) score

on a test item, has the form

p(θ) := P (Y = 1|θ) =ea(θ−b)

1 + ea(θ−b), (1)

where a and b are known item parameters for discrimination and difficulty, respectively. The

1-PLM can be obtained by setting a = 1 for all items.

For CAT with R-MLE, it selects an item to maximize precision in estimating an ex-

aminee’s θ. Items with high a values have high information, provided that b is close to θ.

Consequently, items with high a values tend to be exposed more frequently than items with

low information. In fact, it was reported in Lord and Wingersky (1984) that a and b param-

eter estimates often are positively correlated. This phenomenon is being confirmed again by

analyzing 360 items from a retired item bank of a GRE quantitative test in Chang, Qian,

and Yin (2001). They also found that the correlation between a and b is 0.44.

For the purpose of controlling item exposure rate and simplifying test implementation,

we propose to use the up-and-down method as test item selection method. For the up-and-

down method, the selection of next test item depends only on the response of current test

item. As a contrast, next selected test item based on RMLE method will depend on all past

responses of the examinee. The proposed procedure can be described as follows.

0. Analyze the distribution of (a, b) of the item bank.

1. Determine the strategy of updating a and b.

Two updating strategy will be considered.

Method 1. When a and b are not correlated, we update a and b alternatively.

Method 2. When a and b are correlated, find r, the correlation between a and b of the

item bank. Define a new parameter ci = a + r(bi − b). Update (a, b) based on c.

2. Assume kth test item is administered.

Page 5: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

5

2. The level of next test item, c, is increased by one unit when the response is 1.

Otherwise, it is decreased by one unit.

Method 1. The difficulty level of next test item, b, is increased by one unit when the

response is 1. Otherwise, it is decreased by one unit.

Method 2. We start with low discrimination level. The discrimination level of next test

item, a, is increased by one unit when the responses are 1 or 0 in a two consecutive

row. The difficulty level is the same at this moment. Otherwise, we follow 2a.

3. Repeat Step 2 until the test is finished.

Let nk be the number of observations at stage k for k = 1, · · · ,K, and note that n1 + · · ·+nK

equals the test length. When the test is finished, the method of maximum likelihood method

is applied to estimate the examinee’s ability parameter θ.

Early in the literature, the idea of a-stratified multistage computerized adaptive testing

has been proposed by Chang and Ying (1999), based on the method of R-MLE. It is called

a-stratified multistage R-MLE in CAT.

1. Partition the item bank into K levels according to the a-parameter values of items. The

first item stratum contains items with smallest a’s, the next stratum contains items

with second smallest a’s, etc..

2. Accordingly, partition the test into K stages.

3. Select nk items from the kth stratum based on the matching of item difficulty parameter

b with the updated estimator θn, then administer the items.

4. Repeat Step 3 from k = 1, 2, . . . ,K.

In this procedure, it groups items with similar a values together and select items to maximize

Fisher information from the corresponding level at each stage. As a remark, for the 2-PLM,

maximizing item information is equivalent to matching b with θn. This stratification would,

therefore, decrease exposure rates of high a items and increase exposure rates of low a items.

The reader is referred to Chang and Ying (1999) for details.

The just mentioned up-and-down method can also be regarded as an approximation

of the classical R-MLE in the following sense. First, we apply the idea of moving block to

calculate the MLE at each stage, that is, we only use currently m (for instance, m = 10) data

to do the job. And the up-and down method is that we choose m = 1. Next, the R-MLE

will choose b as large as possible if the examinee’s response is 1, and as small as possible

if the examinee’s response is 0, since there is only one item response is used. In this case,

Page 6: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

6

we may put a prior distribution on b and then select the difficulty level b according to this

distribution. The up-and-down method is that we consider a degenerate prior with fixed step

size.

As a contrast to the R-MLE method, the up-and-down method only requires to solve

the likelihood equation once. Its primary advantage over R-MLE method is to relieve the

burden of repeated computation of MLE in the course of conducting test. Although the local

independence assumption is violated in this case due to sequential design, the selection of

item can be described by a Markov chain (see Section 2 for details). Therefore, the statistical

properties of resulting estimate of θ is much easier to analyze than those based on R-MLE

method. Note that the theoretical basis for using maximum information to select items in

CAT is that R-MLE method will lead to a substantial gain in efficiency. Although it is

expected that the resulting estimator is not as statistically efficient as R-MLE estimator.

However, it is reported in the literature that the expected efficiency gain may not be realized

because θ is used in place of θ in calculating information. Indeed, Chang and Ying (1996)

argued that it may be advantageous not to use item information at the early stages of a CAT

so as to avoid efficiency loss due to poor estimation of θ based on a small number of items.

This motivates the simulation studies conducted in Section 3. Based on the results of those

studies, the loss of statistical efficiency can be small.

The up-and-down method was first proposed in Dixon and Mood (1948) for testing the

sensitivity of explosive to shock. In this setting, the experimental investigation is conducted

by dropping a weight on specimens of the same explosive mixture from various heights. In

CAT, those heights correspond to the difficulty levels of test items and the sensitivity level of

shock is the examinee’s ability parameter. Further research on the up-and-down method can

be found in Wetherill (1963), Wetherill and Glazebrook (1986) and the references therein.

In this paper, we will focus our study on the up-and-down method in computerized

adaptive testing. We also address the issue on the performance of proposed methods on the

choice of step size and the choice of first few initial problems. Since the allowed number of

test items are not large, a bootstrap method is proposed to evaluate the variance of resulting

estimate.

To get a better understanding of the proposed a-stratified multistage computerized

adaptive testing, in Section 2, we describe the likelihood function and the associated esti-

mation induced by the up-and-down method. Additional theoretical studies of the induced

Markov chain by the up-and-down method, and the asymptotic behaviors of the MLE will be

deferred to Section 5. Empirical pilot simulation studies for the comparison of the R-MLE

and the up-and-down method are reported in Section 3. In Section 4, we make conclusion

remarks and some possible further research.

Page 7: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

7

2 Likelihood Function

Assume that the two parameters (a and b) for each item in item bank are known, and have

given a calibrated test to a sample of examinees. In this section, we describe the asymptotic

behavior of the maximum likelihood estimate (MLE) of θ when the test items are selected

by the up-and-down method.

In item response theory, it assumes that an examinee’s responses to different items

in a test are statistically independent. For the up-and-down method used in the kth stage

of computerized adaptive testing, the difficulty level of (N + 1)th test item is increased (or

decreased) by one unit according to whether the Nth response is correct or not. Hence, the

contribution of likelihood function for the 2-PLM during the kth stage is given by

Lnk(θ) = f(X0)

nk∏i=0

f(Yi|Xi) = f(X0)nk∏i=0

[pi(θ)]Yi [1− pi(θ)]1−Yi , (2)

where f(X0) is determined by the design for choosing the difficulty level of the first test item

and

pi(θ) = P (Yi = 1|θ) =eak(θ−bi)

1 + eak(θ−bi).

Here Xi = bi and f(X0) is independent of θ. Since the a values are similar in the kth stage

of a-stratified multistage computerized adaptive testing, we will treat a as a fixed value (= 1)

in each stage in the following discussion. And for simplicity, the difficulty parameter b of the

first test items is set to be 0 here and after. In this case, f(X0) = 1.

As a contrast to the R-MLE, θn is not needed in the selection process of test items.

We only need to calculate θn once at the end of the test. Here θn is the maximum likeli-

hood estimate based on the observations x0, y0, · · · , xn, yn. (i.e., θn is the maximizer of the

likelihood function∏K

k=1 Lnk(θ).) Usually, there is no explicit solution for θn, and hence

numerical algorithm, such as Newton-Raphson method, can be employed to approximate the

value of θn.

Although the likelihood function given in (2) is identical to the likelihood function of a

fixed design (in which X0, X1, . . . , Xn are determined before the administration of the test),

it is apparent that Yi will depend on Yi−1, . . . , Y1 in the up-and-down method. Therefore, we

cannot apply standard likelihood method result to the maximizer of∏K

k=1 Lnk(θ). Instead,

we will utilize the Markovian structure imposed by the up-and-down method to study the

asymptotic behavior of θn.

For an examinee with ability level θ, consider the data set (x,y) = {(x0, y0), . . . , (xn, yn)}

produced by the up-and-down method, where xt is the difficulty level for the tth selected item

and yt is the corresponding response value. Recall that yt is 0 or 1, representing “incorrect

answer” or “correct answer,” respectively. Assume the step size of the up-and-down method

Page 8: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

8

is ∆. Observe that

P (Xi+1 = Xi + ∆|(Xk, Yk), 0 ≤ k ≤ i) = P (Yi = 1|Xi) = e(θ−Xi)/[1 + e(θ−Xi)],

P (Xi+1 = Xi −∆|(Xk, Yk), 0 ≤ k ≤ i) = P (Yi = 0|Xi) = 1/[1 + e(θ−Xi)].

It follows easily that {Xt, t = 0, 1, · · · , n} forms a Markov chain on a state space {bj = x0 +

j∆, j ∈ Z} with transition probability matrix P = (px1,x2) such that px1,x1+∆+px1,x1−∆ = 1.

Or,

px1,x2 := P{X2 = x2|X1 = x1} =

e(θ−x1)/[1 + e(θ−x1)], x2 = x1 + ∆,

1/[1 + e(θ−x1)], x2 = x1 −∆.(3)

We just demonstrate that the data set {(Xt, Yt), t = 0, 1, · · · , n} can be reproduced

from the ordered set of difficulty levels {Xt, t = 0, 1, · · · , n} and vice versa. Moreover the

latter can be formulated as a Markov chain with the specific transition probability matrix

P described in (3). It will be shown in Section 5 that the Markov chain has stationary

distribution π. That is, πP = π. Therefore, we will fully explore the Markovian structure

(3) to investigate the asymptotic properties of the maximum likelihood estimate θn.

For simplicity, we assume that X0 = 0, ∆ = 1, and bi = i here and after. Then the

state space of Xt formed by the difficulty level is {bi = i, i ∈ Z}, which denote the difficulty

levels with natural order. That is,

px1,x2 =

e(θ−x1)/[1 + e(θ−x1)], x2 = x1 + 1,

1/[1 + e(θ−x1)], x2 = x1 − 1.(4)

Since for given observations xk, k = 1, · · · , n,

1n

∂2

∂θ2Ln(θ) =

1n

n∑k=1

−e(θ−xk)

[1 + e(θ−xk)]2< 0, for any θ,

Ln(θ) is a convex funcition and the maximum likelihood estimate is the unique root of the

score equation1n

∂θLn(θ) =

1n

n∑k=1

∂θg(xk−1, xk; θ) = 0,

where g(X1, X2; θ) := log f(X1, X2; θ). Denote the maximum likelihood estimate as θn. To

derive those asymptotic results, we follow the argument by showing that

• The score equation has a root near θ0. It will be established by showing that

limn→∞

1n

n∑t=1

∂θg(Xt−1, Xt; θ)

∣∣∣∣θ=θ0

→ 0 in probability. (5)

Page 9: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

9

• The slope of the score equation is negative. We will show that

limt→∞

1n

n∑t=1

∂2

∂θ2g(Xt−1, Xt; θ)

∣∣∣∣θ=θ0

→ −I in probability. (6)

Here

I = Eθ0

[∂

∂θg(X1, X2; θ)

∣∣∣∣θ=θ0

]2

= −Eθ0

{∂2

∂θ2g(X1, X2; θ0)

}< ∞. (7)

• Show that√

n(θn−θ0) is asymptotic normal with mean 0 and asymptotic variance 1/I.

The idea of proof is to the approximate the score function by an additive function of

the Markov chain. To prove that (5) and (6) hold, we need the law of large numbers. And

to prove the asymptotic normality, we need central limit theorem for additive function of the

Markov chain. In these proofs, we will use regeneration method to represent the additive

function of the Markov chain as sum of independent and identically distributed random

variables. Wald’s equations for Markov chains will also be applied to reduces the moments

conditions for the epoch. We now state main result of this paper. For technical details, refer

to Section 5.

Theorem 1 Let {Xn, n ≥ 0} be a Markov chain on a countably state space S = {· · · ,−2,−1, 0,

1, 2, · · ·}, with transition probability pij defined as (4) for i, j ∈ S. Then

1) θn = θ(x0, · · ·xn) converges in probability to θ0,

2)√

n(θn − θ0) −→ N(0, 1/I) in distribution, where I = Eθ0

[∂∂θ log f(X1, X2; θ0)

]2is

the Fisher information, Eθ0 denotes the expectation Eπ under the true parameter θ0, and

f(X1, X2; θ) denotes the transition probability under parameter θ of the Markov chain (4).

3 Empirical Studies for the Maximum Likelihood Esti-

mate

In order to understand the limitation of the proposed method, we conduct A small-scale

simulation study

3.1 Design of the simulation study

To evaluate the performance of the up-and-down method, we compare it to the recursive

maximum likelihood estimation method in the setting of a-stratified multistage computerized

adaptive testing. Since the up-and-down method is a nonparametric method without using

all information contained in the data, it is expected that the resulting θn is not as efficient as

the R-MLE θn. The first simulation experiment is conducted to evaluate its efficiency loss.

Page 10: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

10

Since the discrimination parameter a is fixed in each stratum, we just compare the accuracy

of the ability parameter θ in the 2-PLM with a = 1. Using R-MLE as a bench mark, we

compare θn to θn in terms of bias, variance, and mean square error. The results are shown

in Tables 3.1 to 3.4.

In this study, the number of test items n varies from 10 to 200. Recall that the initial

test item of the parameter b is set to be 0 in this paper. The choice of small n such as n = 10

is used to illustrate how bad the up-and-down method can be when the initial test item of

the parameter b does not match the ability of the examinee. The choice of n = 200 is used

to illustrate the efficiency loss of the up-and-down method.

For the sequentially selection of test items, the R-MLE item selector selects the (k+1)th

test item with b which maximizes the Fisher information with θ = θk. For the up-and-down

method, we need to determine the step size ∆ in the beginning of the test. In this study, the

step size ∆ is set to be a constant .1. When the response of the kth test item is 0, the difficulty

level of the (k + 1)th test item will be reduced by one more step size; while if the response

is 1, the difficulty level will be increased by one more step size. For the ability parameter

θ of examinee, due to symmetry, we consider three nonnegative levels, θ = 0, θ = 1, and

θ = 2. The choice of θ = 2 is used to test the limit of the up-and-down method when the

initial choice of b is far away from the ability of the tester by 20 steps away from unknown θ.

The range of latent ability or skill level is set to be within (−3, 3) as usually assumed in the

literature.

For the accuracy of the estimate θn, we also compare θn to θn in terms of interval esti-

mation of θ. When the sample size is moderate to large. Based on the asymptotic results in

Theorem 1, the normal approximation will provide a ‘good’ confidence interval with nominal

coverage probability 95%. When the sample size is small and the normal approximation is

not well, we introduce two alternative interval estimations based on bootstrap approxima-

tions (percentile and bootstrap-t). The bootstrap replication size for the ordinal bootstrap

confidence intervals is B = 1, 000. The true 95% interquartile range (t.025, t.975) is given

for reference, in which they were obtained from the appropriate quantiles of the empirical

distributions based on a large simulation with 10, 000 replications. The second simulation

experiment is conducted to evaluate the above proposals.

Computations were performed using C++ programs on the sparc station 10 at the

Department of Mathematics, National Taiwan University. The pseudo-random numbers are

generated using IMSL routines. All tests were compared on the basis of the same random

numbers, samples of different size were nested.

Page 11: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

11

3.2 Simulation experiment I

For each assumed ability level θ0, repeat the experiment 1, 000 times. Let θni, i = 1, 2, · · · , 1, 000,

be the estimate of θ0 obtained in the ith experiment. We consider the following summary

statistics

Bias =1

1000

1000∑i=1

(θni − θ0), MSE =1

1000

1000∑i=1

(θni − θ0)2, VAR =1

1000

1000∑i=1

(θi − θ0)2,

where θ0 is the sample average. By Theorem 1,√

n(θn − θ0) −→ N(0, 1/I), where I is

the Fisher information determined by the Markov chain with transition probability (3) with

∆ = .1. To evaluate if the asymptotic results could give a good approximation, we also report

the asymptotic variance (AVar) of θn.

Let θn denote the RMLE based on the observations. It is known (cf. Chang and Ying,

1999) that √IRMLE(θn)(θn − θ0) −→ N(0, 1) in distribution, (8)

where

IRMLE(θn) =n∑

i=1

a2i e

ai(θn−bi)

1 + eai(θn−bi).

Let θni, i = 1, 2, · · · , 1, 000, be the RMLE of θ0 obtained in the ith experiment. We also

report empirical mean and empirical variance of IRMLE(θni), which are defined as

Ave(IRMLE) =1

1000

1000∑i=1

IRMLE(θni) and Var(IRMLE) =1

1000

1000∑i=1

(IRMLE(θni)−IRMLE)2,

where IRMLE denotes the sample average of IRMLE(θni).

Tables 3.1-3.3 show that the MSEs and the empirical variances of both the R-MLE and

the up-and-down methods are close. In particular, when the initial choice of b = 0 matches

θ = 0, the up-and-down method performs well for all selection of n. When the initial choice

of b = 0 does not match θ, the up-and-down method can be quite bad when n is small. In

summary, the performance of the accuracy for point estimates given by these two methods

is different only in margin, and hence, the up-and-down method can be used an alternative

of the efficient R-MLE method when the implementation of R-MLE method is an issue and

the number of test items is large. Also, the asymptotic variance of θn is close to VAR.

This suggests that the asymptotic results provide a good approximation. See Section 3.3 for

further discussion on interval estimation.

Table 3.1 Comparison of θn and θn when θ0 = 0

Page 12: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

12

n Methods Bias VAR MSE Ave(In) Var(In) AVar

10 R-MLE -0.001 0.473 0.473 2.304 0.017

up-and-down -0.037 0.550 0.551 2.340 0.047 0.410

20 R-MLE 0.007 0.212 0.212 4.707 0.044

up-and-down 0.011 0.206 0.206 4.861 0.054 0.205

30 R-MLE -0.002 0.146 0.146 7.128 0.078

up-and-down -0.037 0.139 0.140 7.358 0.066 0.137

40 R-MLE -0.006 0.113 0.113 9.545 0.120

up-and-down -0.006 0.115 0.115 9.844 0.109 0.103

50 R-MLE -0.019 0.085 0.086 12.011 0.135

up-and-down 0.002 0.079 0.079 12.322 0.162 0.082

100 R-MLE -0.002 0.042 0.042 24.331 0.274

up-and-down -0.008 0.041 0.041 24.609 0.650 0.041

150 R-MLE 0.005 0.027 0.027 36.766 0.303

up-and-down 0.002 0.026 0.026 36.826 1.448 0.027

200 R-MLE 0.002 0.020 0.020 49.161 0.383

up-and-down -0.006 0.022 0.022 49.044 2.624 0.021

Page 13: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

13

Table 3.2 Comparison of θn and θn when θ0 = 1

n Methods Bias VAR MSE Ave(In) Var(In) AVar

10 R-MLE 0.005 0.464 0.464 2.235 0.031

up-and-down 0.141 1.335 1.355 1.973 0.240 0.410

20 R-MLE -0.002 0.224 0.224 4.615 0.057

up-and-down 0.014 0.241 0.241 4.356 0.258 0.205

30 R-MLE 0.023 0.145 0.145 7.018 0.104

up-and-down 0.028 0.150 0.151 6.781 0.253 0.137

40 R-MLE -0.016 0.104 0.104 9.482 0.121

up-and-down 0.001 0.110 0.110 9.276 0.248 0.103

50 R-MLE -0.002 0.084 0.084 11.933 0.146

up-and-down 0.001 0.089 0.089 11.734 0.297 0.082

100 R-MLE -0.006 0.044 0.044 24.241 0.295

up-and-down 0.0003 0.040 0.040 23.998 0.660 0.041

150 R-MLE 0.003 0.027 0.027 36.684 0.309

up-and-down -0.001 0.027 0.027 36.231 1.457 0.027

200 R-MLE -0.003 0.021 0.021 49.095 0.323

up-and-down -0.004 0.022 0.022 48.438 2.536 0.021

Table 3.3 Comparison of θn and θn when θ0 = 2

n Bias VAR MSE Ave(I) Var(I) AVar

10 R-MLE 0.040 0.523 0.525 2.015 0.059

up-and-down 1.233 11.820 13.340 1.234 0.451 0.410

20 R-MLE 0.013 0.242 0.242 4.399 0.073

up-and-down 0.099 0.689 0.699 3.119 0.637 0.205

30 R-MLE 0.037 0.164 0.165 6.818 0.122

up-and-down 0.014 0.195 0.195 5.400 0.589 0.137

40 R-MLE 0.016 0.107 0.107 9.256 0.146

up-and-down 0.001 0.128 0.128 7.796 0.536 0.103

50 R-MLE -0.017 0.089 0.089 11.707 0.159

up-and-down 0.009 0.099 0.099 10.205 0.519 0.082

100 R-MLE 0.007 0.040 0.041 24.057 0.265

up-and-down 0.003 0.045 0.045 22.470 0.776 0.041

150 R-MLE 0.010 0.028 0.028 36.429 0.320

up-and-down -0.006 0.029 0.029 34.742 1.488 0.027

200 R-MLE -0.001 0.020 0.020 48.890 0.372

up-and-down 0.003 0.022 0.022 46.899 2.513 0.021

Page 14: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

14

Next we compare the selected test items based on the R-MLE and the up-and-down

method. For the purpose of illustration, we just randomly pick up one realization and report

it in Table 3.4. It shows that the test items chosen by R-MLE method are close to up-and-

down method except for the first few test items. In Table 2.4, we denote bR and bUD as the

item difficulty parameters to be chosen by the two methods, and let D be the difference of

the successive two item difficulty parameters from the R-MLE method.

Page 15: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

15

Table 3.4 Successive selected test items

item bR D bUD item bR D bUD item bR D bUD

0 0.000 0 1 0.500 0.500 0.1 2 0.250 0.250 0.2

3 0.950 0.700 0.1 4 0.424 0.526 0 5 0.839 0.415 0.1

6 0.494 0.345 0.2 7 0.200 0.294 0.3 8 0.456 0.257 0.2

9 0.229 0.227 0.1 10 0.433 0.204 0 11 0.619 0.185 -0.1

12 0.791 0.172 -0.2 13 0.632 0.159 -0.3 14 0.487 0.145 -0.4

15 0.351 0.136 -0.5 16 0.478 0.127 -0.6 17 0.359 0.119 -0.5

18 0.246 0.113 -0.4 19 0.353 0.107 -0.3 20 0.251 0.102 -0.2

21 0.153 0.098 -0.3 22 0.247 0.093 -0.4 23 0.157 0.089 -0.3

24 0.243 0.085 -0.4 25 0.161 0.082 -0.3 26 0.082 0.079 -0.2

27 0.158 0.076 -0.1 28 0.231 0.073 0 29 0.301 0.070 0.1

30 0.233 0.068 0 31 0.168 0.066 0.1 32 0.231 0.064 0.2

33 0.169 0.062 0.1 34 0.229 0.060 0.2 35 0.287 0.058 0.3

36 0.343 0.056 0.2 37 0.398 0.055 0.3 38 0.451 0.053 0.4

39 0.503 0.052 0.3 40 0.453 0.051 0.4 41 0.502 0.049 0.5

42 0.454 0.048 0.4 43 0.501 0.047 0.3 44 0.455 0.046 0.2

45 0.410 0.045 0.1 46 0.454 0.044 0.2 47 0.411 0.043 0.1

48 0.369 0.042 0 49 0.328 0.041 -0.1 50 0.287 0.040 0

51 0.248 0.049 0.1 52 0.287 0.039 0.2 53 0.248 0.038 0.3

54 0.286 0.037 0.2 55 0.323 0.037 0.3 56 0.287 0.036 0.2

57 0.322 0.035 0.1 58 0.287 0.035 0.2 59 0.253 0.034 0.1

60 0.219 0.034 0 61 0.186 0.033 0.1 62 0.219 0.033 0.2

63 0.187 0.032 0.3 64 0.155 0.032 0.4 65 0.124 0.031 0.3

66 0.154 0.031 0.4 67 0.124 0.030 0.5 68 0.154 0.030 0.4

69 0.183 0.029 0.3 70 0.154 0.029 0.4 71 0.126 0.029 0.3

72 0.154 0.028 0.4 73 0.126 0.028 0.5 74 0.099 0.028 0.4

75 0.071 0.027 0.3 76 0.044 0.027 0.2 77 0.018 0.027 0.1

78 0.044 0.026 0.2 79 0.070 0.026 0.1 80 0.044 0.026 0.2

81 0.019 0.025 0.1 82 0.044 0.025 0.2 83 0.019 0.025 0.1

84 -0.005 0.024 0.2 85 -0.029 0.024 0.3 86 -0.005 0.024 0.2

87 0.018 0.024 0.1 88 0.042 0.023 0 89 0.064 0.023 0.1

90 0.087 0.023 0 91 0.065 0.022 0.1 92 0.043 0.022 0

93 0.021 0.022 0.1 94 -0.001 0.022 0.2 95 0.020 0.022 0.3

96 -0.001 0.021 0.4 97 -0.022 0.021 0.3 98 -0.043 0.021 0.2

99 -0.064 0.021 0.1

Page 16: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

16

3.3 Simulation experiment II

In simulation experiment I, it is found that AVar is close to VAR for the up-and-down

method. We now evaluate the nominal 95% confidence interval of θ0 based on Theorem 1.

Again, consider θ0 = 0, 1, 2. For each θ, simulate 10, 000 times instead. The criteria we

considered are the coverage probability (CP) and average length (AL) of the corresponding

confidence interval.

Tables 3.5 and 3.6 gives the true coverage probabilities for θn and θn which are found

to be close to the nominal coverage probability 95% derived from asymptotic approximation.

Table 3.5 Normal approximation by the up-and-down method

θ = 0 θ = 1 θ = 2

n CP AL n CP AL n CP AL

50 0.9457 1.116984 50 0.9505 1.144526 50 0.9500 1.228273

100 0.9471 0.790488 100 0.9477 0.800057 100 0.9493 0.827105

150 0.9525 0.646108 150 0.9496 0.651195 150 0.9514 0.665727

Table 3.6 Asymptotic normality by R-MLE method

θ = 0 θ = 1 θ = 2

n CP AL n CP AL n CP AL

50 0.9502 1.132100 50 0.9520 1.135626 50 0.9488 1.146214

100 0.9499 0.794528 100 0.9495 0.795699 100 0.9484 0.799639

150 0.9552 0.646676 150 0.9499 0.647239 150 0.9493 0.649403

When n is small and the choice of b does not match θ0, Tables 3.1 to 3.3 show that AVar

is not close to VAR. This motivates us to use the bootstrap method to give an alternative

approximation. The second and the third approximated confidence intervals are obtained by

two bootstrap methods (percentile and bootstrap-t). By the Markov chain representation in

(3) and (4), the parametric bootstrap algorithm can be easily implemented. To be specific, let

x = {x0, · · ·xn} be a realization of the Markov chain {Xt; t ≥ 0} with transition probability

P = (pi,j(θ)), where θ is the unknown parameter. Let θn be the maximum likelihood estimate

(MLE) of θ. To approximate the sampling distribution Hn of R(x, θ) :=√

n(θn − θ), the

bootstrap method can be done as follows.

1. Let x∗ = {x∗0, · · · , x∗n} denote a Markov chain realization of n steps based on (pi,j(θn)).

Call this a bootstrap sample, and let θ∗n be the MLE of θn based on x∗.

Page 17: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

17

2. Approximate the sampling distribution Hn of R(x, θ) by the conditional distribution H∗n

of R(x∗, θn) =√

n(θ∗n − θn) given x.

The difficulty in implementing the bootstrap method lies on Step 2. Here, we approximate

the bootstrap distribution by using Monte Carlo simulation as usual.

For the percentile bootstrap confidence interval, we repeatedly generate bootstrap

samples x∗ for 1, 000 times according to the above bootstrap algorithms, and replications θ∗

are computed. Let G be the cumulative distribution function of θ∗n. The 1 − 2α percentile

interval is defined by the α and 1− α percentile of G:

[θl, θu] = [G−1(α), G−1(1− α)] = [θ∗(α)n , θ∗(1−α)

n ].

The bootstrap-t estimates the percentiles of a studentized statistic T = (θn − θ)/σn by

bootstrapping, where σ2n is the variance estimator. In our simulation, we use In to estimate

the variance. For each sample, 1, 000 bootstrap values T ∗ = (θ∗n− θn)/I∗n are generated. The

95% central interval are then [θn − Tu · In, θn − Tl · In], where Tl and Tu are the 2.5 and 97.5

empirical percentiles based on bootstrap samples.

The simulation results are summarized in Tables 3.7-3.10 which show that the true

coverage probabilities for both methods (R-MLE and up-and-down method) are close to the

nominal coverage probability 95%.

The first one is to use the asymptotic results in Chang and Ying (1999), in which it

states that √IRMLE(θn)(θn − θ) −→ N(0, 1) in distribution.

Let zα denote the 100α percentile of a standard normal distribution. Then(θn − z0.025

1√In(θn)

, θn + z0.0251√

In(θn)

)

gives an approximated 95% confidence interval. For this nominal 95% confidence interval, we

study its true coverage probability

P (z0.025 ≤√

In(θn)(θn − θ) ≤ z0.975).

Table 3.7 Percentile by the up-and-down method

θ = 0 θ = 1 θ = 2

n CP AL n CP AL n CP AL

50 0.9529 1.123377 50 0.9498 1.151801 50 0.9505 1.237174

100 0.9506 0.792974 100 0.9488 0.803081 100 0.9465 0.829987

150 0.9514 0.647380 150 0.9496 0.652618 150 0.9492 0.667122

Page 18: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

18

Table 3.8 Bootstrap-t by the up-and-down method

θ = 0 θ = 1 θ = 2

n CP AL n CP AL n CP AL

50 0.9557 1.132068 50 0.9560 1.161960 50 0.9590 1.255542

100 0.9472 0.795060 100 0.9540 0.804977 100 0.9484 0.831613

150 0.9489 0.648007 150 0.9459 0.653667 150 0.9516 0.668290

Table 3.9 Percentile by the R-MLE method

θ = 0 θ = 1 θ = 2

50 0.9498 1.144303 50 0.9465 1.148002 50 0.9580 1.157604

100 0.9496 0.798792 100 0.9471 0.800395 100 0.9482 0.803923

150 0.9498 0.648822 150 0.9499 0.649643 150

Table 3.10 Bootstrap-t by the R-MLE method

θ = 0 θ = 1 θ = 2

n CP AL n CP AL n CP AL

50 0.9505 1.147494 50 0.9510 1.14946 50 0.9536 1.158084

100 0.9517 0.799233 100 0.9540 0.804977 100 0.9491 0.804242

150 0.9508 0.649507 150 0.9459 0.653667 150

Tables 3.8-3.10 show that the bootstrap methods give reasonable accurate interval

estimation for the unknown parameter θ. This is not incident, since it is known that bootstrap

method is second order accuracy. Theoretical justification of these results will be published

in a separate paper.

4 Conclusion Remarks and Further Researches

In this paper, we introduce the up-and-down method as item selection rule in computerized

adaptive testing (CAT), based on the 2-parameter logistic model (2-PLM). In particular, we

study the a-stratified multistage case in more details. We conduct a simulation experiment

to compare the accuracy of parameter estimates obtained by the up-and-down method and

the recursive maximum likelihood estimation (R-MLE) method. It suggests that the up-

and-down method has the potential to be an alternative of the commonly used recursive

Page 19: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

19

maximum likelihood estimation (R-MLE) method. Asymptotic behaviors of the MLE based

on the up-and-down method are also investigated.

The simulation results show that there is not much differences between the up-and-

down method and the R-MLE method, in the sense of the accuracy of point estimation and

the construction of confidence interval. For the concern of statistical efficiency, although the

selection of test items in the up-and-down method does not utilize all information contained

in the data, the simulation results suggest that it does not lose much information. The

performance of the up-and-down method is almost the same as R-MLE for a reasonable range

of the sample size n for the test. Moreover, from computational point of view, the up-and-

down method is much easier to implement than that of the R-MLE, including the procedure

of item selection and the numbers of estimating θ. In addition, the maximum likelihood

estimate of θ obtained from the up-and-down method are consistent and asymptotic normal.

There are some problems remained to be solved. First, we note that the issue of

model sensitivity is an interesting problem. Since human behavior is quite complex, it raises

the doubt on using the logistic model or the probit model to model a examinee’s response.

How the performance of the up-and-down method and the R-MLE method will be affected

when the model is being misspecified? Second, in computerized criterion-referenced test, its

emphasis is on classification. A study on comparing the performance of both item selection

rules under the above setting deserves to be further pursued. Third, the problem of multistage

in computerized adaptive testing is also a challenge task, and deserve further study.

5 Asymptotic Analysis

With reasonable size of test items, the simulation results in Section 3 show that the per-

formance of the R-MLE and the up-and-down method are not too much different. In

this section, we study the asymptotic behavior of the MLE θn of the ability parameter

θ, when the test items are selected by the up-and-down method. We first show that the

data {(Xt, Yt), t = 0, 1, · · · , n} can be reproduced from the ordered set of difficulty levels

{Xt, t = 0, 1, · · · , n} and vice versa. Next, we recognize that the latter can be formulated

as a Markov chain with a specific transition probability matrix P . Therefore, the likelihood

function defined in (2) can be expressed as a likelihood function derived from a parametric

Markov chain. To establish the consistency of resulting estimator, we will show that the test

items {Xt, t = 0, · · · , n} selected by the up-and-down method formed an irreducible Markov

chain. For interval estimation, we will first to show that the induced Markov chain is posi-

tive recurrent, and then apply regeneration method to derive the asymptotic distribution of

maximum likelihood estimator.

Page 20: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

20

5.1 Technical Results

Denote Pπ as the probability with initial distribution is the stationary distribution π, and let

Eπ be the expectation under Pπ. In order to employ the technique of regeneration method to

establish the consistency and asymptotic normality, we need to show that the Markov chain

(4) is irreducible and positive recurrent.

Consider a Markov chain {Xn, n ≥ 0} on a countably state space S = {· · · ,−2,−1, 0, 1,

2, · · ·}, with transition probability pij for i, j ∈ S. Denote∑∞

n=1 P{Xν(ω) 6= i, 0 < ν <

n;Xn(ω) = i|X0(ω) = i} by f∗i,i. A state i ∈ S is called recurrent if f∗i,i = 1. Assume that

state i is recurrent, let Ti be the first regeneration time of Xn to state i, that is,

Ti =

inf{n ≥ 1, Xn = i};

∞, if no such n exist.

A recurrent state i is called positive if and only if E(Ti) =< ∞. The irreducibility of

the Markov chain implies that if a state i is positive recurrent, then all states are positive

recurrent.

Now we state the key theorem used in this section.

Theorem 2 Let {Xn, n ≥ 0} be an ergodic (irreducible, aperiodic and positive recurrent)

Markov chain on a countably state space S = {· · · ,−2,−1, 0, 1, 2, · · ·}, with stationary distri-

bution π. Let h be a real-valued function on the state space S. Suppose Eπ(|h|) < ∞. The

following holds.

(a)∑N

t=1 h(Xt)/N converges to Eπ{h(X1)} in probability.

(b) If Eπ(|h|2) < ∞ and σ2 := Var(∑Tx0

t=1 h(Xt)) < ∞, then

√N

σ√

π(x0)

{∑Nt=1 h(Xt)

N− Eπ{h(X1)}

π(x0)

}−→ N(0, 1) in distribution. (9)

To prove that there exists a stationary probability distribution of the Markov chain (4),

we need to show that the Markov chain (4) induced by the up-and-down method is ergodic.

By the definition of (4), it is easy to see the properties of irreducibility and aperiodicity hold.

To prove the property of positive recurrent, We need following Theorems.

Theorem 3 Let {Xn, n ≥ 0} be a Markov chain on a countably state space S = {· · · ,−2,−1, 0,

1, 2, · · ·}, with transition probability pij for i, j ∈ S. Let 0 < αi < 1 and βi = 1− αi be given

numbers such that

pi,i+1 = αi, pi,i−1 = βi, for i ≥ 0;

pi,i−1 = αi, pi,i+1 = βi, for i < 0.

Page 21: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

21

(a) The state 0 is recurrent, i.e., f∗0,0 = 1 if and only if∑r≥1

β1 × · · · × βr

α1 × · · · × αr= ∞,

∑r≥1

β−1 × · · · × β−r

α−1 × · · · × α−r= ∞.

(b) The recurrent state 0 is positive if and only if∑r≥1

α1 · · ·αr−1

β1 · · ·βr−1βr< ∞,

∑r≥1

α−1 · · ·α−(r−1)

β−1 · · ·β−(r−1)β−r< ∞.

Remark. Results of Theorem 3 can be found in Chung (1967), in which only the one-

sided Markov chain is studied. The argument there can be generalized easily to the above

results for two-sided Markov chain.

After showing that the Markov chain generated by the up-and-down method is positive

recurrent, Theorem 2 can then be used to derive asymptotic results in Theorem 1.

Theorem 4 The Markov chain with transition probability defined in (4) is positive recurrent.

Proof. Let n0 be the integer satisfying n0 − 1 ≤ θ < n0. Denote

pn0+i,n0+i+1 = αi, pn0+i,n0+i−1 = βi, for i ≥ 0,

pn0+i,n0+i−1 = αi, pn0+i,n0+i+1 = βi, for i < 0.

Note that the logistic curve is monotone increasing. Hence βi/αi > 1 for n ≥ n0. We conclude∑r≥1

β1 × · · · × βr

α1 × · · · × αr= ∞,

∑r≥1

β−1 × · · · × β−r

α−1 × · · · × α−r= ∞.

It follows from Theorem 3(a) that we have fn0,n0 = 1 and n0 is a recurrent state. ♦

Next we show that E(Tx0) < ∞, for all x0. Recall that βi > 1/2, αi/βi < 1 and αi is

monotone decreasing for n ≥ n0. We have∑r≥1

α1 × · · · × αr−1

β1 × · · · × βr< ∞.

By a similar argument, we have∑r≥1

α−1 × · · · × α−(r−1)

β−1 × · · · × β−r< ∞.

Theorem 3(b) hence leads that x0 is positive recurrent. Since the Markov chain (4) is irre-

ducible and this implies that whole states in S are positive recurrent. That is, the Markov

chain is an irreducible, aperiodic and positive recurrent Markov chain. Then, the vector

ϕ = (· · · , 1/E(T−1), 1/E(T0), 1/E(T1), · · ·) is a stationary probability distribution for (4).

Remark. Note that the up-and-down method is a nonparametric method of selecting

test items. Hence, as long as the item response function is continuous, strictly monotone

increasing, and ranges over (0, 1), Theorem 4 remains to hold.

Page 22: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

22

5.2 Asymptotic behavior of the MLE

By using the results in Theorem 2, we will prove our main results, the weak consistency and

asymptotic normality of the MLE θn. One major contribution here is the characterization

of the Fisher information in Theorem 1, for which it can be used to construct confidence

interval of θ0, the true parameter.

Proof of Theorem 1. The proof will follow the argument outlined in Section 2. First,

we show that (5) holds. Note that X0 = 0, and

Eθ0

{∂

∂θg(X1, X2; θ0)

∣∣∣∣X1

}= Eθ0

{∂f(X1, X2; θ0)/∂θ

f(X1, X2; θ0)

∣∣∣∣X1

}=∑

x2∈S

∂θf(X1, x2; θ0).

Differentiate both side of the equation∑

x2∈S f(x1, x2; θ) = 1 with respect to θ leads to∑x2∈S

∂θf(x1, x2; θ) = 0.

This implies that Eθ0{ ∂∂θg(X1, X2; θ0)} = 0.

Since∂

∂θg(x1, x2; θ) =

1/(1 + eθ−x1), x2 = x1 + 1,

−eθ−x1/(1 + eθ−x1), x2 = x1 − 1,

we have

Eθ0

{∣∣∣∣ ∂

∂θg(X1, X2; θ0)

∣∣∣∣∣∣∣∣X1

}≤ 1

4.

By Theorem 2(a), (5) holds by

limn→∞

1n

n∑t=1

∂θg(Xt−1, Xt; θ)

∣∣∣∣θ=θ0

→ Eθ0

{∂

∂θg(X1, X2; θ0)

}= 0 in probability.

Next, we show that (6) holds. Twice differentiation of∑

x2∈S f(x1, x2; θ) = 1 with

respect to θ leads to ∑x2∈S

∂2

∂θ2f(x1, x2; θ) = 0,

and

Eθ0

{∂2

∂θ2g(X1, X2; θ0)

∣∣∣∣X1

}= Eθ0

{∂2

∂θ2 f

f−

(∂f∂θ )2

f2

∣∣∣∣∣X1

}

= Eθ0

{∂2

∂θ2 f

f

∣∣∣∣∣X1

}− Eθ0

{(∂f

∂θ )2

f2

∣∣∣∣∣X1

}= −Eθ0

{[∂

∂θg(X1, X2; θ0)

]2∣∣∣∣∣X1

}.

Again, for all given x1 ∈ S, we have that

Eθ0

{[∂

∂θg(X1, X2; θ0)

]2∣∣∣∣∣X1

}=

eθ0−x1

(1 + eθ0−x1)2≤ 1

4.

Therefore (7) holds.

Page 23: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

23

We also need to calculate Eθ0{| ∂2

∂θ2 g(X1, X2; θ0)|}. For x2 = x1 + 1 or x2 = x1 − 1, we

have ∣∣∣∣ ∂2

∂θ2g(X1, X2; θ0)

∣∣∣∣ = e(θ0−x1)

[1 + e(θ0−x1)]2≤ 1

4,

and this implies that

Eθ0

{∣∣∣∣ ∂2

∂θ2g(X1, X2; θ0)

∣∣∣∣} < ∞.

It follows from Theorem 2(b) that (6) holds by

limn→∞

1n

n∑k=1

∂2

∂θ2g(Xk−1, Xk; θ)

∣∣∣∣θ=θ0

→ Eθ0

{∂2

∂θ2g(X1, X2; θ0)

}= −I in probability.

Denote

G(x1, x2) := supθ∈R

∣∣∣∣ ∂3

∂θ3g(x1, x2; θ)

∣∣∣∣ = supθ∈R

∣∣∣∣eθ−x1(1− eθ−x1)(1 + eθ−x1)3

∣∣∣∣ < 1.

There exists a constant M such that

limn→∞

1n

n∑t=1

G(Xt−1, Xt) = M in probability. (10)

By the mean value theorem, for some |α| < 1, we have

1n

∂θLn(θ) =

1n

n∑t=1

∂θg(xt−1, xt; θ)

=1n

n∑t=1

∂θg(xt−1, xt; θ0) +

1n

(θ − θ0)n∑

t=1

∂2

∂θ2g(xt−1, xt; θ0) +

α

2n(θ − θ0)2

n∑t=1

G(xt−1, xt).

Let S∗ denote the collection of (x1, · · · , xn) satisfying∣∣∣∣∣ 1nn∑

k=1

∂θg(xt−1, xt; θ0)

∣∣∣∣∣ < δ2,1n

n∑t=1

∂2

∂θ2g(xt−1, xt; θ0) < −I/2, and

1n

n∑t=1

G(xt−1, xt) < 2M.

It follows from (5), (7) and (10) that, for all δ, ε, there exists an n0 such that P (S∗) > 1− ε

when n > n0(δ, ε).

For θ = θ0 ± δ, choose δ < 12 I/(M + 1), then,

1n

∂θLn(θ)

∣∣∣∣θ=θ0+δ

≤ δ2 − 12(I · δ) + Mδ2 < 0,

if (x1, · · · , xn) ∈ S∗. By the same argument, we have

1n

∂θLn(θ)

∣∣∣∣θ=θ0−δ

> 0.

Since 1n

∂∂θLn(θ) is continuous, so for any δ, ε > 0, the likelihood equation will, with

probability exceeding 1− ε, have a root belongs to (θ0− δ, θ0 + δ) as long as n > n0(δ, ε). We

conclude that

θn −→ θ0 in probability. (11)

Page 24: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

24

To prove 2), we first characterize the asymptotic variance Var(∑Tx0

t=1∂∂θg(Xt−1, Xt; θ0)

),

where Tx0 is the first regeneration time to state x0. Recall that Eθ0{ ∂∂θg(X1, X2; θ0)} = 0,

and Eθ0{( ∂∂θg(X1, X2; θ0))2} = I.

Var

Tx0∑t=1

∂θg(Xt−1, Xt; θ0)

= Eθ0

( Tx0∑k=1

∂θg(Xt−1, Xt; θ0)

)2

−[Eθ0

( Tx0∑t=1

∂θg(Xt−1, Xt; θ0)

)]2= Eθ0

Tx0∑k=1

[∂

∂θg(Xt−1, Xt; θ0)

]2+ 2∑t′>t

Eθ0

( ∂

∂θg(Xt, Xt+1; θ0)

∂θg(Xt′ , Xt′+1; θ0)

)=

1π(x0)

Eθ0{(∂

∂θg(X1, X2; θ0))2}

+2∑t′>t

Eθ0

[Eθ0

( ∂

∂θg(Xt, Xt+1; θ0)

∂θg(Xt′ , Xt′+1; θ0)|Xt, Xt+1, Xt′

)]=

Iπ(x0)

+ 2∑t′>t

Eθ0

[ ∂

∂θg(Xt, Xt+1; θ0)Eθ0 [

∂θg(Xt′ , Xt′+1; θ0)

∣∣∣∣Xt′ ]]

=I

π(x0).

By Theorem 2, we have

1√n

n∑t=1

∂θg(Xt−1, Xt; θ0) −→ N(0, I) in distribution.

Note that the score equation n−1 ∂∂θLn(θn) = 0 can be written as

0 =1n

n∑k=1

∂θg(xt−1, xt; θ0)+

1n

(θn−θ0)n∑

k=1

∂2

∂θ2g(xt−1, xt; θ0)+

α

2n(θn−θ0)2

n∑k=1

G(xt−1, xt).

We have

√n(θn − θ0) = −

n−1/2∑n

t=1∂∂θg(Xt−1, Xt; θ0)

n−1∑n

t=1∂2

∂θ2 g(Xt−1, Xt; θ0) + (α/2)(θn − θ0)n−1∑n

t=1 G(Xt−1, Xt)

−→ N(0, I−1) in distribution. ♦

5.3 Proof of Theorem 2

Proof of (a). For simplicity, set x0 to be the state 0, which is positive recurrent, and denote

m :=∑N

j=1 Ix0(Xj) as the number of visits to state x0 up to N . It is known (cf. Chung, 1967)

that m/N → π(x0) in probability, where π(x0) = 1/E(Tx0). Let T kx0

be the kth regeneration

time to state x0, and denote

ηj(h) :=T j

x0∑i=T j−1

x0 +1

h(xi)

Page 25: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

25

as the jth regeneration epoch. Note that {ηj(h), j = 1, · · · ,m} forms i.i.d blocks due to

strong Markov property of the underlying Markov chain. Write

1N

N∑j=1

h(Xj) =1N

N∑j=T m

x0+1

h(Xj) +1N

m∑j=1

ηj(h)−[Nπ]∑j=1

ηj(h)

+1N

[Nπ]∑j=1

ηj(h)

:= I1 + I2 + I3. (12)

By the law of large numbers for i.i.d. random variables, we have

1N

[Nπ(x0)]∑j=1

ηj(h) =[Nπ(x0)]

N

1[Nπ(x0)]

[Nπ(x0)]∑j=1

ηj(h) −→ π(x0)Eπ(η) = Eπ(h) in probability.

Next, we show that both I1 and I2 converge to zero in probability. For any ε > 0,

∣∣∣∣∣∣

N∑j=T m

x0+1

h(Xj)

∣∣∣∣∣∣ > εN

≤ Pπ

N∑

j=T mx0

+1

|h(Xj)| > εN

≤ Pπ

T m+1

x0∑j=T m

x0+1

|h(Xj)| > εN

= Pπ {η1(|h|) > εN} ≤ Eπ[η1(|h|)]

εN=

Eπ(|h|)επ(x0)N

.

The last inequality follows from Markov inequality and Eπ(|h|)/[επ(x0)N ] → 0 as N → ∞

by Eπ(|h|) < ∞. This implies that I1 → 0 in probability.

Since m/N −→ π(x0) in probability, we have that for all ε > 0, there exists N0 such

that for N > N0, Pπ{|m− [Nπ(x0)]| > Nε2} < ε. Then, for N > N0, we have

∣∣∣∣∣∣

m∑j=1

ηj(h)−[Nπ(x0)]∑

j=1

ηj(h)

∣∣∣∣∣∣ > εN

≤ Pπ(|m− [Nπ(x0)]| > Nε2) + Pπ

max|r−[Nπ(x0)]|≤ε2N

∣∣∣∣∣∣r∑

j=[Nπ(x0)]+1

ηj(h)

∣∣∣∣∣∣ > εN

< ε + 2Pπ

max1≤r≤ε2N

∣∣∣∣∣∣r∑

j=1

ηj(h)

∣∣∣∣∣∣ > εN

= ε + 2Pπ

r∑

j=1

|ηj(h)| > εN

< ε +

2ε2NE(|η1|)εN

=(

1 +2Eπ(|h|)π(x0)

)ε.

Therefore, I2 → 0 in probability. We conclude the proof of (a).

Proof of (b). Using the same argument as in the proof of (a), we have√

N

σ√

π(x0)

(∑Nj=1 h(Xj)

N− Eπ[h(X1)]

π(x0)

)

=1

σ√

Nπ(x0)

N∑j=T m

x0+1

h(Xj) +1

σ√

Nπ(x0)

m∑j=1

ηj(h)−[Nπ(x0)]∑

j=1

ηj(h)

+

√N

σ√

π(x0)

(∑[Nπ(x0)]j=1 ηj(h)

N− Eπ[h(X1)]

π(x0)

):= II1 + II2 + II3. (13)

Page 26: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

26

First, we consider II3. Note that ηj are i.i.d. random blocks. Under the condition of

Eπ(|h|2) < ∞ and σ2 := Var(∑Tx0

t=1 h(Xt)) < ∞, by standard central limit theorem for

i.i.d. random variables, We have

II3 =

√Nπ(x0)

σ

(∑[Nπ(x0)]j=1 ηj(h)Nπ(x0)

− Eπ[h(X1)]π(x0)

)−→ N(0, 1) in distribution. (14)

It remains to show that II1 and II2 converges to zero in probability. For any ε > 0,

we have

∣∣∣∣∣∣

N∑j=T m

x0+1

h(Xj)

∣∣∣∣∣∣ > εσ√

Nπ(x0)

≤ Pπ

N∑

j=T mx0

+1

|h(Xj)| > εσ√

Nπ(x0)

≤ Pπ

T m+1

x0∑j=T m

x0+1

|h(Xj)| > εσ√

Nπ(x0)

= Pπ

{η1(|h|) > εσ

√Nπ(x0)

}≤ Eπ[η1(|h|)]

εσ√

Nπ(x0)=

Eπ(|h|)επ(x0)σ

√Nπ(x0)

. (15)

Hence, II1 converges to 0 in probability as N →∞ by assumption.

Since m/N −→ π(x0) in probability, we have for all ε > 0, there exists N0 such that

N > N0, Pπ(|m− [Nπ(x0)]| > Nε3) < ε. Clearly, for such N , we have

∣∣∣∣∣∣m∑

j=1

ηj(h)−[Nπ(x0)]∑

j=1

ηj(h)

∣∣∣∣∣∣ > εσ√

Nπ(x0)

≤ Pπ(|m− [Nπ(x0)]| > Nε3) + P

max|r−[Nπ(x0)]|≤ε3N

∣∣∣∣∣∣r∑

j=[Nπ(x0)]+1

ηj(h)

∣∣∣∣∣∣ > εσ√

Nπ(x0)

< ε + 2Pπ

max1≤r≤ε3N

∣∣∣∣∣∣r∑

j=1

ηj(h)

∣∣∣∣∣∣ > εσ√

Nπ(x0)

< ε +

2ε3Nσ2

ε2σ2Nπ(x0)= ε

(1 +

2π(x0)

).

This proves that II2 converges to 0 in probability as N → ∞. We conclude the proof of

(b). ♦

Page 27: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

27

References

[1] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about Markov chains.Ann. Math. Statist. 28, 89-110.

[2] Athreya, K. B. and Fuh, C. D.(1993). Central limit theorem for a double array of harrischains. Sankhya A, 55, 1-11.

[3] Basawa, I. V. and Prakasa Rao, B. L. S. (1980). Statistical Inference for StochasticProcesses. London: Academic Press, Inc..

[4] Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’sability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. ReadingMA: Addison-Wesley.

[5] Chang, H. and Ying, Z. (1996). A global information approach to computerized adaptivetesting. Applied Psychological Measurement, 20, 213-229.

[6] Chang, H. and Ying, Z. (1999). A-stratified multistage computerized adaptive testing.Applied psychological measurement, 23, 211-222.

[7] Chang, H. and Ying, Z. (19??). Nonlinear sequential designs for logistic item responsetheory models with application to computerized adaptive tests. The Annals of Statistics,?, ?-?.

[8] Chao, M. T. and Fuh, C. D. (2001). Bootstrap methods for the up-and-down test onpyrotechnics sensitivity analysis. Statistica Sinica, 11, 1-21.

[9] Chung, K. L.(1967). Markov chain with stationary transition probabilities. New York:Springer.

[10] Derman, C. (1957). Non-parametric up-and-down experimentation. Ann. Math. Statist.28, 795-798.

[11] Dixon, W. J. and Mood, A. M. (1948). A method for obtaining and analyzing sensitivitydata. J. Amer. Statist. Assoc. 43, 109-126.

[12] Fuh, C. D. and Zhang, C. H. (2000). Poisson equation, maximal inequalities and r-quickconvergence for Markov random walks. Stochastic Processes and their Applications, 87,53-67.

[13] Lord, M. F. (1970). Some test theory for tailored testing. In W.H. Holtzman (Ed.),Computer-assisted instruction, testing and guidance. New York: Harper and Row.

[14] Lord, M. F. (1971). Robbins-Monro procedures for tailored testing. Educational andpsychological Measurement, 31, 3-31.

[15] Lord, M. F. (1980). Applications of item response theory to practical testing problem.Hillsdale, NJ: Lawrence Erlbaum.

[16] Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the contextof adaptive mental testing. Journal of the American Statistical Association, 70, 351-356.

Page 28: The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the literature, the idea of a-stratified multistage computerized adaptive testing has

28

[17] Stocking, M. L. and Lewis, C. (1995). A new method of controlling item exposure in com-puterized adaptive testing (Research Report 95-25) Princeton, NJ: Educational TestingService.

[18] Sympson, J. B. and Hetter, R. D. (1985). Controlling item-exposure rates in comput-erized adaptive testing. Proceeding of the 27th annual meeting of the Military TestingAssociation pp. 973-977. San Diego, CA: Navy Personnel Research and DevelopmentCenter.

[19] Thomas, E. V. (1994). Evaluating the ignition sensitivity of thermal-battery heat pellets.Technometrics 36, 273-282.

[20] Wainer, H. (1990). Computerized adaptive testing: A primer. Hillsdale NJ: Erlbaum.

[21] Weiss, D. J. (1976). Adaptive testing research in Minnesota: Overview, recent results,and future directions. In C. L. Clark (Ed.), Proceedings of the first conference on com-puterized adaptive testing pp. 24-35. Washington DC: United States Civil Service Com-mission.

[22] Wetherill, G. B. (1963). Sequential estimation of quantal response curves (with discus-sion). J. Roy. Statist. Soc. Ser. B 25, 1-48.

[23] Wetherill, G. B. and Glazebrook, K. D. (1986). Sequential Methods in Sta tistics. London:Chapman and Hall.