Making Computerized Adaptive Testing Diagnostic Tools for Schools
The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the...
Transcript of The Up-and-Down Method in Computerized Adaptive Testinghchen/Hung/paper/cat.pdf · Early in the...
The Up-and-Down Method in Computerized
Adaptive Testing ∗
Hung Chen, National Taiwan University
Cheng-Der Fuh, Academia Sinica
Jia-Fang Yang, National Taiwan University
April 17, 2002
Abstract
As an item selection method, recursive maximum likelihood estimation (R-MLE)
has been commonly used in computerized adaptive tests (CAT). Based on the idea
of getting item information, it select an item to maximize the Fisher information at
the currently estimated trait level (θ). And it also gives an estimate of the ability
θ of examinees at the end of the test. Although the R-MLE has been studied and
used in most of the cases in item response theory; however, there are two major issues
are concerned in application. First, the requisite of repeated calculation of maximum
likelihood estimates (MLE) in implementation makes it less computational efficient.
Secondly, the uncertainty of likelihood function due to model misspecification and data
uncertainty, especially at early stages, makes it less statistical efficient. In this paper, we
introduce the up-and-down method, as an approximation of the R-MLE, to select test
items in CAT. To be more specific, we propose an a-stratified multistage up-and-down
method, and investigate its statistical properties of the estimate of ability parameter
θ. Bootstrap method for Markov chains is used to approximate the confidence interval
of θ. It is argued that the up-and-down selection procedure should be used, at least
at early stages of a test or the number of test item is not large. Results from pilot
simulation studies show that there is not much differences between the up-and-down
method and the R-MLE method, in the sense that the accuracy of point estimation and
the construction of confidence interval for the trait level (θ). Hence, the up-and-down
method can be regarded as an alternative of the classical R-MLE, and has the prevalence
of easy implementation and more accurate statistical analysis.
Key Words: Up-and-down method, computerized adaptive tests, bootstrap, Markov chains,
∗This research was supported in part by Grants from the National Science Council of Republic of China.
2
recursive maximum likelihood estimate, experimental design.
3
1 Introduction
In the traditional paper-and-pencil (P&P) test, all examinees take the same test without
considering the difference among their abilities. Hence, each examinee may be required to
answer some test items that do not match the examinee’s ability. It is intuitively clear that
an examinee is measured most effectively when the test items are neither too difficult nor
too easy for him/her. Computerized adaptive testing (CAT) is then proposed in Lord (1970,
1971), Owen (1975), and Weiss (1976), among others, to construct such an individualized test.
Now, it is being implemented in the Graduate Recorded Examination (GRE), the Graduate
Management Admission Test and the National Council Licensure Examination for Nurses in
United States for getting a better gauge of examinee’s ability.
To construct an individualized test, the ability (θ) estimate is updated after the ad-
ministration of each item, and the next optimal item is selected from an item bank until a
prespecified number of items is administered. Items are selected to match the examinee’s
estimated θ according to an item response theory (IRT) model that is assumed to describe
an examinee’s response behavior. Without taking consideration on nonstatistical issues such
as content balancing, the standard approach to item selection in CAT has been to select the
item with the maximum Fisher item information as the next item (Lord, 1980, pp. 151-153)
at the examinee’s current estimated ability level administered. This method is the so-called
recursive maximum likelihood estimation (R-MLE) method. The major advantage of CAT
is that it provides more efficient trait estimates with fewer items than that required in con-
ventional tests. However, methodological as well as theoretical developments in CAT appear
to be rather small.
Although the R-MLE has been studied and used in most of the cases in item response
theory; however, there are two major issues are concerned in application. First, the requisite
of repeated calculation of maximum likelihood estimates (MLE) in implementation makes it
less computational efficient. Secondly, the uncertainty of likelihood function due to model
misspecification and data uncertainty, especially at early stages, makes it less statistical
efficient. Therefore, in this paper, we consider a nonparametric item selection rule, the
up-and-down method, originated in Dixon and Mood (1948), to avoid the just mentioned
repeated computation of MLE and the issue of model dependency. This selection rule was
also considered in Lord (1971) as an alternative to another nonparametric item selection rule,
Robbins-Monro method, to avoid preparing and storing too many test items.
To associate how his/her ability for which it determines his/her response behavior to
a test item, IRT model starts with a mathematical model as to how response depends on
the level of ability, and this relationship is given by the item response function. The reader
4
is referred to Lord (1980) for details. In this paper, we consider only a dichotomously item
response in which the item response function is defined as the probability p (or p(θ)) of getting
a correct response to the item, where θ is the so-called ability parameter. It is expected that
p(θ) is a monotone increasing function of θ. Two forms of item response functions have been
commonly used, the logistic and the probit models. It is pointed out in Birnbaum (1968)
that a properly scaled logistic model differs from the probit model by less than 1% and that
the former is much easier to work mathematically. Furthermore, these two models provide
similar results for most practical works (Lord, 1980). Therefore, we only consider the logistic
model in this paper. The two-parameter logistic model (2-PLM) posits that the distribution
of the random observation Y , representing a 0 (incorrect answer) or 1 (correct answer) score
on a test item, has the form
p(θ) := P (Y = 1|θ) =ea(θ−b)
1 + ea(θ−b), (1)
where a and b are known item parameters for discrimination and difficulty, respectively. The
1-PLM can be obtained by setting a = 1 for all items.
For CAT with R-MLE, it selects an item to maximize precision in estimating an ex-
aminee’s θ. Items with high a values have high information, provided that b is close to θ.
Consequently, items with high a values tend to be exposed more frequently than items with
low information. In fact, it was reported in Lord and Wingersky (1984) that a and b param-
eter estimates often are positively correlated. This phenomenon is being confirmed again by
analyzing 360 items from a retired item bank of a GRE quantitative test in Chang, Qian,
and Yin (2001). They also found that the correlation between a and b is 0.44.
For the purpose of controlling item exposure rate and simplifying test implementation,
we propose to use the up-and-down method as test item selection method. For the up-and-
down method, the selection of next test item depends only on the response of current test
item. As a contrast, next selected test item based on RMLE method will depend on all past
responses of the examinee. The proposed procedure can be described as follows.
0. Analyze the distribution of (a, b) of the item bank.
1. Determine the strategy of updating a and b.
Two updating strategy will be considered.
Method 1. When a and b are not correlated, we update a and b alternatively.
Method 2. When a and b are correlated, find r, the correlation between a and b of the
item bank. Define a new parameter ci = a + r(bi − b). Update (a, b) based on c.
2. Assume kth test item is administered.
5
2. The level of next test item, c, is increased by one unit when the response is 1.
Otherwise, it is decreased by one unit.
Method 1. The difficulty level of next test item, b, is increased by one unit when the
response is 1. Otherwise, it is decreased by one unit.
Method 2. We start with low discrimination level. The discrimination level of next test
item, a, is increased by one unit when the responses are 1 or 0 in a two consecutive
row. The difficulty level is the same at this moment. Otherwise, we follow 2a.
3. Repeat Step 2 until the test is finished.
Let nk be the number of observations at stage k for k = 1, · · · ,K, and note that n1 + · · ·+nK
equals the test length. When the test is finished, the method of maximum likelihood method
is applied to estimate the examinee’s ability parameter θ.
Early in the literature, the idea of a-stratified multistage computerized adaptive testing
has been proposed by Chang and Ying (1999), based on the method of R-MLE. It is called
a-stratified multistage R-MLE in CAT.
1. Partition the item bank into K levels according to the a-parameter values of items. The
first item stratum contains items with smallest a’s, the next stratum contains items
with second smallest a’s, etc..
2. Accordingly, partition the test into K stages.
3. Select nk items from the kth stratum based on the matching of item difficulty parameter
b with the updated estimator θn, then administer the items.
4. Repeat Step 3 from k = 1, 2, . . . ,K.
In this procedure, it groups items with similar a values together and select items to maximize
Fisher information from the corresponding level at each stage. As a remark, for the 2-PLM,
maximizing item information is equivalent to matching b with θn. This stratification would,
therefore, decrease exposure rates of high a items and increase exposure rates of low a items.
The reader is referred to Chang and Ying (1999) for details.
The just mentioned up-and-down method can also be regarded as an approximation
of the classical R-MLE in the following sense. First, we apply the idea of moving block to
calculate the MLE at each stage, that is, we only use currently m (for instance, m = 10) data
to do the job. And the up-and down method is that we choose m = 1. Next, the R-MLE
will choose b as large as possible if the examinee’s response is 1, and as small as possible
if the examinee’s response is 0, since there is only one item response is used. In this case,
6
we may put a prior distribution on b and then select the difficulty level b according to this
distribution. The up-and-down method is that we consider a degenerate prior with fixed step
size.
As a contrast to the R-MLE method, the up-and-down method only requires to solve
the likelihood equation once. Its primary advantage over R-MLE method is to relieve the
burden of repeated computation of MLE in the course of conducting test. Although the local
independence assumption is violated in this case due to sequential design, the selection of
item can be described by a Markov chain (see Section 2 for details). Therefore, the statistical
properties of resulting estimate of θ is much easier to analyze than those based on R-MLE
method. Note that the theoretical basis for using maximum information to select items in
CAT is that R-MLE method will lead to a substantial gain in efficiency. Although it is
expected that the resulting estimator is not as statistically efficient as R-MLE estimator.
However, it is reported in the literature that the expected efficiency gain may not be realized
because θ is used in place of θ in calculating information. Indeed, Chang and Ying (1996)
argued that it may be advantageous not to use item information at the early stages of a CAT
so as to avoid efficiency loss due to poor estimation of θ based on a small number of items.
This motivates the simulation studies conducted in Section 3. Based on the results of those
studies, the loss of statistical efficiency can be small.
The up-and-down method was first proposed in Dixon and Mood (1948) for testing the
sensitivity of explosive to shock. In this setting, the experimental investigation is conducted
by dropping a weight on specimens of the same explosive mixture from various heights. In
CAT, those heights correspond to the difficulty levels of test items and the sensitivity level of
shock is the examinee’s ability parameter. Further research on the up-and-down method can
be found in Wetherill (1963), Wetherill and Glazebrook (1986) and the references therein.
In this paper, we will focus our study on the up-and-down method in computerized
adaptive testing. We also address the issue on the performance of proposed methods on the
choice of step size and the choice of first few initial problems. Since the allowed number of
test items are not large, a bootstrap method is proposed to evaluate the variance of resulting
estimate.
To get a better understanding of the proposed a-stratified multistage computerized
adaptive testing, in Section 2, we describe the likelihood function and the associated esti-
mation induced by the up-and-down method. Additional theoretical studies of the induced
Markov chain by the up-and-down method, and the asymptotic behaviors of the MLE will be
deferred to Section 5. Empirical pilot simulation studies for the comparison of the R-MLE
and the up-and-down method are reported in Section 3. In Section 4, we make conclusion
remarks and some possible further research.
7
2 Likelihood Function
Assume that the two parameters (a and b) for each item in item bank are known, and have
given a calibrated test to a sample of examinees. In this section, we describe the asymptotic
behavior of the maximum likelihood estimate (MLE) of θ when the test items are selected
by the up-and-down method.
In item response theory, it assumes that an examinee’s responses to different items
in a test are statistically independent. For the up-and-down method used in the kth stage
of computerized adaptive testing, the difficulty level of (N + 1)th test item is increased (or
decreased) by one unit according to whether the Nth response is correct or not. Hence, the
contribution of likelihood function for the 2-PLM during the kth stage is given by
Lnk(θ) = f(X0)
nk∏i=0
f(Yi|Xi) = f(X0)nk∏i=0
[pi(θ)]Yi [1− pi(θ)]1−Yi , (2)
where f(X0) is determined by the design for choosing the difficulty level of the first test item
and
pi(θ) = P (Yi = 1|θ) =eak(θ−bi)
1 + eak(θ−bi).
Here Xi = bi and f(X0) is independent of θ. Since the a values are similar in the kth stage
of a-stratified multistage computerized adaptive testing, we will treat a as a fixed value (= 1)
in each stage in the following discussion. And for simplicity, the difficulty parameter b of the
first test items is set to be 0 here and after. In this case, f(X0) = 1.
As a contrast to the R-MLE, θn is not needed in the selection process of test items.
We only need to calculate θn once at the end of the test. Here θn is the maximum likeli-
hood estimate based on the observations x0, y0, · · · , xn, yn. (i.e., θn is the maximizer of the
likelihood function∏K
k=1 Lnk(θ).) Usually, there is no explicit solution for θn, and hence
numerical algorithm, such as Newton-Raphson method, can be employed to approximate the
value of θn.
Although the likelihood function given in (2) is identical to the likelihood function of a
fixed design (in which X0, X1, . . . , Xn are determined before the administration of the test),
it is apparent that Yi will depend on Yi−1, . . . , Y1 in the up-and-down method. Therefore, we
cannot apply standard likelihood method result to the maximizer of∏K
k=1 Lnk(θ). Instead,
we will utilize the Markovian structure imposed by the up-and-down method to study the
asymptotic behavior of θn.
For an examinee with ability level θ, consider the data set (x,y) = {(x0, y0), . . . , (xn, yn)}
produced by the up-and-down method, where xt is the difficulty level for the tth selected item
and yt is the corresponding response value. Recall that yt is 0 or 1, representing “incorrect
answer” or “correct answer,” respectively. Assume the step size of the up-and-down method
8
is ∆. Observe that
P (Xi+1 = Xi + ∆|(Xk, Yk), 0 ≤ k ≤ i) = P (Yi = 1|Xi) = e(θ−Xi)/[1 + e(θ−Xi)],
P (Xi+1 = Xi −∆|(Xk, Yk), 0 ≤ k ≤ i) = P (Yi = 0|Xi) = 1/[1 + e(θ−Xi)].
It follows easily that {Xt, t = 0, 1, · · · , n} forms a Markov chain on a state space {bj = x0 +
j∆, j ∈ Z} with transition probability matrix P = (px1,x2) such that px1,x1+∆+px1,x1−∆ = 1.
Or,
px1,x2 := P{X2 = x2|X1 = x1} =
e(θ−x1)/[1 + e(θ−x1)], x2 = x1 + ∆,
1/[1 + e(θ−x1)], x2 = x1 −∆.(3)
We just demonstrate that the data set {(Xt, Yt), t = 0, 1, · · · , n} can be reproduced
from the ordered set of difficulty levels {Xt, t = 0, 1, · · · , n} and vice versa. Moreover the
latter can be formulated as a Markov chain with the specific transition probability matrix
P described in (3). It will be shown in Section 5 that the Markov chain has stationary
distribution π. That is, πP = π. Therefore, we will fully explore the Markovian structure
(3) to investigate the asymptotic properties of the maximum likelihood estimate θn.
For simplicity, we assume that X0 = 0, ∆ = 1, and bi = i here and after. Then the
state space of Xt formed by the difficulty level is {bi = i, i ∈ Z}, which denote the difficulty
levels with natural order. That is,
px1,x2 =
e(θ−x1)/[1 + e(θ−x1)], x2 = x1 + 1,
1/[1 + e(θ−x1)], x2 = x1 − 1.(4)
Since for given observations xk, k = 1, · · · , n,
1n
∂2
∂θ2Ln(θ) =
1n
n∑k=1
−e(θ−xk)
[1 + e(θ−xk)]2< 0, for any θ,
Ln(θ) is a convex funcition and the maximum likelihood estimate is the unique root of the
score equation1n
∂
∂θLn(θ) =
1n
n∑k=1
∂
∂θg(xk−1, xk; θ) = 0,
where g(X1, X2; θ) := log f(X1, X2; θ). Denote the maximum likelihood estimate as θn. To
derive those asymptotic results, we follow the argument by showing that
• The score equation has a root near θ0. It will be established by showing that
limn→∞
1n
n∑t=1
∂
∂θg(Xt−1, Xt; θ)
∣∣∣∣θ=θ0
→ 0 in probability. (5)
9
• The slope of the score equation is negative. We will show that
limt→∞
1n
n∑t=1
∂2
∂θ2g(Xt−1, Xt; θ)
∣∣∣∣θ=θ0
→ −I in probability. (6)
Here
I = Eθ0
[∂
∂θg(X1, X2; θ)
∣∣∣∣θ=θ0
]2
= −Eθ0
{∂2
∂θ2g(X1, X2; θ0)
}< ∞. (7)
• Show that√
n(θn−θ0) is asymptotic normal with mean 0 and asymptotic variance 1/I.
The idea of proof is to the approximate the score function by an additive function of
the Markov chain. To prove that (5) and (6) hold, we need the law of large numbers. And
to prove the asymptotic normality, we need central limit theorem for additive function of the
Markov chain. In these proofs, we will use regeneration method to represent the additive
function of the Markov chain as sum of independent and identically distributed random
variables. Wald’s equations for Markov chains will also be applied to reduces the moments
conditions for the epoch. We now state main result of this paper. For technical details, refer
to Section 5.
Theorem 1 Let {Xn, n ≥ 0} be a Markov chain on a countably state space S = {· · · ,−2,−1, 0,
1, 2, · · ·}, with transition probability pij defined as (4) for i, j ∈ S. Then
1) θn = θ(x0, · · ·xn) converges in probability to θ0,
2)√
n(θn − θ0) −→ N(0, 1/I) in distribution, where I = Eθ0
[∂∂θ log f(X1, X2; θ0)
]2is
the Fisher information, Eθ0 denotes the expectation Eπ under the true parameter θ0, and
f(X1, X2; θ) denotes the transition probability under parameter θ of the Markov chain (4).
3 Empirical Studies for the Maximum Likelihood Esti-
mate
In order to understand the limitation of the proposed method, we conduct A small-scale
simulation study
3.1 Design of the simulation study
To evaluate the performance of the up-and-down method, we compare it to the recursive
maximum likelihood estimation method in the setting of a-stratified multistage computerized
adaptive testing. Since the up-and-down method is a nonparametric method without using
all information contained in the data, it is expected that the resulting θn is not as efficient as
the R-MLE θn. The first simulation experiment is conducted to evaluate its efficiency loss.
10
Since the discrimination parameter a is fixed in each stratum, we just compare the accuracy
of the ability parameter θ in the 2-PLM with a = 1. Using R-MLE as a bench mark, we
compare θn to θn in terms of bias, variance, and mean square error. The results are shown
in Tables 3.1 to 3.4.
In this study, the number of test items n varies from 10 to 200. Recall that the initial
test item of the parameter b is set to be 0 in this paper. The choice of small n such as n = 10
is used to illustrate how bad the up-and-down method can be when the initial test item of
the parameter b does not match the ability of the examinee. The choice of n = 200 is used
to illustrate the efficiency loss of the up-and-down method.
For the sequentially selection of test items, the R-MLE item selector selects the (k+1)th
test item with b which maximizes the Fisher information with θ = θk. For the up-and-down
method, we need to determine the step size ∆ in the beginning of the test. In this study, the
step size ∆ is set to be a constant .1. When the response of the kth test item is 0, the difficulty
level of the (k + 1)th test item will be reduced by one more step size; while if the response
is 1, the difficulty level will be increased by one more step size. For the ability parameter
θ of examinee, due to symmetry, we consider three nonnegative levels, θ = 0, θ = 1, and
θ = 2. The choice of θ = 2 is used to test the limit of the up-and-down method when the
initial choice of b is far away from the ability of the tester by 20 steps away from unknown θ.
The range of latent ability or skill level is set to be within (−3, 3) as usually assumed in the
literature.
For the accuracy of the estimate θn, we also compare θn to θn in terms of interval esti-
mation of θ. When the sample size is moderate to large. Based on the asymptotic results in
Theorem 1, the normal approximation will provide a ‘good’ confidence interval with nominal
coverage probability 95%. When the sample size is small and the normal approximation is
not well, we introduce two alternative interval estimations based on bootstrap approxima-
tions (percentile and bootstrap-t). The bootstrap replication size for the ordinal bootstrap
confidence intervals is B = 1, 000. The true 95% interquartile range (t.025, t.975) is given
for reference, in which they were obtained from the appropriate quantiles of the empirical
distributions based on a large simulation with 10, 000 replications. The second simulation
experiment is conducted to evaluate the above proposals.
Computations were performed using C++ programs on the sparc station 10 at the
Department of Mathematics, National Taiwan University. The pseudo-random numbers are
generated using IMSL routines. All tests were compared on the basis of the same random
numbers, samples of different size were nested.
11
3.2 Simulation experiment I
For each assumed ability level θ0, repeat the experiment 1, 000 times. Let θni, i = 1, 2, · · · , 1, 000,
be the estimate of θ0 obtained in the ith experiment. We consider the following summary
statistics
Bias =1
1000
1000∑i=1
(θni − θ0), MSE =1
1000
1000∑i=1
(θni − θ0)2, VAR =1
1000
1000∑i=1
(θi − θ0)2,
where θ0 is the sample average. By Theorem 1,√
n(θn − θ0) −→ N(0, 1/I), where I is
the Fisher information determined by the Markov chain with transition probability (3) with
∆ = .1. To evaluate if the asymptotic results could give a good approximation, we also report
the asymptotic variance (AVar) of θn.
Let θn denote the RMLE based on the observations. It is known (cf. Chang and Ying,
1999) that √IRMLE(θn)(θn − θ0) −→ N(0, 1) in distribution, (8)
where
IRMLE(θn) =n∑
i=1
a2i e
ai(θn−bi)
1 + eai(θn−bi).
Let θni, i = 1, 2, · · · , 1, 000, be the RMLE of θ0 obtained in the ith experiment. We also
report empirical mean and empirical variance of IRMLE(θni), which are defined as
Ave(IRMLE) =1
1000
1000∑i=1
IRMLE(θni) and Var(IRMLE) =1
1000
1000∑i=1
(IRMLE(θni)−IRMLE)2,
where IRMLE denotes the sample average of IRMLE(θni).
Tables 3.1-3.3 show that the MSEs and the empirical variances of both the R-MLE and
the up-and-down methods are close. In particular, when the initial choice of b = 0 matches
θ = 0, the up-and-down method performs well for all selection of n. When the initial choice
of b = 0 does not match θ, the up-and-down method can be quite bad when n is small. In
summary, the performance of the accuracy for point estimates given by these two methods
is different only in margin, and hence, the up-and-down method can be used an alternative
of the efficient R-MLE method when the implementation of R-MLE method is an issue and
the number of test items is large. Also, the asymptotic variance of θn is close to VAR.
This suggests that the asymptotic results provide a good approximation. See Section 3.3 for
further discussion on interval estimation.
Table 3.1 Comparison of θn and θn when θ0 = 0
12
n Methods Bias VAR MSE Ave(In) Var(In) AVar
10 R-MLE -0.001 0.473 0.473 2.304 0.017
up-and-down -0.037 0.550 0.551 2.340 0.047 0.410
20 R-MLE 0.007 0.212 0.212 4.707 0.044
up-and-down 0.011 0.206 0.206 4.861 0.054 0.205
30 R-MLE -0.002 0.146 0.146 7.128 0.078
up-and-down -0.037 0.139 0.140 7.358 0.066 0.137
40 R-MLE -0.006 0.113 0.113 9.545 0.120
up-and-down -0.006 0.115 0.115 9.844 0.109 0.103
50 R-MLE -0.019 0.085 0.086 12.011 0.135
up-and-down 0.002 0.079 0.079 12.322 0.162 0.082
100 R-MLE -0.002 0.042 0.042 24.331 0.274
up-and-down -0.008 0.041 0.041 24.609 0.650 0.041
150 R-MLE 0.005 0.027 0.027 36.766 0.303
up-and-down 0.002 0.026 0.026 36.826 1.448 0.027
200 R-MLE 0.002 0.020 0.020 49.161 0.383
up-and-down -0.006 0.022 0.022 49.044 2.624 0.021
13
Table 3.2 Comparison of θn and θn when θ0 = 1
n Methods Bias VAR MSE Ave(In) Var(In) AVar
10 R-MLE 0.005 0.464 0.464 2.235 0.031
up-and-down 0.141 1.335 1.355 1.973 0.240 0.410
20 R-MLE -0.002 0.224 0.224 4.615 0.057
up-and-down 0.014 0.241 0.241 4.356 0.258 0.205
30 R-MLE 0.023 0.145 0.145 7.018 0.104
up-and-down 0.028 0.150 0.151 6.781 0.253 0.137
40 R-MLE -0.016 0.104 0.104 9.482 0.121
up-and-down 0.001 0.110 0.110 9.276 0.248 0.103
50 R-MLE -0.002 0.084 0.084 11.933 0.146
up-and-down 0.001 0.089 0.089 11.734 0.297 0.082
100 R-MLE -0.006 0.044 0.044 24.241 0.295
up-and-down 0.0003 0.040 0.040 23.998 0.660 0.041
150 R-MLE 0.003 0.027 0.027 36.684 0.309
up-and-down -0.001 0.027 0.027 36.231 1.457 0.027
200 R-MLE -0.003 0.021 0.021 49.095 0.323
up-and-down -0.004 0.022 0.022 48.438 2.536 0.021
Table 3.3 Comparison of θn and θn when θ0 = 2
n Bias VAR MSE Ave(I) Var(I) AVar
10 R-MLE 0.040 0.523 0.525 2.015 0.059
up-and-down 1.233 11.820 13.340 1.234 0.451 0.410
20 R-MLE 0.013 0.242 0.242 4.399 0.073
up-and-down 0.099 0.689 0.699 3.119 0.637 0.205
30 R-MLE 0.037 0.164 0.165 6.818 0.122
up-and-down 0.014 0.195 0.195 5.400 0.589 0.137
40 R-MLE 0.016 0.107 0.107 9.256 0.146
up-and-down 0.001 0.128 0.128 7.796 0.536 0.103
50 R-MLE -0.017 0.089 0.089 11.707 0.159
up-and-down 0.009 0.099 0.099 10.205 0.519 0.082
100 R-MLE 0.007 0.040 0.041 24.057 0.265
up-and-down 0.003 0.045 0.045 22.470 0.776 0.041
150 R-MLE 0.010 0.028 0.028 36.429 0.320
up-and-down -0.006 0.029 0.029 34.742 1.488 0.027
200 R-MLE -0.001 0.020 0.020 48.890 0.372
up-and-down 0.003 0.022 0.022 46.899 2.513 0.021
14
Next we compare the selected test items based on the R-MLE and the up-and-down
method. For the purpose of illustration, we just randomly pick up one realization and report
it in Table 3.4. It shows that the test items chosen by R-MLE method are close to up-and-
down method except for the first few test items. In Table 2.4, we denote bR and bUD as the
item difficulty parameters to be chosen by the two methods, and let D be the difference of
the successive two item difficulty parameters from the R-MLE method.
15
Table 3.4 Successive selected test items
item bR D bUD item bR D bUD item bR D bUD
0 0.000 0 1 0.500 0.500 0.1 2 0.250 0.250 0.2
3 0.950 0.700 0.1 4 0.424 0.526 0 5 0.839 0.415 0.1
6 0.494 0.345 0.2 7 0.200 0.294 0.3 8 0.456 0.257 0.2
9 0.229 0.227 0.1 10 0.433 0.204 0 11 0.619 0.185 -0.1
12 0.791 0.172 -0.2 13 0.632 0.159 -0.3 14 0.487 0.145 -0.4
15 0.351 0.136 -0.5 16 0.478 0.127 -0.6 17 0.359 0.119 -0.5
18 0.246 0.113 -0.4 19 0.353 0.107 -0.3 20 0.251 0.102 -0.2
21 0.153 0.098 -0.3 22 0.247 0.093 -0.4 23 0.157 0.089 -0.3
24 0.243 0.085 -0.4 25 0.161 0.082 -0.3 26 0.082 0.079 -0.2
27 0.158 0.076 -0.1 28 0.231 0.073 0 29 0.301 0.070 0.1
30 0.233 0.068 0 31 0.168 0.066 0.1 32 0.231 0.064 0.2
33 0.169 0.062 0.1 34 0.229 0.060 0.2 35 0.287 0.058 0.3
36 0.343 0.056 0.2 37 0.398 0.055 0.3 38 0.451 0.053 0.4
39 0.503 0.052 0.3 40 0.453 0.051 0.4 41 0.502 0.049 0.5
42 0.454 0.048 0.4 43 0.501 0.047 0.3 44 0.455 0.046 0.2
45 0.410 0.045 0.1 46 0.454 0.044 0.2 47 0.411 0.043 0.1
48 0.369 0.042 0 49 0.328 0.041 -0.1 50 0.287 0.040 0
51 0.248 0.049 0.1 52 0.287 0.039 0.2 53 0.248 0.038 0.3
54 0.286 0.037 0.2 55 0.323 0.037 0.3 56 0.287 0.036 0.2
57 0.322 0.035 0.1 58 0.287 0.035 0.2 59 0.253 0.034 0.1
60 0.219 0.034 0 61 0.186 0.033 0.1 62 0.219 0.033 0.2
63 0.187 0.032 0.3 64 0.155 0.032 0.4 65 0.124 0.031 0.3
66 0.154 0.031 0.4 67 0.124 0.030 0.5 68 0.154 0.030 0.4
69 0.183 0.029 0.3 70 0.154 0.029 0.4 71 0.126 0.029 0.3
72 0.154 0.028 0.4 73 0.126 0.028 0.5 74 0.099 0.028 0.4
75 0.071 0.027 0.3 76 0.044 0.027 0.2 77 0.018 0.027 0.1
78 0.044 0.026 0.2 79 0.070 0.026 0.1 80 0.044 0.026 0.2
81 0.019 0.025 0.1 82 0.044 0.025 0.2 83 0.019 0.025 0.1
84 -0.005 0.024 0.2 85 -0.029 0.024 0.3 86 -0.005 0.024 0.2
87 0.018 0.024 0.1 88 0.042 0.023 0 89 0.064 0.023 0.1
90 0.087 0.023 0 91 0.065 0.022 0.1 92 0.043 0.022 0
93 0.021 0.022 0.1 94 -0.001 0.022 0.2 95 0.020 0.022 0.3
96 -0.001 0.021 0.4 97 -0.022 0.021 0.3 98 -0.043 0.021 0.2
99 -0.064 0.021 0.1
16
3.3 Simulation experiment II
In simulation experiment I, it is found that AVar is close to VAR for the up-and-down
method. We now evaluate the nominal 95% confidence interval of θ0 based on Theorem 1.
Again, consider θ0 = 0, 1, 2. For each θ, simulate 10, 000 times instead. The criteria we
considered are the coverage probability (CP) and average length (AL) of the corresponding
confidence interval.
Tables 3.5 and 3.6 gives the true coverage probabilities for θn and θn which are found
to be close to the nominal coverage probability 95% derived from asymptotic approximation.
Table 3.5 Normal approximation by the up-and-down method
θ = 0 θ = 1 θ = 2
n CP AL n CP AL n CP AL
50 0.9457 1.116984 50 0.9505 1.144526 50 0.9500 1.228273
100 0.9471 0.790488 100 0.9477 0.800057 100 0.9493 0.827105
150 0.9525 0.646108 150 0.9496 0.651195 150 0.9514 0.665727
Table 3.6 Asymptotic normality by R-MLE method
θ = 0 θ = 1 θ = 2
n CP AL n CP AL n CP AL
50 0.9502 1.132100 50 0.9520 1.135626 50 0.9488 1.146214
100 0.9499 0.794528 100 0.9495 0.795699 100 0.9484 0.799639
150 0.9552 0.646676 150 0.9499 0.647239 150 0.9493 0.649403
When n is small and the choice of b does not match θ0, Tables 3.1 to 3.3 show that AVar
is not close to VAR. This motivates us to use the bootstrap method to give an alternative
approximation. The second and the third approximated confidence intervals are obtained by
two bootstrap methods (percentile and bootstrap-t). By the Markov chain representation in
(3) and (4), the parametric bootstrap algorithm can be easily implemented. To be specific, let
x = {x0, · · ·xn} be a realization of the Markov chain {Xt; t ≥ 0} with transition probability
P = (pi,j(θ)), where θ is the unknown parameter. Let θn be the maximum likelihood estimate
(MLE) of θ. To approximate the sampling distribution Hn of R(x, θ) :=√
n(θn − θ), the
bootstrap method can be done as follows.
1. Let x∗ = {x∗0, · · · , x∗n} denote a Markov chain realization of n steps based on (pi,j(θn)).
Call this a bootstrap sample, and let θ∗n be the MLE of θn based on x∗.
17
2. Approximate the sampling distribution Hn of R(x, θ) by the conditional distribution H∗n
of R(x∗, θn) =√
n(θ∗n − θn) given x.
The difficulty in implementing the bootstrap method lies on Step 2. Here, we approximate
the bootstrap distribution by using Monte Carlo simulation as usual.
For the percentile bootstrap confidence interval, we repeatedly generate bootstrap
samples x∗ for 1, 000 times according to the above bootstrap algorithms, and replications θ∗
are computed. Let G be the cumulative distribution function of θ∗n. The 1 − 2α percentile
interval is defined by the α and 1− α percentile of G:
[θl, θu] = [G−1(α), G−1(1− α)] = [θ∗(α)n , θ∗(1−α)
n ].
The bootstrap-t estimates the percentiles of a studentized statistic T = (θn − θ)/σn by
bootstrapping, where σ2n is the variance estimator. In our simulation, we use In to estimate
the variance. For each sample, 1, 000 bootstrap values T ∗ = (θ∗n− θn)/I∗n are generated. The
95% central interval are then [θn − Tu · In, θn − Tl · In], where Tl and Tu are the 2.5 and 97.5
empirical percentiles based on bootstrap samples.
The simulation results are summarized in Tables 3.7-3.10 which show that the true
coverage probabilities for both methods (R-MLE and up-and-down method) are close to the
nominal coverage probability 95%.
The first one is to use the asymptotic results in Chang and Ying (1999), in which it
states that √IRMLE(θn)(θn − θ) −→ N(0, 1) in distribution.
Let zα denote the 100α percentile of a standard normal distribution. Then(θn − z0.025
1√In(θn)
, θn + z0.0251√
In(θn)
)
gives an approximated 95% confidence interval. For this nominal 95% confidence interval, we
study its true coverage probability
P (z0.025 ≤√
In(θn)(θn − θ) ≤ z0.975).
Table 3.7 Percentile by the up-and-down method
θ = 0 θ = 1 θ = 2
n CP AL n CP AL n CP AL
50 0.9529 1.123377 50 0.9498 1.151801 50 0.9505 1.237174
100 0.9506 0.792974 100 0.9488 0.803081 100 0.9465 0.829987
150 0.9514 0.647380 150 0.9496 0.652618 150 0.9492 0.667122
18
Table 3.8 Bootstrap-t by the up-and-down method
θ = 0 θ = 1 θ = 2
n CP AL n CP AL n CP AL
50 0.9557 1.132068 50 0.9560 1.161960 50 0.9590 1.255542
100 0.9472 0.795060 100 0.9540 0.804977 100 0.9484 0.831613
150 0.9489 0.648007 150 0.9459 0.653667 150 0.9516 0.668290
Table 3.9 Percentile by the R-MLE method
θ = 0 θ = 1 θ = 2
50 0.9498 1.144303 50 0.9465 1.148002 50 0.9580 1.157604
100 0.9496 0.798792 100 0.9471 0.800395 100 0.9482 0.803923
150 0.9498 0.648822 150 0.9499 0.649643 150
Table 3.10 Bootstrap-t by the R-MLE method
θ = 0 θ = 1 θ = 2
n CP AL n CP AL n CP AL
50 0.9505 1.147494 50 0.9510 1.14946 50 0.9536 1.158084
100 0.9517 0.799233 100 0.9540 0.804977 100 0.9491 0.804242
150 0.9508 0.649507 150 0.9459 0.653667 150
Tables 3.8-3.10 show that the bootstrap methods give reasonable accurate interval
estimation for the unknown parameter θ. This is not incident, since it is known that bootstrap
method is second order accuracy. Theoretical justification of these results will be published
in a separate paper.
4 Conclusion Remarks and Further Researches
In this paper, we introduce the up-and-down method as item selection rule in computerized
adaptive testing (CAT), based on the 2-parameter logistic model (2-PLM). In particular, we
study the a-stratified multistage case in more details. We conduct a simulation experiment
to compare the accuracy of parameter estimates obtained by the up-and-down method and
the recursive maximum likelihood estimation (R-MLE) method. It suggests that the up-
and-down method has the potential to be an alternative of the commonly used recursive
19
maximum likelihood estimation (R-MLE) method. Asymptotic behaviors of the MLE based
on the up-and-down method are also investigated.
The simulation results show that there is not much differences between the up-and-
down method and the R-MLE method, in the sense of the accuracy of point estimation and
the construction of confidence interval. For the concern of statistical efficiency, although the
selection of test items in the up-and-down method does not utilize all information contained
in the data, the simulation results suggest that it does not lose much information. The
performance of the up-and-down method is almost the same as R-MLE for a reasonable range
of the sample size n for the test. Moreover, from computational point of view, the up-and-
down method is much easier to implement than that of the R-MLE, including the procedure
of item selection and the numbers of estimating θ. In addition, the maximum likelihood
estimate of θ obtained from the up-and-down method are consistent and asymptotic normal.
There are some problems remained to be solved. First, we note that the issue of
model sensitivity is an interesting problem. Since human behavior is quite complex, it raises
the doubt on using the logistic model or the probit model to model a examinee’s response.
How the performance of the up-and-down method and the R-MLE method will be affected
when the model is being misspecified? Second, in computerized criterion-referenced test, its
emphasis is on classification. A study on comparing the performance of both item selection
rules under the above setting deserves to be further pursued. Third, the problem of multistage
in computerized adaptive testing is also a challenge task, and deserve further study.
5 Asymptotic Analysis
With reasonable size of test items, the simulation results in Section 3 show that the per-
formance of the R-MLE and the up-and-down method are not too much different. In
this section, we study the asymptotic behavior of the MLE θn of the ability parameter
θ, when the test items are selected by the up-and-down method. We first show that the
data {(Xt, Yt), t = 0, 1, · · · , n} can be reproduced from the ordered set of difficulty levels
{Xt, t = 0, 1, · · · , n} and vice versa. Next, we recognize that the latter can be formulated
as a Markov chain with a specific transition probability matrix P . Therefore, the likelihood
function defined in (2) can be expressed as a likelihood function derived from a parametric
Markov chain. To establish the consistency of resulting estimator, we will show that the test
items {Xt, t = 0, · · · , n} selected by the up-and-down method formed an irreducible Markov
chain. For interval estimation, we will first to show that the induced Markov chain is posi-
tive recurrent, and then apply regeneration method to derive the asymptotic distribution of
maximum likelihood estimator.
20
5.1 Technical Results
Denote Pπ as the probability with initial distribution is the stationary distribution π, and let
Eπ be the expectation under Pπ. In order to employ the technique of regeneration method to
establish the consistency and asymptotic normality, we need to show that the Markov chain
(4) is irreducible and positive recurrent.
Consider a Markov chain {Xn, n ≥ 0} on a countably state space S = {· · · ,−2,−1, 0, 1,
2, · · ·}, with transition probability pij for i, j ∈ S. Denote∑∞
n=1 P{Xν(ω) 6= i, 0 < ν <
n;Xn(ω) = i|X0(ω) = i} by f∗i,i. A state i ∈ S is called recurrent if f∗i,i = 1. Assume that
state i is recurrent, let Ti be the first regeneration time of Xn to state i, that is,
Ti =
inf{n ≥ 1, Xn = i};
∞, if no such n exist.
A recurrent state i is called positive if and only if E(Ti) =< ∞. The irreducibility of
the Markov chain implies that if a state i is positive recurrent, then all states are positive
recurrent.
Now we state the key theorem used in this section.
Theorem 2 Let {Xn, n ≥ 0} be an ergodic (irreducible, aperiodic and positive recurrent)
Markov chain on a countably state space S = {· · · ,−2,−1, 0, 1, 2, · · ·}, with stationary distri-
bution π. Let h be a real-valued function on the state space S. Suppose Eπ(|h|) < ∞. The
following holds.
(a)∑N
t=1 h(Xt)/N converges to Eπ{h(X1)} in probability.
(b) If Eπ(|h|2) < ∞ and σ2 := Var(∑Tx0
t=1 h(Xt)) < ∞, then
√N
σ√
π(x0)
{∑Nt=1 h(Xt)
N− Eπ{h(X1)}
π(x0)
}−→ N(0, 1) in distribution. (9)
To prove that there exists a stationary probability distribution of the Markov chain (4),
we need to show that the Markov chain (4) induced by the up-and-down method is ergodic.
By the definition of (4), it is easy to see the properties of irreducibility and aperiodicity hold.
To prove the property of positive recurrent, We need following Theorems.
Theorem 3 Let {Xn, n ≥ 0} be a Markov chain on a countably state space S = {· · · ,−2,−1, 0,
1, 2, · · ·}, with transition probability pij for i, j ∈ S. Let 0 < αi < 1 and βi = 1− αi be given
numbers such that
pi,i+1 = αi, pi,i−1 = βi, for i ≥ 0;
pi,i−1 = αi, pi,i+1 = βi, for i < 0.
21
(a) The state 0 is recurrent, i.e., f∗0,0 = 1 if and only if∑r≥1
β1 × · · · × βr
α1 × · · · × αr= ∞,
∑r≥1
β−1 × · · · × β−r
α−1 × · · · × α−r= ∞.
(b) The recurrent state 0 is positive if and only if∑r≥1
α1 · · ·αr−1
β1 · · ·βr−1βr< ∞,
∑r≥1
α−1 · · ·α−(r−1)
β−1 · · ·β−(r−1)β−r< ∞.
Remark. Results of Theorem 3 can be found in Chung (1967), in which only the one-
sided Markov chain is studied. The argument there can be generalized easily to the above
results for two-sided Markov chain.
After showing that the Markov chain generated by the up-and-down method is positive
recurrent, Theorem 2 can then be used to derive asymptotic results in Theorem 1.
Theorem 4 The Markov chain with transition probability defined in (4) is positive recurrent.
Proof. Let n0 be the integer satisfying n0 − 1 ≤ θ < n0. Denote
pn0+i,n0+i+1 = αi, pn0+i,n0+i−1 = βi, for i ≥ 0,
pn0+i,n0+i−1 = αi, pn0+i,n0+i+1 = βi, for i < 0.
Note that the logistic curve is monotone increasing. Hence βi/αi > 1 for n ≥ n0. We conclude∑r≥1
β1 × · · · × βr
α1 × · · · × αr= ∞,
∑r≥1
β−1 × · · · × β−r
α−1 × · · · × α−r= ∞.
It follows from Theorem 3(a) that we have fn0,n0 = 1 and n0 is a recurrent state. ♦
Next we show that E(Tx0) < ∞, for all x0. Recall that βi > 1/2, αi/βi < 1 and αi is
monotone decreasing for n ≥ n0. We have∑r≥1
α1 × · · · × αr−1
β1 × · · · × βr< ∞.
By a similar argument, we have∑r≥1
α−1 × · · · × α−(r−1)
β−1 × · · · × β−r< ∞.
Theorem 3(b) hence leads that x0 is positive recurrent. Since the Markov chain (4) is irre-
ducible and this implies that whole states in S are positive recurrent. That is, the Markov
chain is an irreducible, aperiodic and positive recurrent Markov chain. Then, the vector
ϕ = (· · · , 1/E(T−1), 1/E(T0), 1/E(T1), · · ·) is a stationary probability distribution for (4).
Remark. Note that the up-and-down method is a nonparametric method of selecting
test items. Hence, as long as the item response function is continuous, strictly monotone
increasing, and ranges over (0, 1), Theorem 4 remains to hold.
22
5.2 Asymptotic behavior of the MLE
By using the results in Theorem 2, we will prove our main results, the weak consistency and
asymptotic normality of the MLE θn. One major contribution here is the characterization
of the Fisher information in Theorem 1, for which it can be used to construct confidence
interval of θ0, the true parameter.
Proof of Theorem 1. The proof will follow the argument outlined in Section 2. First,
we show that (5) holds. Note that X0 = 0, and
Eθ0
{∂
∂θg(X1, X2; θ0)
∣∣∣∣X1
}= Eθ0
{∂f(X1, X2; θ0)/∂θ
f(X1, X2; θ0)
∣∣∣∣X1
}=∑
x2∈S
∂
∂θf(X1, x2; θ0).
Differentiate both side of the equation∑
x2∈S f(x1, x2; θ) = 1 with respect to θ leads to∑x2∈S
∂
∂θf(x1, x2; θ) = 0.
This implies that Eθ0{ ∂∂θg(X1, X2; θ0)} = 0.
Since∂
∂θg(x1, x2; θ) =
1/(1 + eθ−x1), x2 = x1 + 1,
−eθ−x1/(1 + eθ−x1), x2 = x1 − 1,
we have
Eθ0
{∣∣∣∣ ∂
∂θg(X1, X2; θ0)
∣∣∣∣∣∣∣∣X1
}≤ 1
4.
By Theorem 2(a), (5) holds by
limn→∞
1n
n∑t=1
∂
∂θg(Xt−1, Xt; θ)
∣∣∣∣θ=θ0
→ Eθ0
{∂
∂θg(X1, X2; θ0)
}= 0 in probability.
Next, we show that (6) holds. Twice differentiation of∑
x2∈S f(x1, x2; θ) = 1 with
respect to θ leads to ∑x2∈S
∂2
∂θ2f(x1, x2; θ) = 0,
and
Eθ0
{∂2
∂θ2g(X1, X2; θ0)
∣∣∣∣X1
}= Eθ0
{∂2
∂θ2 f
f−
(∂f∂θ )2
f2
∣∣∣∣∣X1
}
= Eθ0
{∂2
∂θ2 f
f
∣∣∣∣∣X1
}− Eθ0
{(∂f
∂θ )2
f2
∣∣∣∣∣X1
}= −Eθ0
{[∂
∂θg(X1, X2; θ0)
]2∣∣∣∣∣X1
}.
Again, for all given x1 ∈ S, we have that
Eθ0
{[∂
∂θg(X1, X2; θ0)
]2∣∣∣∣∣X1
}=
eθ0−x1
(1 + eθ0−x1)2≤ 1
4.
Therefore (7) holds.
23
We also need to calculate Eθ0{| ∂2
∂θ2 g(X1, X2; θ0)|}. For x2 = x1 + 1 or x2 = x1 − 1, we
have ∣∣∣∣ ∂2
∂θ2g(X1, X2; θ0)
∣∣∣∣ = e(θ0−x1)
[1 + e(θ0−x1)]2≤ 1
4,
and this implies that
Eθ0
{∣∣∣∣ ∂2
∂θ2g(X1, X2; θ0)
∣∣∣∣} < ∞.
It follows from Theorem 2(b) that (6) holds by
limn→∞
1n
n∑k=1
∂2
∂θ2g(Xk−1, Xk; θ)
∣∣∣∣θ=θ0
→ Eθ0
{∂2
∂θ2g(X1, X2; θ0)
}= −I in probability.
Denote
G(x1, x2) := supθ∈R
∣∣∣∣ ∂3
∂θ3g(x1, x2; θ)
∣∣∣∣ = supθ∈R
∣∣∣∣eθ−x1(1− eθ−x1)(1 + eθ−x1)3
∣∣∣∣ < 1.
There exists a constant M such that
limn→∞
1n
n∑t=1
G(Xt−1, Xt) = M in probability. (10)
By the mean value theorem, for some |α| < 1, we have
1n
∂
∂θLn(θ) =
1n
n∑t=1
∂
∂θg(xt−1, xt; θ)
=1n
n∑t=1
∂
∂θg(xt−1, xt; θ0) +
1n
(θ − θ0)n∑
t=1
∂2
∂θ2g(xt−1, xt; θ0) +
α
2n(θ − θ0)2
n∑t=1
G(xt−1, xt).
Let S∗ denote the collection of (x1, · · · , xn) satisfying∣∣∣∣∣ 1nn∑
k=1
∂
∂θg(xt−1, xt; θ0)
∣∣∣∣∣ < δ2,1n
n∑t=1
∂2
∂θ2g(xt−1, xt; θ0) < −I/2, and
1n
n∑t=1
G(xt−1, xt) < 2M.
It follows from (5), (7) and (10) that, for all δ, ε, there exists an n0 such that P (S∗) > 1− ε
when n > n0(δ, ε).
For θ = θ0 ± δ, choose δ < 12 I/(M + 1), then,
1n
∂
∂θLn(θ)
∣∣∣∣θ=θ0+δ
≤ δ2 − 12(I · δ) + Mδ2 < 0,
if (x1, · · · , xn) ∈ S∗. By the same argument, we have
1n
∂
∂θLn(θ)
∣∣∣∣θ=θ0−δ
> 0.
Since 1n
∂∂θLn(θ) is continuous, so for any δ, ε > 0, the likelihood equation will, with
probability exceeding 1− ε, have a root belongs to (θ0− δ, θ0 + δ) as long as n > n0(δ, ε). We
conclude that
θn −→ θ0 in probability. (11)
24
To prove 2), we first characterize the asymptotic variance Var(∑Tx0
t=1∂∂θg(Xt−1, Xt; θ0)
),
where Tx0 is the first regeneration time to state x0. Recall that Eθ0{ ∂∂θg(X1, X2; θ0)} = 0,
and Eθ0{( ∂∂θg(X1, X2; θ0))2} = I.
Var
Tx0∑t=1
∂
∂θg(Xt−1, Xt; θ0)
= Eθ0
( Tx0∑k=1
∂
∂θg(Xt−1, Xt; θ0)
)2
−[Eθ0
( Tx0∑t=1
∂
∂θg(Xt−1, Xt; θ0)
)]2= Eθ0
Tx0∑k=1
[∂
∂θg(Xt−1, Xt; θ0)
]2+ 2∑t′>t
Eθ0
( ∂
∂θg(Xt, Xt+1; θ0)
∂
∂θg(Xt′ , Xt′+1; θ0)
)=
1π(x0)
Eθ0{(∂
∂θg(X1, X2; θ0))2}
+2∑t′>t
Eθ0
[Eθ0
( ∂
∂θg(Xt, Xt+1; θ0)
∂
∂θg(Xt′ , Xt′+1; θ0)|Xt, Xt+1, Xt′
)]=
Iπ(x0)
+ 2∑t′>t
Eθ0
[ ∂
∂θg(Xt, Xt+1; θ0)Eθ0 [
∂
∂θg(Xt′ , Xt′+1; θ0)
∣∣∣∣Xt′ ]]
=I
π(x0).
By Theorem 2, we have
1√n
n∑t=1
∂
∂θg(Xt−1, Xt; θ0) −→ N(0, I) in distribution.
Note that the score equation n−1 ∂∂θLn(θn) = 0 can be written as
0 =1n
n∑k=1
∂
∂θg(xt−1, xt; θ0)+
1n
(θn−θ0)n∑
k=1
∂2
∂θ2g(xt−1, xt; θ0)+
α
2n(θn−θ0)2
n∑k=1
G(xt−1, xt).
We have
√n(θn − θ0) = −
n−1/2∑n
t=1∂∂θg(Xt−1, Xt; θ0)
n−1∑n
t=1∂2
∂θ2 g(Xt−1, Xt; θ0) + (α/2)(θn − θ0)n−1∑n
t=1 G(Xt−1, Xt)
−→ N(0, I−1) in distribution. ♦
5.3 Proof of Theorem 2
Proof of (a). For simplicity, set x0 to be the state 0, which is positive recurrent, and denote
m :=∑N
j=1 Ix0(Xj) as the number of visits to state x0 up to N . It is known (cf. Chung, 1967)
that m/N → π(x0) in probability, where π(x0) = 1/E(Tx0). Let T kx0
be the kth regeneration
time to state x0, and denote
ηj(h) :=T j
x0∑i=T j−1
x0 +1
h(xi)
25
as the jth regeneration epoch. Note that {ηj(h), j = 1, · · · ,m} forms i.i.d blocks due to
strong Markov property of the underlying Markov chain. Write
1N
N∑j=1
h(Xj) =1N
N∑j=T m
x0+1
h(Xj) +1N
m∑j=1
ηj(h)−[Nπ]∑j=1
ηj(h)
+1N
[Nπ]∑j=1
ηj(h)
:= I1 + I2 + I3. (12)
By the law of large numbers for i.i.d. random variables, we have
1N
[Nπ(x0)]∑j=1
ηj(h) =[Nπ(x0)]
N
1[Nπ(x0)]
[Nπ(x0)]∑j=1
ηj(h) −→ π(x0)Eπ(η) = Eπ(h) in probability.
Next, we show that both I1 and I2 converge to zero in probability. For any ε > 0,
Pπ
∣∣∣∣∣∣
N∑j=T m
x0+1
h(Xj)
∣∣∣∣∣∣ > εN
≤ Pπ
N∑
j=T mx0
+1
|h(Xj)| > εN
≤ Pπ
T m+1
x0∑j=T m
x0+1
|h(Xj)| > εN
= Pπ {η1(|h|) > εN} ≤ Eπ[η1(|h|)]
εN=
Eπ(|h|)επ(x0)N
.
The last inequality follows from Markov inequality and Eπ(|h|)/[επ(x0)N ] → 0 as N → ∞
by Eπ(|h|) < ∞. This implies that I1 → 0 in probability.
Since m/N −→ π(x0) in probability, we have that for all ε > 0, there exists N0 such
that for N > N0, Pπ{|m− [Nπ(x0)]| > Nε2} < ε. Then, for N > N0, we have
Pπ
∣∣∣∣∣∣
m∑j=1
ηj(h)−[Nπ(x0)]∑
j=1
ηj(h)
∣∣∣∣∣∣ > εN
≤ Pπ(|m− [Nπ(x0)]| > Nε2) + Pπ
max|r−[Nπ(x0)]|≤ε2N
∣∣∣∣∣∣r∑
j=[Nπ(x0)]+1
ηj(h)
∣∣∣∣∣∣ > εN
< ε + 2Pπ
max1≤r≤ε2N
∣∣∣∣∣∣r∑
j=1
ηj(h)
∣∣∣∣∣∣ > εN
= ε + 2Pπ
r∑
j=1
|ηj(h)| > εN
< ε +
2ε2NE(|η1|)εN
=(
1 +2Eπ(|h|)π(x0)
)ε.
Therefore, I2 → 0 in probability. We conclude the proof of (a).
Proof of (b). Using the same argument as in the proof of (a), we have√
N
σ√
π(x0)
(∑Nj=1 h(Xj)
N− Eπ[h(X1)]
π(x0)
)
=1
σ√
Nπ(x0)
N∑j=T m
x0+1
h(Xj) +1
σ√
Nπ(x0)
m∑j=1
ηj(h)−[Nπ(x0)]∑
j=1
ηj(h)
+
√N
σ√
π(x0)
(∑[Nπ(x0)]j=1 ηj(h)
N− Eπ[h(X1)]
π(x0)
):= II1 + II2 + II3. (13)
26
First, we consider II3. Note that ηj are i.i.d. random blocks. Under the condition of
Eπ(|h|2) < ∞ and σ2 := Var(∑Tx0
t=1 h(Xt)) < ∞, by standard central limit theorem for
i.i.d. random variables, We have
II3 =
√Nπ(x0)
σ
(∑[Nπ(x0)]j=1 ηj(h)Nπ(x0)
− Eπ[h(X1)]π(x0)
)−→ N(0, 1) in distribution. (14)
It remains to show that II1 and II2 converges to zero in probability. For any ε > 0,
we have
Pπ
∣∣∣∣∣∣
N∑j=T m
x0+1
h(Xj)
∣∣∣∣∣∣ > εσ√
Nπ(x0)
≤ Pπ
N∑
j=T mx0
+1
|h(Xj)| > εσ√
Nπ(x0)
≤ Pπ
T m+1
x0∑j=T m
x0+1
|h(Xj)| > εσ√
Nπ(x0)
= Pπ
{η1(|h|) > εσ
√Nπ(x0)
}≤ Eπ[η1(|h|)]
εσ√
Nπ(x0)=
Eπ(|h|)επ(x0)σ
√Nπ(x0)
. (15)
Hence, II1 converges to 0 in probability as N →∞ by assumption.
Since m/N −→ π(x0) in probability, we have for all ε > 0, there exists N0 such that
N > N0, Pπ(|m− [Nπ(x0)]| > Nε3) < ε. Clearly, for such N , we have
Pπ
∣∣∣∣∣∣m∑
j=1
ηj(h)−[Nπ(x0)]∑
j=1
ηj(h)
∣∣∣∣∣∣ > εσ√
Nπ(x0)
≤ Pπ(|m− [Nπ(x0)]| > Nε3) + P
max|r−[Nπ(x0)]|≤ε3N
∣∣∣∣∣∣r∑
j=[Nπ(x0)]+1
ηj(h)
∣∣∣∣∣∣ > εσ√
Nπ(x0)
< ε + 2Pπ
max1≤r≤ε3N
∣∣∣∣∣∣r∑
j=1
ηj(h)
∣∣∣∣∣∣ > εσ√
Nπ(x0)
< ε +
2ε3Nσ2
ε2σ2Nπ(x0)= ε
(1 +
2π(x0)
).
This proves that II2 converges to 0 in probability as N → ∞. We conclude the proof of
(b). ♦
27
References
[1] Anderson, T. W. and Goodman, L. A. (1957). Statistical inference about Markov chains.Ann. Math. Statist. 28, 89-110.
[2] Athreya, K. B. and Fuh, C. D.(1993). Central limit theorem for a double array of harrischains. Sankhya A, 55, 1-11.
[3] Basawa, I. V. and Prakasa Rao, B. L. S. (1980). Statistical Inference for StochasticProcesses. London: Academic Press, Inc..
[4] Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’sability. In F.M. Lord & M.R. Novick, Statistical theories of mental test scores. ReadingMA: Addison-Wesley.
[5] Chang, H. and Ying, Z. (1996). A global information approach to computerized adaptivetesting. Applied Psychological Measurement, 20, 213-229.
[6] Chang, H. and Ying, Z. (1999). A-stratified multistage computerized adaptive testing.Applied psychological measurement, 23, 211-222.
[7] Chang, H. and Ying, Z. (19??). Nonlinear sequential designs for logistic item responsetheory models with application to computerized adaptive tests. The Annals of Statistics,?, ?-?.
[8] Chao, M. T. and Fuh, C. D. (2001). Bootstrap methods for the up-and-down test onpyrotechnics sensitivity analysis. Statistica Sinica, 11, 1-21.
[9] Chung, K. L.(1967). Markov chain with stationary transition probabilities. New York:Springer.
[10] Derman, C. (1957). Non-parametric up-and-down experimentation. Ann. Math. Statist.28, 795-798.
[11] Dixon, W. J. and Mood, A. M. (1948). A method for obtaining and analyzing sensitivitydata. J. Amer. Statist. Assoc. 43, 109-126.
[12] Fuh, C. D. and Zhang, C. H. (2000). Poisson equation, maximal inequalities and r-quickconvergence for Markov random walks. Stochastic Processes and their Applications, 87,53-67.
[13] Lord, M. F. (1970). Some test theory for tailored testing. In W.H. Holtzman (Ed.),Computer-assisted instruction, testing and guidance. New York: Harper and Row.
[14] Lord, M. F. (1971). Robbins-Monro procedures for tailored testing. Educational andpsychological Measurement, 31, 3-31.
[15] Lord, M. F. (1980). Applications of item response theory to practical testing problem.Hillsdale, NJ: Lawrence Erlbaum.
[16] Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the contextof adaptive mental testing. Journal of the American Statistical Association, 70, 351-356.
28
[17] Stocking, M. L. and Lewis, C. (1995). A new method of controlling item exposure in com-puterized adaptive testing (Research Report 95-25) Princeton, NJ: Educational TestingService.
[18] Sympson, J. B. and Hetter, R. D. (1985). Controlling item-exposure rates in comput-erized adaptive testing. Proceeding of the 27th annual meeting of the Military TestingAssociation pp. 973-977. San Diego, CA: Navy Personnel Research and DevelopmentCenter.
[19] Thomas, E. V. (1994). Evaluating the ignition sensitivity of thermal-battery heat pellets.Technometrics 36, 273-282.
[20] Wainer, H. (1990). Computerized adaptive testing: A primer. Hillsdale NJ: Erlbaum.
[21] Weiss, D. J. (1976). Adaptive testing research in Minnesota: Overview, recent results,and future directions. In C. L. Clark (Ed.), Proceedings of the first conference on com-puterized adaptive testing pp. 24-35. Washington DC: United States Civil Service Com-mission.
[22] Wetherill, G. B. (1963). Sequential estimation of quantal response curves (with discus-sion). J. Roy. Statist. Soc. Ser. B 25, 1-48.
[23] Wetherill, G. B. and Glazebrook, K. D. (1986). Sequential Methods in Sta tistics. London:Chapman and Hall.