Bayesian Item Response Theory: Methods and Applications
Transcript of Bayesian Item Response Theory: Methods and Applications
University of ConnecticutOpenCommons@UConn
Doctoral Dissertations University of Connecticut Graduate School
8-2-2019
Bayesian Item Response Theory: Methods andApplicationsYang LiuUniversity of Connecticut - Storrs, [email protected]
Follow this and additional works at: https://opencommons.uconn.edu/dissertations
Recommended CitationLiu, Yang, "Bayesian Item Response Theory: Methods and Applications" (2019). Doctoral Dissertations. 2274.https://opencommons.uconn.edu/dissertations/2274
Bayesian Item Response Theory: Methods and Applications
Yang Liu, Ph.D.
University of Connecticut, 2019
Item response theory (IRT) models play a critical role in psychometric studies
for the design and analysis of examinations. IRT models mainly consider the rela-
tionship among the correctness of items, individual’s latent ability, difficulty of each
item and other potential factors such as guessing.
In this dissertation, we develop Bayesian modeling methods and model selection
techniques under the IRT model framework. For Bayesian model comparison, the
Bayes factor is a widely used tool, which requires computation of the marginal
likelihoods. For complex models such as the IRT models, the marginal likelihoods
are not analytically available. There are a variety of Monte Carlo methods for
estimating or computing the marginal likelihoods, though some of them may not be
feasible for IRT models due to the high dimensionality of the parameter space. We
review several different Monte Carlo methods for marginal likelihood computation
under classic IRT models, develop the “best” implementation of these methods for
the IRT models, and apply these methods to a real dataset for comparison of the
classic one-parameter IRT model and two-parameter IRT model.
With increasing availability of computerized testing, observations are often col-
lected at irregular and variable time points. We adopt a dynamic IRT model based
Yang Liu––University of Connecticut, 2019
on the one-parameter IRT model to accommodate this data structure. A hierar-
chical layer on the dynamic IRT model is built to capture the relationship between
the “growth factor” and the characteristics of individuals. We use the Bayes factor
to perform variable selection on the covariates linked to the growth, and develop a
Monte Carlo approach to compute the Bayes factors for all model pairs using a sin-
gle Markov chain Monte Carlo (MCMC) output. We also show the model selection
consistency of the Bayes factor under certain conditions.
Additionally, to allow more flexibility, we propose a nonparametric model and
embed a monotone shape constraint on the mean latent growth trend. Further, we
develop a partially collapsed Gibbs sampling algorithm coupled with a reversible
jump MCMC technique to sample the dimension-varying parameters from their
corresponding posterior distribution.
Bayesian Item Response Theory: Methods and Applications
Yang Liu
B.S., Zhejiang University, China, 2014
A Dissertation
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Doctor of Philosophy
at the
University of Connecticut
2019
Copyright by
Yang Liu
2019
APPROVAL PAGE
Doctor of Philosophy Dissertation
Bayesian Item Response Theory: Methods and Applications
Presented by
Yang Liu, B.S.
Major Advisor
Ming-Hui Chen
Major Advisor
Xiaojing Wang
Associate AdvisorDipak K. Dey
University of Connecticut
2019
ii
To my mother Liqing Yang, my father Junxing Liu, and Xi Wang.
iii
ACKNOWLEDGEMENTS
First and foremost, I would like to express my deep gratitude to my major advi-
sors, Prof. Ming-Hui Chen and Prof. Xiaojing Wang, for their tremendous support
throughout all these years. They lead me into the world of Bayesian Statistics, guide
me into various Statistics research areas, and inspire me to pursue my career as a
researcher. They are my wonderful role models–kind, humble, diligent in life and
they are always patient with me. I am truly blessed to have them as my advisors
and mentors for my Ph.D. study.
I am also very grateful to my associate advisor Prof. Dipak K. Dey who is always
supportive. He encourages me and provides guidance and insightful advices for both
my research and student activities.
My sincere thanks also go to Prof. Lynn Kuo, Prof. Zhiyi Chi, and Prof. Haiying
Wang for serving as my general exam committee.
I would like to thank Dr. Donghui Zhang and Dr. Xiwen Ma for their great
support and advices on research and professional development. Many thanks to Dr.
Naitee Ting for his invaluable career advices and guidance.
Many thanks are owed to my fellow graduate students and the faculty members
who accompanied and supported me during this journey. I would also like to thank
Ms. Tracy Burke, Mr. Anthony Luis, and Ms. Megan Petsa for their helpful
administrative assistance.
iv
Lastly, and most importantly, I want to dedicate this dissertation to my mother
Liqing and my father Junxing, and my love, Xi, who always encourage and support
me to pursue the life and career of my choice.
v
TABLE OF CONTENTS
Chapter 1: Introduction 1
1.1 Classic Item Response Theory Models and Marginal Likelihood Com-
putation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dynamic Item Response Theory Models and Extensions . . . . . . . 3
1.3 Motivating Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 The English Examination Dataset . . . . . . . . . . . . . . . 9
1.3.2 The Edsphere Dataset . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 2: A Comparison of Monte Carlo Methods for Computing
Marginal Likelihoods of Item Response Theory Models 14
2.1 Classic Item Response Theory Models . . . . . . . . . . . . . . . . . 14
2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 The Prior-Based Monte Carlo (PMC) Method . . . . . . . . . 18
2.2.2 The Harmonic Mean (HM) Method . . . . . . . . . . . . . . 18
2.2.3 The Chen Method . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.4 The Chib Method . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5 The Stepping Stone Method (SSM) . . . . . . . . . . . . . . . 25
2.3 Analysis of English Examination Data . . . . . . . . . . . . . . . . . 27
2.3.1 Results of Marginal Likelihood Computation . . . . . . . . . 28
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
vi
Chapter 3: Variable Selection in Dynamic Item Response Theory
Models via Bayes Factors with a Single MCMC Output 34
3.1 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Computation of Bayes Factors with a Single MCMC Output . . . . . 36
3.3 Model Selection Consistency . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 4: Bayesian Nonparametric Monotone Regression of Dy-
namic Latent Traits in IRT Models 51
4.1 Nonparametric Monotone Regression Model . . . . . . . . . . . . . . 51
4.2 Bayesian Computation Scheme . . . . . . . . . . . . . . . . . . . . . 57
4.2.1 Prior Specification and Posterior Distribution . . . . . . . . . 57
4.2.2 The MCMC Sampling Schemes . . . . . . . . . . . . . . . . . 60
4.3 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.1 An Example Using the Mixture of TSP Functions as the
Latent Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.2 An Example of the Logistic Curve as the Latent Trajectory . 67
4.4 Application to the EdSphere Data . . . . . . . . . . . . . . . . . . . 71
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 5: Preliminary Results of Model Selection by DIC 77
vii
5.1 Model Selection via DIC under Hierarchical Model . . . . . . . . . . 77
5.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Choose θ as the Parameters in Focus . . . . . . . . . . . . . . 80
5.1.3 Choose τ as the Parameters in Focus . . . . . . . . . . . . . 82
5.1.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 84
Appendix A: Posterior Propriety and MCMC Sampling Scheme 87
Bibliography 96
viii
LIST OF TABLES
1.1 Illustration of the unequally spaced response data structure. . . . . . 11
2.1 Results of DIC and WAIC for the 1PL model and the 2PL model. . . 27
2.2 Log marginal likelihoods obtained by each method and corresponding
Monte Carlo standard errors in parentheses. . . . . . . . . . . . . . . 30
2.3 Time usage to obtain the estimates by each method under 1PL and
2PL IRT models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Percentage of simulations in which the true model is ranked as the
“best” model by the Bayes factor approach. . . . . . . . . . . . . . . 47
3.2 Percentage of simulations in which the true model is ranked as the
top 3 “best” model by the Bayes factor approach. . . . . . . . . . . . 47
3.3 Estimated log Bayes factor of each model to the full model via the
proposed Monte Carlo approach. . . . . . . . . . . . . . . . . . . . . 48
4.1 Simulated parameters for each individual of TSP functions. . . . . . . 63
4.2 Coverage probabilities of φ, θi’s, τi’s, βi,0’s and βi,1’s by 100 indepen-
dent simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Model identification counts in 100 simulations per setting of DIC1,
DICθ4 , and DICτ4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2 PD results in 100 simulations per setting of DIC1, DICθ4 , and DICτ4 . 86
A.1 Characteristics for each selected individual in the EdSphere dataset. . 95
ix
LIST OF FIGURES
2.1 Posterior medians and 95% credible intervals of aj’s under the 2PL
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Estimated item characteristic curves for the items 1, 4, 5, 8, 20. . . . 29
3.1 Trace plots of β, g and ψ with 3000 MCMC samples. The red line
refers to the true value. . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1 Illustration of several TSP functions with different γ and b values. . . 56
4.2 The estimation of latent trajectories for the 1st, 2nd, 3rd, 6th subjects 65
4.3 Posterior median and 95% CIs of τi’s in TSP mixture based simulation. 66
4.4 The simulation results of posterior median and 95% CIs of βi,0’s and
βi,1’s in TSP mixture based latent trajectories. . . . . . . . . . . . . . 66
4.5 The estimation of latent trajectories for the 4-th, 6-th, 7-th, 8-th
subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.6 Posterior median and 95% CIs of τi’s in logistic curve simulation. . . 70
4.7 Posterior median and 95% CIs of βi,0’s and βi,1’s in logistic curve
simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8 The estimation of latent trajectories for the 2nd, 3rd, 6th and 7th
subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9 Posterior median and 95% CIs of τi’s. . . . . . . . . . . . . . . . . . . 75
4.10 Posterior median and 95% CIs of βi,0’s and βi,1’s. . . . . . . . . . . . 75
A.1 A pseudo code of the RJ-MCMC algorithm . . . . . . . . . . . . . . . 91
x
Chapter 1
Introduction
1.1 Classic Item Response Theory Models and Marginal Likelihood
Computation
In psychometrics, item response theory (IRT) models play a critical role for
the design and analysis of tests. The classic item response theory models link the
correctness of the answer (a dichotomous random variable) for each item with the
latent ability of an individual, the item difficulty and potentially other parameters
with different types of link functions. The widely-used Rasch model (Rasch, 1960)
refers to the one-parameter IRT model with the logistic link (1PL model). The two-
parameter logistic IRT model (2PL model) adds the discrimination parameter to the
1PL model to allow different scaling of each item. Other extensions and variations of
the IRT models include the three-parameter IRT model, which adds the parameter
for guessing compared to the two-parameter IRT model (Harris, 1989); multi-level
IRT models (Bock and Mislevy, 1989; Fox, 2005), which introduce extra layers for
1
2
the ability parameter; and the dynamic IRT model (Wang et al., 2013), which is able
to handle irregular and time-varying observations and estimate the latent trajectory
of ability over a time period.
Due to the complexity of IRT models, Bayesian methods become more and more
popular for the analysis and parameter estimation of IRT models (Mislevy, 1986;
Cao and Stokes, 2008; Karabatsos, 2016). The Bayes factor (Kass and Raftery,
1995) is a commonly used Bayesian model comparison criterion, which requires the
proper prior distributions and computation of the marginal likelihoods. Since IRT
model often use the probit or logit link, the marginal likelihood is not analytically
available. Due to the recent advance in Bayesian computation, there is a rich liter-
ature on Monte Carlo methods for estimating/computing the marginal likelihoods,
including, for example, a Prior-Based Monte Carlo (PMC) approach using a random
sample from the prior distribution, the Harmonic Mean (HM) approach (Newton
and Raftery, 1994) using a Markov chain Monte Carlo (MCMC) sample from the
posterior distribution, the Chib’s method (Chib, 1995) using Gibbs samples from the
posterior distribution and the conditional posterior distributions, a hybrid Laplace
approximation and Monte Carlo method (Lewis and Raftery, 1997), and a Monte
Carlo method of Chen (2005) using the latent structure of the model and a single
MCMC sample from the posterior distribution. Interestingly, Lartillot and Philippe
(2006) and Friel and Pettitt (2008) independently developed the thermodynamic
integration method and the power posterior method. Each of these methods is
a variation of the path sampling method of Gelman and Meng (1998). Xie et al.
3
(2011) proposed the stepping stone method (SSM), which is another variation of the
path sampling method. A more general version of SSM was developed by Fan et al.
(2010). More recently, Wang et al. (2018) developed the partition weighted kernel
(PWK) method, which is an extension of the inflated density ratio (IDR) method
of Petris and Tardella (2003) and Petris and Tardella (2007). Based on our best
knowledge, most of these methods have not yet been applied for computing marginal
likelihoods under the IRT models. Also note that some of these MC methods may
not be feasible or difficult to implement for computing the marginal likelihoods of
the IRT models due to a high-dimension of the model parameters. There may be
different ways to implement the same MC method, depending on how we arrange
the model parameters. It would be interesting to discuss how to apply some of these
MC methods in the “best possible” way.
1.2 Dynamic Item Response Theory Models and Extensions
Longitudinal studies play a prominent role on investigating temporal changes
in a construct of interest, which are often referred to as growth curve analysis in
social and behavior science. The advent of computerized testing and online rating
brings an entirely new way for social and behavior researchers to collect longitudinal
data. Test takers have much more freedom to choose their test time than before. The
responses are often collected at variable and irregular time points across individuals.
The randomness of responses may further create sparsity in a certain period, such
as in the summer or winter holidays.
4
Traditionally, there are two models of wide usage for studying individual changes.
One is called latent growth curve modeling (Bollen and Curran, 2006) and the other
is multi-level modeling or hierarchical linear modeling (Raudenbush and Bryk, 2002).
However, when outcomes are observed at individually-varying and irregularly-spaced
time points, the inferences from these two traditional models for studying individ-
ual changes may become problematic due to the uneasy adjustment of parametric
structures in those models (Geiser et al., 2013). Furthermore, in computerized test-
ing/survey in education, the manifest responses are usually dichotomous, ordinal or
nominal, while latent traits needed to be inferred are often continuous. These make
the inference even more difficult because of the information loss in the discretization
procedure of the underlying response variables.
We focus on an extension of the classic IRT model framework to model longitudi-
nal dichotomous data collected at irregular and variable time points. As mentioned
previously, there are two widely used approaches, i.e., the multidimensional and
multilevel approaches, which are available in the current literature of IRT models
for the analysis of longitudinal data. First, for the multidimensional approach, a
multidimensional IRT model is used to represent the change of an ability as an initial
ability and one or more potential modifiers in unidimensional or multidimensional
tests (Embretson, 1991; te Marvelde et al., 2006; Cho et al., 2013). However, this
approach allows little variation of items on different occasions and it often requires
the individuals to take the same tests. These drawbacks prevent us from extending
their methods to analyze a time series of computerized testing data.
5
Second, for the multilevel approach, the first-level is often assumed to follow a
classic IRT model, while in the higher-level, there are two common ideas to model
the growth. One idea is to assume the growth of a latent trait to be a parametric
function of time, such as a linear or polynomial regression of the time variable
with fixed or random coefficients. This idea is a variation of the latent growth curve
modeling in the analysis of binary/categorical longitudinal data (Albers et al., 1989;
Tan et al., 1999; Johnson and Raudenbush, 2006; Verhagen and Fox, 2013; Hsieh
et al., 2013). Another idea is to employ Markov chain models, where the changes of
a latent trait over time are assumed to be dependent on its previous value or status
(Martin and Quinn, 2002; Park, 2011; Bartolucci et al., 2011; Kim and Camilli,
2014). However, there are many instances in which neither of these two ideas would
be adequate enough to describe the growth (Bollen and Curran, 2004). One of such
instances is computerized testing, especially when the time lapses between tests are
unequally spaced across individuals as well as within individuals.
To tackle the challenges in computerized testing for modeling the growth of
latent traits, Wang et al. (2013) proposed a dynamic model by combining the ideas
of parametric functions of time as well as Markov chain models to describe the
growth. They imbedded IRT models into a new class of state space models for
analyzing longitudinal data located at individually-varying and irregularly-spaced
time points. In particular, an individually-varying parameter which can be called as
the “growth factor” at the second level of their proposed dynamic IRT model, may
be related to the characteristics of each individual, such as gender, grade, etc. We
6
build up another hierarchical layer based upon the dynamic IRT model discussed
in Wang et al. (2013), to capture relationship between the “growth factor” and a
few candidate covariates. The Bayesian modeling framework is also adopted here
for the extended dynamic IRT model.
Variable selection at the higher layer is a natural and important problem in this
model. There are many Bayesian variable selection techniques available in litera-
ture, such as the deviance information criterion (DIC) (Spiegelhalter et al., 2002),
the Bayesian lasso (Park and Casella, 2008), and the previously mentioned Bayes
factor approach. Many of these criteria are initially discussed within a linear model
framework, and they may be applied to hierarchical models such as the dynamic
IRT model under consideration. However, it remains unclear whether some of their
attractive model selection properties such as model selection consistency will still
hold under hierarchical models.
Fernandez et al. (2001) and Liang et al. (2008) discussed some attractive model
selection properties of the Bayes factor when the g-prior (Zellner, 1986) is specified
for the regression coefficients under the linear model framework. In particular, Liang
et al. (2008) showed that when the Zellner-Siow prior (Zellner and Siow, 1980) is
assumed, i.e., assign an Inv-Gamma(1/2, n/2) prior on g, the Bayes factor approach
has several attractive properties, including posterior model selection consistency. Li
and Clyde (2018) and Wu et al. (2016) discussed the model selection properties with
mixtures of g priors under the generalized linear model framework. These motivate
7
us to adopt the Bayes factor approach for variable selection in the added layer under
the dynamic IRT model.
However, due to the complex structure of the dynamic IRT model and the high
dimensionality of the parameter space, there are no analytical forms available for the
Bayes factors of the dynamic IRT models (while there are analytical forms available
for the linear regression model discussed in Liang et al. (2008) when g is fixed).
Inspired by Chen (2005), we develop a Monte Carlo approach to compute the Bayes
factors under the dynamic IRT model using a single MCMC output, which is a
feasible approach if the number of covariates is rather moderate. For example, 20
covariates will lead to 220 = 1048576 possible models even if no interaction terms
are considered. Model selection properties will be discussed in Chapter 3.
Nevertheless, assuming a particular functional relationship for the growth in
general can be restrictive and usually difficult to justify. Instead, a nonparametric
model could be much more flexible to describe changes of latent traits and avoid the
errors of model misspecification.
Based on the results shown in Wang et al. (2013), we find that the trajectory
of subject’s reading ability often grows more quickly in the initial period, but slows
down when it approaches maturation. Overall, the ability exhibits an increasing
trend but often has a flat region at the end. Such discovery of the shape as the
growth trajectory is consistent with prior beliefs and experiences from practitioners.
In various scientific applications, assumptions about the shape of the trajectory,
such as monotonicity, convexity or concavity, may be proposed ahead of the analysis
8
to aid in the modeling process and enhance interpretability (Erickson, 1976; Paul
et al., 2016; Rodrigues et al., 2018). This calls our attention to incorporate shape
constraints as a prior information to nonparametric modeling, and the usage of
shape information may improve the efficiency and accuracy of the nonparametric
estimates.
From the Bayesian perspective, nonparametric regression with monotonicity con-
straints has been already considered in the literature. For instance, Gelfand and Kuo
(1991) used an ordered Dirichlet process prior to impose the monotone constraint.
Neelon and Dunson (2004) imposed a piecewise linear regression, with an autore-
gressive prior for the parameters of basis functions. Shively et al. (2009) as well as
Brezger and Steiner (2008) implemented restricted splines to ensure monotonicity.
McKay Curtis and Ghosh (2011) used Bernstein polynomials with restrictions on
parameter space, while Choi et al. (2016) extended this idea by allowing incorpo-
ration of uncertainty of the constraints through the prior. Lin and Dunson (2014)
proposed Gaussian process projection to perform shape-constrained inference and
applied the method to multivariate surface estimation. Wang and Berger (2016)
imposed the constraints on the derivative process of the original Gaussian process
to estimate shape constrained functions. However, typical nonparametric models
are often less interpretable comparing to parametric models. Enlightened by the
idea of Bornkamp and Ickstadt (2009), we develop a Bayesian dynamic IRT model
9
with a monotone shape constraint on the mean latent growth trend. A nonpara-
metric model for the latent growth is adopted to allow more flexibilities, while the
interpretability is still retained to some degree.
1.3 Motivating Datasets
1.3.1 The English Examination Dataset
The first dataset consists of the English exam results of the 2017-2018 academic
year from No. 11 Middle School of Wuhan, Bingjiang Campus, which is a public
middle school in Jiang’an district of Wuhan, China. This exam is a midterm English
exam for grade 8 students in fall 2017. The total number of students attended this
exam is 468. The official difficulty coefficient of this exam from the Education
Bureau of Jiangan district is 0.67, which means that the average score of all the
students who attended is 67% of full marks.
The classic IRT model framework seems suitable for the data structure here. If
guessing is not considered, it remains unclear whether the 2PL model should be
preferred over the 1PL model. We consider applying the Bayes factor (Kass and
Raftery, 1995) approach for model comparison and develop the “best implemen-
tation” of several Monte Carlo approaches in Chapter 2 to compute the marginal
likelihoods of the IRT models.
10
1.3.2 The Edsphere Dataset
The second motivating dataset is the EdSphere dataset provided by MetaMetrics
Company and Highroad Learning Company. EdSphere is a personalized literacy
learning platform that continuously collects data about student performance and
strategic behaviors each time when he/she reads an article. During a typical read-
ing test, a student selects from a system-generated list of articles having the test
difficulty level in a range targeted to the current estimate of the student’s ability.
Then, for the selected article, a subset of words from the article are eligible to be
clozed, that is, removed and replaced by a blank. The computer, following a pre-
scribed protocol, randomly selects a sample of the eligible words to be clozed and
presents the article to the student with these words clozed. The question items pro-
duced by this procedure are randomized items. They are single-use items generated
at the time of an encounter between a student and an article. If another student
selects the same article to read, a new set of clozed words is selected. As a conse-
quence, the reoccurrence of individual items among students is highly improbable,
so obtaining empirical estimates of item parameters is not feasible.
The difficulty levels of the items in the reading test are provided by MetaMet-
rics using proprietary data and methods. The ensemble mean and the variance of
difficulty level for the items in a test are known due to the test design of EdSphere
learning platform.
11
Currently, the EdSphere dataset consists of 16,949 students from a school district
in Mississippi who participated over 5 years (2007-2011) in EdSphere learning plat-
form. The students were in different grades and could enter and leave the program
at different times. They were free to take tests on different days and had different
time lapses between tests. An illustration of this kind of data structure is included
in Table 1.1.
Table 1.1: Illustration of the unequally spaced response data structure.
Date Day 1 Day 2 Day 3 . . . Day M . . . Day N . . . Day X . . .Examinee A 1 test N/A 3 tests . . . 5 tests . . . N/A . . . Leave . . .Examinee B N/A Enter 1 tests . . . N/A . . . 4 tests . . . 10 tests . . .Examinee C 4 tests 2 tests 1 tests . . . N/A . . . N/A . . . 1 test . . .
This design yields longitudinal observations located at individually-varying and
irregularly-spaced time points and suggests that we need to model the changes of
latent traits with a dynamic structure. Further, in the spirit of EdSphere test design,
we could imagine the factors such as overall comprehension, emotional status and
others, would exist and undermine the local independence assumption of IRT models
as mentioned in Wang et al. (2013). Therefore, we need to extend the classic IRT
models to accommodate the modern computerized (adaptive) testing (not merely
EdSphere datasets), which have the described distinctive features, i.e., randomized
items, longitudinal observations, and local dependence.
1.4 Dissertation Outline
The rest of the thesis is organized as follows. In Chapter 2, we review the PMC
method, the HM method, the Chib method, the Chen method, and the SSM and
12
provide an in-depth discussion on how these methods can be applied to the IRT
models. However, our focus will be on the “best possible” implementation of these
methods for the IRT models. Full implementation details of each reviewed MC
method are provided. We apply these MC methods to the English examination
dataset and present the estimated marginal likelihoods by each method along with
the computation costs.
In Chapter 3, we propose a dynamic IRT model which models the relationship be-
tween the growth trend and the individual characteristics, and perform the variable
selection among these candidate covariates using the Bayes factor. To overcome the
computation burden, enlightened by Chen (2005), we develop a convenient Monte
Carlo approach to compute the Bayes factors using a single MCMC output from the
posterior under the full model. Additionally, we show the posterior model selection
consistency of the Bayes factors under the Zellner-Siow prior with certain conditions.
Simulations studies are carried out to examine the model selection performance of
the proposed Bayes factor approach under finite sample situations. The proposed
method is also applied to the Edsphere dataset.
In Chapter 4, we propose a flexible Bayesian monotone nonparametric regression
of the latent traits within the dynamic IRT model framework. Specifically, we
impose the monotone shape constraint by using a mixture of cumulative distribution
functions. We set the dimension of the mixture as random and develop a partially
collapsed Gibbs sampling algorithm in conjunction with the reversible jump MCMC
approach to generate the posterior samples. Results for simulation studies and
13
the Edsphere data application are discussed. The details of the MCMC sampling
algorithm are included in Appendix A.
In Chapter 5, we discuss some of our preliminary results of applying the DIC for
model selection under hierarchical models.
Chapter 2
A Comparison of Monte Carlo Methods for
Computing Marginal Likelihoods of Item
Response Theory Models
2.1 Classic Item Response Theory Models
Suppose that there are n subjects and each of them answers J items (questions).
Let yij represent the correctness of the item j answered by subject i, where yij equals
1 if the answer is correct and 0 otherwise for j = 1, 2, . . . , J and i = 1, 2, . . . , n. The
classic one-parameter logistic (1PL) model assumes
P (yij = 1|θ, b) =exp(θi − bj)
1 + exp(θi − bj), (2.1)
where θi denotes the latent ability of subject i, bj is the difficulty of item j. The
two-parameter logistic (2PL) IRT model adds the discrimination parameter aj to
the 1PL model, and assumes
P (yij = 1|θ, b,a) =expaj(θi − bj)
1 + expaj(θi − bj). (2.2)
14
15
Here aj > 0 allows different degrees of distinction for different items. We use
θ = (θ1, θ2, . . . , θn)′, b = (b1, b2, . . . , bJ)′, and a = (a1, a2, . . . , aJ)′ to represent the
parameter vectors of all θi’s, bj’s, and aj’s, respectively.
Denote the data as D = yij, i = 1, 2, . . . , n, j = 1, 2, . . . , J. Since the one-
parameter model is a special case of the two parameter model when discrimination
parameter aj = 1 for j = 1, 2, . . . , J , to avoid redundancy, we will discuss the details
for the implementation of each Monte Carlo method in the next section only for the
two-parameter model.
For the two-parameter model (2.2), the data likelihood can be expressed as
L(θ, b,a|D) =n∏i=1
J∏j=1
expyijaj(θi − bj)1 + expaj(θi − bj)
. (2.3)
Similar to Natesan et al. (2016) and Luo and Jiao (2018), the priors for the model
parameters are specified as follows: (i) θi ∼ N(0, 1) independently for i = 1, 2 . . . , n
to ensure the identifiability of the model parameters θi’s and aj’s; (ii) given µ and
ψ, a hierarchical prior is used for b: bj ∼ N(µb, ψ−1) for j = 1, 2, . . . , J ; (iii) aj ∼
U(0.5, 2.5) independently for j = 1, 2, . . . , J , log-normal priors are also considered
in sensitivity analysis; and (iv) µb ∼ N(0, s20ψ−1) and ψ ∼ Gamma(1, 1), where s0
is a pre-specified constant. We set s0 = 2 in our empirical study, which yields a
moderately noninformative prior for µb.
16
The full posterior distribution can be written as
π(θ, b,a, µb, ψ|D)
=1
C(D)
[n∏i=1
J∏j=1
expyijaj(θi − bj)1 + expaj(θi − bj)
](2π)−(n+J+1)/2 exp
(−∑n
i=1 θ2i
2
)
× ψJ+12 exp
−ψ∑J
j=1(bj − µb)2
2
s−1
0 exp
(−ψµ
2b
2s20
)exp(−ψ)
× 2−JJ∏j=1
10.5 < aj < 2.5, (2.4)
where C(D) is the normalizing constant of the full posterior distribution and the
indicator function 1E = 1 if E is true and 0 otherwise. Since∫π(θ, b,a, µb, ψ)dθdbdadµbdψ = 1,
the prior distributions are proper and, hence, C(D) reduces to the marginal likeli-
hood M(D) given by
M(D) =
∫L(θ, b,a|D)π(θ, b,a, µb, ψ)dθdbdadµbdψ. (2.5)
There are some imbedded properties for the full conditional distributions asso-
ciated with this posterior distribution structure. For instance, given (b,a, D), the
distributions of θi’s are conditionally independent. The conditional independence
property also holds for a and b in a similar fashion when considering their full
conditional posteriors. Specifically, we have
π(θi|b,a, D) ∝J∏j=1
exp yijaj(θi − bj)1 + exp aj(θi − bj)
exp(−θ2i /2), (2.6)
π(bj|θ,a, µb, ψ,D) ∝n∏i=1
exp yijaj(θi − bj)1 + exp aj(θi − bj)
exp
−ψ(bj − µb)2
2
, (2.7)
π(aj|θ, b, D) ∝n∏i=1
exp yijaj(θi − bj)1 + exp aj(θi − bj)
10.5 < aj < 2.5. (2.8)
17
Although the normalizing constants corresponding to the above three conditional
distributions are not analytically available, log π(θi|b,a, D), log π(aj|θ, b, D), and
log π(bj|θ,a, µb, ψ,D) are concave. However, the analytical forms for π(µb|b, ψ,D)
and π(ψ|b, D) are available. From the full posterior (2.4), we have
π(µb|b,a, ψ,D) ∝ exp
−ψ∑J
j=1(bj − µb)2
2
exp
(−ψµ
2b
2s20
)
∝ exp
−ψ(J + 1/s20)(µb −
∑Jj=1 bj
J+1/s20)2
2
,
thus µb|b,a, ψ,D ∼ N
(∑Jj=1 bj
J+1/s20, ψ(J + 1/s2
0)−1
). For π(ψ|b,a, D), we have
π(ψ|b,a, D) =
∫π(µb, ψ|b,a, D)dµb
∝ exp(−ψ)ψJ/2 exp
ψ(∑J
j=1 bj)2
2(J + s−20 )
exp
(−ψ∑J
j=1 b2j
2
)
∝ ψJ/2 exp
[−ψ
1 +
J∑j=1
b2j/2−
(∑J
j=1 bj)2
2(J + s−20 )
].
Thus ψ|b,a, D ∼ Gamma
(J/2 + 1,
1 +
∑Jj=1 b
2j/2−
(∑Jj=1 bj)
2
2(J+s−20 )
−1)
. These are
attractive and useful properties, which lead to an efficient and convenient imple-
mentation of certain Monte Carlo methods discussed in the next section.
2.2 Monte Carlo Methods
From Section 2.1, we see that the dimension of parameters θ, a, and b grows
in n or J and is generally high. Therefore, we consider only a few selected Monte
Carlo methods discussed in Section 1.1 for computing the marginal likelihood under
the 2PL model.
18
2.2.1 The Prior-Based Monte Carlo (PMC) Method
By the definition of the marginal likelihood (2.5), the PMC estimator can be
constructed as
M(D)PMC =1
S
S∑s=1
L(θ(s), b(s),a(s)|D)
=1
S
S∑s=1
[n∏i=1
J∏j=1
expyija(s)j (θ
(s)i − b
(s)j )
1 + expa(s)j (θ
(s)i − b
(s)j )
],
where (θ(s), b(s),a(s), µ(s)b , ψ(s)), b = 1, 2, . . . , S is a random sample of size S from
the joint prior distribution of (θ, b,a, µb, ψ) specified in Section 2.1. This estimator
has a finite variance since the likelihood function∏n
i=1
∏Jj=1
expyija(s)j (θ
(s)i −b
(s)j )
1+expa(s)j (θ(s)i −b
(s)j )≤ 1.
When S is large enough, M(D)PMC converges to M(D) in probability by the law of
large numbers. However, a well-known issue with this method is that the samples
rarely come from the high-likelihood region (Lartillot and Philippe, 2006). Thus
unless the sample size S is “very” large, M(D)PMC generally results in a poor
estimate of the true marginal likelihood.
2.2.2 The Harmonic Mean (HM) Method
Using the harmonic mean approach (Newton and Raftery, 1994), we have the
identity
1
M(D)=
∫π(θ, b,a, µb, ψ)
M(D)dθdbdadµbdψ
=
∫L(θ, b,a|D)−1L(θ, b,a|D)π(θ, b,a, µb, ψ)
M(D)dθdbdadµbdψ. (2.9)
19
Therefore the HM estimator of M(D) can be constructed as
M(D)HM =
1
S
S∑s=1
1
L(θ(s), b(s),a(s)|D)
−1
=
1
S
S∑s=1
[n∏i=1
J∏j=1
expyija(s)j (θ
(s)i − b
(s)j )
1 + expa(s)j (θ
(s)i − b
(s)j )
]−1−1
,
where (θ(s), b(s),a(s)), s = 1, 2, . . . , S is an MCMC sample from the joint poste-
rior distribution π(θ, b,a|D). As discussed in DiCiccio et al. (1997) and Xie et al.
(2011), the finite variance of the HM estimator is not guaranteed, depending on the
specification of the prior distributions.
2.2.3 The Chen Method
Based on the idea in Chen (2005), we have the identity
L(b∗,a∗|D)
=
∫L(θ, b∗,a∗|D)π(θ)dθ
=
∫L(θ, b∗,a∗|D)π(θ)g(b,a, µb, ψ|θ)dθdbdadµbdψ
= M(D)
∫L(θ, b,a|D)π(θ, b,a, µb, ψ)
M(D)
g(b,a, µb, ψ|θ)
π(b,a, µb, ψ)
L(θ, b∗,a∗|D)
L(θ, b,a|D)dθdbdadµbdψ
= M(D)E
g(b,a, µb, ψ|θ)
π(b,a, µb, ψ)
L(θ, b∗,a∗|D)
L(θ, b,a|D)|D, (2.10)
for any fixed b∗, a∗, µ∗b , and ψ∗, where the expectation is taken with respect to
the joint posterior distribution of θ, b,a, µb, ψ, and g is any normalized density
function of (b,a, µb, ψ).
20
Using the MCMC sample (θ(s), b(s),a(s)), s = 1, 2, . . . , S from the joint pos-
terior distribution π(θ, b,a|D), an MC estimate of M(D) is given by
M(D) = L(b∗,a∗|D)
1
S
S∑s=1
g(b(s),a(s), µ(s)b , ψ(s)|θ(s))
π(b(s),a(s), µ(s)b , ψ(s))
L(θ(s), b∗,a∗|D)
L(θ(s), b(s),a(s)|D)
−1
.
(2.11)
In (2.10) and (2.11), the optimal choice for g is
g0 = π(b,a, µb, ψ|θ, D) = π(µb, ψ|b,a, D)π(a|θ, b, D)π(b|θ, D) (2.12)
to minimize Var
(g(b(s)
,a(s),µ(s)b ,ψ(s)|θ(s)
)
π(b(s),a(s),µ
(s)b ,ψ(s))
L(θ(s),b∗,a∗|D)
L(θ(s),b(s)
,a(s)|D)
)according to Chen (2005).
However, since there is no analytical form available for the normalized distribution
π(b|θ, D), it may be too costly to compute π(b|θ, D) for each sample of θ. A
“sub-optimal” alternative for g is
g1 = π(b,a, µb, ψ|θ∗, D) = π(ψ|b,a, D)π(µb|b,a, ψ,D)π(a|θ∗, b, D)π(b|θ∗, D),
where π(ψ|b,a, D) and π(µb|b,a, ψD) can be computed using the closed forms de-
rived in Section 2.1. We need to obtain the normalizing constant for each π(aj|θ∗, b, D)
and we have its kernel in (2.8). π(b|θ∗, D) can be obtained using the conditional
marginal density estimator (CMDE) (Gelfand et al., 1992) as
π(b|θ∗, D) =
∫π(b,a, µb, ψ|θ∗, D)π(b|θ∗,a, µb, ψ,D)dadµbdψdb
=
∫π(b|θ∗,a, µb, ψ,D)π(b,a, µb, ψ|θ∗, D)dadµbdψdb
=
∫ J∏j=1
[C−1bj
n∏i=1
expyi,jaj(θ∗i − bj)1 + expaj(θ∗i − bj)
exp
−ψ(bj − µb)2
2
]×π(b,a, µb, ψ|θ∗, D)dadµbdψdb,
21
where Cbj =∫ ∏n
i=1exp(yijaj(θ
∗i−bj))
1+exp(aj(θ∗i−bj))exp(−ψ(bj−µb)2
2)dbj is the normalizing constant
for the conditional distribution π(bj|θ∗,a, µb, ψ,D), which can be computed nu-
merically. Thus we can draw a batch of MCMC samples b(s),a(s), µ
(s)b , ψ(s) from
π(b,a, µb, ψ|θ∗, D) and obtain π(b|θ∗, D) for each b from the original MCMC sam-
ples from π(θ, b,a, µb, ψ|D). However, this approach requires two MCMC outputs,
which is computationally expensive.
Another “fast” alternative choice of g is g2 = π(µb, ψ|b,a, D)h1(a)h2(b), where
h1(a) =∏J
j=1 h1j(aj), h1j is the density of N(2.5, (2/3)2) truncated at [0.5,2.5].
Here h2(b) =∏J
j=1 h2j(bj) and h2j ∼ N(bj, ν2j ). bj is the posterior median of bj, νj
is a constant, which is set as 1/6 of the posterior samples of bj’s range here. By
choosing g2 over g1, we only need the MCMC output from π(θ, b,a, µb, ψ|D) now,
and the computation cost is greatly reduced. For L(b∗,a∗|D), we can compute it
by multiplying n one-dimensional integrals together
L(b∗,a∗|D) =
∫L(θ, b∗,a∗|D)π(θ)dθ (2.13)
=n∏i=1
∫L(θi, b
∗,a∗|D)π(θi)dθi. (2.14)
So far we have introduced three different Monte Carlo approaches and all of them
require only a single MC or MCMC output to compute the marginal likelihood. The
following two approaches require multiple MCMC outputs.
22
2.2.4 The Chib Method
Motivated by the idea in (Chib, 1995), we can make use of the following identity
logM(D) = logL(D|θ∗, b∗,a∗)+ logπ(θ∗, b∗,a∗, µ∗b , ψ∗)
− logπ(θ∗, b∗,a∗, µ∗b , ψ∗|D) (2.15)
for any fixed (θ∗, b∗,a∗, µ∗b , ψ∗). Since the analytical forms of the data likelihood and
the prior distribution are available, we only need to estimate π(θ∗, b∗,a∗, µ∗b , ψ∗|D).
There are different choices on how to compute π(θ∗, b∗,a∗, µ∗b , ψ∗|D), though not
all of them are efficient. Considering the conditional independence properties of θi’s,
bj’s, and aj’s discussed in Section 2.1, along with the information that practically
the number of test subjects is usually greater than the number of items in the test,
i.e., n > J , we can compute π(θ∗, b∗,a∗, µ∗b , ψ∗|D) by
logπ(θ∗, b∗,a∗, µ∗b , ψ∗|D)
= logπ(θ∗|b∗,a∗, µ∗b , ψ∗, D)+ logπ(µ∗b |b∗,a∗, ψ∗, D)
+ logπ(ψ∗|b∗,a∗, D)+ logπ(a∗|b∗, D)+ logπ(b∗|D). (2.16)
For the first term on the right side of (2.16), based on (2.6), we have
logπ(θ∗|b∗,a∗, µ∗b , ψ∗, D)
=n∑i=1
logπ(θ∗i |b∗,a∗, µ∗b , ψ∗, D)
=n∑i=1
log
(1
ci,0
J∏j=1
[expyija∗j(θ∗i − b∗j)
1 + expa∗j(θ∗i − b∗j)
]exp−(θ∗i )
2/2
), (2.17)
where ci,0 is the normalizing constant for the distribution π(θi|b∗,a∗, µ∗b , ψ∗, D). It
can be obtained numerically by ci,0 =∫∞−∞∏J
j=1
[expyija∗j (θi−b∗j )1+expa∗j (θi−b∗j )
]exp−θ2
i /2dθi,
23
the numerical integration can be done in many existing softwares such as R and
MATLAB.
For π(µ∗b |b∗,a∗, ψ∗, D) and π(ψ∗|b∗,a∗, D) in (2.16), we have the closed forms
derived in Section 2.1 and just need to plug in the (b∗,a∗, µ∗b , ψ∗) values. For π(b∗|D)
in (2.16), using the importance-weighted marginal density estimator (IWMDE) idea
(Chen, 1994), we have
π(b∗|D)
=
∫π(b∗,θ,a, µb, ψ|D)w(b|θ,a, µb, ψ)dθdbdadµbdψ
=
∫π(b∗,θ,a, µb, ψ|D)
π(b,θ,a, µb, ψ|D)w(b|θ,a, µb, ψ)π(b,θ,a, µb, ψ|D)dθdbdadµbdψ
=
∫π(b∗|θ,a, µb, ψ,D)
π(b|θ,a, µb, ψ,D)w(b|θ,a, µb, ψ)π(b,θ,a, µb, ψ|D)dθdbdadµbdψ,
where w(b|θ,a, µb, ψ) can be any conditional density satisfying∫w(b|θ,a, µb, ψ)db =
1, as long as its support is contained within the support of π(b|θ,a, µb, ψ,D). The
optimal choice for w is the conditional posterior distribution π(b|θ,a, µb, ψ,D), and
the estimator coincides with the CMDE. With the optimal choice of w, we have
π(b∗|D)
=
∫π(b∗|θ,a, µb, ψ,D)π(θ, b,a, µb, ψ|D)dθdbdadµbdψ
=
∫ J∏j=1
[U−1bj
n∏i=1
expyijaj(θi − b∗j)1 + expaj(θi − b∗j)
exp
−ψ(b∗j − µb)2
2
]×π(θ, b,a, µb, ψ|D)dθdbdadµbdψ, (2.18)
where Ubj =∫ ∏n
i=1expyijaj(θi−bj)1+expaj(θi−bj) exp
−ψ(bj−µb)2
2
dbj is the normalizing constant
for π(bj|θ,a, µb, ψ,D).
24
For π(a∗|b∗, D) in (2.16), similarly we have
π(a∗|b∗, D)
=
∫π(θ,a∗, µb, ψ|b∗, D)
π(θ,a, µb, ψ|b∗, D)w(a|θ, b∗)π(θ,a, µb, ψ|b∗, D)dθdadµbdψ
=
∫π(θ, b∗,a∗, µb, ψ|D)
π(θ, b∗,a, µb, ψ|D)w(a|θ, b∗)π(θ,a, µb, ψ|b∗, D)dθdadµbdψ,
where π(θ,a, µb, ψ|b∗, D) is a conditional posterior density and samples will be
generated when setting b as b∗ during the sampling process. The weight function
w(a|θ, b) is a normalized density. According to the result in (Chen, 1994), the
optimal choice is w = π(a|θ, b∗, D), which can be written as
w(a|θ, b∗) = π(a|θ, b∗, D) =J∏j=1
[C−1aj
n∏i=1
expyijaj(θi − b∗j)1 + expaj(θi − b∗j)
],
where Caj =∫ 2.5
0.5
∏ni=1
expyijaj(θi−b∗j )1+expaj(θi−b∗j )daj is the normalizing constant of π(aj|θ, b∗, D).
Therefore we have
π(a∗|b∗, D) =
∫ J∏j=1
C−1aj
n∏i=1
J∏j=1
expyija∗j(θi − b∗j)1 + expa∗j(θi − b∗j)
×π(θ,a, µb, ψ|b∗, D)dθdadµbdψ. (2.19)
Hence we can take MCMC samples from π(θ,a, µb, ψ|b∗, D) and get π(a∗|b∗, D) by
taking the sample average of∏J
j=1C−1aj
∏ni=1
∏Jj=1
expyija∗j (θi−b∗j )1+expa∗j (θi−b∗j ) .
Now we see that to compute (2.16), we need two MCMC outputs in total, from
π(θ,a, µb, ψ|b∗, D) and π(θ,a, µb, ψ, b|D). An extension of the Chib method is to
utilize the Metropolis-Hastings (MH) output to compute the marginal likelihood
(Chib and Jeliazkov, 2001), which requires knowing the proposal distributions used
in the MH sampling algorithm. Since we assume that the MCMC samples are
25
obtained using the existing software (such as NIMBLE in R), without knowing its
sampling mechanism, this approach may not be feasible to implement for the IRT
models.
2.2.5 The Stepping Stone Method (SSM)
Lartillot and Philippe (2006) and Friel and Pettitt (2008) independently discov-
ered the thermodynamic integration (TI) method, which is also known as the power
posterior method. As mentioned in Lartillot and Philippe (2006), the TI method
has discretization error and requires more time in sampling from multiple power
posteriors. The stepping stone method (Xie et al., 2011) used a similar idea without
discretization error, and its basic idea is outlined as follows.
Denote
L(D|θ, b,a, βk) = L(D|θ, b,a)βk =n∏i=1
J∏j=1
expβkyijaj(θi − bj)[1 + expaj(θi − bj)]βk
. (2.20)
Let qβk = L(D|θ, b,a, βk)π(θ, b,a, µb, ψ) be the kernel of the power posterior density
associated with βk, pβk = qβk/Cβk is the normalized power posterior density, where
Cβk is the corresponding normalizing constant.
Let βi, i = 0, 1, . . . , K be a non-decreasing sequence between 0 and 1 such that
β0 = 0 and βK = 1. The marginal likelihood can be expressed as
M(D) ≡ rSS =CKC0
=K∏k=1
CβkCβk−1
≡K∏k=1
rSS,k.
26
We can compute each rSS,k by
rSS,k =
∫L(D|θ, b,a)βkπ(θ, b,a, µb, ψ)dθdbdadµbdψ
Cβk−1
=
∫L(D|θ, b,a)βk−βk−1
L(D|θ, b,a)βk−1π(Θ)
Cβk−1
dΘ
= Epβk−1L(D|θ, b,a)βk−βk−1. (2.21)
Then we can sample from pβk−1and compute the normalizing constant ratio
rSS,k using the posterior sample average of L(D|θ, b,a)βk−βk−1 , and the log marginal
likelihood can be estimated as
logM(D)SS = log(rSS) =K∑k=1
log(rSS,k). (2.22)
In practice, according to Xie et al. (2011), a viable choice for βiKi=0 can be K + 1
evenly spaced quantiles on [0, 1] from the beta distribution Beta(0.3, 1.0). In Section
2.3.1, we choose a moderate K = 20. Therefore we need to obtain in total K+1 = 21
MCMC outputs from the corresponding power posteriors to carry out the marginal
likelihood computation.
The generalized stepping stone method (Fan et al., 2010) extends the stepping
stone method by introducing a reference distribution when constructing the power
posterior, and it reduces to the original stepping stone method when choosing the
prior as the reference distribution. The generalized SS estimator can be advan-
tageous comparing to the original SS method when the reference distribution ap-
proximates the posterior distribution well, which is difficult to achieve due to the
high dimension of the parameter space. Thus we do not consider to implement the
generalized SS for the IRT models.
27
Table 2.1: Results of DIC and WAIC for the 1PL model and the 2PL model.
Model DIC WAIC1PL 23262.08 23434.482PL 22404.74 22636.92
2.3 Analysis of English Examination Data
We extract 50 objective questions from the English Examination dataset intro-
duced in Chapter 1.3 for the analysis. The overall accuracy of all the items is 0.607.
The item with highest accuracy 0.929 is item 10. The item with lowest accuracy
0.177 is item 45.
We fit the 2PL model to the English examination data introduced in Chapter
1 and the posterior estimates for the discrimination parameters a are displayed in
Figure 2.1. Quite a few aj’s 95% credible intervals do not include 1, which implies
that the 2PL model may be a better fit for the data. Using the posterior medians
for a and b, the estimated item characteristic curves (ICC) for the items 1, 4, 5,
8, and 20 are displayed in Figure 2.2. The difficulty parameters bj’s determine the
locations of θ and P (y = 1|θ = bj, bj, aj) = 0.5 regardless of the discrimination
parameter aj. The posterior medians of the discrimination parameters for these
items are a1 = 2.38, a4 = 2.04, a5 = 1.45, a8 = 0.99, and a20 = 0.65, which lead to
different increasing trends for the ICCs in Figure 2.2.
We also obtain the widely applicable information criterion (WAIC) (Watanabe,
2010) and deviance information criterion (DIC) (Spiegelhalter et al., 2002) values
for the 1PL and 2PL models in Table 2.1. Both information criteria favor the 2PL
model over the 1PL model.
28
0 10 20 30 40 50
0.5
1.0
1.5
2.0
2.5
j
a j
Figure 2.1: Posterior medians and 95% credible intervals of aj’s under the 2PLmodel.
2.3.1 Results of Marginal Likelihood Computation
The marginal likelihood of an IRT model is intrinsically small and certain tech-
niques are required to control the numerical precision in implementing each of the
five Monte Carlo methods discussed in Section 2.2. In Table 2.2, we show the results
of the log marginal likelihoods of the 1PL model and the 2PL model. In Table 2.3, we
summarize the time usage for sampling and computing separately for each method.
A MacBook computer with 2.2 GHz Intel Core i7 processor was used for the sam-
pling process with NIMBLE in R, and calculations of each estimator were carried
out in MATLAB by a Windows computer with 3.3 GHz Intel Core i5 processor. For
the PMC approach, 500000 i.i.d samples are taken from the prior distribution to
29
0.00
0.25
0.50
0.75
1.00
−5 0 5
theta
Pro
babi
lity
of y
= 1 color
item 1
item 20
item 4
item 5
item 8
Figure 2.2: Estimated item characteristic curves for the items 1, 4, 5, 8, 20.
obtain the point estimate MDPMC,0. We obtain another 100 independent point es-
timates MDPMC,i, i = 1, . . . , 100, calculated with 50000 i.i.d samples for each, and
estimate the Monte Carlo standard error as√
1100
∑100i=1(MDPMC,i − MDPMC,0)2.
For other Monte Carlo approaches, 65000 samples are drawn with a burn-in of
5000 samples for each MCMC output. First, we obtain a point estimate MD∗,0 using
all 60000 MCMC samples, where ∗ is used to denote any particular method among
the HM method, the Chen method, the Chib method, and the SSM. Then we divide
60000 Monte Carlo samples into 100 non-overlapping batches, each with 600 samples,
and obtain 100 point estimates of the marginal likelihood MD∗,i, i = 1, . . . , 100
with each batch of Monte Carlo samples. Finally the Monte Carlo standard error is
obtained as√
1100
∑100i=1(MD∗,i − MD∗,0)2.
30
Table 2.2: Log marginal likelihoods obtained by each method and correspondingMonte Carlo standard errors in parentheses.
Model PMC HM Chib Chen SSM1PL -18485.98 -11557.61 -12184.06 -12183.06 -12183.10
(872.99) (21.51) (0.22) (1.71) (13.20)2PL -16249.33 -11135.98 -11878.45 -11867.16 -11881.45
(254.73) (24.89) (0.64) (3.33) (14.64)
Table 2.3: Time usage to obtain the estimates by each method under 1PL and 2PLIRT models.
Model Time PMC HM Chib Chen SSM1PL Sampling 4.2s 1088s 1103s 1092s 4.8h
Computing 450s 65s 13.6h 314s 401s2PL Sampling 4.5s 1273s 1785s 1267s 6.5h
Computing 472s 70s 20.7h 360s 440s
Here “s” denotes “second” and “h” refers to “hour”.
From Table 2.2, we can see the point estimates of log marginal likelihood for
the 2PL model are much larger than those of the 1PL model for all methods, which
is consistent with our previous finding from Figure 2.1 that the 2PL model should
be a better fit for the data than the 1PL model. Although we do not know the
‘true values’ for the marginal likelihoods for the 1PL and 2PL models, we can see
from Table 2.2 that the prior-based estimators have large Monte Carlo standard
errors and the point estimators are quite far from those computed by the other
methods, despite the fact that we have used 500000 i.i.d samples. Compared with
the estimates from the PMC, the point estimates produced by the HM method are
much closer to the Chib method, the Chen method, and the SSM, though they are
still off a few hundred in the log scale.
31
The Chib method, the Chen method, and the SSM have similar point estimates
for the marginal likelihoods. Among these three methods, the estimated Monte
Carlo standard errors of the Chib method are rather small compared with the other
two methods, however, the computation time for the Chib method is noticeably
longer than the other two methods. The Chen method requires the least computing
time (only one MCMC output) among the three.
To examine the sensitivity of the marginal likelihood under the 2PL model
with respect to the specification of π(aj), we use the Chib method to compute
the marginal likelihoods with three log normal priors, LN(0.5, 1), LN(0.5, 1/5),
and LN(0.5, 25), where LN denotes the Log-Normal distribution. We keep 4000
MCMC samples after a burn-in of 8000 samples for each MCMC output, and the
log marginal likelihoods under the 2PL model are -11954, -11959, and -11952 corre-
sponding to LN(0.5, 1), LN(0.5, 1/5), and LN(0.5, 25). The results still prefer the
2PL model over the 1PL model.
2.4 Discussion
Through the data analysis we assume the MCMC samples are obtained from
some existing software without knowing the mechanism of how they are sampled.
We have observed some quite interesting results for these five Monte Carlo methods.
For the prior-based method, although it has the consistency guarantee by the law
of large numbers, the simplest form among the 5 methods, and the shortest time
for a single sample computation, the results are not satisfying at least under our
32
specific setting. As mentioned in (Lewis and Raftery, 1997), “unfortunately it is
generally necessary to obtain a very large numbers of draws from the prior distribu-
tion” before it becomes a good estimator. All its benefits may be eclipsed if we need
more than 1 million or even more i.i.d. samples at each single run to obtain reliable
estimates. For the HM approach, despite the potential infinite variance problem,
the HM estimates seem to be better comparing to the PMC. For the Chen method,
though the implementation of its original form seems resource demanding, a ‘fast’
alternative can be used with a single MCMC output, and it produces very similar
results compared to the Chib method and the SSM. For the Chib method and the
SSM, both require multiple MCMC outputs to compute the marginal likelihood for
IRT models, and they produce very close point estimates for the marginal likeli-
hoods. Both methods take longer time comparing to the other three methods, and
the computational time of the SSM depends on the choice of the number of power
posteriors, K.
The computation cost comparison might be fairer, if considering the number
of iterations for each method such that the standard errors of the estimated log
marginal likelihoods are comparable. We applied the Chib method to the 2PL
model with 6000 samples and the log marginal likelihood and the corresponding
Monte Carlo standard error are -11878.3 and 5.61, and the computing time is about
7600 seconds.
33
The results might be sensitive to the choice of a∗ and b∗ when applying the Chib
method and the Chen method, and it would be interesting to test how robust the
results would be if using different sets of a∗ and b∗ in the future.
Chapter 3
Variable Selection in Dynamic Item Response
Theory Models via Bayes Factors with a Single
MCMC Output
3.1 Proposed Model
We follow the basic infrastructure of the dynamic IRT model proposed in Wang
et al. (2013) and build a dynamic IRT model to accommodate the data structure
for the Edsphere dataset. At the first level of the model, the correctness of the
items is linked to the latent ability of each individual, the difficulty of the items and
certain random effects. At the second level, it is assumed the mean of the latent
ability of the current day is the sum of the latent ability of the previous day and
an increment related to the time lapse. We want to investigate the relationship
between the individual “growth factor” and the characteristics of test takers, such
as socio-economic status, gender, grade, etc. We propose the following dynamic IRT
34
35
model and incorporate this information on ci:
Level 1: Pr(Xi,t,s,l = 1 | θi,t, ηi,t,s, di,t,s,l) = Φ(θi,t − di,t,s,l + ηi,t,s), (3.1)
Level 2: θi,t = θi,t−1 + ci(1− ρθi,t−1)∆+i,t + ωi,t, ωi,t ∼ N(0, φ−1∆i,t), (3.2)
Level 3: log(c) ∼ N(1α + Zβ, ψ−1In), (3.3)
where Xi,t,s,l represents the correctness of the l-th item in the s-th test on the t-
th day taken by subject i, for i = 1, 2, . . . , n, t = 1, 2, . . . , Ti, s = 1, 2, . . . , Si,t,
l = 1, 2, . . . , Ki,t,s. di,t,s,l denotes the corresponding item difficulty. θi,t denotes the
latent ability of individual i on the test day t, η is the exam-wise random effect, and
we assume ηi,t = (ηi,t,1, . . . , ηi,t,Si,t)′ ∼ N (0, τ−1
i I) for i = 1, · · · , n, t = 1, · · · , Ti,
and di,t,s,l ∼ N(ai,t,s, σ2), where both ai,t,s and σ are known due to the test design.
At level 2, ρ is a known positive constant. ∆i,t is the time lapse between the
t-th test day and the (t − 1)-th test day of subject i, t = 1, . . . , Ti. The term
∆+i,t = min(∆Tmax ,∆i,t) is the time lapse truncated by a pre-specified maximum time
interval, and it reflects the fact the student’s ability may not be growing proportional
to the time lapse when the gap is over certain number of days, i.e., ∆Tmax here. In
practice it may occur when students take a vacation. We set ∆Tmax = 14 days as
the same in Wang et al. (2013).
At level 3, Z is the n × p design matrix of the covariates, and we assume each
column has been centered, i.e., 1′Z = 0. We use α to denote the intercept, β is the
p× 1 vector of coefficients for the covariates.
We aim to perform variable selection among the covariates at level 3, i.e., find
all non-zero β elements in β, while treating all other model structures as common
36
setups. Liang et al. (2008) showed some attractive variable selection properties of the
Bayes factor with mixtures of g-priors in a linear model setting, which motivates us
to apply the Bayes factor with mixtures of g-priors in our hierarchical model setup.
As discussed in Liang et al. (2008), the marginal likelihoods of all models have
closed-form expressions when g is fixed, and the Bayes factor can be computed via
Laplace approximation when the Inv-Gamma(1/2,n/2) prior is imposed on g, i.e.,
the Zellner-Siow prior is used. However, under our hierarchical model framework,
there are no analytical forms available for neither the marginal likelihoods nor Bayes
factors, no matter when g is fixed or random. It seems not feasible either to use
Laplace approximation to obtain the Bayes factor. Therefore a tractable computa-
tion algorithm is needed to obtain the Bayes factors.
3.2 Computation of Bayes Factors with a Single MCMC Output
To compute marginal likelihoods which are not analytically available, there are
many Monte Carlo methods, such as the Harmonic Mean method (Newton and
Raftery, 1994), the Chib method (Chib, 1995), the Stepping Stone method (Xie
et al., 2011), to name a few. Due to the complex structure and high-dimensionality of
the parameter space in the dynamic IRT model, computing the marginal likelihood
for each model with existing Monte Carlo methods seems to be a difficult task.
Motivated by Chen (2005), taking advantage of the nested model structure at level
3, it may be a computationally viable approach if we can obtain the Bayes factor of
each model to the full model using only a single MCMC output.
37
To formally define the models under comparison, we introduce the notations as
follows: Let MN be the null model where all β coefficients are 0, i.e., β = 0. MF
denotes the full model where all β coefficients are considered non-zero. Let Mγ
denote the model with β−γ = 0, where (βγ,β−γ) = β. Let pγ be the length of βγ,
thus we have pγ + p−γ = p for any γ. Setups at level 1 and level 2 are common for
these models.
Priors for all model parameters are specified as follows: (i) Assign τi ∼ Γ(0.01, 100),
i = 1, 2, . . . , n, φ ∼ Γ(1, 1000) for τ and φ; (ii) Use the Zellner-Siow prior dis-
cussed in the null-based approach in Liang et al. (2008): π(α, ψ) ∝ ψ−1, g ∼
Inv-Gamma(1/2, n/2), and βγ ∼ N(0, gψ−1(Z ′γZγ)−1) for α,β, ψ, g. A uniform
prior on the model space is used.
Based on the current linear structure at level 3, we have the identity
π(c|α,βγ, ψ,Mγ) = π(c|α,βγ,β−γ = 0, ψ,MF ), (3.4)
which leads to the proposition as follows.
Proposition 3.2.1. For Mγ 6= MN , the Bayes factor of model Mγ to the full
model MF under the dynamic IRT model framework is
BFH [Mγ :MF ]
= E
[π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
], (3.5)
the Bayes factor of the null model MN to the full model MF under the dynamic
IRT model framework is
BFH [MN :MF ] = E
[π(β = 0|c, α, ψ, g,D,MF )
π(β = 0|ψ, g,MF )π(g)
], (3.6)
38
where both expectations are taken w.r.t. π(c, α,βγ, ψ, g|D,MF ), the posterior dis-
tribution under the full model.
Proof. Denote P = θ,d,η, τ , φ. Using the identity (3.4), the marginal likelihood
of Mγ can be expressed as
M(D|Mγ)
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|α,βγ, ψ,Mγ)π(α)π(ψ)
×π(βγ|ψ, g,Mγ)π(g)dPdcdαdβγdψdg
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|α,βγ,β−γ = 0, ψ,MF )π(α)
×π(ψ)π(g)π(βγ,β−γ = 0|ψ, g,MF )π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )dPdcdαdβγdψdg
= M(D|MF )
∫π(P , c, α,βγ,β−γ = 0, ψ, g|D,MF )
×π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )dPdcdαdβγdψdg
= M(D|MF )
∫π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
×π(P , c, α,βγ, ψ, g|D,MF )dPdcdαdβγdψdg
= M(D|MF )E
[π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
],
which leads to the Bayes factor
BFH [Mγ :MF ]
=M(D|Mγ)
M(D|MF )
= E
[π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
]. (3.7)
39
For the null model, similarly we have
M(D|MN) (3.8)
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|α, ψ,Mγ)π(α)π(ψ)
×dPdcdαdψ
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|α,β = 0, ψ,MF )π(α)
×π(ψ)π(g)π(β = 0|ψ, g,MF )1
π(β = 0|ψ, g,MF )π(g)dPdcdαdβdψdg
= M(D|MF )
∫1
π(β = 0|ψ, g,MF )π(g)π(β = 0|c, α, ψ, g,D,MF )
×π(P , c, α, ψ, g|D,MF )dPdcdαdβdψdg
= M(D|MF )E
[π(β = 0|c, α, ψ, g,D,MF )
π(β = 0|ψ, g,MF )π(g)
],
and thus
BFH [MN :MF ] =M(D|MN)
M(D|MF )= E
[π(β = 0|c, α, ψ, g,D,MF )
π(β = 0|ψ, g,MF )π(g)
]. (3.9)
We use BFH symbol above to emphasize the Bayes factor is defined under the
hierarchical model framework of the dynamic IRT model. Based on the results in
Proposition (3.2.1), we can compute the Bayes factor of any model Mγ to the full
model MF using one single MCMC output from the posterior based on the full
model. For the Bayes factor of any arbitrary pair of model (Mγ, Mγ′), it can be
obtained by
BFH [Mγ :Mγ′ ] =M(D|Mγ)
M(D|MF )/M(D|Mγ′)
M(D|MF )=BFH [Mγ :MF ]
BFH [Mγ′ :MF ]. (3.10)
40
Regarding to the quantity inside the expectation of equation (3.5), we have
π(βγ|ψ, g,Mγ)
π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
=(2π)−pγ detgψ−1(Z ′γZγ)
−1−1/2 exp−12g−1ψβ′γ(Z
′γZγ)βγ
(2π)−p detgψ−1(Z ′Z)−1−1/2 exp−12g−1ψβ′γ(Z
′γZγ)βγ
×π(β−γ = 0|c, α,βγ, ψ, g,D,MF )
= (2π)p−pγ (gψ−1)(p−pγ)/2π(β−γ = 0|c, α,βγ, ψ, g,D,MF ), (3.11)
where π(β−γ = 0|c, α,βγ, ψ, g,D,MF ) is the conditional normal density evaluated
at β−γ = 0, and it can be computed with existing results for conditional normal
distributions.
For the quantity inside the expectation of equation (3.6), we have
π(β = 0|c, α, ψ, g,D,MF )
π(β = 0|ψ, g,MF )π(g)(3.12)
= (g + 1)p/2 exp
−ψ
2
g
g + 1log(c)′Z(Z ′Z)−1Z ′ log(c)
Γ(1/2)
(n/2)1/2g3/2en/(2g).
Hence we are able to compute the Bayes factors for all model pairs using a single
MCMC output from the posterior of the full model conveniently.
3.3 Model Selection Consistency
A major motivation to use the Bayes factor as the variable selection tool is the
attractive model selection properties proved in Liang et al. (2008) under the linear
model setup, namely plimnP (Mγ|D) = 1 when Mγ is the true underlying model.
Here ‘plim’ denotes the convergence in probability with respect to the sampling
distribution under the true modelMγ. Also, the posterior probability of a model is
41
given by
P (Mγ|D) =P (D|Mγ)π(Mγ)∑γ′ P (D|Mγ′)π(Mγ′)
=π(Mγ)∑
γ′ π(Mγ′)BF [Mγ′ :Mγ].
Since p is fixed, equivalently Liang et al. (2008) proved that
plimnBF [Mγ′ :Mγ] = 0 ∀ Mγ′ 6=Mγ (3.13)
when the Zellner-Siow prior is used and the following condition is satisfied:
limn→∞βTγZ
Tγ (I − Pγ′)Zγβγ
n= bγ ∈ (0,∞) (3.14)
for any model Mγ′ that does not contain Mγ, Pγ′ is the projection matrix of the
column space of Zγ′ .
It is important to investigate whether the posterior model selection consistency
still holds under the dynamic IRT model. Under the hierarchical model framework of
the proposed dynamic IRT model, log(c) mimics at level 3 the role of the response
variable in a linear regression model, but it is a latent variable rather than the
observed response. The proof for the model selection consistency in Liang et al.
(2008) is not directly applicable here.
Under similar assumptions, we obtain the posterior model selection consistency
in Theorem 3.3.1 stated below.
Theorem 3.3.1. Assume that marginal likelihood of any modelMγ defined under
the dynamic IRT model framework is finite for any given n. Assume that condition
(3.14) holds and p is fixed. Denote the true model asMγ, for any modelMγ′ 6=Mγ,
D = X as the observed data. We have
plimnBF [Mγ′ :Mγ|D] = 0. (3.15)
42
Proof. For convenience, let P = θ,d,η, τ , φ, let Qγ,Qγ′ denote the parameters at
level 3 for model Mγ and Mγ′ , respectively. The parameter set Q includes α, β, ψ,
and contains g if the model is not the null model. By definition,
BFH [Mγ′ :Mγ|D]
=M(D|Mγ′)
M(D|Mγ)
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|Qγ′)π(Qγ′)dQγ′dcdP∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)π(c|Qγ)π(Qγ)dQγdcdP
=
∫L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)
∫π(c|Qγ′)π(Qγ′)dQγ′dcdP∫
L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ)∫π(c|Qγ)π(Qγ)dQγdcdP
.
Letting
h(P , c|D) = L(θ,d,η|D)π(d)π(η|τ )π(τ )π(θ|c, φ)π(φ),
ML(Mγ′|c) =
∫π(c|Qγ′)π(Qγ′)dQγ′ ,
ML(Mγ|c) =
∫π(c|Qγ)π(Qγ)dQγ,
then we have
BFH [Mγ′ :Mγ|D] =
∫h(P , c|D)ML(Mγ|c)
ML(Mγ′ |c)
ML(Mγ |c)dcdP∫
h(P , c|D)ML(Mγ|c)dcdP.
Since
π(c|Qγ) =n∏i=1
1
ciπ(log(ci)|Qγ) =
1∏ni=1 ci
π(log(c)|Qγ)
and similarly
π(c|Qγ′) =n∏i=1
1
ciπ(log(ci)|Qγ′) =
1∏ni=1 ci
π(log(c)|Qγ′),
we have
ML(Mγ′|c)ML(Mγ|c)
=
∫π(log(c)|Qγ′)π(Qγ′)dQγ′∫π(log(c)|Qγ)π(Qγ)dQγ
= BFL[Mγ′ :Mγ| log(c)],(3.16)
43
where we use BFL[Mγ′ : Mγ| log(c)] to denote the Bayes factor of model Mγ′ to
model Mγ under the linear model framework, with log(c) serving as the response.
To prove plimnBF [Mγ′ :Mγ|D] = 0 for anyMγ′ 6=Mγ, we need to show ∀ε, δ,
there exists a constant nε,δ, such that
P (BF [Mγ′ :Mγ|D] > ε) < δ
for all n ≥ nε,δ.
WhenMγ is the true model, we have log(c) ∼ N(1α+Zβγ, ψ−1In). Along with
the condition (3.14), by Theorem 3 in Liang et al. (2008),
plimnBFL[Mγ′ :Mγ| log(c)] = 0. (3.17)
Therefore for any ε, δ > 0, there exists n0ε,δ such that P (
ML(Mγ′ |c)
ML(Mγ |c)> ε) < δ for all
n ≥ nε,δ under Mγ, where log(c) ∼ N(1α + Zβγ, ψ−1In).
LetML(Mγ′ |c)
ML(Mγ |c)− ε = ∆γ′,γ,ε(c). We have
P (BFH [Mγ′ :Mγ|D] > ε)
= P (
∫h(P , c|D)ML(Mγ|c)
ML(Mγ′ |c)
ML(Mγ |c)− εdcdP∫
h(P , c|D)ML(Mγ|c)dcdP> 0)
= P (
∫h(P , c|D)ML(Mγ|c)
ML(Mγ′|c)ML(Mγ|c)
− εdcdP > 0),
The expression R(D) =∫h(P , c|D)ML(Mγ|c)
ML(Mγ′ |c)
ML(Mγ |c)− εdcdP is a function of
data D, and hence it is random with respect to the sampling distribution. Consider
an exhaustive partition of the domain of (D, c) as follows: S1 = (D, c) : ∆γ′,γ,ε >
44
0, S2 = (D, c) : ∆γ′,γ,ε ≤ 0. Then we have
P (BFH [Mγ′ :Mγ|D] > ε)
= P (R(D) > 0)
= P (R(D) > 0|S1)P (S1) + P (R(D) > 0|S2)P (S2)
= P (
∫∆γ′,γ,ε>0
h(P , c|D)ML(Mγ|c)∆γ′,γ,εdcdP > 0)P (S1)
+ P (
∫∆γ′,γ,ε≤0
h(P , c|D)ML(Mγ|c)∆γ′,γ,εdcdP > 0)P (S2)
= 1× P (S1) + 0× P (S2)
= P (S1)
= P (∆γ′,γ,ε > 0)
< δ under model Mγ
holds ∀n > nε,δ, hence plimnBF [Mγ′ :Mγ|D] = 0 for any Mγ′ 6=Mγ.
3.4 Simulation Studies
Though the proposed Bayes factor approach is supported by Theorem 3.3.1 for
the posterior model selection consistency, it is desirable to examine the small sample
performance of this approach. We design and carry out simulation studies as follows:
(i) Consider 3 setups for the number of individuals n = 40, 60, 80;
(ii) Consider 2 setups for number of distinct test days Ti = 100, 120 for each
individual i, i = 1, 2, . . . , n.
45
(iii) Set number of tests per day as Si,t = 4, number of items per test as Ki,t,s = 10
for i = 1, 2, . . . , n t = 1, 2, . . . , Ti, s = 1, 2, . . . , Si,t.
(iv) Set ∆i,t = t+ 10 for t ≤ Ti/2, and ∆i,t = t− 10 for Ti ≥ t ≥ Ti/2 + 1.
(v) Set α = −5, ρ = 0.118, ψ = 100, σ2 = 0.73332, φ−1/2 = 0.0055, τ−1/2i
i.i.d.∼
U(0.2, 0.4).
(vi) Set p = 5, consider 3 setups for the β coefficients:
β = (0, 0, 0, 0, 0), (0.1, 0, 0.2, 0,−0.1), (0.2, 0, 0.4, 0, 0.2),
which include the case when the null model is the true model to test the “type
I error”, and two other cases with different magnitudes for non-zero coefficients
to test the “power”.
(vii) n × p random numbers are independently generated from N(0, 0.22) to form
a n × p matrix, then each column is centered to 0 and scaled to have unit
variance to build the design matrix Z.
For each simulation scenario, 100 independent simulations are carried out. 5000
MCMC samples are drawn from the posterior distribution of the full model MF
for each scenario, and the first 2000 samples are discarded as burn-in. Trace plots
of β, g and ψ from a simulation of n = 40, Ti = 100,β = (0.1, 0, 0.2, 0,−0.1) are
displayed in Figure 3.1.
For each simulation scenario, we record how frequently the true model is ranked
as the “best” model (with the largest log marginal likelihood), and how frequently
46
0 500 1000 1500 2000 2500 3000
iterations
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-1/2
0 500 1000 1500 2000 2500 3000
iterations
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
g-1
/2
0 500 1000 1500 2000 2500 3000
iterations
-5.12
-5.1
-5.08
-5.06
-5.04
-5.02
-5
-4.98
-4.96
-4.94
1
0 500 1000 1500 2000 2500 3000
iterations
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
2
0 500 1000 1500 2000 2500 3000
iterations
-0.12
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
3
0 500 1000 1500 2000 2500 3000
iterations
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
5
Figure 3.1: Trace plots of β, g and ψ with 3000 MCMC samples. The red line refersto the true value.
47
it is ranked as the top 3 “best” models. The results are recorded in Tables 3.1 and
3.2. We can see that the Bayes factor approach performs pretty well when n ≥ 60
in both “type I error” and “power” studies, and as n increases, the performance is
better. In terms of selecting the top 3 models, when the true model has non-zero
coefficients, the Bayes factor approach ranks the true model as top 3 even over 90%
of the time when n = 40 in this simulation study. The finite sample performance of
the proposed Bayes factor approach seems pretty good.
Table 3.1: Percentage of simulations in which the true model is ranked as the “best”model by the Bayes factor approach.
(n, Ti)β
(0, 0, 0, 0, 0) (0.1, 0, 0.2, 0,−0.1) (0.2, 0, 0.4, 0,−0.2)
(40,100) 0.48 0.6 0.7(40,120) 0.62 0.66 0.76(60,100) 0.8 0.69 0.78(60,120) 0.82 0.79 0.86(80,100) 0.93 0.79 0.85(80,120) 0.91 0.85 0.87
Table 3.2: Percentage of simulations in which the true model is ranked as the top 3“best” model by the Bayes factor approach.
(n, Ti)β
(0, 0, 0, 0, 0) (0.1, 0, 0.2, 0,−0.1) (0.2, 0, 0.4, 0, 0.2)
(40,100) 0.51 0.91 0.97(40,120) 0.64 0.95 0.96(60,100) 0.83 0.98 0.98(60,120) 0.84 0.98 0.98(80,100) 0.93 1.00 0.98(80,120) 0.92 0.99 0.97
48
Table 3.3: Estimated log Bayes factor of each model to the full model via theproposed Monte Carlo approach.
Model Log Bayes factor comparing to MF
Null model -243.13Z1 -318.54Z2 -210.93Z1, Z2 -17.49Z3 -256.72Z1, Z3 -378.92Z2, Z3 -116.77Z1, Z2, Z3 0
3.5 Real Data Application
We apply the proposed method to the EdSphere dataset introduced in Chapter
1. 35 students are selected for the analysis, and 3 candidate covariates are used:
gender, economic status and grade, which are denoted by Z1–Z3, respectively. For
gender, we let Z1 = 1 for male and Z1 = 0 for female. There are two categories for
economic status, i.e., disadvantaged and non-disadvantaged, based on lunch status,
homelessness state, and other district-level indicators of economic situation, and we
denote Z2 = 1 for disadvantaged and 0 otherwise. For the grade variable Z3, we
use the grade information at the first day during the study period for each student.
Among these 35 students, their test days range from 45 to 122, the maximum number
of tests per day is 18, the maximum number of questions per test is 78.
We implement the MCMC procedure with 300000 iterations, and burn in the
first 50000 MCMC samples. The posterior medians of the regression coefficients
β = (−0.249, 0.428,−0.109)′. The estimated log Bayes factor of each model to the
full model MF is shown in Table 3.3.
49
The model with the largest Bayes factor comparing to the full model is the full
model itself MF : Z1, Z2, Z3, i.e., gender, economic status, and grade all play a
role in terms of explaining the growth factor ci. The Bayes factor of the full model
MF to the 2nd “preferred” model M1,2 : Z1, Z2, i.e., e17.49, is considered as “very
strong” evidence to favor the full model Kass and Raftery (1995). A comprehensive
study with more subjects and interaction terms will be carried out in future work
to substantiate the findings.
3.6 Discussion
From the previous analysis results, we can see that Bayes factor seems to be a
reliable tool for the variable selection task in the proposed dynamic IRT model, if the
Zellner-Siow prior is used and certain assumptions are met. Note that generally it is
difficult to compute the marginal likelihoods of hierarchical models with multi-levels
if no analytical forms are available, and the usage of Bayes factor may be limited. If
one wishes to perform variable selection under a hierarchical model, at only a single
layer which consists of the linear structure similar to level 3 in the dynamic IRT
model, Proposition 3.2.1 may be applied in a similar fashion to compute the Bayes
factors conveniently. The proof for posterior model selection consistency may still
hold if similar assumptions as the ones in Theorem 3.3.1 are made, and the layer
under consideration links with other layers in certain ways, e.g., the parameters in
such layer are not connected to other layers.
50
There are also other model selection criteria for hierarchical models, such as
the deviance information criterion (DIC) (Spiegelhalter et al., 2002) and the widely
applicable information criterion (WAIC)(Watanabe, 2010). When applying the DIC
for model selection, the “focus” of the parameters of interest may affect the results.
One may integrate out all “intermediate” parameters to obtain the likelihood with
respect to the parameters of interest, which is usually challenging in hierarchical
models such as IRT models, or treat the likelihood to be just the data likelihood in
the first layer and treat the “intermediate” parameters the same as the parameters
in the uttermost layers. It is unclear which implementation may lead to better
results. It would be interesting to compare the model selection performances of
these criteria in future work.
Chapter 4
Bayesian Nonparametric Monotone Regression of
Dynamic Latent Traits in IRT Models
4.1 Nonparametric Monotone Regression Model
In this chapter, we keep the discussion focused on the one-parameter IRT model,
which links latent ability and item difficulty to the correctness of each item. The
idea could be similarly implemented on two-parameter/three-parameter IRT models
or other continuous, ordinal, and nominal latent variable models. As introduced
in Chapter 1, in the classic one-parameter IRT model, the latent ability of each
individual and the difficulty level for each item are the two key components, which
are assumed to be static. While in the EdSphere dataset, the item responses are
longitudinal, and thus the latent ability of individuals and item difficulty levels are
both varying with time. In addition, each test taker is allowed to take tests at any
time they wish in the computerized testing. Moreover, they can take more than one
51
52
exams per day. Therefore, following the discussion of Wang et al. (2013), we extend
the classic one-parameter IRT model as follows.
The proposed shape-constrained IRT model involves two-levels, i.e.,
Level 1: Pr(Xi,j,s,k = 1 | θi,j, ηi,j,s, di,j,s,k) = F(θi,j − di,j,s,k + ηi,j,s), (4.1)
Level 2: θi,j = fi(ti,j) + ωi,j. (4.2)
At the first level, Equation (4.1) extends the classic one-parameter IRT models to
the scenario of computerized testing, where θi,j represents the i-th person’s ability
(latent trait) on the j-th day with assuming a person’s ability is constant over a given
day; di,j,s,l is the difficulty of the l-th item in the s-th test on the j-th day taken by
the subject i; ηi,j,s takes account of the random effects that can not be explained by
person’s ability and the item difficulty in the s-th test on the j-th day for the i-th
person; F(·) is a cumulative distribution function (CDF) for a continuous random
variable; and i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j and k = 1, . . . , Ki,j,s. For
the EdSphere dataset, since the ensemble mean and variance of the item difficulties
in a test are known quantities due to the test design, we assume
di,j,s,k = ai,j,s + υi,j,s,k, (4.3)
where υi,j,s,k ∼ N (0, σ2), and ai,j,s and σ are known. Further, we assume for each
individual i, the random effects ηi,j,si.i.d.∼ N (0, τ−1
i ) for j = 1, · · · , Ti and s =
1, · · · , Si,j, where τi is a precision parameter and changes according to individuals.
For the link function F−1(·), we will use F−1(·) = Φ−1(·) (called the Normal Ogive
53
or Probit link), where Φ−1(·) is the inverse of the standard normal CDF, and this
link can ease the computation for Bayesian analysis.
In the second level (4.2), wi,j represents the random residual, wi,j ∼ N (0, φ−1∆i,j),
where φ is an unknown proportion of the precision of wi,j and ∆i,j is the time lapse
between the j-th test day and the (j − 1)-th test day of the subject i, j = 1, . . . , Ti.
We assume the variance of wi,j is proportional to the time lapse because it implies
the uncertainty about one’s ability would become larger when he/she does not take
the tests for a while. The function fi(·) is the i-th individual latent trajectory, which
is assumed to be a continuous function, where ti,j is the time location, i.e., the actual
test day for an examinee to take a test in the study period. When we model fi(·),
some prior information of often available. For example, psychologists and educators
can assume that the mean trend for a student’s reading ability should be growing or
at least not decreasing during one school year. However, in a longer time span, such
a monotonicity assumption may not hold. We choose to impose such prior beliefs
on fi(·) since it can be used for a large part of the potential applications, and we
may be able to check from the data-fit about the reasonableness of this assumption.
We will utilize the idea of Bornkamp and Ickstadt (2009) to model the unknown
latent trajectory fi(·) flexibly and conveniently with a monotone shape constraint.
First, for each individual i, i = 1, . . . , n, we rescale the original time units into [0, 1],
by subtracting the original time with the minimum time value of the i-th individual
and then divided by the time range of individual i. After such rescaling, we continue
54
to use ti,j for the notation of time for convenience, and assume
fi(ti,j) = βi,0 + βi,1f0i (ti,j), i = 1, · · · , n, j = 1, . . . , Ti, (4.4)
where f 0i (·) is the CDF of a bounded and continuous random variable on [0,1]. Thus
fi(·) is an increasing function if βi,1 ≥ 0 and a decreasing function otherwise. More-
over, fi(·) can accommodate flat regions if chosen f 0i (·) properly. This is because
a CDF of a random variable will reach its plateau when the variable approaches
its bound of the domain. Under the current time rescaling mechanism and since
0 ≤ f 0i (tij) ≤ 1 for each individual i, the intercept βi,0 can be interpreted as the
initial level of the latent ability for the i-th individual, while βi,0 +βi,1 can be treated
as the maximum level that an individual can reach during the study period. Second,
to make the modeling of f 0i (·) more flexible, we introduce the nonparametric ideas,
f 0i (ti,j) =
∫Ξ
F(ti,j, ξ)Gi(dξ) =
Li∑`=1
πi,`F(ti,j, ξi,`), ξi,`i.i.d.∼ P0, (4.5)
where we model f 0i (·) as a discrete mixing of parametric CDFs. In Equation (4.5),
F(·, ξ) is a CDF with parameters ξ on the parameter space Ξ; Gi is a discrete
probability measure on Ξ with assigning a general discrete random measure prior
introduced by Ongaro and Cattaneo (2004), i.e., Gi(dξ) =∑Li
`=1 πi,`δξi,`(dξ), where
δ is a Dirac delta function; ξi,`’s are independent and identically distributed realiza-
tions from a continuous distribution P0 on Ξ, and are assumed to be independent
with πi,`’s and Li’s; πi,` satisfies∑Li
`=1 πi,` = 1 and Li requires its support on pos-
itive integers. Also, note the common choices of P0 in Equation (4.5) will help us
to borrow strength among individual latent trajectories. The construction of Equa-
tion (4.5) actually includes many popular discrete random probability measures in
55
the current literature as a special case, such as the Dirichlet process, general stick-
breaking processes and etc. In addition, this construction is very flexible in modeling
monotone function since it has not imposed any restricted structure on parameters
(i.e., for the weights πi,`’s and the parameters ξi,`’s) as other methods mentioned in
the Introduction.
In Equation (4.5), a proper choice of the base distribution function F(·, ξ) is the
key for the success of modeling fi(·). A typical requirement is that we need a convex
combination of functions F(·, ξi,1),F(·, ξi,2), . . ., for any i = 1, · · · , n, which can ap-
proximate any arbitrary continuous CDF on [0, 1]. The beta distribution functions
(i.e., the regularized incomplete beta functions) will satisfy this requirement, how-
ever, there is usually extensive computation associated with the beta distribution
functions since they have no closed form. To balance the computation burden and
the adequacy of approximation, we consider the CDF of two-sided power (TSP)
distribution (c.f., Van Dorp and Kotz (2002)) for F(·, ξ):
F(ti,j, ξ) =
b(ti,jb
)γ, if 0 ≤ ti,j ≤ b,
1− (1− b)(1−ti,j1−b )γ, if b ≤ ti,j ≤ 1.
, (b, γ) ∈ [0, 1]×R+,
which has two key parameters (b, γ) and is a viable alternative of beta distribution
functions. Here, define ξ = (b, γ). Bornkamp and Ickstadt (2009) proved the convex
combination of the CDF of TSP functions can be capable of approximating any
continuous CDF on [0, 1].
We illustrate several examples of TSP functions in Figure 4.1. From Figure 4.1,
we can see that γ controls the steepness of the curve. When γ is small, the pace of
56
increasing is comparatively slow and when γ becomes larger, the increasing trend
becomes steeper. The parameter b is the unique mode of TSP function when γ > 1.
A TSP function with a small b and a large γ describes a curve increasing steeply in
the beginning and stabilizing afterward, which could be viewed as a ‘fast learner’
growth curve; while, a TSP function with a large b and small γ can be viewed as
a growth trend for an individual who improves steadily with a slower pace. Other
scenarios could be interpreted accordingly and they will be varying with different
values of b and γ.
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1TSP function examples with 4 different γ values and b=0.2
γ=2
γ=10
γ=30
γ=50
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1TSP function examples with 4 different γ values and b=0.5
γ=2
γ=10
γ=30
γ=50
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1TSP function examples with 4 different γ values and b=0.8
γ=2
γ=10
γ=30
γ=50
Figure 4.1: Illustration of several TSP functions with different γ and b values.
The monotone regression can be treated as the sum of two parts: 1) an intercept
parameter (interpretable as one’s initial ability), and 2) the product of a scale pa-
rameter (interpretable as the maximum ability that one can grow during the study
period) with a continuous function of the time variable (which is monotonically in-
creasing over the study period and can easily capture the plateaus effect of one’s
ability at the end). Another potential advantage of our proposed approach could be
that the parameters for the underlying base functions, which constitute the mono-
tone function, do not have constraints on the domain of parameter space. Lack of
57
constraints on the domain of the parameter space could make the model more flex-
ible and reduce computational burden. In addition, the base functions themselves
can be directly learned from the data.
4.2 Bayesian Computation Scheme
The hierarchical model of (4.1) and (4.2) can accommodate the complex structure
of computerized testing (e.g., EdSphere datasets) and it also allows the incorporation
of prior information. However, because of the complexity of the model considered,
we have to resort to Markov chain Monte Carlo (MCMC) computational techniques
for the analysis. A byproduct of Bayesian inference is that all uncertainties in all
quantities are combined in the overall assessment of inferential uncertainty.
4.2.1 Prior Specification and Posterior Distribution
Before starting the Bayesian inference, first, we have to specify the prior distribu-
tions of unknowns in the model. For parameters φ, τi’s, βi,0’s and βi,1’s in Equation
(4.1), (4.2) and (4.3), due to a lack of scientific knowledge, we use the following
objective priors for them: π(φ) ∝ φ−32 , π(τi) ∝ τ
− 32
i , π(βi,0) ∝ 1 and π(βi,1) ∝ 1,
for i = 1, . . . , n. The objective priors used for π(φ) and π(τi) are recommended in
Wang et al. (2013).
According to Lemma 1 in Bornkamp and Ickstadt (2009), the prior distributions
of ξi,`’s determine the mean and prior correlation structure of f 0i (·). Without expert’s
information, a uniform distribution on a finite subset of parameter space Ξ for ξi,`’s
58
would be a reasonable choice to start with. Theoretically, we would like to elicit an
unbounded prior for Li. One viable choice is the zero truncated Poisson distribution
with the rate parameter λ > 0, and thus, its prior mean is λeλ
eλ−1. The larger λ value
is, the more components are in TSP mixture. For the prior of πi = (πi1, · · · , πiL)′, a
natural choice is a symmetric Dirichlet distribution with common parameter ρ > 0.
Notice the prior variability of f 0i (·) is increasing when Li or ρ gets smaller (see
Lemma 1 in Bornkamp and Ickstadt (2009)). Hence, in practice, the values of λ and
ρ are chosen according to the desired prior variability for f 0i (·) and the expected
number of jumps in the model response.
Using the priors specified above, we can derive the posteriors of unknowns in
the proposed model as shown in Equation (A.1) of Appendix A. We can show this
posterior is proper (see details in Appendix A). Then, the statistical inference based
on the sampling from this posterior is legitimate. To facilitate the computation for
the posterior of unknowns, we implement the idea of data augmentation (Albert and
Chib, 1993) by introducing a latent variable Zi,j,s,k for each dichotomous response
Xi,j,s,k, i.e., defining Zi,j,s,k ∼ N(θi,j−ai,j,s+ηi,j,s, 1+σ2)I(Zi,j,s,k > 0) if Xi,j,s,k = 1,
and Zi,j,s,k ∼ N(θi,j − ai,j,s + ηi,j,s, 1 + σ2)I(Zi,j,s,k ≤ 0) if Xi,j,s,k = 0. Therefore, the
two-level hierarchical model (4.1) and (4.2) can be simplified as
Zi,j,s,k = fi(ti,j) + ωi,j − di,j,s,k + ηi,j,s + εi,j,s,k, (4.6)
with εi,j,s,ki.i.d∼ N (0, 1). Then, our computation schemes for drawing samples from
the joint posterior distribution of unknowns will derive from this data augmented
model.
59
Let Λi = (Li, πi,1, . . . , πi,Li , ξi,1, . . . , ξi,Li)′ with each ξi,` = (bi,`, γi,`) being the
corresponding parameters of the `-th TSP mixture component for the i-th subject,
where πi,` is the assigned weight of the `-th mixture component for i-th subject,
for i = 1, . . . , n and ` = 1, . . . , Li. Thus, Λi contains all information about the
TSP mixture for the i-th individual, and we denote the notations Λ = Λ1, . . . ,Λn
and ξ = ξ1,1, . . . , ξ1,L1 , . . . , ξn,1, . . . , ξn,Ln to represent the sets of variables for all
n subjects.
Similarly, we use bold notations θ, β,Z, η, τ to define the sets of corresponding
variables introduced in Section 4.1 over all indices, i.e., β = β1,0, β1,1, . . . , βn,0, βn,1,
η = ηi,j,s : i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j, τ = τ1, . . . , τn, Z =
Zi,j,s,k : i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j, k = 1, . . . , Ki,j,s, θ = θ1,1,
. . . , θ1,T1 , . . . , θn,1, . . . , θn,Tn. Similarly, define X = Xi,j,s,k : i = 1, . . . , n, j =
1, . . . , Ti, s = 1, . . . , Si,j, k = 1, . . . , Ki,j,s. Then, the joint posterior distribution of
parameters θ, β,Λ, Z, η, τ , φ given the data X is derived as:
f(θ, β,Λ, Z, η, τ , φ|X) (4.7)
∝ f(X|Z)f(Z|θ, η)f(θ|β,Λ, φ)f(η|τ )π(β)π(Λ)π(φ)π(τ )
∝n∏i=1
Ti∏j=1
Si,j∏s=1
Ki,j,s∏k=1
(I(Zi,j,s,k > 0)I(Xi,j,s,k = 1) + I(Zi,j,s,k ≤ 0)I(Xi,j,s,k = 0))
×√ψi,j,s,k
2πexp−ψi,j,s,k(Zi,j,s,k − θi,j + ai,j,s − ηi,j,s)2
2
×
n∏i=1
Ti∏j=1
√φ
2π∆i,j
exp−φ[θi,j − βi,0 − βi,1f 0
i (ti,j)]2
2∆i,j
×
n∏i=1
Ti∏j=1
( τi2π
)Si,j2
exp−τi∑Si,j
s=1 η2i,j,s
2
[ n∏i=1
π(τi)π(βi)π(Λi)
]π(φ),
60
where ψi,j,s,k = (1+σ2)−1 and π(τi), π(βi), π(Λi) and π(φ) denote the priors specified
in the beginning of this subsection.
4.2.2 The MCMC Sampling Schemes
The key part in developing MCMC algorithm to draw samples from the joint
posterior distribution (4.7) is to estimate the latent trajectory f 0i (ti,j)’s in Equation
(4.4). Notice that once Λ is known, f 0i (ti,j)’s are fully determined. Since the distri-
bution of Λ can vary in dimension, we employ a reversible jump Markov chain Monte
Carlo (RJ-MCMC) sampling scheme (Green and Hastie, 2009). Further, to reduce
the correlation as well as to achieve faster convergence of the MCMC samples of
parameters, we implement the idea of partially collapsed Gibbs sampling (Van Dyk
and Park, 2008). Thus, at the q-th iteration, we perform the sampling procedure of
unknowns in the order below:
1. Sample Z(q) from f(Z(q) | θ(q−1),η(q−1)), which is a truncated normal distri-
bution for the full conditional distribution of each individual Zi,j,s,k given the
rest;
2. Sample η(q) from f(η(q) | θ(q−1), τ (q−1),Z(q)), which is a normal distribution
for the full conditional distribution of each ηi,j,s given the rest;
3. Sample τ (q) from f(τ (q) | η(q)), which is a gamma distribution for the full
conditional distribution of each τi given the rest;
61
4. Sample Λ(q) from f(Λ(q) | θ(q−1)), which has no closed form; moreover, the di-
mension of Λ is varying for each iteration, and thus we employ the Metropolis-
Hasting algorithm within the RJ-MCMC to sample the full conditional distri-
bution of each Λi given the rest;
5. Sample φ(q) from f(φ(q) | θ(q−1)Λ(q)), which is a gamma distribution for the
full conditional distribution of φ given the rest;
6. Sample β(q) from f(β(q) | θ(q−1),Λ(q), φ(q)), which is a multivariate normal
distribution for the full conditional distribution of β given the rest;
7. Sample θ(q) from f(θ(q) | β(q),Λ(q),Z(q),η(q), φ(q)), which is a normal distri-
bution for the full conditional distribution of each individual θi,j given the
rest.
The details of each sampling step are described in Appendix A. The MCMC
sampling loops through Step 1-Step 7 and repeats until the MCMC is converged.
The initial values (i.e., when (q − 1)-th iteration = (0)-th iteration) of parameters
chosen in the simulation and application are: θ(0)i,j ’s drawing from N (0, 1); β
(0)i,1 ’s
and β(0)i,2 ’s being zero; η
(0)i,j,s’s being zero; φ(0) = 200; and τ
(0)i ’s being 6 for i =
1, · · · , n. While, we will specifically discuss how to choose the initial values of
Λ(0)i = (L
(0)i , π
(0)i , b
(0)i , γ
(0)i ) for i = 1, · · · , n in different examples. The convergence
is evaluated informally by looking at trace plots and other diagnostic procedures
discussed in Chen et al. (2012).
62
Then, statistical inferences are made straightforward from the MCMC samples.
For example, an estimate and 95% credible interval (CI) for the latent trajectory
of one’s ability θi = (θi,1, · · · , θi,Ti)′ can be plot from the median, 2.5%, and 97.5%
empirical quantiles of the corresponding MCMC realizations of each θi,j, for j =
1, · · · , Ti. In examples, these will be graphed as a function of time tij, so that the
dynamic changes of an examinee is apparent.
4.3 Simulation Examples
To validate the inference procedure of MCMC schemes and show the success of
using monotone shape constraints in the nonparametric modeling, two simulation
studies are conducted. In the first simulated example, we know the true underlying
curve f 0i (·)’s that generate the latent trajectory of one’s ability, while in the second
simulated example, we have no information about the true underlying curve f 0i (·)’s.
4.3.1 An Example Using the Mixture of TSP Functions as the Latent
Trajectory
In this section, we apply our proposed method to a simulated dataset that use
the mixture of TSP functions as the true latent trajectory. We consider 10 test
takers, i.e. n = 10, each of them is examined at 60 different test days, that is
Ti = 60, i = 1, . . . , 10. During each distinctive test day, there are 4 examinations
for each individual, thus Si,j = 4 for i = 1, . . . , 10, j = 1, . . . , Ti, and there are 10
questions (or items) in each test, i.e., Ki,j,s = 10 for i = 1, . . . , 10, j = 1, . . . , Ti and
63
Table 4.1: Simulated parameters for each individual of TSP functions.
L π b γΛ1 2 [0.5,0.5] [0.5,0.1] [3,5]Λ2 2 [0.2,0.8] [0.4,0.7] [2,8]Λ3 1 1 0.15 10Λ4 2 [0.3,0.7] [0.3,0.5] [5,13]Λ5 1 1 0.3 4Λ6 1 1 0.5 17Λ7 2 [0.1,0.9] [0.2,0.3] [3,15]Λ8 1 1 0.3 5Λ9 2 [0.6,0.4] [0.1,0.5] [8,5]Λ10 2 [0.7,0.3] [0.2,0.45] [4,10]
s = 1, . . . , Si,j. For each person i, we assume the time lapse between two consecutive
tests is a function of j, which is set to be ∆i,j = j + 10 for j = 1, . . . , Ti/2 and
∆i,j = j − 10 for j = Ti/2 + 1, . . . , Ti, with i = 1, . . . , 10.
We set the true values for model parameters as below:
(a) φ−1/2 = 0.05, and thus, the corresponding standard deviation of the random
component wi,j in Equation (4.2) is 0.05√
∆i,j.
(b) τ−1/2 = (0.361, 0.286, 0.322, 0.362, 0.359, 0.302, 0.347, 0.325, 0.360, 0.378)′, where
each element represents the standard deviation of random effects ηi,j,s for the i-th
person, respectively.
(c) σ2 = 0.73332, which is chosen based on the test design of EdSphere data.
The parameters specified for the mixture components of TSP functions are given
in Table 4.1.
We use the prior specified for the unknown parameters of the proposed model
in Section 4.2.1. Particularly, for this simulated example, we use a zero truncated
Poisson for each Li with λ = 2, which implies there are expecting no more than 3
64
jumps since each of the mixture of TSP functions in the simulation consists at most
2 components. Without available scientific information, we employ a symmetric
Dirichlet distribution with the parameter ρ = 1 as the prior for πi,`’s, assign a
uniform(0, 1) for the prior of each bi,` and specify a uniform(1, 50) for the prior of
each γi,`.
Model fitting is done with the MCMC algorithm described in the Section 4.2.2,
based on 50,000 iterations in total. The first 10,000 samples are burned in and only
every 20-th value is taken for our inference as to reduce dependence. After this
burn-in period, the mixing of MCMC samples looks pretty good and the trace plots
of MCMC samples of parameters are convergent.
In Figure 4.2, we display results of 4 represented individuals among the 10 simu-
lated individuals in the example. Those 4 individuals have noticeable differences in
the shape of their corresponding trajectories. The growth curve for the first subject
is steadily increasing without obvious transition. While for the second individual
in Figure 4.2, he/she has obvious turning points. He/she grows slowly during the
initial period, then rapidly grows in the middle and reaches the plateaus in the end.
For the third and 6-th subjects in Figure 4.2, they have similar phenomena as the
second individual but with different lengths of the three stages in the study.
In each subfigure of Figure 4.2, the blue dotted line represents the true underlying
function fi(·), the black dots are simulated values of θi,j’s for j = 1, · · · , Ti, and the
red dotted line is the estimate of posterior median of fi(·), along with the red dash
lines representing the corresponding 2.5% and 97.5% credible bands. We could see
65
that the fitted trajectory captures the trend pretty well even though the number of
different test date is comparatively small, that is 60. Simulations for larger sample
sizes have been tested and the results were generally better. Those results are not
reported here to save the space.
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T
0.5
1
1.5
2
2.5
3
3.5
θ1
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(a) 1st individual
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T
0.5
1
1.5
2
2.5
3
θ2
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(b) 2nd individual
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
θ3
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(c) 3rd individual
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T
0.5
1
1.5
2
2.5
3θ
6Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(d) 6th individual
Figure 4.2: The estimation of latent trajectories for the 1st, 2nd, 3rd, 6th subjects
We also calculate the posterior estimates of all other unknowns. The posterior
median of φ−1/2 is 0.0561, with a 95% CI being [0.0491, 0.0634], where the true
value of φ−1/2 (i.e., 0.05) is included. The posterior median as well as 95% CIs of
other parameters are displayed in Figure 4.3 and Figure 4.4. We could see that their
respective true values, which are marked by red stars, are all included in the 95%
CIs.
66
0
0.1
0.2
0.3
0.4
0.5
0.6
τ1
-1/2τ
2
-1/2τ
3
-1/2τ
4
-1/2τ
5
-1/2τ
6
-1/2τ
7
-1/2τ
8
-1/2τ
9
-1/2τ
10
-1/2
Posterior Median
True Value
95% Credible Interval
Figure 4.3: Posterior median and 95% CIs of τi’s in TSP mixture based simulation.
0
0.5
1
1.5
2
2.5
3
β1,0
β2,0
β3,0
β4,0
β5,0
β6,0
β7,0
β8,0
β9,0
β10,0
β1,1
β2,1
β3,1
β4,1
β5,1
β6,1
β7,1
β8,1
β9,1
β10,1
Posterior median
True value
Figure 4.4: The simulation results of posterior median and 95% CIs of βi,0’s andβi,1’s in TSP mixture based latent trajectories.
To take account of randomness in the simulations, another 100 independent
datasets are simulated to check frequentist coverage probabilities, with same pa-
rameter setup but different random seeds. For each dataset, we also run the MCMC
sampling for 50,000 iterations with the first 10, 000 samples being burned in. In
addition, the MCMC chains are thinned by using every 20-th sample. The results
are shown in Table 4.2. We can see that the coverage probabilities of the 95% CIs of
all model parameters including the true values are equal or very close to the nominal
67
level 95%. Thus, while the inferential method is Bayesian, it seems to yield sets that
have good frequentist coverage.
Table 4.2: Coverage probabilities of φ, θi’s, τi’s, βi,0’s and βi,1’s by 100 independentsimulations.
Parameter Coverage Probability Parameter Coverage Probabilityτ1 0.940 τ2 0.940τ3 0.960 τ4 0.920τ5 0.940 τ6 0.900τ7 0.900 τ8 0.920τ9 0.880 τ10 0.940θ1 0.944 θ2 0.932θ3 0.950 θ4 0.975θ5 0.969 θ6 0.965θ7 0.971 θ8 0.931θ9 0.945 θ10 0.963β1,0 0.960 β2,0 0.920β3,0 0.960 β4,0 1.000β5,0 0.980 β6,0 0.980β7,0 1.000 β8,0 0.940β9,0 0.960 β10,0 0.960β1,1 0.960 β2,1 0.960β3,1 0.960 β4,1 1.000β5,1 0.980 β6,1 0.960β7,1 0.960 β8,1 1.000β9,1 0.980 β10,1 1.000φ 0.940
4.3.2 An Example of the Logistic Curve as the Latent Trajectory
In this section, we apply our proposed method to some non-TSP based true
latent trajectories. The setup is the same as the previous simulation except the true
68
trajectories fi(·), i = 1, . . . , 10 become the logistic curves as below:
f1(t) = −0.5 + 2/(exp(−10t+ 5) + 1); f2(t) = −0.5 + 2/(exp(−5t+ 2) + 1);
f3(t) = 0.5 + 1/(exp(−10t+ 2.5) + 1); f4(t) = 0.5 + 2/(exp(−12t+ 3) + 1);
f5(t) = 2.5 + 1/(exp(−5t+ 3) + 1); f6(t) = −0.5 + 2/(exp(−10t+ 2) + 1);
f7(t) = 1/(exp(−10t+ 2) + 1); f8(t) = 0.5 + 2/(exp(−4t+ 1) + 1);
f9(t) = −1.5 + 2/(exp(−5t+ 1) + 1); f10(t) = 1/(exp(−2t+ 1) + 1).
The motivation for using logistic curves as the true latent trajectories in the simula-
tion is that the logistic curves have been widely applied in the growth curve analysis.
Similarly, our model fitting is done based on 200,000 iterations in total. The first
40,000 samples are burned in and every 20-th value of MCMC samples is taken to
reduce the dependence of samples. We use a zero truncated Poisson for each Li
with λ = 3, expecting a few more TSP components are needed than the previous
example to fit the logistic curve. Similarly as before, we use a symmetric Dirichlet
distribution with the parameter ρ = 1 for πi,`’s, and assign the uniform distribution
U(0, 1) as the prior of each bi,` as well as specify U(1, 50) as the prior of each γi,`.
The priors for the rest unknowns are specified in the same fashion as in Section
4.3.1.
Figure 4.5 displays the results of 4 selected individuals, where the blue dots
represent the truth of the underlying growth curve, the black dots are simulated
values of θi,j’s for j = 1, · · · , Ti, the red dots correspond to the posterior median
estimates of one’s ability and the red dash lines indicate the 95% credible band of the
estimates. Figure 4.5(b) and Figure 4.5(c) show that the true growth curves of 4-th
69
and 6-th examiners both have a steadily increasing growth curve at the beginning
and reach a plateau after half of the study period, whereas our estimated growth
curves (red dots) capture the trend of the truth (blue dots) very well and all true
values (blue dots) are within the 95% CIs of our estimation (red dash lines). For
the 2-nd and 8-th subjects, seen from Figure 4.5(a) and Figure 4.5(d), their growth
curves are strictly increasing over the study period and similarly, our proposed
method can capture the underlying trend well under these situations.
0 50 100 150
T
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
θ2
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(a) 2nd individual
0 50 100 150
T
0
0.5
1
1.5
2
2.5
3
θ4
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(b) 4th individual
0 50 100 150
T
-0.5
0
0.5
1
1.5
2
θ6
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(c) 6th individual
0 50 100 150
T
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
θ8
Fitted Trajactory
Generated Theta
True Underlying Function
95% Credible Band
(d) 8th individual
Figure 4.5: The estimation of latent trajectories for the 4-th, 6-th, 7-th, 8-th subjects
In addition, we calculate the posterior estimates of all other unknowns. The
posterior median of φ−1/2 is 0.0539, with the 95% CI being [0.0471, 0.0610], which
includes its true value 0.05. The posterior median as well as 95% CIs of τi’s, βi,0’s
70
and βi,1’s are displayed in Figure 4.6 and Figure 4.7, in which we can see that the
true simulated values of τi’s are all inside their corresponding 95% CIs. Notice that
in the logistic curve simulations, the true values of βi,0’s and βi,1’s are unknown and
thus, we are not able to compare the truth relative to the corresponding 95% CIs
for βi,0’s and βi,1’s.
0
0.1
0.2
0.3
0.4
0.5
0.6
τ1
-1/2τ
2
-1/2τ
3
-1/2τ
4
-1/2τ
5
-1/2τ
6
-1/2τ
7
-1/2τ
8
-1/2τ
9
-1/2τ
10
-1/2
Posterior Median
True Value
95% Credible Interval
Figure 4.6: Posterior median and 95% CIs of τi’s in logistic curve simulation.
-1
0
1
2
3
β1,0
β2,0
β3,0
β4,0
β5,0
β6,0
β7,0
β8,0
β9,0
β10,0
β1,1
β2,1
β3,1
β4,1
β5,1
β6,1
β7,1
β8,1
β9,1
β10,1
Posterior median
Figure 4.7: Posterior median and 95% CIs of βi,0’s and βi,1’s in logistic curve simu-lation.
71
4.4 Application to the EdSphere Data
Since our approach has been successfully applied to estimate the trend of fi(·)
from simulated data and recovered the true values of parameters well, we employ
our two-level hierarchical models to the EdSphere Data. Due to the limitation of our
computer’s RAM (8 GB), a sample of 10 individuals from the EdSphere database was
randomly chosen for illustration purpose. The characteristics of the individuals are
described in Table A.1. . There are two goals for the analysis of EdSphere datasets.
One is to assess the appropriateness of the local independence assumption for this
type of data, and the other is to understand the growth in ability of students, by
retrospectively producing the estimated growth trajectories of their latent abilities,
incorporating the monotone assumption.
The prior specification is the same as the aforementioned simulation examples
except we use a zero truncated Poisson prior for each Li with λ = 1, which corre-
sponds to a prior belief from MetaMetric researchers that using about 2 different
TSP functions we can explain the changes of the response trajectory. But such prior
assumption can be washed out by the data if our data has strong information to
indicate that we need more mixture components of TSP functions to explain one’s
ability growth. In addition, to examine the sensitivity of the prior specification for
Li, we have tried other λ values, the yielded estimation of latent trajectories are
almost the same as those shown in Figure 4.8.
We have run in total 500,000 iterations and burn in the first 1/5 of the samples,
and to reduce dependence of MCMC samples, we have taken only every 20-th value
72
of MCMC samples. We have checked the trace plots of model parameters to access
the convergence for MCMC samples and we found a good mixing is observed after
our burn-in and thinning process. Figure 4.8 shows the estimated trajectory of 4
individuals, which represent four types of growth curves we typically observed from
the data.
In Figure 4.8, the red dots indicate the estimated posterior mean of one’s latent
ability from our proposed method and the red dash lines denote the corresponding
95% credible band of one’s latent trajectory; the blue dots represent the estimated
trajectory from MetaMetrics used in current EdSphere implementation (where they
assume AR(1) model for θi,j and consider an observation equation as Model (4.1)
but without the random effect ηi,j,s); and the black dots correspond to the estimates
of one’s ability obtained by solving the equation that the expectation of expected
score for a person’s ability is equivalent to the observed score (which can roughly be
thought of as the raw test scores put on the same scale as the θi,t’s). In Figure 4.8,
we can see that our estimated latent trajectory of one’s ability growth (i.e., the red
dots for the posterior median of θi,t’s) displays a much smooth monotone increasing
trend of one’s ability in comparison to MetaMetrics’ estimation, where MetaMetrics’
approach shows a continuously up-and-down oscillation in the estimation of one’s
ability.
Noticeably, in Figure 4.8, the latent growth curve of the second individual (i.e.,
see Figure 4.8(a)), a six-grader, increases sharply at about the 300-th day and
reaches the plateau before the 450-th day, while the 6-th individual (in Figure 4.8(c))
73
experiences a moderate growth during most of the time before stabilizing by the
end of the study. For the third and the 7-th students in Figure 4.8(b) and Figure
4.8(d), respectively, they both are in Grade 2 and have a similar type of steadily
increasing shape of the growth over the entire study period. However, seen from
Figure 4.8, the growth rate (or learning speed) of the 7-th individual is much faster
than that of the third individual. Since the study period of the 7-th individual
is much shorter than that of the 3rd individual, clearly, we can also compare the
estimation of posterior median as well as the corresponding 95% CIs of β7,1 and
β3,1 (they represent the magnitude of one’s ability growth during the study period),
respectively in Figure 4.10 to validate the differences between the growth rate of
these two individuals. The results of Figure 4.8 inform us that the timing for
students reaching the ‘learning ceiling’ (i.e., plateau) as well as their ‘learning speed’
differs among different individuals. Further investigation or clustering of students
based on the shape of the growth curves might help teachers to tailor education
practice or assignments for each individual student.
Moreover, we are able to summarize the results of other parameters in the model.
The posterior median of φ−1/2 is 0.0566, and its 95% CI is [0.0301,0.0797]. The
posterior estimates and 95% CIs for τ and β are summarized in Figure 4.9 and
Figure 4.10, respectively. In Figure 4.9, all τi values and their corresponding CIs
are far away from zero except τ−1/22 and τ
−1/23 , which suggests the local dependence
indeed exists in the EdSphere datasets. In Figure 4.10, all βi,1’s values as well
as their corresponding 95% CIs are above 0, which shows the data supports the
74
0 200 400 600 800 1000 1200 1400
T
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
θ2
The trajectory of latent ability
Posterior Median
MetaMetrics Est
Raw Scores
95% Credible Band
(a) 2nd individual
0 500 1000 1500
T
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
θ3
The trajectory of latent ability
Posterior Median
MetaMetrics Est
Raw Scores
95% Credible Band
(b) 3rd individual
0 200 400 600 800 1000 1200 1400 1600 1800 2000
T
-5
-4
-3
-2
-1
0
1
2
θ6
The trajectory of latent ability
Posterior Median
MetaMetrics Est
Raw Scores
95% Credible Band
(c) 6th individual
0 100 200 300 400 500 600 700 800 900
T
-2
-1
0
1
2
3
4
5
θ7
The trajectory of latent ability
Posterior Median
MetaMetrics Est
Raw Scores
95% Credible Band
(d) 7th individual
Figure 4.8: The estimation of latent trajectories for the 2nd, 3rd, 6th and 7thsubjects
belief that the growth curve of one’s reading ability is always increasing. Notice our
method does not require any additional restriction on the parameters in the model,
thus, the values of parameters are fully determined by the data. Also, we could see
the values of βi,0’s and βi,1’s are varying a lot according to different individuals.
4.5 Conclusion
In this chapter, we have proposed a Bayesian nonparametric two-level hierar-
chical model for the analysis of one’s latent ability growth in educational testing
for longitudinal scenarios. The advantage of our method is able to incorporate the
75
0
0.2
0.4
0.6
0.8
τ1
-1/2τ
2
-1/2τ
3
-1/2τ
4
-1/2τ
5
-1/2τ
6
-1/2τ
7
-1/2τ
8
-1/2τ
9
-1/2τ
10
-1/2
Posterior Median
95% Credible Interval
Figure 4.9: Posterior median and 95% CIs of τi’s.
-4
-2
0
2
4
β1,0
β2,0
β3,0
β4,0
β5,0
β6,0
β7,0
β8,0
β9,0
β10,0
β1,1
β2,1
β3,1
β4,1
β5,1
β6,1
β7,1
β8,1
β9,1
β10,1
Posterior median
Figure 4.10: Posterior median and 95% CIs of βi,0’s and βi,1’s.
monotonic shape constraints into the estimation of latent trajectories. Due to the
flexibility of our nonparametric method, we are able to fit any monotonic continuous
curve without further restrictions on the estimation of parameters. Therefore, the
estimation of the slopes βi,1’s without including zero in their corresponding 95% CI
indicates the monotonicity is supported by EdSphere datasets.
The latent trajectories of ability growth estimated from our approach can help
educators or practitioners to better understand the behaviors of students in the
study, such as the growth patterns (continuous increasing, sharp increasing and
etc.) and the timing that a student reaches the ceiling of increasing for his/her
76
learning ability (i.e., the timing of reaching the plateau). Further studies on cluster-
ing those behavioral patterns of individuals will guide us to design education practice
or teaching tailored to individuals. In addition, since the evidence of the local de-
pendence assumption is generally strong from the analysis of EdSphere datasets, we
can conclude that the use of random effects to model the local dependence seems to
be necessary and successful.
As the MCMC computation is time-consuming and resource-demanding for our
current approach, our next goal is to improve the efficiency of our computation by
developing big data schemes, such as similar ideas of Wei et al. (2017), to make
the parallel computing possible so as to conveniently apply our approach to the
entire datasets. With improvement of computation efficiency, we might be able to
develop a computable evaluation criterion to assess the model fit of the proposed
nonparametric model using the predictive score ideas (Gneiting and Raftery, 2007).
Another potential extension is to explore covariates including the grade and others
with their relationships to the growth of one’s ability trajectory. This direction will
facilitate us to group individuals based on their similar characteristics and encourage
the development of personalized education.
Chapter 5
Preliminary Results of Model Selection by DIC
5.1 Model Selection via DIC under Hierarchical Model
We have discussed the Bayes factor approach for model selection in Chapters 2
and 3 under the IRT models, which are essentially hierarchical models. In future
research, we want to investigate the model selection performance of the deviance
information criterion (DIC) (Spiegelhalter et al., 2002) and its variants for hierar-
chical models, and potentially develop a reliable model selection criterion for the
IRT models.
The DIC is initially defined by Spiegelhalter et al. (2002) as
DIC(θ) = −4Eθ|ylog(y|θ)+ 2 log(y|θ), (5.1)
where θ = E(θ|y) is generally taken as the posterior mean. The quantity
PD = −2Eθ|ylog(y|θ)+ 2 log(y|θ) (5.2)
is the effective number of model parameters with the “focus” on Θ, θ ∈ Θ.
77
78
As briefly discussed in the end of Chapter 3, when one wants to apply the DIC in
a hierarchical model setting, it needs to be clearly defined what are the parameters
“in focus”. In addition, different versions of DIC are proposed by Celeux et al.
(2006) when there are missing data. In particular, denote the missing data as Z,
complete data DICs are proposed in (Celeux et al., 2006), and the commonly used
DIC4 is defined as follows:
DIC4 = −4Eθ,z|ylog(y, Z|θ)+ 2Ez[logy, z|Eθ(θ|y, z)|y], (5.3)
which requires to compute the posterior mean of θ for each value of z from the
posterior. When the analytical form of Eθ(θ|y, z) is available, the computation is
not difficult.
For item response theory models, the first layer usually links a few latent variables
with the observed responses via a logistic function or probit function. If one wishes
to use DIC to perform model selection under the IRT model framework, e.g., the
dynamic IRT model proposed in Chapter 3, it is not easy to decide what shall be
the parameters “in focus”. If only the β coefficients and other hyper parameters,
such as ψ, φ, etc., are considered as the focus, it will involve complex integrations
to obtain the data likelihood with respect to these focused parameters. Instead, if
all parameters are considered as the focus, the log data likelihood only involves the
first layer. Other types of parameter focus may be proposed as well, which could
lead to different results.
Besides the choice of the focus of parameters, if the augmented data is introduced
to facilitate the sampling process (Albert and Chib, 1993), one may also use the
79
complete data DIC criterion such as the DIC4. Also, it is possible to treat some of
the intermediate variables as “missing data” in the definition of the complete DIC4
if they are not parameters “in focus”, which may lead to interesting results for the
PD, i.e., the “effective” number of parameters. This motivates us to investigate the
performance of the original DIC (called as DIC1 by Celeux et al. (2006)) and DIC4
under the hierarchical model framework.
5.1.1 The Model
We start with the following simple hierarchical model:
yij = Φ(X ijβ + θi), i = 1, . . . , n, j = 1, . . . ,m, (5.4)
θi ∼ N(0, σ21) for i = 1, . . . , [
n
2],
θi ∼ N(0, σ22) for i = [
n
2] + 1, . . . , n,
where Xij = (X ij1, . . . , Xijk), β = (β1, . . . , βk), [n2] is the largest integer less than
or equal to n2.
Denote τ1 = 1/σ21 and τ2 = 1/σ2
2. We want to select between the model with
the equal variance, i.e., σ21 = σ2
2, and the model with unequal variances, i.e., σ21 6=
σ22. Following Albert and Chib (1993), we introduce the augmented data Zij ∼
N(X ijβ + θi, 1), under the condition that Zij > 0 if Yij = 1 and Zij ≤ 0 if Yij = 0.
We assign the noninformative priors for τ and β as follows: π(τi) ∝ τ−1i , i = 1, 2,
π(β) ∝ 1.
80
Thus the joint likelihood function can be written as
P (Y ,Z,β,θ,X, τ1, τ2)
= P (Y |Z)P (Z|β,θ,X)P (θ|τ1, τ2)π(β)π(τ1)π(τ2)
=n∏i=1
m∏j=1
[I(Yij = 1)I(Zij > 0) + I(Yij = 0)I(Zij ≤ 0)](2π)−12
× exp(−(Zij −X ijβ − θi)2
2)(2π)−n/2
×(τ1)n4 (τ2)
n4 exp(−
τ1
∑n/2i=1 θ
2i + τ2
∑ni=1+n/2 θ
2i
2)× τ−1
1 τ−12
= (2π)−mn+n
2
n∏i=1
m∏j=1
[I(Yij = 1)I(Zij > 0) + I(Yij = 0)I(Zij ≤ 0)]
× exp(−(Zij −X ijβ − θi)2
2)(τ1)
n4−1(τ2)
n4−1 exp(−
τ1
∑n/2i=1 θ
2i + τ2
∑ni=1+n/2 θ
2i
2),
where Y represents the vector of all Yij’s, and Z represents the vector of all Zij’s.
5.1.2 Choose θ as the Parameters in Focus
Mimicking the definition of DIC4 by Celeux et al. (2006), if we treat only θ as
the parameters and fix all others, then denote
DIC(Y ,Z,β, τ )
= 2PD(Y ,Z,β, τ ) +D(Y ,Z,β, τ |θ),
= −4Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ
+2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ), (5.5)
81
where
θ = Eθ(θ|Y ,Z,β, τ ),
D(Y ,Z,β, τ |θ) = −2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ),
PD(Y ,Z,β, τ ) = −2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ
+2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ),
and, hence, define complete data DIC4 as
DICθ4 (Y )
= EZ ,β,τ (DIC(Y ,Z,β, τ )|Y )
= EZ ,β,τ 2PD(Y ,Z,β, τ ) +D(Y ,Z,β, τ |θ).
In the first term of equation (5.5),
log f(Y ,Z,β, τ |θ)
= −mn+ n
2log(2π)− 1
2
n∑i=1
m∑j=1
(Zij −X ijβ − θi)2.
In the second term
Eθ(θ|Y ,Z,β, τ ) =
∑m
j=1 (Zij −X ijβ)/(m+ τ1), i = 1, . . . , n/2;∑mj=1 (Zij −X ijβ)/(m+ τ2), i = n/2 + 1, . . . , n,
we have
−2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ )
= mn log(2π) +n∑i=1
m∑j=1
(Zij −X ijβ)2
−n/2∑i=1
(m+ 2τ1)[∑m
j=1(Zij −X ijβ)]2
(m+ τ1)2−
n∑i=n/2+1
(m+ 2τ2)[∑m
j=1(Zij −X ijβ)]2
(m+ τ2)2.
82
The full conditional distribution of θi given Y ,Z,β, τ is
p(θi|Y ,Z,β, τ ) ∼ N(m∑j=1
(Zij −X ijβ)/(m+ τq), 1/(m+ τq)),
where q = 1 if i ≤ n/2, otherwise q = 2.
To compute the mean deviance D = −2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ, we
have
−2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ
= Eθ|Y ,Z ,β,τ n/2∑i=1
[m∑j=1
(Zij −X ijβ)2 − 2m∑j=1
(Zij −X ijβ)θi +mθ2i ]
+
n/2∑i=1
[m∑j=1
(Zij −X ijβ)2 − 2m∑j=1
(Zij −X ijβ)θi +mθ2i +mn log(2π)
=n∑i=1
m∑j=1
(Zij −X ijβ)2 +
n/2∑i=1
[m
m+ τ1
−(m+ 2τ1)[
∑mj=1(Zij −X ijβ)]2
(m+ τ1)2]
+n∑
i=n/2+1
[m
m+ τ2
−(m+ 2τ2)[
∑mj=1(Zij −X ijβ)]2
(m+ τ2)2] +mn log(2π)
= −2 log f(Y ,Z,β, τ |Eθ(θ|Y ,Z,β, τ )) +mn
2(m+ τ1)+
mn
2(m+ τ2),
thus PD(Y ,Z,β, τ ) is equal to mn2(m+τ1)
+ mn2(m+τ2)
, and we can get the PD in the
complete data DIC4 by EZ ,β,τ PD(Y ,Z,β, τ ).
5.1.3 Choose τ as the Parameters in Focus
Similarly, if we focus on τ , we can define the DIC4 as follows. Firstly we define
DIC(Y ,Z,β,θ)
= 2PD(Y ,Z,β,θ) +D(Y ,Z,β,θ|τ ),
= −4Eτ log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ
+2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ) (5.6)
83
where
τ = E(τ |Y ,Z,β,θ),
D(Y ,Z,β,θ|τ ) = −2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ)
PD(Y ,Z,β,θ) = −2Eτ log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ
+2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ),
and the complete DIC is defined as
DICτ4 (Y )
= EZ ,β,θ(DIC(Y ,Z,β,θ)|Y )
= EZ ,β,θ[2PD(Y ,Z,β,θ) +D(Y ,Z,β,θ|τ )].
In the first term of equation (5.6),
log f(Y ,Z,β,θ|τ )
= −mn+ n
2log(2π)−
n∑i=1
m∑j=1
(Zij −X ijβ + θi)2
2+n
4[log(τ1) + log(τ2)]
−τ1
∑n/2i=1 θ
2i + τ2
∑ni=1+n/2 θ
2i
2,
note that the conditional likelihood does not contain the prior for τ . And in the
second term
Eτ (τ |Y ,Z,β,θ) = (n
2∑n/2
i=1 θ2i
,n
2∑n
i=n/2+1 θ2i
)′.
Regarding the computation of the mean deviance
D = −2Eτ [log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ],
84
notice that if a random varialbe γ ∼ Γ(α, 1),
dΓ(α)
dα=
∫ ∞0
log(x)xα−1 exp(−x)dx = Γ(α)Elog(γ),
thus Elog(γ) = Γ′(α)Γ(α)
. Since
P (τ1|Y ,Z,β,θ) ∼ Γ(n/4,2∑n/2i=1 θ
2i
), P (τ2|Y ,Z,β,θ) ∼ Γ(n/4,2∑n
i=1+n/2 θ2i
),
we have
Eτ |Y ,Z ,β,θlog(τ1)
= Eτ |Y ,Z ,β,θlog(τ1
∑n/2i=1 θ
2i
2) − log(
∑n/2i=1 θ
2i
2)
=Γ′(n/4)
Γ(n/4)− log(
∑n/2i=1 θ
2i
2)
= ψ(n/4)− log(
∑n/2i=1 θ
2i
2),
where ψ(x) = Γ′(x)Γ(x)
is a built-in function in Matlab. Similarly we can compute
Eτ |Y ,Z ,β,θlog(τ2).
Thus we can obtain PD(Y ,Z,β,θ) = nψ(n/4) + log(2) − log(n/2) for the
case when τ1 6= τ2, which is a constant that only depends on n. Hence the PD
corresponding to the complete data DIC4 is equal to PD(Y ,Z,β,θ).
5.1.4 Simulation Results
In the simulation study, we set k = 3, and let X1 = 1 to be the intercept,
X2i.i.d∼ N(1, 4),X3
i.i.d∼ N(0, 1). We set the true value of β = (1,−1, 1.5)′.
Consider these two setups for the variance parameters for θ: σ21 = σ2
2 = 1, and
(σ21, σ
22) = (1, 4).
85
Several sample size combinations (n,m) are considered: (100, 8), (80, 10), (40, 20)
and (100, 20). For each simulation setup, 100 independent simulations are con-
ducted. 8000 MCMC samples are drawn via the Gibbs sampling algorithm and the
first 40% of the draws are discarded for burn-in.
We also consider the DIC1 and treat all parameters as the parameters in focus for
the comparison. Model selection results and the PDs are summarized in Tables 5.1
and 5.2. Results for PD are the median values across 100 independent simulations,
being rounded to the nearest tenth. From Table 5.1 we see that DICθ4 performs
much worse than the other two DICs in the case when the true model is σ21 6= σ2
2,
and DICτ4 does slightly better than DIC1. However, when σ21 = σ2
2 is true, the
selection results of DICτ4 and DICθ4 are close, but both are worse than DIC1. The
performance of DIC1 seems worse when (n,m) changed from (100, 8) to (100, 20) in
this case, which needs further investigation for justifications.
In terms of PD, the effective number of parameters, note that the PD of DICτ4
are identical (after rounding to the nearest tenth) to the number of parameters in
focus: when σ21 6= σ2
2, the PD is about 2 and when σ21 6= σ2
2, the PD is equal to
1. This seems like a desirable result intuitively, at least under the non-hierarchical
model setup, however it is unclear under the hierarchical framework.
Since DIC can be decomposed as the sum of the deviance and PD, it is interesting
to investigate in the future whether combining the deviance of DIC1 and the PD
of DIC4 under the hierarchical model framework can lead to a better measure that
outperforms DIC1.
86
Table 5.1: Model identification counts in 100 simulations per setting of DIC1, DICθ4 ,and DICτ4 .
(n,m)(σ2
1, σ22)
From (1, 4) From (1, 1)Select Select Select Selectσ2
1 6= σ22 σ2
1 = σ22 σ2
1 = σ22 σ2
1 6= σ22
DIC1 selections(100,8) 95 5 77 23(80,10) 90 10 84 16(40,20) 77 23 71 29(100,20) 94 6 67 33
DICθ4 selections(100,8) 80 20 38 62(80,10) 75 25 37 63(40,20) 62 38 39 61(100,20) 56 44 57 43
DICτ4 selections(100,8) 90 10 36 64(80,10) 98 2 43 57(40,20) 84 16 46 54(100,20) 93 7 62 38
Table 5.2: PD results in 100 simulations per setting of DIC1, DICθ4 , and DICτ4 .
(n,m)(σ2
1, σ22)
From (1, 4) From (1, 1)Fit Fit Fit Fit
σ21 6= σ2
2 σ21 = σ2
2 σ21 = σ2
2 σ21 6= σ2
2
PD for DIC1
(100,8) 79.0 82.1 72.3 72.1(80,10) 66.5 69.1 61.7 62.2(40,20) 38 38.7 36.5 36.7(100,20) 91.0 93.2 87.2 87.4
PD for DICθ4(100,8) 92.7 94.6 88.8 88.1(80,10) 75.1 76.7 72.8 72.2(40,20) 38.7 39.1 38.0 37.9(100,20) 96.9 97.8 95.1 95.0
PD for DICτ4(100,8) 2.0 1.0 1.0 2.0(80,10) 2.0 1.0 1.0 2.0(40,20) 2.0 1.0 1.0 2.0(100,20) 2.0 1.0 1.0 2.0
Appendix A
Posterior Propriety and MCMC Sampling Scheme
With some mild assumptions stated in Theorem 1 of Wang et al. (2013), we can
show the usage of the objective priors introduced in Section 4.2.1 will lead to the
proper posterior. The proof of the proper posterior can be followed directly using
the lemmas in Appendix C of Wang et al. (2013) by rewriting Model (4.1)–(4.2)
into the Probit model without data augmentation as
Pr(Xi,j,s,k = 1 | βi,0, βi,1, f 0i (ti,j), ωi,j, di,j,s,k, ηi,j,s)
= Φ(βi,0 + βi,1f0i (ti,j) + ωi,j − di,j,s,k + ηi,j,s).
Further, we redefine Xi,j,s,k = −1 to indicate that a student answers a question
wrong. Then, following the notation given in Equation (4.7), provided that Λ and
87
88
the data X are known, the joint posterior distribution of parameters β, η, τ , φ is:
f(β, η, τ , φ|Λ,X)
∝
n∏i=1
Ti∏j=1
Si,j∏s=1
Ki,j,s∏k=1
Φ[Xi,j,s,k(βi,0 + βi,1f
0i (ti,j) + ωi,j − ai,j,s + ηi,j,s + υi,j,s,k)
]×
n∏i=1
Ti∏j=1
Si,j∏s=1
Ki,j,s∏k=1
1√2πσ
exp
(−υ2i,j,s,k
2σ2
)
n∏i=1
Ti∏j=2
√φ
2π∆i,j
exp
(−φω2
i,j
2∆i,j
)
×
n∏i=1
Ti∏j=1
( τi2π
)Si,j2
exp
(−τi∑Si,j
s=1 η2i,j,s
2
)[n∏i=1
π(τi)π(βi,0)π(βi,1)
]π(φ),(A.1)
where π(φ) ∝ φ−3/2, π(τi) ∝ τ−3/2i , π(βi,0) ∝ 1 and π(βi,1) ∝ 1. Hence, us-
ing similar ideas that derive Lemma 2, Lemma 3 and Lemma 4 of Appendix C
in Wang et al. (2013), we are able to show the joint posterior distribution above
is proper. Since the prior we impose on Λ is proper and f(β, η, τ,Λ, φ|X) =
f(β, η, τ , φ|Λ,X)∏n
i=1 π(Λi), then we can conclude the joint posterior distribution
of parameters β, η, τ,Λ, φ given the data X will be proper. Therefore, we are legit-
imate to use the objective priors we specify in Section 4.2.1 to make our statistical
inference based on the posterior distribution derived from those priors.
Next, we are going to give the detailed derivation of the MCMC sampling scheme
for each step and we will follow the sampling procedure mentioned earlier in Section
4.2.2 to run the MCMC from Step A.1 to Step A.7 until our MCMC converges.
89
A.1 Sampling of Z: Truncated Normal Distribution
Given θ, η, the augmented variables Zi,j,s,l are sampled from the truncated
normal distribution below:
Zi,j,s,k ∼ N+(θi,j − ai,j,s + ηi,j,s, 1 + σ2), if Xi,j,s,k = 1;
Zi,j,s,k ∼ N−(θi,j − ai,j,s + ηi,j,s, 1 + σ2), if Xi,j,s,k = 0.
A.2 Sampling of η: Normal Distribution
Given θ, τ,Z, the full conditional distribution of ηi,j,s is N (ζi,j,s, δi,j,s), where
ζi,j,s =
∑Ki,j,sk=1 ψi,j,s,k(Zi,j,s,k − θi,j+ai,j,s)∑Ki,j,s
l=1 ψi,j,s,l + τi, and δi,j,s =
Ki,j,s∑k=1
ψi,j,s,k + τi
−1
.
A.3 Sampling of τ : Gamma Distribution
Given η, the full conditional distribution of τi is
f(τi|η) ∼ Ga(
∑Tij=1 Si,j − 1
2,
∑Tij=1
∑Si,js=1 η
2i,j
2),
where Ga(·, ·) denotes the symbol of the gamma distribution, (∑Ti
j=1 Si,j − 1)/2 is
the shape parameter and∑Ti
j=1
∑Si,js=1 η
2i,j/2 is the rate parameter of the gamma
distribution, respectively.
A.4 Sampling of Λ: Metropolis-Hastings within RJ-MCMC Algorithm
Given θ(q−1) (here we add the iteration index as the superscript to make the
discussion much clear for Subsection A.4), it is easy to derive the full conditional
90
distribution of Λ(q) as follows
f(Λ(q) | θ(q−1)) (A.2)
∝ |X ′desD−1Xdes|−12
|D| 12Γ(
∑ni=1 Ti − 3n− 1
2)
× [(θ(q−1) −Xdesβ)′(θ(q−1) −Xdesβ)
2]−
∑ni=1 Ti−3n−1
2
n∏i=1
π(Λi),
where Γ(·) is a gamma function, Xdes = diag(x1, · · · ,xn) is a block diagonal matrix
with xi =
1 f 0
i (ti,1)
......
1 f 0i (ti,Ti)
, D = diag(D1, · · · ,Dn) is also a block diagonal matrix
with Di = diag(∆i,1, · · · ,∆i,Ti) being a diagonal matrix, π(Λi) = π(Li)π(ξi)π(πi)
is the prior distribution of Λi and β = (X ′desD−1Xdes)
−1X ′desD−1θ(q−1). In our
study, we use π(Li) = λLiLi!(eλ−1)
, which is the zero truncated Poisson prior with
parameter λ, π(ξi) = ( 1γU−γL
)Li , which indicates we apply U(0, 1) as the prior for
bi,`’s and U(γL, γU) as the prior for γi,`’s, and let π(πi) = Γ(Liρ)
Γ(ρ)Li
∏Li`=1 π
ρ−1i,` , which is
a symmetric Dirichlet distribution.
For each subject i = 1, . . . , n, we develop a RJ-MCMC to sample Λi from its
full conditional distribution f(Λi|θ,Λ−i), where Λ−i =⋃j 6=iΛj. At the (q+ 1)-th
iteration, the RJ-MCMC procedure is divided into three parts:
1. UPDATE when the dimension index Li does not change, i.e., L(q+1)i = L
(q)i ;
2. ADD when the dimension Li increases to Li + 1, i.e., L(q+1)i = L
(q)i + 1;
3. REMOVE when the dimension Li decreases to Li− 1, i.e., L(q+1)i = L
(q)i − 1.
91
(While q < M)Draw u from U(0, 1);
if (u < PU(L(q)i )) obtain Λ
(q+1)i with UPDATE step;
else if (PU(L(q)i ) ≤ u < PU(L
(q)i ) + PA(L
(q)i )) obtain Λ
(q+1)i with ADD step;
else if (PU(L(q)i ) + PA(L
(q)i ) < u ≤ 1) obtain Λ
(q+1)i with REMOVE step;
Figure A.1: A pseudo code of the RJ-MCMC algorithm
We assign the corresponding probability PA(L), PR(L) and PU(L) to decide whether
the next step is to add, remove or update when the current dimension is L.
In our program, we use equal probability for PA(L), PR(L) and PU(L) when
L ≥ 2, i.e., PA(L) = PR(L) = PU(L) = 1/3. As we require the dimension to be
positive, when L = 1, we assign PR(1) = 0, while PA(1) = PU(1) = 1/2.
The entire algorithm is shown in Figure A.1, where M is the total number of
iterations. In the following paragraphs, we will describe the details of each step.
A.4.1 UPDATE step
1. Computep(Λ∗i ,Λ
(q)−i |θ
(q))
p(Λ(q)|θ(q)) using Equation (A.2).
2. For the weights π, we omit the subscript i for brevity in the discussion and
use a Dirichlet proposal distribution q(π∗ | π) for π∗:
q(π∗ | π) =
∏Lij=1 Γ(αj)
Γ(∑Li
j=1 αj)
Li∏j=1
(π∗j )αj−1,
where αj = h0πj + 1 with h0 = 10 and similarly, a Dirichlet proposal distribu-
tion q(π | π∗) for π,
q(π | π∗) =
∏Lij=1 Γ(α∗j )
Γ(∑Li
j=1 α∗j )
Li∏j=1
(πj)α∗j ,
92
where α∗j = h0π∗j + 1 with h0 = 10.
3. Assume γ is bounded on the interval [γL, γU ], where similarly we omit the
subscript i and ` for brevity. Firstly, standardize γ by
m =γ − γLγU − γL
.
For the transformed γ, a beta distribution proposal for m∗ is employed as
below:
q(m∗ | m) =Γ(α + β)
Γ(α)Γ(β)(m∗)α−1(1−m∗)β−1,
where α = m(h1 + 2) + 1 and β = h1 + 2−m(h1 + 2) + 1 with h1 = 10. Then,
γ∗ = (γU − γL)m∗ + γL. Similarly, we use a beta distribution proposal for m
as follows:
q(m | m∗) =Γ(α∗ + β∗)
Γ(α∗)Γ(β∗)mα∗−1(1−m)β
∗−1,
where α∗ = (h1 + 2)m∗ + 1 and β = h1 + 2− (h1 + 2)m∗ + 1 with h1 = 10.
4. For b, we suspend the subscript for b in this discussion and also use the beta
distribution as the proposal for b∗, i.e.,
q(b∗ | b) =Γ(α + β)
Γ(α)Γ(β)(b∗)α−1(1− b∗)β−1,
where α = b(h2 + 2) + 1 and β = h2 + 2− b(h2 + 2) with h2 = 10. Similarly,
the proposal for b is also a beta distribution, i.e.,
q(b | b∗) =Γ(α∗ + β∗)
Γ(α∗)Γ(β∗)bα∗−1(1− b)β∗−1,
where α∗ = (h2 + 2)b∗ + 1 and β∗ = h2 + 2− (h2 + 2)b∗ with h2 = 10.
93
5. Compute the proposal distribution ratiof(Λ
(q)i |Λ
∗i )
f(Λ∗i |Λ(q)i )
.
6. Obtain the Metropolis-Hastings ratio MH(Λ∗i ,Λ(q)i ) =
p(Λ∗i ,Λ(q)−i |θ
(q))
p(Λ(q)|θ(q)) ·f(Λ
(q)i |Λ
∗i )
f(Λ∗i |Λ(q)i )
.
7. Draw u from U(0, 1), take Λ(q+1)i = Λ∗i if u < MH(Λ∗i ,Λ
(q)i ), otherwise take
Λ(q+1)i = Λ
(q)i .
A.4.2 ADD step
1. We sample (π∗, ξ∗) from a proposal distribution q(π, ξ), specifically, sample π∗
from U(0, 1) and b∗ from U(0, 1), γ∗ from U(γL, γU).
2. Let r be the index to insert the new variables, and we sample r from a discrete
uniform distribution of 1, 2, · · · , L(q)i + 1.
3. We set Λ∗i = (L(q)i + 1, π
(q)i,1 (1− π∗), π(q)
i,2 (1− π∗), · · · , π(q)i,r−1(1− π∗), π∗, π(q)
i,r (1−
π∗), · · · , π(q)
i,L(q)i
(1− π∗), ξ(q)i,1 , · · · , ξ
(q)i,r−1, ξ
∗, ξ(q)i,r , · · · , ξ
(q)
i,L(q)i
)′.
4. Calculate the posterior conditional distribution ratiop(Λ∗i |θ(q),Λ
(q)−i )
p(Λ(q)i |θ(q),Λ
(q)−i )
=p(Λ∗i ,Λ
(q)−i |θ
(q))
p(Λ(q)|θ(q))
by Equation (A.2).
5. Define MH(Λ∗i ,Λ(q)i ) =
p(Λ∗i ,Λ(q)−i |θ
(q))
p(Λ(q)|θ(q))PR(L
(q)i +1)
PA(L(q)i )q(π∗,ξ∗)
(1 − π∗)L(q)i , where PR(L
(q)i +
1) and PA(L(q)i ) can be computed based on their definitions in the para-
graph above Figure A.1, the proposal q(π∗, ξ∗) in the ratio of MH(Λ∗i ,Λ(q)i )
is q(π∗, ξ∗) = 1γU−γL
and the Jacobian in the RJ-MCMC is J = (1− π∗)L(q)i .
6. Draw u from U(0, 1), take Λ(q+1)i = Λ∗i if u < MH(Λ∗i ,Λ
(q)i ), otherwise take
Λ(q+1)i = Λ
(q)i .
94
A.4.3 REMOVE step
1. Let r be the index to remove the existing variables and we sample r from a
uniform distribution of 1, 2, · · · , L(q)i .
2. We set Λ∗i as (L(q)i − 1,
π(q)i,1
(1−π(q)i,r ),
π(q)i,2
(1−π(q)i,r ), · · · , π
(q)i,r−1
(1−π(q)i,r ),
π(q)i,r+1
(1−π(q)i,r ), · · · ,
π(q)
i,L(q)i
(1−π(q)i,r ), ξ
(q)i,1 , · · · ,ξ
(q)i,r−1,
ξ(q)i,r+1, · · · , ξ
(q)
i,L(q)i
)′.
3. Calculatep(Λ∗i ,Λ
(q)−i |θ
(q))
p(Λ(q)|θ(q)) similarly as Step 4 in A.4.2.
4. Define MH(Λ∗i ,Λ(q)i ) =
p(Λ∗i ,Λ(q)−i |θ
(q))
p(Λ(q)|θ(q))PA(L
(q)i −1)q(π
(q)i,r ,ξ
(q)i,r )
PR(L(q)i )
(1− π(q)i,r )1−L(q)
i , where the
proposal q(π(q)i,r , ξ
(q)i,r ) in the ratio of MH(Λ∗i ,Λ
(q)i ) is q(π
(q)i,r , ξ
(q)i,r ) = 1
γU−γLand
the Jacobian for the RJ-MCMC is J = (1− π(q)i,r )1−L(q)
i .
5. Draw u from U(0, 1), take Λ(q+1)i = Λ∗i if u < MH(Λ∗i ,Λ
(q)i ), otherwise take
Λ(q+1)i = Λ
(q)i .
By looping A.4.1-A.4.3, we finish the sampling of Λ(q+1)i , i = 1, 2, · · · , n.
A.5 Sampling of φ: Gamma Distribution
Given θ,Λ, we sample φ from a gamma distribution:
f(φ | θ,Λ) ∝ Ga
(∑ni=1 Ti − 3n− 1
2,(θ −Xdesβ)′(θ −Xdesβ)
2
), (A.3)
where (∑n
i=1 Ti − 3n− 1)/2 is the shape parameter and (θ −Xdesβ)′(θ −Xdesβ)/2
is the rate parameter of the gamma distribution, respectively.
95
A.6 Sampling of β: Multivariate Normal Distribution
Given θ,Λ, φ, we sample β from a multivariate normal distribution:
P (β | θ,Λ, φ) ∝ N (β, (X ′desΣ−1Xdes)
−1), (A.4)
where Σi = φ−1Di and Σ = diag(Σ1, · · · ,Σn) is a block diagonal matrix.
A.7 Sampling of θ: Normal Distribution
Given all other parameters, we sample θi,j from a normal distributionN (µi,j,Ωi,j)
for i = 1, · · · , n and j = 1, · · · , Ti, where
µi,j =
∑Si,js=1
∑Ki,j,sk=1 ψi,j,s,k(Zi,j,s,k+ai,j,s−ηi,j,s) + φ
∆i,j[βi,0 + βi,1f
0i (ti,j)]
∑Si,j
s=1
∑Ki,j,sk=1 ψi,j,s,k + φ
∆i,j
,
Ω−1i,j =
Si,j∑s=1
Ki,j,s∑k=1
ψi,j,s,k +φ
∆i,j
.
A.8 The Characteristics of EdSphere datasets
The characteristics of the 10 selected subjects are displayed in Table A.1.
Table A.1: Characteristics for each selected individual in the EdSphere dataset.
ID Total tests #Test Days Max tests/days Range of items/test Max Gap Grade
1 335 185 9 4-21 115 1
2 139 115 3 2-26 177 6
3 161 117 5 2-29 194 2
4 219 118 10 2-22 204 1
5 234 129 9 2-20 196 1
6 217 113 13 2-23 150 1
7 221 128 10 1-25 101 2
8 266 118 9 3-21 125 1
9 185 120 6 2-21 105 3
10 232 103 11 2-24 112 4
Bibliography
Albers, W., Does, R., Imbos, T., and Janssen, M. (1989). A stochastic growth
model applied to repeated tests of academic knowledge. Psychometrika, 54(3),
451–466.
Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous
response data. Journal of the American statistical Association, 88(422), 669–679.
Bartolucci, F., Pennoni, F., and Vittadini, G. (2011). Assessment of school perfor-
mance through a multilevel latent Markov Rasch model. Journal of Educational
and Behavioral Statistics , 36(4), 491–522.
Bock, R. D. and Mislevy, R. J. (1989). A hierarchical item response model for
educational testing. In Multilevel Analysis of Educational Data, pages 57–74.
Elsevier.
Bollen, K. A. and Curran, P. J. (2004). Autoregressive latent trajectory (ALT)
models a synthesis of two traditions. Sociological Methods and Research, 32(3),
336–383.
96
97
Bollen, K. A. and Curran, P. J. (2006). Latent curve models: A structural equation
perspective, volume 467. John Wiley & Sons.
Bornkamp, B. and Ickstadt, K. (2009). Bayesian nonparametric estimation of
continuous monotone functions with applications to dose–response analysis. Bio-
metrics , 65(1), 198–205.
Brezger, A. and Steiner, W. J. (2008). Monotonic regression based on Bayesian
P–splines: An application to estimating price response functions from store-level
scanner data. Journal of Business & Economic Statistics , 26(1), 90–104.
Cao, J. and Stokes, S. L. (2008). Bayesian IRT guessing models for partial guess-
ing behaviors. Psychometrika, 73(2), 209.
Celeux, G., Forbes, F., Robert, C. P., Titterington, D. M., et al. (2006). Deviance
information criteria for missing data models. Bayesian Analysis , 1(4), 651–673.
Chen, M.-H. (1994). Importance-weighted marginal Bayesian posterior density
estimation. Journal of the American Statistical Association, 89(427), 818–824.
Chen, M.-H. (2005). Computing marginal likelihoods from a single MCMC out-
put. Statistica Neerlandica, 59(1), 16–29.
Chen, M.-H., Shao, Q.-M., and Ibrahim, J. G. (2012). Monte Carlo methods in
Bayesian computation. Springer Science & Business Media.
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the
American Statistical Association, 90(432), 1313–1321.
98
Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis–
Hastings output. Journal of the American Statistical Association, 96(453), 270–
281.
Cho, S.-J., Athay, M., and Preacher, K. J. (2013). Measuring change for a multidi-
mensional test using a generalized explanatory longitudinal item response model.
British Journal of Mathematical and Statistical Psychology , 66(2), 353–381.
Choi, T., Kim, H.-J., and Jo, S. (2016). Bayesian variable selection approach to
a Bernstein polynomial regression model with stochastic constraints. Journal of
Applied Statistics , 43(15), 2751–2771.
DiCiccio, T. J., Kass, R. E., Raftery, A., and Wasserman, L. (1997). Computing
Bayes factors by combining simulation and asymptotic approximations. Journal
of the American Statistical Association, 92(439), 903–915.
Embretson, S. (1991). A multidimensional latent trait model for measuring learn-
ing and change. Psychometrika, 56(3), 495–515.
Erickson, R. O. (1976). Modeling of plant growth. Annual Review of Plant
Physiology , 27(1), 407–434.
Fan, Y., Wu, R., Chen, M.-H., Kuo, L., and Lewis, P. O. (2010). Choosing among
partition models in Bayesian phylogenetics. Molecular Biology and Evolution,
28(1), 523–532.
99
Fernandez, C., Ley, E., and Steel, M. F. (2001). Benchmark priors for Bayesian
model averaging. Journal of Econometrics , 100(2), 381–427.
Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous response
data. British Journal of Mathematical and Statistical Psychology , 58(1), 145–
172.
Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power
posteriors. Journal of the Royal Statistical Society: Series B (Statistical Method-
ology), 70(3), 589–607.
Geiser, C., Bishop, J., Lockhart, G., Shiffman, S., and Grenard, J. L. (2013).
Analyzing latent state-trait and multiple-indicator latent growth curve models as
multilevel structural equation models. Frontiers in Psychology , 4(975), 1–23.
Gelfand, A. E. and Kuo, L. (1991). Nonparametric Bayesian bioassay including
ordered polytomous response. Biometrika, 78(3), 657–666.
Gelfand, A. E., Smith, A. F., and Lee, T.-M. (1992). Bayesian analysis of con-
strained parameter and truncated data problems using Gibbs sampling. Journal
of the American Statistical Association, 87(418), 523–532.
Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From
importance sampling to bridge sampling to path sampling. Statistical Science,
pages 163–185.
100
Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction,
and estimation. Journal of the American Statistical Association, 102(477), 359–
378.
Green, P. J. and Hastie, D. I. (2009). Reversible jump MCMC. Genetics , 155(3),
1391–1403.
Harris, D. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educa-
tional Measurement: Issues and Practice, 8(1), 35–41.
Hsieh, C.-A., von Eye, A., Maier, K., Hsieh, H.-J., and Chen, S.-H. (2013). A
unified latent growth curve model. Structural Equation Modeling: A Multidisci-
plinary Journal , 20(4), 592–615.
Johnson, C. and Raudenbush, S. W. (2006). A Repeated Measures, Multilevel
Rasch Model With Application to Self-Reported Criminal Behavior , pages 131–
164. New York: Routledge.
Karabatsos, G. (2016). Bayesian nonparametric response models. In Handbook of
Item Response Theory, Volume One, pages 351–364. Chapman and Hall/CRC.
Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American
Statistical Association, 90(430), 773–795.
Kim, S. and Camilli, G. (2014). An item response theory approach to longitudi-
nal analysis with application to summer setback in preschool language/literacy.
Large-scale Assessments in Education, 2(1), 1.
101
Lartillot, N. and Philippe, H. (2006). Computing Bayes factors using thermody-
namic integration. Systematic Biology , 55(2), 195–207.
Lewis, S. M. and Raftery, A. E. (1997). Estimating Bayes factors via poste-
rior simulation with the Laplace–Metropolis estimator. Journal of the American
Statistical Association, 92(438), 648–655.
Li, Y. and Clyde, M. A. (2018). Mixtures of g-priors in generalized linear models.
Journal of the American Statistical Association, 113(524), 1828–1845.
Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). Mixtures
of g priors for Bayesian variable selection. Journal of the American Statistical
Association, 103(481), 410–423.
Lin, L. and Dunson, D. B. (2014). Bayesian monotone regression using Gaussian
process projection. Biometrika, 101(2), 303–317.
Luo, Y. and Jiao, H. (2018). Using the stan program for Bayesian item response
theory. Educational and psychological measurement , 78(3), 384–408.
Martin, A. D. and Quinn, K. M. (2002). Dynamic ideal point estimation via
Markov chain Monte Carlo for the U.S. Supreme Court, 1953-1999. Political
Analysis , 10(2), 134–153.
McKay Curtis, S. and Ghosh, S. K. (2011). A variable selection approach to
monotonic regression with Bernstein polynomials. Journal of Applied Statistics ,
38(5), 961–976.
102
Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psy-
chometrika, 51(2), 177–195.
Natesan, P., Nandakumar, R., Minka, T., and Rubright, J. D. (2016). Bayesian
prior choice in IRT estimation using MCMC and variational Bayes. Frontiers in
Psychology , 7, 1422.
Neelon, B. and Dunson, D. B. (2004). Bayesian isotonic regression and trend
analysis. Biometrics , 60(2), 398–406.
Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference by the
weighted likelihood Bootstrap. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 56, 3–48.
Ongaro, A. and Cattaneo, C. (2004). Discrete random probability measures: a
general framework for nonparametric Bayesian inference. Statistics & Probability
letters , 67(1), 33–45.
Park, J. H. (2011). Modeling preference changes via a hidden Markov item re-
sponse theory model. Handbook of Markov Chain Monte Carlo, pages 479–491.
Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American
Statistical Association, 103(482), 681–686.
Paul, D., Peng, J., Burman, P., et al. (2016). Nonparametric estimation of dy-
namics of monotone trajectories. The Annals of Statistics , 44(6), 2401–2432.
103
Petris, G. and Tardella, L. (2003). A geometric approach to transdimensional
Markov chain Monte Carlo. The Canadian Journal of Statistics , 31(4), 469–482.
Petris, G. and Tardella, L. (2007). New perspectives for estimating normalizing
constants via posterior simulation. Tech. rep.
Rasch, G. (1960). Studies in mathematical psychology: I. probabilistic models
for some intelligence and attainment tests. Copenhagen: Danish Institute for
Educational Research.
Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical linear models: Applica-
tions and data analysis methods , volume 1. Sage.
Rodrigues, A., Chaves, L. M., Silva, F. F., Garcia, I. P., Duarte, D. A. S., and
Ventura, H. T. (2018). Isotonic regression analysis of Guzera cattle growth curves.
Revista Ceres , 65(1), 24–27.
Shively, T. S., Sager, T. W., and Walker, S. G. (2009). A Bayesian approach to
non-parametric monotone function estimation. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 71(1), 159–175.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002).
Bayesian measures of model complexity and fit. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 64(4), 583–639.
104
Tan, E. S., Ambergen, A. W., Does, R. J. M. M., and Imbos, T. (1999). Approxi-
mations of normal IRT models for change. Journal of Educational and Behavioral
Statistics , 24(2), 208–223.
te Marvelde, J. M., Glas, C. A. W., Landeghem, G. V., and Damme, J. V. (2006).
Application of multidimensional item response theory models to longitudinal data.
Educational and Psychological Measurement , 66(1), 5–34.
Van Dorp, J. R. and Kotz, S. (2002). The standard two-sided power distribution
and its properties: with applications in financial engineering. The American
Statistician, 56(2), 90–99.
Van Dyk, D. A. and Park, T. (2008). Partially collapsed Gibbs samplers: Theory
and methods. Journal of the American Statistical Association, 103(482), 790–
796.
Verhagen, J. and Fox, J.-P. (2013). Longitudinal measurement in health-related
surveys. a Bayesian joint growth model for multivariate ordinal responses. Statis-
tics in Medicine, 32(17), 2988–3005.
Wang, X. and Berger, J. O. (2016). Estimating shape constrained functions using
Gaussian processes. SIAM/ASA Journal on Uncertainty Quantification, 4(1), 1–
25.
Wang, X., Berger, J. O., and Burdick, D. S. (2013). Bayesian analysis of dynamic
item response models in educational testing. The Annals of Applied Statistics ,
7(1), 126–153.
105
Wang, Y.-B., Chen, M.-H., Kuo, L., and Lewis, P. O. (2018). A new Monte Carlo
method for estimating marginal likelihoods. Bayesian Analysis , 13(2), 311–333.
Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely
applicable information criterion in singular learning theory. Journal of Machine
Learning Research, 11, 3571–3594.
Wei, Z., Wang, X., and Conlon, E. M. (2017). Parallel Markov chain Monte
Carlo for Bayesian dynamic item response models in educational testing. Stat ,
6(1), 420–433.
Wu, H.-H., Ferreira, M. A., Gompper, M. E., et al. (2016). Consistency of hyper-
g-prior-based Bayesian variable selection for generalized linear models. Brazilian
Journal of Probability and Statistics , 30(4), 691–709.
Xie, W., Lewis, P. O., Fan, Y., Kuo, L., and Chen, M.-H. (2011). Improving
marginal likelihood estimation for Bayesian phylogenetic model selection. Sys-
tematic Biology , 60(2), 150–160.
Zellner, A. (1986). On assessing prior distributions and Bayesian regression anal-
ysis with g-prior distributions. Bayesian Inference and Decision Techniques .
Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression
hypotheses. Trabajos de estadıstica y de investigacion operativa, 31(1), 585–603.