Bayesian Item Response Theory: Methods and Applications

University of ConnecticutOpenCommons@UConn

Doctoral Dissertations University of Connecticut Graduate School

8-2-2019

Bayesian Item Response Theory: Methods andApplicationsYang LiuUniversity of Connecticut - Storrs, [email protected]

Follow this and additional works at: https://opencommons.uconn.edu/dissertations

Recommended CitationLiu, Yang, "Bayesian Item Response Theory: Methods and Applications" (2019). Doctoral Dissertations. 2274.https://opencommons.uconn.edu/dissertations/2274

http://lib.uconn.edu/?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

http://lib.uconn.edu/?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

https://opencommons.uconn.edu/?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

https://opencommons.uconn.edu/dissertations?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

https://opencommons.uconn.edu/gs?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

https://opencommons.uconn.edu/dissertations?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

https://opencommons.uconn.edu/dissertations/2274?utm_source=opencommons.uconn.edu%2Fdissertations%2F2274&utm_medium=PDF&utm_campaign=PDFCoverPages

Bayesian Item Response Theory: Methods and Applications

Yang Liu, Ph.D.

University of Connecticut, 2019

Item response theory (IRT) models play a critical role in psychometric studies

for the design and analysis of examinations. IRT models mainly consider the rela-

tionship among the correctness of items, individual’s latent ability, difficulty of each

item and other potential factors such as guessing.

In this dissertation, we develop Bayesian modeling methods and model selection

techniques under the IRT model framework. For Bayesian model comparison, the

Bayes factor is a widely used tool, which requires computation of the marginal

likelihoods. For complex models such as the IRT models, the marginal likelihoods

are not analytically available. There are a variety of Monte Carlo methods for

estimating or computing the marginal likelihoods, though some of them may not be

feasible for IRT models due to the high dimensionality of the parameter space. We

review several different Monte Carlo methods for marginal likelihood computation

under classic IRT models, develop the “best” implementation of these methods for

the IRT models, and apply these methods to a real dataset for comparison of the

classic one-parameter IRT model and two-parameter IRT model.

With increasing availability of computerized testing, observations are often col-

lected at irregular and variable time points. We adopt a dynamic IRT model based

Yang Liu––University of Connecticut, 2019

on the one-parameter IRT model to accommodate this data structure. A hierar-

chical layer on the dynamic IRT model is built to capture the relationship between

the “growth factor” and the characteristics of individuals. We use the Bayes factor

to perform variable selection on the covariates linked to the growth, and develop a

Monte Carlo approach to compute the Bayes factors for all model pairs using a sin-

gle Markov chain Monte Carlo (MCMC) output. We also show the model selection

consistency of the Bayes factor under certain conditions.

Additionally, to allow more flexibility, we propose a nonparametric model and

embed a monotone shape constraint on the mean latent growth trend. Further, we

develop a partially collapsed Gibbs sampling algorithm coupled with a reversible

jump MCMC technique to sample the dimension-varying parameters from their

corresponding posterior distribution.


Yang Liu

B.S., Zhejiang University, China, 2014

A Dissertation

Submitted in Partial Fulfillment of the

Requirements for the Degree of

Doctor of Philosophy

at the

University of Connecticut

2019

Copyright by

Yang Liu

2019

APPROVAL PAGE

Doctor of Philosophy Dissertation


Presented by

Yang Liu, B.S.

Major Advisor

Ming-Hui Chen

Major Advisor

Xiaojing Wang

Associate AdvisorDipak K. Dey

University of Connecticut

2019

ii

To my mother Liqing Yang, my father Junxing Liu, and Xi Wang.

iii

ACKNOWLEDGEMENTS

First and foremost, I would like to express my deep gratitude to my major advi-

sors, Prof. Ming-Hui Chen and Prof. Xiaojing Wang, for their tremendous support

throughout all these years. They lead me into the world of Bayesian Statistics, guide

me into various Statistics research areas, and inspire me to pursue my career as a

researcher. They are my wonderful role models–kind, humble, diligent in life and

they are always patient with me. I am truly blessed to have them as my advisors

and mentors for my Ph.D. study.

I am also very grateful to my associate advisor Prof. Dipak K. Dey who is always

supportive. He encourages me and provides guidance and insightful advices for both

my research and student activities.

My sincere thanks also go to Prof. Lynn Kuo, Prof. Zhiyi Chi, and Prof. Haiying

Wang for serving as my general exam committee.

I would like to thank Dr. Donghui Zhang and Dr. Xiwen Ma for their great

support and advices on research and professional development. Many thanks to Dr.

Naitee Ting for his invaluable career advices and guidance.

Many thanks are owed to my fellow graduate students and the faculty members

who accompanied and supported me during this journey. I would also like to thank

Ms. Tracy Burke, Mr. Anthony Luis, and Ms. Megan Petsa for their helpful

administrative assistance.

iv

Lastly, and most importantly, I want to dedicate this dissertation to my mother

Liqing and my father Junxing, and my love, Xi, who always encourage and support

me to pursue the life and career of my choice.

v

TABLE OF CONTENTS

Chapter 1: Introduction 1

1.1 Classic Item Response Theory Models and Marginal Likelihood Com-

putation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Dynamic Item Response Theory Models and Extensions . . . . . . . 3

1.3 Motivating Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 The English Examination Dataset . . . . . . . . . . . . . . . 9

1.3.2 The Edsphere Dataset . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2: A Comparison of Monte Carlo Methods for Computing

Marginal Likelihoods of Item Response Theory Models 14

2.1 Classic Item Response Theory Models . . . . . . . . . . . . . . . . . 14

2.2 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 The Prior-Based Monte Carlo (PMC) Method . . . . . . . . . 18

2.2.2 The Harmonic Mean (HM) Method . . . . . . . . . . . . . . 18

2.2.3 The Chen Method . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.4 The Chib Method . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.5 The Stepping Stone Method (SSM) . . . . . . . . . . . . . . . 25

2.3 Analysis of English Examination Data . . . . . . . . . . . . . . . . . 27

2.3.1 Results of Marginal Likelihood Computation . . . . . . . . . 28

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

vi

Chapter 3: Variable Selection in Dynamic Item Response Theory

Models via Bayes Factors with a Single MCMC Output 34

3.1 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Computation of Bayes Factors with a Single MCMC Output . . . . . 36

3.3 Model Selection Consistency . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Real Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Chapter 4: Bayesian Nonparametric Monotone Regression of Dy-

namic Latent Traits in IRT Models 51

4.1 Nonparametric Monotone Regression Model . . . . . . . . . . . . . . 51

4.2 Bayesian Computation Scheme . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Prior Specification and Posterior Distribution . . . . . . . . . 57

4.2.2 The MCMC Sampling Schemes . . . . . . . . . . . . . . . . . 60

4.3 Simulation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 An Example Using the Mixture of TSP Functions as the

Latent Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.2 An Example of the Logistic Curve as the Latent Trajectory . 67

4.4 Application to the EdSphere Data . . . . . . . . . . . . . . . . . . . 71

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Chapter 5: Preliminary Results of Model Selection by DIC 77

vii

5.1 Model Selection via DIC under Hierarchical Model . . . . . . . . . . 77

5.1.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1.2 Choose θ as the Parameters in Focus . . . . . . . . . . . . . . 80

5.1.3 Choose τ as the Parameters in Focus . . . . . . . . . . . . . 82

5.1.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 84

Appendix A: Posterior Propriety and MCMC Sampling Scheme 87

Bibliography 96

viii

LIST OF TABLES

1.1 Illustration of the unequally spaced response data structure. . . . . . 11

2.1 Results of DIC and WAIC for the 1PL model and the 2PL model. . . 27

2.2 Log marginal likelihoods obtained by each method and corresponding

Monte Carlo standard errors in parentheses. . . . . . . . . . . . . . . 30

2.3 Time usage to obtain the estimates by each method under 1PL and

2PL IRT models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Percentage of simulations in which the true model is ranked as the

“best” model by the Bayes factor approach. . . . . . . . . . . . . . . 47

3.2 Percentage of simulations in which the true model is ranked as the

top 3 “best” model by the Bayes factor approach. . . . . . . . . . . . 47

3.3 Estimated log Bayes factor of each model to the full model via the

proposed Monte Carlo approach. . . . . . . . . . . . . . . . . . . . . 48

4.1 Simulated parameters for each individual of TSP functions. . . . . . . 63

4.2 Coverage probabilities of φ, θi’s, τi’s, βi,0’s and βi,1’s by 100 indepen-

dent simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1 Model identification counts in 100 simulations per setting of DIC1,

DICθ4 , and DICτ4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2 PD results in 100 simulations per setting of DIC1, DICθ4 , and DICτ4 . 86

A.1 Characteristics for each selected individual in the EdSphere dataset. . 95

ix

LIST OF FIGURES

2.1 Posterior medians and 95% credible intervals of aj’s under the 2PL

model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 Estimated item characteristic curves for the items 1, 4, 5, 8, 20. . . . 29

3.1 Trace plots of β, g and ψ with 3000 MCMC samples. The red line

refers to the true value. . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Illustration of several TSP functions with different γ and b values. . . 56

4.2 The estimation of latent trajectories for the 1st, 2nd, 3rd, 6th subjects 65

4.3 Posterior median and 95% CIs of τi’s in TSP mixture based simulation. 66

4.4 The simulation results of posterior median and 95% CIs of βi,0’s and

βi,1’s in TSP mixture based latent trajectories. . . . . . . . . . . . . . 66

4.5 The estimation of latent trajectories for the 4-th, 6-th, 7-th, 8-th

subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6 Posterior median and 95% CIs of τi’s in logistic curve simulation. . . 70

4.7 Posterior median and 95% CIs of βi,0’s and βi,1’s in logistic curve

simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.8 The estimation of latent trajectories for the 2nd, 3rd, 6th and 7th

subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.9 Posterior median and 95% CIs of τi’s. . . . . . . . . . . . . . . . . . . 75

4.10 Posterior median and 95% CIs of βi,0’s and βi,1’s. . . . . . . . . . . . 75

A.1 A pseudo code of the RJ-MCMC algorithm . . . . . . . . . . . . . . . 91

x

Chapter 1

Introduction

1.1 Classic Item Response Theory Models and Marginal Likelihood

Computation

In psychometrics, item response theory (IRT) models play a critical role for

the design and analysis of tests. The classic item response theory models link the

correctness of the answer (a dichotomous random variable) for each item with the

latent ability of an individual, the item difficulty and potentially other parameters

with different types of link functions. The widely-used Rasch model (Rasch, 1960)

refers to the one-parameter IRT model with the logistic link (1PL model). The two-

parameter logistic IRT model (2PL model) adds the discrimination parameter to the

1PL model to allow different scaling of each item. Other extensions and variations of

the IRT models include the three-parameter IRT model, which adds the parameter

for guessing compared to the two-parameter IRT model (Harris, 1989); multi-level

IRT models (Bock and Mislevy, 1989; Fox, 2005), which introduce extra layers for

1

2

the ability parameter; and the dynamic IRT model (Wang et al., 2013), which is able

to handle irregular and time-varying observations and estimate the latent trajectory

of ability over a time period.

Due to the complexity of IRT models, Bayesian methods become more and more

popular for the analysis and parameter estimation of IRT models (Mislevy, 1986;

Cao and Stokes, 2008; Karabatsos, 2016). The Bayes factor (Kass and Raftery,

1995) is a commonly used Bayesian model comparison criterion, which requires the

proper prior distributions and computation of the marginal likelihoods. Since IRT

model often use the probit or logit link, the marginal likelihood is not analytically

available. Due to the recent advance in Bayesian computation, there is a rich liter-

ature on Monte Carlo methods for estimating/computing the marginal likelihoods,

including, for example, a Prior-Based Monte Carlo (PMC) approach using a random

sample from the prior distribution, the Harmonic Mean (HM) approach (Newton

and Raftery, 1994) using a Markov chain Monte Carlo (MCMC) sample from the

posterior distribution, the Chib’s method (Chib, 1995) using Gibbs samples from the

posterior distribution and the conditional posterior distributions, a hybrid Laplace

approximation and Monte Carlo method (Lewis and Raftery, 1997), and a Monte

Carlo method of Chen (2005) using the latent structure of the model and a single

MCMC sample from the posterior distribution. Interestingly, Lartillot and Philippe

(2006) and Friel and Pettitt (2008) independently developed the thermodynamic

integration method and the power posterior method. Each of these methods is

a variation of the path sampling method of Gelman and Meng (1998). Xie et al.

3

(2011) proposed the stepping stone method (SSM), which is another variation of the

path sampling method. A more general version of SSM was developed by Fan et al.

(2010). More recently, Wang et al. (2018) developed the partition weighted kernel

(PWK) method, which is an extension of the inflated density ratio (IDR) method

of Petris and Tardella (2003) and Petris and Tardella (2007). Based on our best

knowledge, most of these methods have not yet been applied for computing marginal

likelihoods under the IRT models. Also note that some of these MC methods may

not be feasible or difficult to implement for computing the marginal likelihoods of

the IRT models due to a high-dimension of the model parameters. There may be

different ways to implement the same MC method, depending on how we arrange

the model parameters. It would be interesting to discuss how to apply some of these

MC methods in the “best possible” way.

1.2 Dynamic Item Response Theory Models and Extensions

Longitudinal studies play a prominent role on investigating temporal changes

in a construct of interest, which are often referred to as growth curve analysis in

social and behavior science. The advent of computerized testing and online rating

brings an entirely new way for social and behavior researchers to collect longitudinal

data. Test takers have much more freedom to choose their test time than before. The

responses are often collected at variable and irregular time points across individuals.

The randomness of responses may further create sparsity in a certain period, such

as in the summer or winter holidays.

4

Traditionally, there are two models of wide usage for studying individual changes.

One is called latent growth curve modeling (Bollen and Curran, 2006) and the other

is multi-level modeling or hierarchical linear modeling (Raudenbush and Bryk, 2002).

However, when outcomes are observed at individually-varying and irregularly-spaced

time points, the inferences from these two traditional models for studying individ-

ual changes may become problematic due to the uneasy adjustment of parametric

structures in those models (Geiser et al., 2013). Furthermore, in computerized test-

ing/survey in education, the manifest responses are usually dichotomous, ordinal or

nominal, while latent traits needed to be inferred are often continuous. These make

the inference even more difficult because of the information loss in the discretization

procedure of the underlying response variables.

We focus on an extension of the classic IRT model framework to model longitudi-

nal dichotomous data collected at irregular and variable time points. As mentioned

previously, there are two widely used approaches, i.e., the multidimensional and

multilevel approaches, which are available in the current literature of IRT models

for the analysis of longitudinal data. First, for the multidimensional approach, a

multidimensional IRT model is used to represent the change of an ability as an initial

ability and one or more potential modifiers in unidimensional or multidimensional

tests (Embretson, 1991; te Marvelde et al., 2006; Cho et al., 2013). However, this

approach allows little variation of items on different occasions and it often requires

the individuals to take the same tests. These drawbacks prevent us from extending

their methods to analyze a time series of computerized testing data.

5

Second, for the multilevel approach, the first-level is often assumed to follow a

classic IRT model, while in the higher-level, there are two common ideas to model

the growth. One idea is to assume the growth of a latent trait to be a parametric

function of time, such as a linear or polynomial regression of the time variable

with fixed or random coefficients. This idea is a variation of the latent growth curve

modeling in the analysis of binary/categorical longitudinal data (Albers et al., 1989;

Tan et al., 1999; Johnson and Raudenbush, 2006; Verhagen and Fox, 2013; Hsieh

et al., 2013). Another idea is to employ Markov chain models, where the changes of

a latent trait over time are assumed to be dependent on its previous value or status

(Martin and Quinn, 2002; Park, 2011; Bartolucci et al., 2011; Kim and Camilli,

2014). However, there are many instances in which neither of these two ideas would

be adequate enough to describe the growth (Bollen and Curran, 2004). One of such

instances is computerized testing, especially when the time lapses between tests are

unequally spaced across individuals as well as within individuals.

To tackle the challenges in computerized testing for modeling the growth of

latent traits, Wang et al. (2013) proposed a dynamic model by combining the ideas

of parametric functions of time as well as Markov chain models to describe the

growth. They imbedded IRT models into a new class of state space models for

analyzing longitudinal data located at individually-varying and irregularly-spaced

time points. In particular, an individually-varying parameter which can be called as

the “growth factor” at the second level of their proposed dynamic IRT model, may

be related to the characteristics of each individual, such as gender, grade, etc. We

6

build up another hierarchical layer based upon the dynamic IRT model discussed

in Wang et al. (2013), to capture relationship between the “growth factor” and a

few candidate covariates. The Bayesian modeling framework is also adopted here

for the extended dynamic IRT model.

Variable selection at the higher layer is a natural and important problem in this

model. There are many Bayesian variable selection techniques available in litera-

ture, such as the deviance information criterion (DIC) (Spiegelhalter et al., 2002),

the Bayesian lasso (Park and Casella, 2008), and the previously mentioned Bayes

factor approach. Many of these criteria are initially discussed within a linear model

framework, and they may be applied to hierarchical models such as the dynamic

IRT model under consideration. However, it remains unclear whether some of their

attractive model selection properties such as model selection consistency will still

hold under hierarchical models.

Fernandez et al. (2001) and Liang et al. (2008) discussed some attractive model

selection properties of the Bayes factor when the g-prior (Zellner, 1986) is specified

for the regression coefficients under the linear model framework. In particular, Liang

et al. (2008) showed that when the Zellner-Siow prior (Zellner and Siow, 1980) is

assumed, i.e., assign an Inv-Gamma(1/2, n/2) prior on g, the Bayes factor approach

has several attractive properties, including posterior model selection consistency. Li

and Clyde (2018) and Wu et al. (2016) discussed the model selection properties with

mixtures of g priors under the generalized linear model framework. These motivate

7

us to adopt the Bayes factor approach for variable selection in the added layer under

the dynamic IRT model.

However, due to the complex structure of the dynamic IRT model and the high

dimensionality of the parameter space, there are no analytical forms available for the

Bayes factors of the dynamic IRT models (while there are analytical forms available

for the linear regression model discussed in Liang et al. (2008) when g is fixed).

Inspired by Chen (2005), we develop a Monte Carlo approach to compute the Bayes

factors under the dynamic IRT model using a single MCMC output, which is a

feasible approach if the number of covariates is rather moderate. For example, 20

covariates will lead to 220 = 1048576 possible models even if no interaction terms

are considered. Model selection properties will be discussed in Chapter 3.

Nevertheless, assuming a particular functional relationship for the growth in

general can be restrictive and usually difficult to justify. Instead, a nonparametric

model could be much more flexible to describe changes of latent traits and avoid the

errors of model misspecification.

Based on the results shown in Wang et al. (2013), we find that the trajectory

of subject’s reading ability often grows more quickly in the initial period, but slows

down when it approaches maturation. Overall, the ability exhibits an increasing

trend but often has a flat region at the end. Such discovery of the shape as the

growth trajectory is consistent with prior beliefs and experiences from practitioners.

In various scientific applications, assumptions about the shape of the trajectory,

such as monotonicity, convexity or concavity, may be proposed ahead of the analysis

8

to aid in the modeling process and enhance interpretability (Erickson, 1976; Paul

et al., 2016; Rodrigues et al., 2018). This calls our attention to incorporate shape

constraints as a prior information to nonparametric modeling, and the usage of

shape information may improve the efficiency and accuracy of the nonparametric

estimates.

From the Bayesian perspective, nonparametric regression with monotonicity con-

straints has been already considered in the literature. For instance, Gelfand and Kuo

(1991) used an ordered Dirichlet process prior to impose the monotone constraint.

Neelon and Dunson (2004) imposed a piecewise linear regression, with an autore-

gressive prior for the parameters of basis functions. Shively et al. (2009) as well as

Brezger and Steiner (2008) implemented restricted splines to ensure monotonicity.

McKay Curtis and Ghosh (2011) used Bernstein polynomials with restrictions on

parameter space, while Choi et al. (2016) extended this idea by allowing incorpo-

ration of uncertainty of the constraints through the prior. Lin and Dunson (2014)

proposed Gaussian process projection to perform shape-constrained inference and

applied the method to multivariate surface estimation. Wang and Berger (2016)

imposed the constraints on the derivative process of the original Gaussian process

to estimate shape constrained functions. However, typical nonparametric models

are often less interpretable comparing to parametric models. Enlightened by the

idea of Bornkamp and Ickstadt (2009), we develop a Bayesian dynamic IRT model

9

with a monotone shape constraint on the mean latent growth trend. A nonpara-

metric model for the latent growth is adopted to allow more flexibilities, while the

interpretability is still retained to some degree.

1.3 Motivating Datasets

1.3.1 The English Examination Dataset

The first dataset consists of the English exam results of the 2017-2018 academic

year from No. 11 Middle School of Wuhan, Bingjiang Campus, which is a public

middle school in Jiang’an district of Wuhan, China. This exam is a midterm English

exam for grade 8 students in fall 2017. The total number of students attended this

exam is 468. The official difficulty coefficient of this exam from the Education

Bureau of Jiangan district is 0.67, which means that the average score of all the

students who attended is 67% of full marks.

The classic IRT model framework seems suitable for the data structure here. If

guessing is not considered, it remains unclear whether the 2PL model should be

preferred over the 1PL model. We consider applying the Bayes factor (Kass and

Raftery, 1995) approach for model comparison and develop the “best implemen-

tation” of several Monte Carlo approaches in Chapter 2 to compute the marginal

likelihoods of the IRT models.

10

1.3.2 The Edsphere Dataset

The second motivating dataset is the EdSphere dataset provided by MetaMetrics

Company and Highroad Learning Company. EdSphere is a personalized literacy

learning platform that continuously collects data about student performance and

strategic behaviors each time when he/she reads an article. During a typical read-

ing test, a student selects from a system-generated list of articles having the test

difficulty level in a range targeted to the current estimate of the student’s ability.

Then, for the selected article, a subset of words from the article are eligible to be

clozed, that is, removed and replaced by a blank. The computer, following a pre-

scribed protocol, randomly selects a sample of the eligible words to be clozed and

presents the article to the student with these words clozed. The question items pro-

duced by this procedure are randomized items. They are single-use items generated

at the time of an encounter between a student and an article. If another student

selects the same article to read, a new set of clozed words is selected. As a conse-

quence, the reoccurrence of individual items among students is highly improbable,

so obtaining empirical estimates of item parameters is not feasible.

The difficulty levels of the items in the reading test are provided by MetaMet-

rics using proprietary data and methods. The ensemble mean and the variance of

difficulty level for the items in a test are known due to the test design of EdSphere

learning platform.

11

Currently, the EdSphere dataset consists of 16,949 students from a school district

in Mississippi who participated over 5 years (2007-2011) in EdSphere learning plat-

form. The students were in different grades and could enter and leave the program

at different times. They were free to take tests on different days and had different

time lapses between tests. An illustration of this kind of data structure is included

in Table 1.1.

Table 1.1: Illustration of the unequally spaced response data structure.

Date Day 1 Day 2 Day 3 . . . Day M . . . Day N . . . Day X . . .Examinee A 1 test N/A 3 tests . . . 5 tests . . . N/A . . . Leave . . .Examinee B N/A Enter 1 tests . . . N/A . . . 4 tests . . . 10 tests . . .Examinee C 4 tests 2 tests 1 tests . . . N/A . . . N/A . . . 1 test . . .

This design yields longitudinal observations located at individually-varying and

irregularly-spaced time points and suggests that we need to model the changes of

latent traits with a dynamic structure. Further, in the spirit of EdSphere test design,

we could imagine the factors such as overall comprehension, emotional status and

others, would exist and undermine the local independence assumption of IRT models

as mentioned in Wang et al. (2013). Therefore, we need to extend the classic IRT

models to accommodate the modern computerized (adaptive) testing (not merely

EdSphere datasets), which have the described distinctive features, i.e., randomized

items, longitudinal observations, and local dependence.

1.4 Dissertation Outline

The rest of the thesis is organized as follows. In Chapter 2, we review the PMC

method, the HM method, the Chib method, the Chen method, and the SSM and

12

provide an in-depth discussion on how these methods can be applied to the IRT

models. However, our focus will be on the “best possible” implementation of these

methods for the IRT models. Full implementation details of each reviewed MC

method are provided. We apply these MC methods to the English examination

dataset and present the estimated marginal likelihoods by each method along with

the computation costs.

In Chapter 3, we propose a dynamic IRT model which models the relationship be-

tween the growth trend and the individual characteristics, and perform the variable

selection among these candidate covariates using the Bayes factor. To overcome the

computation burden, enlightened by Chen (2005), we develop a convenient Monte

Carlo approach to compute the Bayes factors using a single MCMC output from the

posterior under the full model. Additionally, we show the posterior model selection

consistency of the Bayes factors under the Zellner-Siow prior with certain conditions.

Simulations studies are carried out to examine the model selection performance of

the proposed Bayes factor approach under finite sample situations. The proposed

method is also applied to the Edsphere dataset.

In Chapter 4, we propose a flexible Bayesian monotone nonparametric regression

of the latent traits within the dynamic IRT model framework. Specifically, we

impose the monotone shape constraint by using a mixture of cumulative distribution

functions. We set the dimension of the mixture as random and develop a partially

collapsed Gibbs sampling algorithm in conjunction with the reversible jump MCMC

approach to generate the posterior samples. Results for simulation studies and

13

the Edsphere data application are discussed. The details of the MCMC sampling

algorithm are included in Appendix A.

In Chapter 5, we discuss some of our preliminary results of applying the DIC for

model selection under hierarchical models.

Chapter 2

A Comparison of Monte Carlo Methods for

Computing Marginal Likelihoods of Item

Response Theory Models

2.1 Classic Item Response Theory Models

Suppose that there are n subjects and each of them answers J items (questions).

Let yij represent the correctness of the item j answered by subject i, where yij equals

1 if the answer is correct and 0 otherwise for j = 1, 2, . . . , J and i = 1, 2, . . . , n. The

classic one-parameter logistic (1PL) model assumes

P (yij = 1|θ, b) =exp(θi − bj)

1 + exp(θi − bj), (2.1)

where θi denotes the latent ability of subject i, bj is the difficulty of item j. The

two-parameter logistic (2PL) IRT model adds the discrimination parameter aj to

the 1PL model, and assumes

P (yij = 1|θ, b,a) =expaj(θi − bj)

1 + expaj(θi − bj). (2.2)

14

15

Here aj > 0 allows different degrees of distinction for different items. We use

θ = (θ1, θ2, . . . , θn)′, b = (b1, b2, . . . , bJ)′, and a = (a1, a2, . . . , aJ)′ to represent the

parameter vectors of all θi’s, bj’s, and aj’s, respectively.

Denote the data as D = yij, i = 1, 2, . . . , n, j = 1, 2, . . . , J. Since the one-

parameter model is a special case of the two parameter model when discrimination

parameter aj = 1 for j = 1, 2, . . . , J , to avoid redundancy, we will discuss the details

for the implementation of each Monte Carlo method in the next section only for the

two-parameter model.

For the two-parameter model (2.2), the data likelihood can be expressed as

L(θ, b,a|D) =n∏i=1

J∏j=1

expyijaj(θi − bj)1 + expaj(θi − bj)

. (2.3)

Similar to Natesan et al. (2016) and Luo and Jiao (2018), the priors for the model

parameters are specified as follows: (i) θi ∼ N(0, 1) independently for i = 1, 2 . . . , n

to ensure the identifiability of the model parameters θi’s and aj’s; (ii) given µ and

ψ, a hierarchical prior is used for b: bj ∼ N(µb, ψ−1) for j = 1, 2, . . . , J ; (iii) aj ∼

U(0.5, 2.5) independently for j = 1, 2, . . . , J , log-normal priors are also considered

in sensitivity analysis; and (iv) µb ∼ N(0, s20ψ−1) and ψ ∼ Gamma(1, 1), where s0

is a pre-specified constant. We set s0 = 2 in our empirical study, which yields a

moderately noninformative prior for µb.

16

The full posterior distribution can be written as

π(θ, b,a, µb, ψ|D)

=1

C(D)

[n∏i=1

J∏j=1

expyijaj(θi − bj)1 + expaj(θi − bj)

](2π)−(n+J+1)/2 exp

(−∑n

i=1 θ2i

2

)

× ψJ+12 exp

−ψ∑J

j=1(bj − µb)2

2

s−1

0 exp

(−ψµ

2b

2s20

)exp(−ψ)

× 2−JJ∏j=1

10.5 < aj < 2.5, (2.4)

where C(D) is the normalizing constant of the full posterior distribution and the

indicator function 1E = 1 if E is true and 0 otherwise. Since∫π(θ, b,a, µb, ψ)dθdbdadµbdψ = 1,

the prior distributions are proper and, hence, C(D) reduces to the marginal likeli-

hood M(D) given by

M(D) =

∫L(θ, b,a|D)π(θ, b,a, µb, ψ)dθdbdadµbdψ. (2.5)

There are some imbedded properties for the full conditional distributions asso-

ciated with this posterior distribution structure. For instance, given (b,a, D), the

distributions of θi’s are conditionally independent. The conditional independence

property also holds for a and b in a similar fashion when considering their full

conditional posteriors. Specifically, we have

π(θi|b,a, D) ∝J∏j=1

exp yijaj(θi − bj)1 + exp aj(θi − bj)

exp(−θ2i /2), (2.6)

π(bj|θ,a, µb, ψ,D) ∝n∏i=1


exp

−ψ(bj − µb)2

2

, (2.7)

π(aj|θ, b, D) ∝n∏i=1


10.5 < aj < 2.5. (2.8)

17

Although the normalizing constants corresponding to the above three conditional

distributions are not analytically available, log π(θi|b,a, D), log π(aj|θ, b, D), and

log π(bj|θ,a, µb, ψ,D) are concave. However, the analytical forms for π(µb|b, ψ,D)

and π(ψ|b, D) are available. From the full posterior (2.4), we have

π(µb|b,a, ψ,D) ∝ exp

−ψ∑J

j=1(bj − µb)2

2

exp

(−ψµ

2b

2s20

)

∝ exp

−ψ(J + 1/s20)(µb −

∑Jj=1 bj

J+1/s20)2

2

,

thus µb|b,a, ψ,D ∼ N

(∑Jj=1 bj

J+1/s20, ψ(J + 1/s2

0)−1

). For π(ψ|b,a, D), we have

π(ψ|b,a, D) =

∫π(µb, ψ|b,a, D)dµb

∝ exp(−ψ)ψJ/2 exp

ψ(∑J

j=1 bj)2

2(J + s−20 )

exp

(−ψ∑J

j=1 b2j

2

)

∝ ψJ/2 exp

[−ψ

1 +

J∑j=1

b2j/2−

(∑J

j=1 bj)2

2(J + s−20 )

].

Thus ψ|b,a, D ∼ Gamma

(J/2 + 1,

1 +

∑Jj=1 b

2j/2−

(∑Jj=1 bj)

2

2(J+s−20 )

−1)

. These are

attractive and useful properties, which lead to an efficient and convenient imple-

mentation of certain Monte Carlo methods discussed in the next section.

2.2 Monte Carlo Methods

From Section 2.1, we see that the dimension of parameters θ, a, and b grows

in n or J and is generally high. Therefore, we consider only a few selected Monte

Carlo methods discussed in Section 1.1 for computing the marginal likelihood under

the 2PL model.

18

2.2.1 The Prior-Based Monte Carlo (PMC) Method

By the definition of the marginal likelihood (2.5), the PMC estimator can be

constructed as

M(D)PMC =1

S

S∑s=1

L(θ(s), b(s),a(s)|D)

=1

S

S∑s=1

[n∏i=1

J∏j=1

expyija(s)j (θ

(s)i − b

(s)j )

1 + expa(s)j (θ

(s)i − b

(s)j )

],

where (θ(s), b(s),a(s), µ(s)b , ψ(s)), b = 1, 2, . . . , S is a random sample of size S from

the joint prior distribution of (θ, b,a, µb, ψ) specified in Section 2.1. This estimator

has a finite variance since the likelihood function∏n

i=1

∏Jj=1

expyija(s)j (θ

(s)i −b

(s)j )

1+expa(s)j (θ(s)i −b

(s)j )≤ 1.

When S is large enough, M(D)PMC converges to M(D) in probability by the law of

large numbers. However, a well-known issue with this method is that the samples

rarely come from the high-likelihood region (Lartillot and Philippe, 2006). Thus

unless the sample size S is “very” large, M(D)PMC generally results in a poor

estimate of the true marginal likelihood.

2.2.2 The Harmonic Mean (HM) Method

Using the harmonic mean approach (Newton and Raftery, 1994), we have the

identity

1

M(D)=

∫π(θ, b,a, µb, ψ)

M(D)dθdbdadµbdψ

=

∫L(θ, b,a|D)−1L(θ, b,a|D)π(θ, b,a, µb, ψ)

M(D)dθdbdadµbdψ. (2.9)

19

Therefore the HM estimator of M(D) can be constructed as

M(D)HM =

1

S

S∑s=1

1


−1

=

1

S

S∑s=1

[n∏i=1

J∏j=1

expyija(s)j (θ

(s)i − b

(s)j )

1 + expa(s)j (θ

(s)i − b

(s)j )

]−1−1

,

where (θ(s), b(s),a(s)), s = 1, 2, . . . , S is an MCMC sample from the joint poste-

rior distribution π(θ, b,a|D). As discussed in DiCiccio et al. (1997) and Xie et al.

(2011), the finite variance of the HM estimator is not guaranteed, depending on the

specification of the prior distributions.

2.2.3 The Chen Method

Based on the idea in Chen (2005), we have the identity

L(b∗,a∗|D)

=

∫L(θ, b∗,a∗|D)π(θ)dθ

=

∫L(θ, b∗,a∗|D)π(θ)g(b,a, µb, ψ|θ)dθdbdadµbdψ

= M(D)

∫L(θ, b,a|D)π(θ, b,a, µb, ψ)

M(D)

g(b,a, µb, ψ|θ)

π(b,a, µb, ψ)

L(θ, b∗,a∗|D)

L(θ, b,a|D)dθdbdadµbdψ

= M(D)E

g(b,a, µb, ψ|θ)

π(b,a, µb, ψ)

L(θ, b∗,a∗|D)

L(θ, b,a|D)|D, (2.10)

for any fixed b∗, a∗, µ∗b , and ψ∗, where the expectation is taken with respect to

the joint posterior distribution of θ, b,a, µb, ψ, and g is any normalized density

function of (b,a, µb, ψ).

20

Using the MCMC sample (θ(s), b(s),a(s)), s = 1, 2, . . . , S from the joint pos-

terior distribution π(θ, b,a|D), an MC estimate of M(D) is given by

M(D) = L(b∗,a∗|D)

1

S

S∑s=1

g(b(s),a(s), µ(s)b , ψ(s)|θ(s))

π(b(s),a(s), µ(s)b , ψ(s))

L(θ(s), b∗,a∗|D)


−1

.

(2.11)

In (2.10) and (2.11), the optimal choice for g is

g0 = π(b,a, µb, ψ|θ, D) = π(µb, ψ|b,a, D)π(a|θ, b, D)π(b|θ, D) (2.12)

to minimize Var

(g(b(s)

,a(s),µ(s)b ,ψ(s)|θ(s)

)

π(b(s),a(s),µ

(s)b ,ψ(s))

L(θ(s),b∗,a∗|D)

L(θ(s),b(s)

,a(s)|D)

)according to Chen (2005).

However, since there is no analytical form available for the normalized distribution

π(b|θ, D), it may be too costly to compute π(b|θ, D) for each sample of θ. A

“sub-optimal” alternative for g is

g1 = π(b,a, µb, ψ|θ∗, D) = π(ψ|b,a, D)π(µb|b,a, ψ,D)π(a|θ∗, b, D)π(b|θ∗, D),

where π(ψ|b,a, D) and π(µb|b,a, ψD) can be computed using the closed forms de-

rived in Section 2.1. We need to obtain the normalizing constant for each π(aj|θ∗, b, D)

and we have its kernel in (2.8). π(b|θ∗, D) can be obtained using the conditional

marginal density estimator (CMDE) (Gelfand et al., 1992) as

π(b|θ∗, D) =

∫π(b,a, µb, ψ|θ∗, D)π(b|θ∗,a, µb, ψ,D)dadµbdψdb

=

∫π(b|θ∗,a, µb, ψ,D)π(b,a, µb, ψ|θ∗, D)dadµbdψdb

=

∫ J∏j=1

[C−1bj

n∏i=1

expyi,jaj(θ∗i − bj)1 + expaj(θ∗i − bj)

exp

−ψ(bj − µb)2

2

]×π(b,a, µb, ψ|θ∗, D)dadµbdψdb,

21

where Cbj =∫ ∏n

i=1exp(yijaj(θ

∗i−bj))

1+exp(aj(θ∗i−bj))exp(−ψ(bj−µb)2

2)dbj is the normalizing constant

for the conditional distribution π(bj|θ∗,a, µb, ψ,D), which can be computed nu-

merically. Thus we can draw a batch of MCMC samples b(s),a(s), µ

(s)b , ψ(s) from

π(b,a, µb, ψ|θ∗, D) and obtain π(b|θ∗, D) for each b from the original MCMC sam-

ples from π(θ, b,a, µb, ψ|D). However, this approach requires two MCMC outputs,

which is computationally expensive.

Another “fast” alternative choice of g is g2 = π(µb, ψ|b,a, D)h1(a)h2(b), where

h1(a) =∏J

j=1 h1j(aj), h1j is the density of N(2.5, (2/3)2) truncated at [0.5,2.5].

Here h2(b) =∏J

j=1 h2j(bj) and h2j ∼ N(bj, ν2j ). bj is the posterior median of bj, νj

is a constant, which is set as 1/6 of the posterior samples of bj’s range here. By

choosing g2 over g1, we only need the MCMC output from π(θ, b,a, µb, ψ|D) now,

and the computation cost is greatly reduced. For L(b∗,a∗|D), we can compute it

by multiplying n one-dimensional integrals together

L(b∗,a∗|D) =

∫L(θ, b∗,a∗|D)π(θ)dθ (2.13)

=n∏i=1

∫L(θi, b

∗,a∗|D)π(θi)dθi. (2.14)

So far we have introduced three different Monte Carlo approaches and all of them

require only a single MC or MCMC output to compute the marginal likelihood. The

following two approaches require multiple MCMC outputs.

22

2.2.4 The Chib Method

Motivated by the idea in (Chib, 1995), we can make use of the following identity

logM(D) = logL(D|θ∗, b∗,a∗)+ logπ(θ∗, b∗,a∗, µ∗b , ψ∗)

− logπ(θ∗, b∗,a∗, µ∗b , ψ∗|D) (2.15)

for any fixed (θ∗, b∗,a∗, µ∗b , ψ∗). Since the analytical forms of the data likelihood and

the prior distribution are available, we only need to estimate π(θ∗, b∗,a∗, µ∗b , ψ∗|D).

There are different choices on how to compute π(θ∗, b∗,a∗, µ∗b , ψ∗|D), though not

all of them are efficient. Considering the conditional independence properties of θi’s,

bj’s, and aj’s discussed in Section 2.1, along with the information that practically

the number of test subjects is usually greater than the number of items in the test,

i.e., n > J , we can compute π(θ∗, b∗,a∗, µ∗b , ψ∗|D) by

logπ(θ∗, b∗,a∗, µ∗b , ψ∗|D)

= logπ(θ∗|b∗,a∗, µ∗b , ψ∗, D)+ logπ(µ∗b |b∗,a∗, ψ∗, D)

+ logπ(ψ∗|b∗,a∗, D)+ logπ(a∗|b∗, D)+ logπ(b∗|D). (2.16)

For the first term on the right side of (2.16), based on (2.6), we have

logπ(θ∗|b∗,a∗, µ∗b , ψ∗, D)

=n∑i=1

logπ(θ∗i |b∗,a∗, µ∗b , ψ∗, D)

=n∑i=1

log

(1

ci,0

J∏j=1

[expyija∗j(θ∗i − b∗j)

1 + expa∗j(θ∗i − b∗j)

]exp−(θ∗i )

2/2

), (2.17)

where ci,0 is the normalizing constant for the distribution π(θi|b∗,a∗, µ∗b , ψ∗, D). It

can be obtained numerically by ci,0 =∫∞−∞∏J

j=1

[expyija∗j (θi−b∗j )1+expa∗j (θi−b∗j )

]exp−θ2

i /2dθi,

23

the numerical integration can be done in many existing softwares such as R and

MATLAB.

For π(µ∗b |b∗,a∗, ψ∗, D) and π(ψ∗|b∗,a∗, D) in (2.16), we have the closed forms

derived in Section 2.1 and just need to plug in the (b∗,a∗, µ∗b , ψ∗) values. For π(b∗|D)

in (2.16), using the importance-weighted marginal density estimator (IWMDE) idea

(Chen, 1994), we have

π(b∗|D)

=

∫π(b∗,θ,a, µb, ψ|D)w(b|θ,a, µb, ψ)dθdbdadµbdψ

=

∫π(b∗,θ,a, µb, ψ|D)

π(b,θ,a, µb, ψ|D)w(b|θ,a, µb, ψ)π(b,θ,a, µb, ψ|D)dθdbdadµbdψ

=

∫π(b∗|θ,a, µb, ψ,D)

π(b|θ,a, µb, ψ,D)w(b|θ,a, µb, ψ)π(b,θ,a, µb, ψ|D)dθdbdadµbdψ,

where w(b|θ,a, µb, ψ) can be any conditional density satisfying∫w(b|θ,a, µb, ψ)db =

1, as long as its support is contained within the support of π(b|θ,a, µb, ψ,D). The

optimal choice for w is the conditional posterior distribution π(b|θ,a, µb, ψ,D), and

the estimator coincides with the CMDE. With the optimal choice of w, we have

π(b∗|D)

=

∫π(b∗|θ,a, µb, ψ,D)π(θ, b,a, µb, ψ|D)dθdbdadµbdψ

=

∫ J∏j=1

[U−1bj

n∏i=1

expyijaj(θi − b∗j)1 + expaj(θi − b∗j)

exp

−ψ(b∗j − µb)2

2

]×π(θ, b,a, µb, ψ|D)dθdbdadµbdψ, (2.18)

where Ubj =∫ ∏n

i=1expyijaj(θi−bj)1+expaj(θi−bj) exp

−ψ(bj−µb)2

2

dbj is the normalizing constant

for π(bj|θ,a, µb, ψ,D).

24

For π(a∗|b∗, D) in (2.16), similarly we have

π(a∗|b∗, D)

=

∫π(θ,a∗, µb, ψ|b∗, D)

π(θ,a, µb, ψ|b∗, D)w(a|θ, b∗)π(θ,a, µb, ψ|b∗, D)dθdadµbdψ

=

∫π(θ, b∗,a∗, µb, ψ|D)

π(θ, b∗,a, µb, ψ|D)w(a|θ, b∗)π(θ,a, µb, ψ|b∗, D)dθdadµbdψ,

where π(θ,a, µb, ψ|b∗, D) is a conditional posterior density and samples will be

generated when setting b as b∗ during the sampling process. The weight function

w(a|θ, b) is a normalized density. According to the result in (Chen, 1994), the

optimal choice is w = π(a|θ, b∗, D), which can be written as

w(a|θ, b∗) = π(a|θ, b∗, D) =J∏j=1

[C−1aj

n∏i=1

expyijaj(θi − b∗j)1 + expaj(θi − b∗j)

],

where Caj =∫ 2.5

0.5

∏ni=1

expyijaj(θi−b∗j )1+expaj(θi−b∗j )daj is the normalizing constant of π(aj|θ, b∗, D).

Therefore we have

π(a∗|b∗, D) =

∫ J∏j=1

C−1aj

n∏i=1

J∏j=1

expyija∗j(θi − b∗j)1 + expa∗j(θi − b∗j)

×π(θ,a, µb, ψ|b∗, D)dθdadµbdψ. (2.19)

Hence we can take MCMC samples from π(θ,a, µb, ψ|b∗, D) and get π(a∗|b∗, D) by

taking the sample average of∏J

j=1C−1aj

∏ni=1

∏Jj=1

expyija∗j (θi−b∗j )1+expa∗j (θi−b∗j ) .

Now we see that to compute (2.16), we need two MCMC outputs in total, from

π(θ,a, µb, ψ|b∗, D) and π(θ,a, µb, ψ, b|D). An extension of the Chib method is to

utilize the Metropolis-Hastings (MH) output to compute the marginal likelihood

(Chib and Jeliazkov, 2001), which requires knowing the proposal distributions used

in the MH sampling algorithm. Since we assume that the MCMC samples are

25

obtained using the existing software (such as NIMBLE in R), without knowing its

sampling mechanism, this approach may not be feasible to implement for the IRT

models.

2.2.5 The Stepping Stone Method (SSM)

Lartillot and Philippe (2006) and Friel and Pettitt (2008) independently discov-

ered the thermodynamic integration (TI) method, which is also known as the power

posterior method. As mentioned in Lartillot and Philippe (2006), the TI method

has discretization error and requires more time in sampling from multiple power

posteriors. The stepping stone method (Xie et al., 2011) used a similar idea without

discretization error, and its basic idea is outlined as follows.

Denote

L(D|θ, b,a, βk) = L(D|θ, b,a)βk =n∏i=1

J∏j=1

expβkyijaj(θi − bj)[1 + expaj(θi − bj)]βk

. (2.20)

Let qβk = L(D|θ, b,a, βk)π(θ, b,a, µb, ψ) be the kernel of the power posterior density

associated with βk, pβk = qβk/Cβk is the normalized power posterior density, where

Cβk is the corresponding normalizing constant.

Let βi, i = 0, 1, . . . , K be a non-decreasing sequence between 0 and 1 such that

β0 = 0 and βK = 1. The marginal likelihood can be expressed as

M(D) ≡ rSS =CKC0

=K∏k=1

CβkCβk−1

≡K∏k=1

rSS,k.

26

We can compute each rSS,k by

rSS,k =

∫L(D|θ, b,a)βkπ(θ, b,a, µb, ψ)dθdbdadµbdψ

Cβk−1

=

∫L(D|θ, b,a)βk−βk−1

L(D|θ, b,a)βk−1π(Θ)

Cβk−1

dΘ

= Epβk−1L(D|θ, b,a)βk−βk−1. (2.21)

Then we can sample from pβk−1and compute the normalizing constant ratio

rSS,k using the posterior sample average of L(D|θ, b,a)βk−βk−1 , and the log marginal

likelihood can be estimated as

logM(D)SS = log(rSS) =K∑k=1

log(rSS,k). (2.22)

In practice, according to Xie et al. (2011), a viable choice for βiKi=0 can be K + 1

evenly spaced quantiles on [0, 1] from the beta distribution Beta(0.3, 1.0). In Section

2.3.1, we choose a moderate K = 20. Therefore we need to obtain in total K+1 = 21

MCMC outputs from the corresponding power posteriors to carry out the marginal

likelihood computation.

The generalized stepping stone method (Fan et al., 2010) extends the stepping

stone method by introducing a reference distribution when constructing the power

posterior, and it reduces to the original stepping stone method when choosing the

prior as the reference distribution. The generalized SS estimator can be advan-

tageous comparing to the original SS method when the reference distribution ap-

proximates the posterior distribution well, which is difficult to achieve due to the

high dimension of the parameter space. Thus we do not consider to implement the

generalized SS for the IRT models.

27

Table 2.1: Results of DIC and WAIC for the 1PL model and the 2PL model.

Model DIC WAIC1PL 23262.08 23434.482PL 22404.74 22636.92

2.3 Analysis of English Examination Data

We extract 50 objective questions from the English Examination dataset intro-

duced in Chapter 1.3 for the analysis. The overall accuracy of all the items is 0.607.

The item with highest accuracy 0.929 is item 10. The item with lowest accuracy

0.177 is item 45.

We fit the 2PL model to the English examination data introduced in Chapter

1 and the posterior estimates for the discrimination parameters a are displayed in

Figure 2.1. Quite a few aj’s 95% credible intervals do not include 1, which implies

that the 2PL model may be a better fit for the data. Using the posterior medians

for a and b, the estimated item characteristic curves (ICC) for the items 1, 4, 5,

8, and 20 are displayed in Figure 2.2. The difficulty parameters bj’s determine the

locations of θ and P (y = 1|θ = bj, bj, aj) = 0.5 regardless of the discrimination

parameter aj. The posterior medians of the discrimination parameters for these

items are a1 = 2.38, a4 = 2.04, a5 = 1.45, a8 = 0.99, and a20 = 0.65, which lead to

different increasing trends for the ICCs in Figure 2.2.

We also obtain the widely applicable information criterion (WAIC) (Watanabe,

2010) and deviance information criterion (DIC) (Spiegelhalter et al., 2002) values

for the 1PL and 2PL models in Table 2.1. Both information criteria favor the 2PL

model over the 1PL model.

28

0 10 20 30 40 50

0.5

1.0

1.5

2.0

2.5

j

a j

Figure 2.1: Posterior medians and 95% credible intervals of aj’s under the 2PLmodel.

2.3.1 Results of Marginal Likelihood Computation

The marginal likelihood of an IRT model is intrinsically small and certain tech-

niques are required to control the numerical precision in implementing each of the

five Monte Carlo methods discussed in Section 2.2. In Table 2.2, we show the results

of the log marginal likelihoods of the 1PL model and the 2PL model. In Table 2.3, we

summarize the time usage for sampling and computing separately for each method.

A MacBook computer with 2.2 GHz Intel Core i7 processor was used for the sam-

pling process with NIMBLE in R, and calculations of each estimator were carried

out in MATLAB by a Windows computer with 3.3 GHz Intel Core i5 processor. For

the PMC approach, 500000 i.i.d samples are taken from the prior distribution to

29

0.00

0.25

0.50

0.75

1.00

−5 0 5

theta

Pro

babi

lity

of y

= 1 color

item 1

item 20

item 4

item 5

item 8

Figure 2.2: Estimated item characteristic curves for the items 1, 4, 5, 8, 20.

obtain the point estimate MDPMC,0. We obtain another 100 independent point es-

timates MDPMC,i, i = 1, . . . , 100, calculated with 50000 i.i.d samples for each, and

estimate the Monte Carlo standard error as√

1100

∑100i=1(MDPMC,i − MDPMC,0)2.

For other Monte Carlo approaches, 65000 samples are drawn with a burn-in of

5000 samples for each MCMC output. First, we obtain a point estimate MD∗,0 using

all 60000 MCMC samples, where ∗ is used to denote any particular method among

the HM method, the Chen method, the Chib method, and the SSM. Then we divide

60000 Monte Carlo samples into 100 non-overlapping batches, each with 600 samples,

and obtain 100 point estimates of the marginal likelihood MD∗,i, i = 1, . . . , 100

with each batch of Monte Carlo samples. Finally the Monte Carlo standard error is

obtained as√

1100

∑100i=1(MD∗,i − MD∗,0)2.

30

Table 2.2: Log marginal likelihoods obtained by each method and correspondingMonte Carlo standard errors in parentheses.

Model PMC HM Chib Chen SSM1PL -18485.98 -11557.61 -12184.06 -12183.06 -12183.10

(872.99) (21.51) (0.22) (1.71) (13.20)2PL -16249.33 -11135.98 -11878.45 -11867.16 -11881.45

(254.73) (24.89) (0.64) (3.33) (14.64)

Table 2.3: Time usage to obtain the estimates by each method under 1PL and 2PLIRT models.

Model Time PMC HM Chib Chen SSM1PL Sampling 4.2s 1088s 1103s 1092s 4.8h

Computing 450s 65s 13.6h 314s 401s2PL Sampling 4.5s 1273s 1785s 1267s 6.5h

Computing 472s 70s 20.7h 360s 440s

Here “s” denotes “second” and “h” refers to “hour”.

From Table 2.2, we can see the point estimates of log marginal likelihood for

the 2PL model are much larger than those of the 1PL model for all methods, which

is consistent with our previous finding from Figure 2.1 that the 2PL model should

be a better fit for the data than the 1PL model. Although we do not know the

‘true values’ for the marginal likelihoods for the 1PL and 2PL models, we can see

from Table 2.2 that the prior-based estimators have large Monte Carlo standard

errors and the point estimators are quite far from those computed by the other

methods, despite the fact that we have used 500000 i.i.d samples. Compared with

the estimates from the PMC, the point estimates produced by the HM method are

much closer to the Chib method, the Chen method, and the SSM, though they are

still off a few hundred in the log scale.

31

The Chib method, the Chen method, and the SSM have similar point estimates

for the marginal likelihoods. Among these three methods, the estimated Monte

Carlo standard errors of the Chib method are rather small compared with the other

two methods, however, the computation time for the Chib method is noticeably

longer than the other two methods. The Chen method requires the least computing

time (only one MCMC output) among the three.

To examine the sensitivity of the marginal likelihood under the 2PL model

with respect to the specification of π(aj), we use the Chib method to compute

the marginal likelihoods with three log normal priors, LN(0.5, 1), LN(0.5, 1/5),

and LN(0.5, 25), where LN denotes the Log-Normal distribution. We keep 4000

MCMC samples after a burn-in of 8000 samples for each MCMC output, and the

log marginal likelihoods under the 2PL model are -11954, -11959, and -11952 corre-

sponding to LN(0.5, 1), LN(0.5, 1/5), and LN(0.5, 25). The results still prefer the

2PL model over the 1PL model.

2.4 Discussion

Through the data analysis we assume the MCMC samples are obtained from

some existing software without knowing the mechanism of how they are sampled.

We have observed some quite interesting results for these five Monte Carlo methods.

For the prior-based method, although it has the consistency guarantee by the law

of large numbers, the simplest form among the 5 methods, and the shortest time

for a single sample computation, the results are not satisfying at least under our

32

specific setting. As mentioned in (Lewis and Raftery, 1997), “unfortunately it is

generally necessary to obtain a very large numbers of draws from the prior distribu-

tion” before it becomes a good estimator. All its benefits may be eclipsed if we need

more than 1 million or even more i.i.d. samples at each single run to obtain reliable

estimates. For the HM approach, despite the potential infinite variance problem,

the HM estimates seem to be better comparing to the PMC. For the Chen method,

though the implementation of its original form seems resource demanding, a ‘fast’

alternative can be used with a single MCMC output, and it produces very similar

results compared to the Chib method and the SSM. For the Chib method and the

SSM, both require multiple MCMC outputs to compute the marginal likelihood for

IRT models, and they produce very close point estimates for the marginal likeli-

hoods. Both methods take longer time comparing to the other three methods, and

the computational time of the SSM depends on the choice of the number of power

posteriors, K.

The computation cost comparison might be fairer, if considering the number

of iterations for each method such that the standard errors of the estimated log

marginal likelihoods are comparable. We applied the Chib method to the 2PL

model with 6000 samples and the log marginal likelihood and the corresponding

Monte Carlo standard error are -11878.3 and 5.61, and the computing time is about

7600 seconds.

33

The results might be sensitive to the choice of a∗ and b∗ when applying the Chib

method and the Chen method, and it would be interesting to test how robust the

results would be if using different sets of a∗ and b∗ in the future.

Chapter 3

Variable Selection in Dynamic Item Response

Theory Models via Bayes Factors with a Single

MCMC Output

3.1 Proposed Model

We follow the basic infrastructure of the dynamic IRT model proposed in Wang

et al. (2013) and build a dynamic IRT model to accommodate the data structure

for the Edsphere dataset. At the first level of the model, the correctness of the

items is linked to the latent ability of each individual, the difficulty of the items and

certain random effects. At the second level, it is assumed the mean of the latent

ability of the current day is the sum of the latent ability of the previous day and

an increment related to the time lapse. We want to investigate the relationship

between the individual “growth factor” and the characteristics of test takers, such

as socio-economic status, gender, grade, etc. We propose the following dynamic IRT

34

35

model and incorporate this information on ci:

Level 1: Pr(Xi,t,s,l = 1 | θi,t, ηi,t,s, di,t,s,l) = Φ(θi,t − di,t,s,l + ηi,t,s), (3.1)

Level 2: θi,t = θi,t−1 + ci(1− ρθi,t−1)∆+i,t + ωi,t, ωi,t ∼ N(0, φ−1∆i,t), (3.2)

Level 3: log(c) ∼ N(1α + Zβ, ψ−1In), (3.3)

where Xi,t,s,l represents the correctness of the l-th item in the s-th test on the t-

th day taken by subject i, for i = 1, 2, . . . , n, t = 1, 2, . . . , Ti, s = 1, 2, . . . , Si,t,

l = 1, 2, . . . , Ki,t,s. di,t,s,l denotes the corresponding item difficulty. θi,t denotes the

latent ability of individual i on the test day t, η is the exam-wise random effect, and

we assume ηi,t = (ηi,t,1, . . . , ηi,t,Si,t)′ ∼ N (0, τ−1

i I) for i = 1, · · · , n, t = 1, · · · , Ti,

and di,t,s,l ∼ N(ai,t,s, σ2), where both ai,t,s and σ are known due to the test design.

At level 2, ρ is a known positive constant. ∆i,t is the time lapse between the

t-th test day and the (t − 1)-th test day of subject i, t = 1, . . . , Ti. The term

∆+i,t = min(∆Tmax ,∆i,t) is the time lapse truncated by a pre-specified maximum time

interval, and it reflects the fact the student’s ability may not be growing proportional

to the time lapse when the gap is over certain number of days, i.e., ∆Tmax here. In

practice it may occur when students take a vacation. We set ∆Tmax = 14 days as

the same in Wang et al. (2013).

At level 3, Z is the n × p design matrix of the covariates, and we assume each

column has been centered, i.e., 1′Z = 0. We use α to denote the intercept, β is the

p× 1 vector of coefficients for the covariates.

We aim to perform variable selection among the covariates at level 3, i.e., find

all non-zero β elements in β, while treating all other model structures as common

36

setups. Liang et al. (2008) showed some attractive variable selection properties of the

Bayes factor with mixtures of g-priors in a linear model setting, which motivates us

to apply the Bayes factor with mixtures of g-priors in our hierarchical model setup.

As discussed in Liang et al. (2008), the marginal likelihoods of all models have

closed-form expressions when g is fixed, and the Bayes factor can be computed via

Laplace approximation when the Inv-Gamma(1/2,n/2) prior is imposed on g, i.e.,

the Zellner-Siow prior is used. However, under our hierarchical model framework,

there are no analytical forms available for neither the marginal likelihoods nor Bayes

factors, no matter when g is fixed or random. It seems not feasible either to use

Laplace approximation to obtain the Bayes factor. Therefore a tractable computa-

tion algorithm is needed to obtain the Bayes factors.

3.2 Computation of Bayes Factors with a Single MCMC Output

To compute marginal likelihoods which are not analytically available, there are

many Monte Carlo methods, such as the Harmonic Mean method (Newton and

Raftery, 1994), the Chib method (Chib, 1995), the Stepping Stone method (Xie

et al., 2011), to name a few. Due to the complex structure and high-dimensionality of

the parameter space in the dynamic IRT model, computing the marginal likelihood

for each model with existing Monte Carlo methods seems to be a difficult task.

Motivated by Chen (2005), taking advantage of the nested model structure at level

3, it may be a computationally viable approach if we can obtain the Bayes factor of

each model to the full model using only a single MCMC output.

37

To formally define the models under comparison, we introduce the notations as

follows: Let MN be the null model where all β coefficients are 0, i.e., β = 0. MF

denotes the full model where all β coefficients are considered non-zero. Let Mγ

denote the model with β−γ = 0, where (βγ,β−γ) = β. Let pγ be the length of βγ,

thus we have pγ + p−γ = p for any γ. Setups at level 1 and level 2 are common for

these models.

Priors for all model parameters are specified as follows: (i) Assign τi ∼ Γ(0.01, 100),

i = 1, 2, . . . , n, φ ∼ Γ(1, 1000) for τ and φ; (ii) Use the Zellner-Siow prior dis-

cussed in the null-based approach in Liang et al. (2008): π(α, ψ) ∝ ψ−1, g ∼

Inv-Gamma(1/2, n/2), and βγ ∼ N(0, gψ−1(Z ′γZγ)−1) for α,β, ψ, g. A uniform

prior on the model space is used.

Based on the current linear structure at level 3, we have the identity

π(c|α,βγ, ψ,Mγ) = π(c|α,βγ,β−γ = 0, ψ,MF ), (3.4)

which leads to the proposition as follows.

Proposition 3.2.1. For Mγ 6= MN , the Bayes factor of model Mγ to the full

model MF under the dynamic IRT model framework is

BFH [Mγ :MF ]

= E

[π(βγ|ψ, g,Mγ)

π(βγ,β−γ = 0|ψ, g,MF )π(β−γ = 0|c, α,βγ, ψ, g,D,MF )

], (3.5)

the Bayes factor of the null model MN to the full model MF under the dynamic

IRT model framework is

BFH [MN :MF ] = E

[π(β = 0|c, α, ψ, g,D,MF )

π(β = 0|ψ, g,MF )π(g)

], (3.6)

40

Regarding to the quantity inside the expectation of equation (3.5), we have

π(βγ|ψ, g,Mγ)


=(2π)−pγ detgψ−1(Z ′γZγ)

−1−1/2 exp−12g−1ψβ′γ(Z

′γZγ)βγ

(2π)−p detgψ−1(Z ′Z)−1−1/2 exp−12g−1ψβ′γ(Z

′γZγ)βγ

×π(β−γ = 0|c, α,βγ, ψ, g,D,MF )

= (2π)p−pγ (gψ−1)(p−pγ)/2π(β−γ = 0|c, α,βγ, ψ, g,D,MF ), (3.11)

where π(β−γ = 0|c, α,βγ, ψ, g,D,MF ) is the conditional normal density evaluated

at β−γ = 0, and it can be computed with existing results for conditional normal

distributions.

For the quantity inside the expectation of equation (3.6), we have

π(β = 0|c, α, ψ, g,D,MF )

π(β = 0|ψ, g,MF )π(g)(3.12)

= (g + 1)p/2 exp

−ψ

2

g

g + 1log(c)′Z(Z ′Z)−1Z ′ log(c)

Γ(1/2)

(n/2)1/2g3/2en/(2g).

Hence we are able to compute the Bayes factors for all model pairs using a single

MCMC output from the posterior of the full model conveniently.

3.3 Model Selection Consistency

A major motivation to use the Bayes factor as the variable selection tool is the

attractive model selection properties proved in Liang et al. (2008) under the linear

model setup, namely plimnP (Mγ|D) = 1 when Mγ is the true underlying model.

Here ‘plim’ denotes the convergence in probability with respect to the sampling

distribution under the true modelMγ. Also, the posterior probability of a model is

41

given by

P (Mγ|D) =P (D|Mγ)π(Mγ)∑γ′ P (D|Mγ′)π(Mγ′)

=π(Mγ)∑

γ′ π(Mγ′)BF [Mγ′ :Mγ].

Since p is fixed, equivalently Liang et al. (2008) proved that

plimnBF [Mγ′ :Mγ] = 0 ∀ Mγ′ 6=Mγ (3.13)

when the Zellner-Siow prior is used and the following condition is satisfied:

limn→∞βTγZ

Tγ (I − Pγ′)Zγβγ

n= bγ ∈ (0,∞) (3.14)

for any model Mγ′ that does not contain Mγ, Pγ′ is the projection matrix of the

column space of Zγ′ .

It is important to investigate whether the posterior model selection consistency

still holds under the dynamic IRT model. Under the hierarchical model framework of

the proposed dynamic IRT model, log(c) mimics at level 3 the role of the response

variable in a linear regression model, but it is a latent variable rather than the

observed response. The proof for the model selection consistency in Liang et al.

(2008) is not directly applicable here.

Under similar assumptions, we obtain the posterior model selection consistency

in Theorem 3.3.1 stated below.

Theorem 3.3.1. Assume that marginal likelihood of any modelMγ defined under

the dynamic IRT model framework is finite for any given n. Assume that condition

(3.14) holds and p is fixed. Denote the true model asMγ, for any modelMγ′ 6=Mγ,

D = X as the observed data. We have

plimnBF [Mγ′ :Mγ|D] = 0. (3.15)

43

where we use BFL[Mγ′ : Mγ| log(c)] to denote the Bayes factor of model Mγ′ to

model Mγ under the linear model framework, with log(c) serving as the response.

To prove plimnBF [Mγ′ :Mγ|D] = 0 for anyMγ′ 6=Mγ, we need to show ∀ε, δ,

there exists a constant nε,δ, such that

P (BF [Mγ′ :Mγ|D] > ε) < δ

for all n ≥ nε,δ.

WhenMγ is the true model, we have log(c) ∼ N(1α+Zβγ, ψ−1In). Along with

the condition (3.14), by Theorem 3 in Liang et al. (2008),

plimnBFL[Mγ′ :Mγ| log(c)] = 0. (3.17)

Therefore for any ε, δ > 0, there exists n0ε,δ such that P (

ML(Mγ′ |c)

ML(Mγ |c)> ε) < δ for all

n ≥ nε,δ under Mγ, where log(c) ∼ N(1α + Zβγ, ψ−1In).

LetML(Mγ′ |c)

ML(Mγ |c)− ε = ∆γ′,γ,ε(c). We have

P (BFH [Mγ′ :Mγ|D] > ε)

= P (


ML(Mγ′ |c)

ML(Mγ |c)− εdcdP∫

h(P , c|D)ML(Mγ|c)dcdP> 0)

= P (


ML(Mγ′|c)ML(Mγ|c)

− εdcdP > 0),

The expression R(D) =∫h(P , c|D)ML(Mγ|c)

ML(Mγ′ |c)

ML(Mγ |c)− εdcdP is a function of

data D, and hence it is random with respect to the sampling distribution. Consider

an exhaustive partition of the domain of (D, c) as follows: S1 = (D, c) : ∆γ′,γ,ε >

44

0, S2 = (D, c) : ∆γ′,γ,ε ≤ 0. Then we have

P (BFH [Mγ′ :Mγ|D] > ε)

= P (R(D) > 0)

= P (R(D) > 0|S1)P (S1) + P (R(D) > 0|S2)P (S2)

= P (

∫∆γ′,γ,ε>0

h(P , c|D)ML(Mγ|c)∆γ′,γ,εdcdP > 0)P (S1)

+ P (

∫∆γ′,γ,ε≤0

h(P , c|D)ML(Mγ|c)∆γ′,γ,εdcdP > 0)P (S2)

= 1× P (S1) + 0× P (S2)

= P (S1)

= P (∆γ′,γ,ε > 0)

< δ under model Mγ

holds ∀n > nε,δ, hence plimnBF [Mγ′ :Mγ|D] = 0 for any Mγ′ 6=Mγ.

3.4 Simulation Studies

Though the proposed Bayes factor approach is supported by Theorem 3.3.1 for

the posterior model selection consistency, it is desirable to examine the small sample

performance of this approach. We design and carry out simulation studies as follows:

(i) Consider 3 setups for the number of individuals n = 40, 60, 80;

(ii) Consider 2 setups for number of distinct test days Ti = 100, 120 for each

individual i, i = 1, 2, . . . , n.

45

(iii) Set number of tests per day as Si,t = 4, number of items per test as Ki,t,s = 10

for i = 1, 2, . . . , n t = 1, 2, . . . , Ti, s = 1, 2, . . . , Si,t.

(iv) Set ∆i,t = t+ 10 for t ≤ Ti/2, and ∆i,t = t− 10 for Ti ≥ t ≥ Ti/2 + 1.

(v) Set α = −5, ρ = 0.118, ψ = 100, σ2 = 0.73332, φ−1/2 = 0.0055, τ−1/2i

i.i.d.∼

U(0.2, 0.4).

(vi) Set p = 5, consider 3 setups for the β coefficients:

β = (0, 0, 0, 0, 0), (0.1, 0, 0.2, 0,−0.1), (0.2, 0, 0.4, 0, 0.2),

which include the case when the null model is the true model to test the “type

I error”, and two other cases with different magnitudes for non-zero coefficients

to test the “power”.

(vii) n × p random numbers are independently generated from N(0, 0.22) to form

a n × p matrix, then each column is centered to 0 and scaled to have unit

variance to build the design matrix Z.

For each simulation scenario, 100 independent simulations are carried out. 5000

MCMC samples are drawn from the posterior distribution of the full model MF

for each scenario, and the first 2000 samples are discarded as burn-in. Trace plots

of β, g and ψ from a simulation of n = 40, Ti = 100,β = (0.1, 0, 0.2, 0,−0.1) are

displayed in Figure 3.1.

For each simulation scenario, we record how frequently the true model is ranked

as the “best” model (with the largest log marginal likelihood), and how frequently

46

0 500 1000 1500 2000 2500 3000

iterations

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-1/2

0 500 1000 1500 2000 2500 3000

iterations

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

g-1

/2

0 500 1000 1500 2000 2500 3000

iterations

-5.12

-5.1

-5.08

-5.06

-5.04

-5.02

-5

-4.98

-4.96

-4.94

1

0 500 1000 1500 2000 2500 3000

iterations

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

2

0 500 1000 1500 2000 2500 3000

iterations

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

3

0 500 1000 1500 2000 2500 3000

iterations

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

0.08

0.1

5

Figure 3.1: Trace plots of β, g and ψ with 3000 MCMC samples. The red line refersto the true value.

47

it is ranked as the top 3 “best” models. The results are recorded in Tables 3.1 and

3.2. We can see that the Bayes factor approach performs pretty well when n ≥ 60

in both “type I error” and “power” studies, and as n increases, the performance is

better. In terms of selecting the top 3 models, when the true model has non-zero

coefficients, the Bayes factor approach ranks the true model as top 3 even over 90%

of the time when n = 40 in this simulation study. The finite sample performance of

the proposed Bayes factor approach seems pretty good.

Table 3.1: Percentage of simulations in which the true model is ranked as the “best”model by the Bayes factor approach.

(n, Ti)β

(0, 0, 0, 0, 0) (0.1, 0, 0.2, 0,−0.1) (0.2, 0, 0.4, 0,−0.2)

(40,100) 0.48 0.6 0.7(40,120) 0.62 0.66 0.76(60,100) 0.8 0.69 0.78(60,120) 0.82 0.79 0.86(80,100) 0.93 0.79 0.85(80,120) 0.91 0.85 0.87

Table 3.2: Percentage of simulations in which the true model is ranked as the top 3“best” model by the Bayes factor approach.

(n, Ti)β

(0, 0, 0, 0, 0) (0.1, 0, 0.2, 0,−0.1) (0.2, 0, 0.4, 0, 0.2)

(40,100) 0.51 0.91 0.97(40,120) 0.64 0.95 0.96(60,100) 0.83 0.98 0.98(60,120) 0.84 0.98 0.98(80,100) 0.93 1.00 0.98(80,120) 0.92 0.99 0.97

48

Table 3.3: Estimated log Bayes factor of each model to the full model via theproposed Monte Carlo approach.

Model Log Bayes factor comparing to MF

Null model -243.13Z1 -318.54Z2 -210.93Z1, Z2 -17.49Z3 -256.72Z1, Z3 -378.92Z2, Z3 -116.77Z1, Z2, Z3 0

3.5 Real Data Application

We apply the proposed method to the EdSphere dataset introduced in Chapter

1. 35 students are selected for the analysis, and 3 candidate covariates are used:

gender, economic status and grade, which are denoted by Z1–Z3, respectively. For

gender, we let Z1 = 1 for male and Z1 = 0 for female. There are two categories for

economic status, i.e., disadvantaged and non-disadvantaged, based on lunch status,

homelessness state, and other district-level indicators of economic situation, and we

denote Z2 = 1 for disadvantaged and 0 otherwise. For the grade variable Z3, we

use the grade information at the first day during the study period for each student.

Among these 35 students, their test days range from 45 to 122, the maximum number

of tests per day is 18, the maximum number of questions per test is 78.

We implement the MCMC procedure with 300000 iterations, and burn in the

first 50000 MCMC samples. The posterior medians of the regression coefficients

β = (−0.249, 0.428,−0.109)′. The estimated log Bayes factor of each model to the

full model MF is shown in Table 3.3.

49

The model with the largest Bayes factor comparing to the full model is the full

model itself MF : Z1, Z2, Z3, i.e., gender, economic status, and grade all play a

role in terms of explaining the growth factor ci. The Bayes factor of the full model

MF to the 2nd “preferred” model M1,2 : Z1, Z2, i.e., e17.49, is considered as “very

strong” evidence to favor the full model Kass and Raftery (1995). A comprehensive

study with more subjects and interaction terms will be carried out in future work

to substantiate the findings.

3.6 Discussion

From the previous analysis results, we can see that Bayes factor seems to be a

reliable tool for the variable selection task in the proposed dynamic IRT model, if the

Zellner-Siow prior is used and certain assumptions are met. Note that generally it is

difficult to compute the marginal likelihoods of hierarchical models with multi-levels

if no analytical forms are available, and the usage of Bayes factor may be limited. If

one wishes to perform variable selection under a hierarchical model, at only a single

layer which consists of the linear structure similar to level 3 in the dynamic IRT

model, Proposition 3.2.1 may be applied in a similar fashion to compute the Bayes

factors conveniently. The proof for posterior model selection consistency may still

hold if similar assumptions as the ones in Theorem 3.3.1 are made, and the layer

under consideration links with other layers in certain ways, e.g., the parameters in

such layer are not connected to other layers.

50

There are also other model selection criteria for hierarchical models, such as

the deviance information criterion (DIC) (Spiegelhalter et al., 2002) and the widely

applicable information criterion (WAIC)(Watanabe, 2010). When applying the DIC

for model selection, the “focus” of the parameters of interest may affect the results.

One may integrate out all “intermediate” parameters to obtain the likelihood with

respect to the parameters of interest, which is usually challenging in hierarchical

models such as IRT models, or treat the likelihood to be just the data likelihood in

the first layer and treat the “intermediate” parameters the same as the parameters

in the uttermost layers. It is unclear which implementation may lead to better

results. It would be interesting to compare the model selection performances of

these criteria in future work.

Chapter 4

Bayesian Nonparametric Monotone Regression of

Dynamic Latent Traits in IRT Models

4.1 Nonparametric Monotone Regression Model

In this chapter, we keep the discussion focused on the one-parameter IRT model,

which links latent ability and item difficulty to the correctness of each item. The

idea could be similarly implemented on two-parameter/three-parameter IRT models

or other continuous, ordinal, and nominal latent variable models. As introduced

in Chapter 1, in the classic one-parameter IRT model, the latent ability of each

individual and the difficulty level for each item are the two key components, which

are assumed to be static. While in the EdSphere dataset, the item responses are

longitudinal, and thus the latent ability of individuals and item difficulty levels are

both varying with time. In addition, each test taker is allowed to take tests at any

time they wish in the computerized testing. Moreover, they can take more than one

51

52

exams per day. Therefore, following the discussion of Wang et al. (2013), we extend

the classic one-parameter IRT model as follows.

The proposed shape-constrained IRT model involves two-levels, i.e.,

Level 1: Pr(Xi,j,s,k = 1 | θi,j, ηi,j,s, di,j,s,k) = F(θi,j − di,j,s,k + ηi,j,s), (4.1)

Level 2: θi,j = fi(ti,j) + ωi,j. (4.2)

At the first level, Equation (4.1) extends the classic one-parameter IRT models to

the scenario of computerized testing, where θi,j represents the i-th person’s ability

(latent trait) on the j-th day with assuming a person’s ability is constant over a given

day; di,j,s,l is the difficulty of the l-th item in the s-th test on the j-th day taken by

the subject i; ηi,j,s takes account of the random effects that can not be explained by

person’s ability and the item difficulty in the s-th test on the j-th day for the i-th

person; F(·) is a cumulative distribution function (CDF) for a continuous random

variable; and i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j and k = 1, . . . , Ki,j,s. For

the EdSphere dataset, since the ensemble mean and variance of the item difficulties

in a test are known quantities due to the test design, we assume

di,j,s,k = ai,j,s + υi,j,s,k, (4.3)

where υi,j,s,k ∼ N (0, σ2), and ai,j,s and σ are known. Further, we assume for each

individual i, the random effects ηi,j,si.i.d.∼ N (0, τ−1

i ) for j = 1, · · · , Ti and s =

1, · · · , Si,j, where τi is a precision parameter and changes according to individuals.

For the link function F−1(·), we will use F−1(·) = Φ−1(·) (called the Normal Ogive

53

or Probit link), where Φ−1(·) is the inverse of the standard normal CDF, and this

link can ease the computation for Bayesian analysis.

In the second level (4.2), wi,j represents the random residual, wi,j ∼ N (0, φ−1∆i,j),

where φ is an unknown proportion of the precision of wi,j and ∆i,j is the time lapse

between the j-th test day and the (j − 1)-th test day of the subject i, j = 1, . . . , Ti.

We assume the variance of wi,j is proportional to the time lapse because it implies

the uncertainty about one’s ability would become larger when he/she does not take

the tests for a while. The function fi(·) is the i-th individual latent trajectory, which

is assumed to be a continuous function, where ti,j is the time location, i.e., the actual

test day for an examinee to take a test in the study period. When we model fi(·),

some prior information of often available. For example, psychologists and educators

can assume that the mean trend for a student’s reading ability should be growing or

at least not decreasing during one school year. However, in a longer time span, such

a monotonicity assumption may not hold. We choose to impose such prior beliefs

on fi(·) since it can be used for a large part of the potential applications, and we

may be able to check from the data-fit about the reasonableness of this assumption.

We will utilize the idea of Bornkamp and Ickstadt (2009) to model the unknown

latent trajectory fi(·) flexibly and conveniently with a monotone shape constraint.

First, for each individual i, i = 1, . . . , n, we rescale the original time units into [0, 1],

by subtracting the original time with the minimum time value of the i-th individual

and then divided by the time range of individual i. After such rescaling, we continue

54

to use ti,j for the notation of time for convenience, and assume

fi(ti,j) = βi,0 + βi,1f0i (ti,j), i = 1, · · · , n, j = 1, . . . , Ti, (4.4)

where f 0i (·) is the CDF of a bounded and continuous random variable on [0,1]. Thus

fi(·) is an increasing function if βi,1 ≥ 0 and a decreasing function otherwise. More-

over, fi(·) can accommodate flat regions if chosen f 0i (·) properly. This is because

a CDF of a random variable will reach its plateau when the variable approaches

its bound of the domain. Under the current time rescaling mechanism and since

0 ≤ f 0i (tij) ≤ 1 for each individual i, the intercept βi,0 can be interpreted as the

initial level of the latent ability for the i-th individual, while βi,0 +βi,1 can be treated

as the maximum level that an individual can reach during the study period. Second,

to make the modeling of f 0i (·) more flexible, we introduce the nonparametric ideas,

f 0i (ti,j) =

∫Ξ

F(ti,j, ξ)Gi(dξ) =

Li∑`=1

πi,`F(ti,j, ξi,`), ξi,`i.i.d.∼ P0, (4.5)

where we model f 0i (·) as a discrete mixing of parametric CDFs. In Equation (4.5),

F(·, ξ) is a CDF with parameters ξ on the parameter space Ξ; Gi is a discrete

probability measure on Ξ with assigning a general discrete random measure prior

introduced by Ongaro and Cattaneo (2004), i.e., Gi(dξ) =∑Li

`=1 πi,`δξi,`(dξ), where

δ is a Dirac delta function; ξi,`’s are independent and identically distributed realiza-

tions from a continuous distribution P0 on Ξ, and are assumed to be independent

with πi,`’s and Li’s; πi,` satisfies∑Li

`=1 πi,` = 1 and Li requires its support on pos-

itive integers. Also, note the common choices of P0 in Equation (4.5) will help us

to borrow strength among individual latent trajectories. The construction of Equa-

tion (4.5) actually includes many popular discrete random probability measures in

55

the current literature as a special case, such as the Dirichlet process, general stick-

breaking processes and etc. In addition, this construction is very flexible in modeling

monotone function since it has not imposed any restricted structure on parameters

(i.e., for the weights πi,`’s and the parameters ξi,`’s) as other methods mentioned in

the Introduction.

In Equation (4.5), a proper choice of the base distribution function F(·, ξ) is the

key for the success of modeling fi(·). A typical requirement is that we need a convex

combination of functions F(·, ξi,1),F(·, ξi,2), . . ., for any i = 1, · · · , n, which can ap-

proximate any arbitrary continuous CDF on [0, 1]. The beta distribution functions

(i.e., the regularized incomplete beta functions) will satisfy this requirement, how-

ever, there is usually extensive computation associated with the beta distribution

functions since they have no closed form. To balance the computation burden and

the adequacy of approximation, we consider the CDF of two-sided power (TSP)

distribution (c.f., Van Dorp and Kotz (2002)) for F(·, ξ):

F(ti,j, ξ) =

b(ti,jb

)γ, if 0 ≤ ti,j ≤ b,

1− (1− b)(1−ti,j1−b )γ, if b ≤ ti,j ≤ 1.

, (b, γ) ∈ [0, 1]×R+,

which has two key parameters (b, γ) and is a viable alternative of beta distribution

functions. Here, define ξ = (b, γ). Bornkamp and Ickstadt (2009) proved the convex

combination of the CDF of TSP functions can be capable of approximating any

continuous CDF on [0, 1].

We illustrate several examples of TSP functions in Figure 4.1. From Figure 4.1,

we can see that γ controls the steepness of the curve. When γ is small, the pace of

56

increasing is comparatively slow and when γ becomes larger, the increasing trend

becomes steeper. The parameter b is the unique mode of TSP function when γ > 1.

A TSP function with a small b and a large γ describes a curve increasing steeply in

the beginning and stabilizing afterward, which could be viewed as a ‘fast learner’

growth curve; while, a TSP function with a large b and small γ can be viewed as

a growth trend for an individual who improves steadily with a slower pace. Other

scenarios could be interpreted accordingly and they will be varying with different

values of b and γ.

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1TSP function examples with 4 different γ values and b=0.2

γ=2

γ=10

γ=30

γ=50

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


γ=2

γ=10

γ=30

γ=50

0 0.2 0.4 0.6 0.8 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


γ=2

γ=10

γ=30

γ=50

Figure 4.1: Illustration of several TSP functions with different γ and b values.

The monotone regression can be treated as the sum of two parts: 1) an intercept

parameter (interpretable as one’s initial ability), and 2) the product of a scale pa-

rameter (interpretable as the maximum ability that one can grow during the study

period) with a continuous function of the time variable (which is monotonically in-

creasing over the study period and can easily capture the plateaus effect of one’s

ability at the end). Another potential advantage of our proposed approach could be

that the parameters for the underlying base functions, which constitute the mono-

tone function, do not have constraints on the domain of parameter space. Lack of

57

constraints on the domain of the parameter space could make the model more flex-

ible and reduce computational burden. In addition, the base functions themselves

can be directly learned from the data.

4.2 Bayesian Computation Scheme

The hierarchical model of (4.1) and (4.2) can accommodate the complex structure

of computerized testing (e.g., EdSphere datasets) and it also allows the incorporation

of prior information. However, because of the complexity of the model considered,

we have to resort to Markov chain Monte Carlo (MCMC) computational techniques

for the analysis. A byproduct of Bayesian inference is that all uncertainties in all

quantities are combined in the overall assessment of inferential uncertainty.

4.2.1 Prior Specification and Posterior Distribution

Before starting the Bayesian inference, first, we have to specify the prior distribu-

tions of unknowns in the model. For parameters φ, τi’s, βi,0’s and βi,1’s in Equation

(4.1), (4.2) and (4.3), due to a lack of scientific knowledge, we use the following

objective priors for them: π(φ) ∝ φ−32 , π(τi) ∝ τ

− 32

i , π(βi,0) ∝ 1 and π(βi,1) ∝ 1,

for i = 1, . . . , n. The objective priors used for π(φ) and π(τi) are recommended in

Wang et al. (2013).

According to Lemma 1 in Bornkamp and Ickstadt (2009), the prior distributions

of ξi,`’s determine the mean and prior correlation structure of f 0i (·). Without expert’s

information, a uniform distribution on a finite subset of parameter space Ξ for ξi,`’s

58

would be a reasonable choice to start with. Theoretically, we would like to elicit an

unbounded prior for Li. One viable choice is the zero truncated Poisson distribution

with the rate parameter λ > 0, and thus, its prior mean is λeλ

eλ−1. The larger λ value

is, the more components are in TSP mixture. For the prior of πi = (πi1, · · · , πiL)′, a

natural choice is a symmetric Dirichlet distribution with common parameter ρ > 0.

Notice the prior variability of f 0i (·) is increasing when Li or ρ gets smaller (see

Lemma 1 in Bornkamp and Ickstadt (2009)). Hence, in practice, the values of λ and

ρ are chosen according to the desired prior variability for f 0i (·) and the expected

number of jumps in the model response.

Using the priors specified above, we can derive the posteriors of unknowns in

the proposed model as shown in Equation (A.1) of Appendix A. We can show this

posterior is proper (see details in Appendix A). Then, the statistical inference based

on the sampling from this posterior is legitimate. To facilitate the computation for

the posterior of unknowns, we implement the idea of data augmentation (Albert and

Chib, 1993) by introducing a latent variable Zi,j,s,k for each dichotomous response

Xi,j,s,k, i.e., defining Zi,j,s,k ∼ N(θi,j−ai,j,s+ηi,j,s, 1+σ2)I(Zi,j,s,k > 0) if Xi,j,s,k = 1,

and Zi,j,s,k ∼ N(θi,j − ai,j,s + ηi,j,s, 1 + σ2)I(Zi,j,s,k ≤ 0) if Xi,j,s,k = 0. Therefore, the

two-level hierarchical model (4.1) and (4.2) can be simplified as

Zi,j,s,k = fi(ti,j) + ωi,j − di,j,s,k + ηi,j,s + εi,j,s,k, (4.6)

with εi,j,s,ki.i.d∼ N (0, 1). Then, our computation schemes for drawing samples from

the joint posterior distribution of unknowns will derive from this data augmented

model.

59

Let Λi = (Li, πi,1, . . . , πi,Li , ξi,1, . . . , ξi,Li)′ with each ξi,` = (bi,`, γi,`) being the

corresponding parameters of the `-th TSP mixture component for the i-th subject,

where πi,` is the assigned weight of the `-th mixture component for i-th subject,

for i = 1, . . . , n and ` = 1, . . . , Li. Thus, Λi contains all information about the

TSP mixture for the i-th individual, and we denote the notations Λ = Λ1, . . . ,Λn

and ξ = ξ1,1, . . . , ξ1,L1 , . . . , ξn,1, . . . , ξn,Ln to represent the sets of variables for all

n subjects.

Similarly, we use bold notations θ, β,Z, η, τ to define the sets of corresponding

variables introduced in Section 4.1 over all indices, i.e., β = β1,0, β1,1, . . . , βn,0, βn,1,

η = ηi,j,s : i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j, τ = τ1, . . . , τn, Z =

Zi,j,s,k : i = 1, . . . , n, j = 1, . . . , Ti, s = 1, . . . , Si,j, k = 1, . . . , Ki,j,s, θ = θ1,1,

. . . , θ1,T1 , . . . , θn,1, . . . , θn,Tn. Similarly, define X = Xi,j,s,k : i = 1, . . . , n, j =

1, . . . , Ti, s = 1, . . . , Si,j, k = 1, . . . , Ki,j,s. Then, the joint posterior distribution of

parameters θ, β,Λ, Z, η, τ , φ given the data X is derived as:

f(θ, β,Λ, Z, η, τ , φ|X) (4.7)

∝ f(X|Z)f(Z|θ, η)f(θ|β,Λ, φ)f(η|τ )π(β)π(Λ)π(φ)π(τ )

∝n∏i=1

Ti∏j=1

Si,j∏s=1

Ki,j,s∏k=1

(I(Zi,j,s,k > 0)I(Xi,j,s,k = 1) + I(Zi,j,s,k ≤ 0)I(Xi,j,s,k = 0))

×√ψi,j,s,k

2πexp−ψi,j,s,k(Zi,j,s,k − θi,j + ai,j,s − ηi,j,s)2

2

×

n∏i=1

Ti∏j=1

√φ

2π∆i,j

exp−φ[θi,j − βi,0 − βi,1f 0

i (ti,j)]2

2∆i,j

×

n∏i=1

Ti∏j=1

( τi2π

)Si,j2

exp−τi∑Si,j

s=1 η2i,j,s

2

[ n∏i=1

π(τi)π(βi)π(Λi)

]π(φ),

60

where ψi,j,s,k = (1+σ2)−1 and π(τi), π(βi), π(Λi) and π(φ) denote the priors specified

in the beginning of this subsection.

4.2.2 The MCMC Sampling Schemes

The key part in developing MCMC algorithm to draw samples from the joint

posterior distribution (4.7) is to estimate the latent trajectory f 0i (ti,j)’s in Equation

(4.4). Notice that once Λ is known, f 0i (ti,j)’s are fully determined. Since the distri-

bution of Λ can vary in dimension, we employ a reversible jump Markov chain Monte

Carlo (RJ-MCMC) sampling scheme (Green and Hastie, 2009). Further, to reduce

the correlation as well as to achieve faster convergence of the MCMC samples of

parameters, we implement the idea of partially collapsed Gibbs sampling (Van Dyk

and Park, 2008). Thus, at the q-th iteration, we perform the sampling procedure of

unknowns in the order below:

1. Sample Z(q) from f(Z(q) | θ(q−1),η(q−1)), which is a truncated normal distri-

bution for the full conditional distribution of each individual Zi,j,s,k given the

rest;

2. Sample η(q) from f(η(q) | θ(q−1), τ (q−1),Z(q)), which is a normal distribution

for the full conditional distribution of each ηi,j,s given the rest;

3. Sample τ (q) from f(τ (q) | η(q)), which is a gamma distribution for the full

conditional distribution of each τi given the rest;

61

4. Sample Λ(q) from f(Λ(q) | θ(q−1)), which has no closed form; moreover, the di-

mension of Λ is varying for each iteration, and thus we employ the Metropolis-

Hasting algorithm within the RJ-MCMC to sample the full conditional distri-

bution of each Λi given the rest;

5. Sample φ(q) from f(φ(q) | θ(q−1)Λ(q)), which is a gamma distribution for the

full conditional distribution of φ given the rest;

6. Sample β(q) from f(β(q) | θ(q−1),Λ(q), φ(q)), which is a multivariate normal

distribution for the full conditional distribution of β given the rest;

7. Sample θ(q) from f(θ(q) | β(q),Λ(q),Z(q),η(q), φ(q)), which is a normal distri-

bution for the full conditional distribution of each individual θi,j given the

rest.

The details of each sampling step are described in Appendix A. The MCMC

sampling loops through Step 1-Step 7 and repeats until the MCMC is converged.

The initial values (i.e., when (q − 1)-th iteration = (0)-th iteration) of parameters

chosen in the simulation and application are: θ(0)i,j ’s drawing from N (0, 1); β

(0)i,1 ’s

and β(0)i,2 ’s being zero; η

(0)i,j,s’s being zero; φ(0) = 200; and τ

(0)i ’s being 6 for i =

1, · · · , n. While, we will specifically discuss how to choose the initial values of

Λ(0)i = (L

(0)i , π

(0)i , b

(0)i , γ

(0)i ) for i = 1, · · · , n in different examples. The convergence

is evaluated informally by looking at trace plots and other diagnostic procedures

discussed in Chen et al. (2012).

62

Then, statistical inferences are made straightforward from the MCMC samples.

For example, an estimate and 95% credible interval (CI) for the latent trajectory

of one’s ability θi = (θi,1, · · · , θi,Ti)′ can be plot from the median, 2.5%, and 97.5%

empirical quantiles of the corresponding MCMC realizations of each θi,j, for j =

1, · · · , Ti. In examples, these will be graphed as a function of time tij, so that the

dynamic changes of an examinee is apparent.

4.3 Simulation Examples

To validate the inference procedure of MCMC schemes and show the success of

using monotone shape constraints in the nonparametric modeling, two simulation

studies are conducted. In the first simulated example, we know the true underlying

curve f 0i (·)’s that generate the latent trajectory of one’s ability, while in the second

simulated example, we have no information about the true underlying curve f 0i (·)’s.

4.3.1 An Example Using the Mixture of TSP Functions as the Latent

Trajectory

In this section, we apply our proposed method to a simulated dataset that use

the mixture of TSP functions as the true latent trajectory. We consider 10 test

takers, i.e. n = 10, each of them is examined at 60 different test days, that is

Ti = 60, i = 1, . . . , 10. During each distinctive test day, there are 4 examinations

for each individual, thus Si,j = 4 for i = 1, . . . , 10, j = 1, . . . , Ti, and there are 10

questions (or items) in each test, i.e., Ki,j,s = 10 for i = 1, . . . , 10, j = 1, . . . , Ti and

63

Table 4.1: Simulated parameters for each individual of TSP functions.

L π b γΛ1 2 [0.5,0.5] [0.5,0.1] [3,5]Λ2 2 [0.2,0.8] [0.4,0.7] [2,8]Λ3 1 1 0.15 10Λ4 2 [0.3,0.7] [0.3,0.5] [5,13]Λ5 1 1 0.3 4Λ6 1 1 0.5 17Λ7 2 [0.1,0.9] [0.2,0.3] [3,15]Λ8 1 1 0.3 5Λ9 2 [0.6,0.4] [0.1,0.5] [8,5]Λ10 2 [0.7,0.3] [0.2,0.45] [4,10]

s = 1, . . . , Si,j. For each person i, we assume the time lapse between two consecutive

tests is a function of j, which is set to be ∆i,j = j + 10 for j = 1, . . . , Ti/2 and

∆i,j = j − 10 for j = Ti/2 + 1, . . . , Ti, with i = 1, . . . , 10.

We set the true values for model parameters as below:

(a) φ−1/2 = 0.05, and thus, the corresponding standard deviation of the random

component wi,j in Equation (4.2) is 0.05√

∆i,j.

(b) τ−1/2 = (0.361, 0.286, 0.322, 0.362, 0.359, 0.302, 0.347, 0.325, 0.360, 0.378)′, where

each element represents the standard deviation of random effects ηi,j,s for the i-th

person, respectively.

(c) σ2 = 0.73332, which is chosen based on the test design of EdSphere data.

The parameters specified for the mixture components of TSP functions are given

in Table 4.1.

We use the prior specified for the unknown parameters of the proposed model

in Section 4.2.1. Particularly, for this simulated example, we use a zero truncated

Poisson for each Li with λ = 2, which implies there are expecting no more than 3

64

jumps since each of the mixture of TSP functions in the simulation consists at most

2 components. Without available scientific information, we employ a symmetric

Dirichlet distribution with the parameter ρ = 1 as the prior for πi,`’s, assign a

uniform(0, 1) for the prior of each bi,` and specify a uniform(1, 50) for the prior of

each γi,`.

Model fitting is done with the MCMC algorithm described in the Section 4.2.2,

based on 50,000 iterations in total. The first 10,000 samples are burned in and only

every 20-th value is taken for our inference as to reduce dependence. After this

burn-in period, the mixing of MCMC samples looks pretty good and the trace plots

of MCMC samples of parameters are convergent.

In Figure 4.2, we display results of 4 represented individuals among the 10 simu-

lated individuals in the example. Those 4 individuals have noticeable differences in

the shape of their corresponding trajectories. The growth curve for the first subject

is steadily increasing without obvious transition. While for the second individual

in Figure 4.2, he/she has obvious turning points. He/she grows slowly during the

initial period, then rapidly grows in the middle and reaches the plateaus in the end.

For the third and 6-th subjects in Figure 4.2, they have similar phenomena as the

second individual but with different lengths of the three stages in the study.

In each subfigure of Figure 4.2, the blue dotted line represents the true underlying

function fi(·), the black dots are simulated values of θi,j’s for j = 1, · · · , Ti, and the

red dotted line is the estimate of posterior median of fi(·), along with the red dash

lines representing the corresponding 2.5% and 97.5% credible bands. We could see

65

that the fitted trajectory captures the trend pretty well even though the number of

different test date is comparatively small, that is 60. Simulations for larger sample

sizes have been tested and the results were generally better. Those results are not

reported here to save the space.

0 200 400 600 800 1000 1200 1400 1600 1800 2000

T

0.5

1

1.5

2

2.5

3

3.5

θ1

Fitted Trajactory

Generated Theta

True Underlying Function

95% Credible Band

(a) 1st individual

0 200 400 600 800 1000 1200 1400 1600 1800 2000

T

0.5

1

1.5

2

2.5

3

θ2

Fitted Trajactory

Generated Theta


95% Credible Band

(b) 2nd individual

0 200 400 600 800 1000 1200 1400 1600 1800 2000

T

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

θ3

Fitted Trajactory

Generated Theta


95% Credible Band

(c) 3rd individual

0 200 400 600 800 1000 1200 1400 1600 1800 2000

T

0.5

1

1.5

2

2.5

3θ

6Fitted Trajactory

Generated Theta


95% Credible Band

(d) 6th individual

Figure 4.2: The estimation of latent trajectories for the 1st, 2nd, 3rd, 6th subjects

We also calculate the posterior estimates of all other unknowns. The posterior

median of φ−1/2 is 0.0561, with a 95% CI being [0.0491, 0.0634], where the true

value of φ−1/2 (i.e., 0.05) is included. The posterior median as well as 95% CIs of

other parameters are displayed in Figure 4.3 and Figure 4.4. We could see that their

respective true values, which are marked by red stars, are all included in the 95%

CIs.

66

0

0.1

0.2

0.3

0.4

0.5

0.6

τ1

-1/2τ

2

-1/2τ

3

-1/2τ

4

-1/2τ

5

-1/2τ

6

-1/2τ

7

-1/2τ

8

-1/2τ

9

-1/2τ

10

-1/2

Posterior Median

True Value

95% Credible Interval

Figure 4.3: Posterior median and 95% CIs of τi’s in TSP mixture based simulation.

0

0.5

1

1.5

2

2.5

3

β1,0

β2,0

β3,0

β4,0

β5,0

β6,0

β7,0

β8,0

β9,0

β10,0

β1,1

β2,1

β3,1

β4,1

β5,1

β6,1

β7,1

β8,1

β9,1

β10,1

Posterior median

True value

Figure 4.4: The simulation results of posterior median and 95% CIs of βi,0’s andβi,1’s in TSP mixture based latent trajectories.

To take account of randomness in the simulations, another 100 independent

datasets are simulated to check frequentist coverage probabilities, with same pa-

rameter setup but different random seeds. For each dataset, we also run the MCMC

sampling for 50,000 iterations with the first 10, 000 samples being burned in. In

addition, the MCMC chains are thinned by using every 20-th sample. The results

are shown in Table 4.2. We can see that the coverage probabilities of the 95% CIs of

all model parameters including the true values are equal or very close to the nominal

67

level 95%. Thus, while the inferential method is Bayesian, it seems to yield sets that

have good frequentist coverage.

Table 4.2: Coverage probabilities of φ, θi’s, τi’s, βi,0’s and βi,1’s by 100 independentsimulations.

Parameter Coverage Probability Parameter Coverage Probabilityτ1 0.940 τ2 0.940τ3 0.960 τ4 0.920τ5 0.940 τ6 0.900τ7 0.900 τ8 0.920τ9 0.880 τ10 0.940θ1 0.944 θ2 0.932θ3 0.950 θ4 0.975θ5 0.969 θ6 0.965θ7 0.971 θ8 0.931θ9 0.945 θ10 0.963β1,0 0.960 β2,0 0.920β3,0 0.960 β4,0 1.000β5,0 0.980 β6,0 0.980β7,0 1.000 β8,0 0.940β9,0 0.960 β10,0 0.960β1,1 0.960 β2,1 0.960β3,1 0.960 β4,1 1.000β5,1 0.980 β6,1 0.960β7,1 0.960 β8,1 1.000β9,1 0.980 β10,1 1.000φ 0.940

4.3.2 An Example of the Logistic Curve as the Latent Trajectory

In this section, we apply our proposed method to some non-TSP based true

latent trajectories. The setup is the same as the previous simulation except the true

68

trajectories fi(·), i = 1, . . . , 10 become the logistic curves as below:

f1(t) = −0.5 + 2/(exp(−10t+ 5) + 1); f2(t) = −0.5 + 2/(exp(−5t+ 2) + 1);

f3(t) = 0.5 + 1/(exp(−10t+ 2.5) + 1); f4(t) = 0.5 + 2/(exp(−12t+ 3) + 1);

f5(t) = 2.5 + 1/(exp(−5t+ 3) + 1); f6(t) = −0.5 + 2/(exp(−10t+ 2) + 1);

f7(t) = 1/(exp(−10t+ 2) + 1); f8(t) = 0.5 + 2/(exp(−4t+ 1) + 1);

f9(t) = −1.5 + 2/(exp(−5t+ 1) + 1); f10(t) = 1/(exp(−2t+ 1) + 1).

The motivation for using logistic curves as the true latent trajectories in the simula-

tion is that the logistic curves have been widely applied in the growth curve analysis.

Similarly, our model fitting is done based on 200,000 iterations in total. The first

40,000 samples are burned in and every 20-th value of MCMC samples is taken to

reduce the dependence of samples. We use a zero truncated Poisson for each Li

with λ = 3, expecting a few more TSP components are needed than the previous

example to fit the logistic curve. Similarly as before, we use a symmetric Dirichlet

distribution with the parameter ρ = 1 for πi,`’s, and assign the uniform distribution

U(0, 1) as the prior of each bi,` as well as specify U(1, 50) as the prior of each γi,`.

The priors for the rest unknowns are specified in the same fashion as in Section

4.3.1.

Figure 4.5 displays the results of 4 selected individuals, where the blue dots

represent the truth of the underlying growth curve, the black dots are simulated

values of θi,j’s for j = 1, · · · , Ti, the red dots correspond to the posterior median

estimates of one’s ability and the red dash lines indicate the 95% credible band of the

estimates. Figure 4.5(b) and Figure 4.5(c) show that the true growth curves of 4-th

69

and 6-th examiners both have a steadily increasing growth curve at the beginning

and reach a plateau after half of the study period, whereas our estimated growth

curves (red dots) capture the trend of the truth (blue dots) very well and all true

values (blue dots) are within the 95% CIs of our estimation (red dash lines). For

the 2-nd and 8-th subjects, seen from Figure 4.5(a) and Figure 4.5(d), their growth

curves are strictly increasing over the study period and similarly, our proposed

method can capture the underlying trend well under these situations.

0 50 100 150

T

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

θ2

Fitted Trajactory

Generated Theta


95% Credible Band

(a) 2nd individual

0 50 100 150

T

0

0.5

1

1.5

2

2.5

3

θ4

Fitted Trajactory

Generated Theta


95% Credible Band

(b) 4th individual

0 50 100 150

T

-0.5

0

0.5

1

1.5

2

θ6

Fitted Trajactory

Generated Theta


95% Credible Band

(c) 6th individual

0 50 100 150

T

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

θ8

Fitted Trajactory

Generated Theta


95% Credible Band

(d) 8th individual

Figure 4.5: The estimation of latent trajectories for the 4-th, 6-th, 7-th, 8-th subjects

In addition, we calculate the posterior estimates of all other unknowns. The

posterior median of φ−1/2 is 0.0539, with the 95% CI being [0.0471, 0.0610], which

includes its true value 0.05. The posterior median as well as 95% CIs of τi’s, βi,0’s

70

and βi,1’s are displayed in Figure 4.6 and Figure 4.7, in which we can see that the

true simulated values of τi’s are all inside their corresponding 95% CIs. Notice that

in the logistic curve simulations, the true values of βi,0’s and βi,1’s are unknown and

thus, we are not able to compare the truth relative to the corresponding 95% CIs

for βi,0’s and βi,1’s.

0

0.1

0.2

0.3

0.4

0.5

0.6

τ1

-1/2τ

2

-1/2τ

3

-1/2τ

4

-1/2τ

5

-1/2τ

6

-1/2τ

7

-1/2τ

8

-1/2τ

9

-1/2τ

10

-1/2

Posterior Median

True Value


Figure 4.6: Posterior median and 95% CIs of τi’s in logistic curve simulation.

-1

0

1

2

3

β1,0

β2,0

β3,0

β4,0

β5,0

β6,0

β7,0

β8,0

β9,0

β10,0

β1,1

β2,1

β3,1

β4,1

β5,1

β6,1

β7,1

β8,1

β9,1

β10,1

Posterior median

Figure 4.7: Posterior median and 95% CIs of βi,0’s and βi,1’s in logistic curve simu-lation.

71

4.4 Application to the EdSphere Data

Since our approach has been successfully applied to estimate the trend of fi(·)

from simulated data and recovered the true values of parameters well, we employ

our two-level hierarchical models to the EdSphere Data. Due to the limitation of our

computer’s RAM (8 GB), a sample of 10 individuals from the EdSphere database was

randomly chosen for illustration purpose. The characteristics of the individuals are

described in Table A.1. . There are two goals for the analysis of EdSphere datasets.

One is to assess the appropriateness of the local independence assumption for this

type of data, and the other is to understand the growth in ability of students, by

retrospectively producing the estimated growth trajectories of their latent abilities,

incorporating the monotone assumption.

The prior specification is the same as the aforementioned simulation examples

except we use a zero truncated Poisson prior for each Li with λ = 1, which corre-

sponds to a prior belief from MetaMetric researchers that using about 2 different

TSP functions we can explain the changes of the response trajectory. But such prior

assumption can be washed out by the data if our data has strong information to

indicate that we need more mixture components of TSP functions to explain one’s

ability growth. In addition, to examine the sensitivity of the prior specification for

Li, we have tried other λ values, the yielded estimation of latent trajectories are

almost the same as those shown in Figure 4.8.

We have run in total 500,000 iterations and burn in the first 1/5 of the samples,

and to reduce dependence of MCMC samples, we have taken only every 20-th value

72

of MCMC samples. We have checked the trace plots of model parameters to access

the convergence for MCMC samples and we found a good mixing is observed after

our burn-in and thinning process. Figure 4.8 shows the estimated trajectory of 4

individuals, which represent four types of growth curves we typically observed from

the data.

In Figure 4.8, the red dots indicate the estimated posterior mean of one’s latent

ability from our proposed method and the red dash lines denote the corresponding

95% credible band of one’s latent trajectory; the blue dots represent the estimated

trajectory from MetaMetrics used in current EdSphere implementation (where they

assume AR(1) model for θi,j and consider an observation equation as Model (4.1)

but without the random effect ηi,j,s); and the black dots correspond to the estimates

of one’s ability obtained by solving the equation that the expectation of expected

score for a person’s ability is equivalent to the observed score (which can roughly be

thought of as the raw test scores put on the same scale as the θi,t’s). In Figure 4.8,

we can see that our estimated latent trajectory of one’s ability growth (i.e., the red

dots for the posterior median of θi,t’s) displays a much smooth monotone increasing

trend of one’s ability in comparison to MetaMetrics’ estimation, where MetaMetrics’

approach shows a continuously up-and-down oscillation in the estimation of one’s

ability.

Noticeably, in Figure 4.8, the latent growth curve of the second individual (i.e.,

see Figure 4.8(a)), a six-grader, increases sharply at about the 300-th day and

reaches the plateau before the 450-th day, while the 6-th individual (in Figure 4.8(c))

73

experiences a moderate growth during most of the time before stabilizing by the

end of the study. For the third and the 7-th students in Figure 4.8(b) and Figure

4.8(d), respectively, they both are in Grade 2 and have a similar type of steadily

increasing shape of the growth over the entire study period. However, seen from

Figure 4.8, the growth rate (or learning speed) of the 7-th individual is much faster

than that of the third individual. Since the study period of the 7-th individual

is much shorter than that of the 3rd individual, clearly, we can also compare the

estimation of posterior median as well as the corresponding 95% CIs of β7,1 and

β3,1 (they represent the magnitude of one’s ability growth during the study period),

respectively in Figure 4.10 to validate the differences between the growth rate of

these two individuals. The results of Figure 4.8 inform us that the timing for

students reaching the ‘learning ceiling’ (i.e., plateau) as well as their ‘learning speed’

differs among different individuals. Further investigation or clustering of students

based on the shape of the growth curves might help teachers to tailor education

practice or assignments for each individual student.

Moreover, we are able to summarize the results of other parameters in the model.

The posterior median of φ−1/2 is 0.0566, and its 95% CI is [0.0301,0.0797]. The

posterior estimates and 95% CIs for τ and β are summarized in Figure 4.9 and

Figure 4.10, respectively. In Figure 4.9, all τi values and their corresponding CIs

are far away from zero except τ−1/22 and τ

−1/23 , which suggests the local dependence

indeed exists in the EdSphere datasets. In Figure 4.10, all βi,1’s values as well

as their corresponding 95% CIs are above 0, which shows the data supports the

74

0 200 400 600 800 1000 1200 1400

T

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

θ2

The trajectory of latent ability

Posterior Median

MetaMetrics Est

Raw Scores

95% Credible Band

(a) 2nd individual

0 500 1000 1500

T

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

3

θ3


Posterior Median

MetaMetrics Est

Raw Scores

95% Credible Band

(b) 3rd individual

0 200 400 600 800 1000 1200 1400 1600 1800 2000

T

-5

-4

-3

-2

-1

0

1

2

θ6


Posterior Median

MetaMetrics Est

Raw Scores

95% Credible Band

(c) 6th individual

0 100 200 300 400 500 600 700 800 900

T

-2

-1

0

1

2

3

4

5

θ7


Posterior Median

MetaMetrics Est

Raw Scores

95% Credible Band

(d) 7th individual

Figure 4.8: The estimation of latent trajectories for the 2nd, 3rd, 6th and 7thsubjects

belief that the growth curve of one’s reading ability is always increasing. Notice our

method does not require any additional restriction on the parameters in the model,

thus, the values of parameters are fully determined by the data. Also, we could see

the values of βi,0’s and βi,1’s are varying a lot according to different individuals.

4.5 Conclusion

In this chapter, we have proposed a Bayesian nonparametric two-level hierar-

chical model for the analysis of one’s latent ability growth in educational testing

for longitudinal scenarios. The advantage of our method is able to incorporate the

75

0

0.2

0.4

0.6

0.8

τ1

-1/2τ

2

-1/2τ

3

-1/2τ

4

-1/2τ

5

-1/2τ

6

-1/2τ

7

-1/2τ

8

-1/2τ

9

-1/2τ

10

-1/2

Posterior Median


Figure 4.9: Posterior median and 95% CIs of τi’s.

-4

-2

0

2

4

β1,0

β2,0

β3,0

β4,0

β5,0

β6,0

β7,0

β8,0

β9,0

β10,0

β1,1

β2,1

β3,1

β4,1

β5,1

β6,1

β7,1

β8,1

β9,1

β10,1

Posterior median

Figure 4.10: Posterior median and 95% CIs of βi,0’s and βi,1’s.

monotonic shape constraints into the estimation of latent trajectories. Due to the

flexibility of our nonparametric method, we are able to fit any monotonic continuous

curve without further restrictions on the estimation of parameters. Therefore, the

estimation of the slopes βi,1’s without including zero in their corresponding 95% CI

indicates the monotonicity is supported by EdSphere datasets.

The latent trajectories of ability growth estimated from our approach can help

educators or practitioners to better understand the behaviors of students in the

study, such as the growth patterns (continuous increasing, sharp increasing and

etc.) and the timing that a student reaches the ceiling of increasing for his/her

76

learning ability (i.e., the timing of reaching the plateau). Further studies on cluster-

ing those behavioral patterns of individuals will guide us to design education practice

or teaching tailored to individuals. In addition, since the evidence of the local de-

pendence assumption is generally strong from the analysis of EdSphere datasets, we

can conclude that the use of random effects to model the local dependence seems to

be necessary and successful.

As the MCMC computation is time-consuming and resource-demanding for our

current approach, our next goal is to improve the efficiency of our computation by

developing big data schemes, such as similar ideas of Wei et al. (2017), to make

the parallel computing possible so as to conveniently apply our approach to the

entire datasets. With improvement of computation efficiency, we might be able to

develop a computable evaluation criterion to assess the model fit of the proposed

nonparametric model using the predictive score ideas (Gneiting and Raftery, 2007).

Another potential extension is to explore covariates including the grade and others

with their relationships to the growth of one’s ability trajectory. This direction will

facilitate us to group individuals based on their similar characteristics and encourage

the development of personalized education.

Chapter 5

Preliminary Results of Model Selection by DIC

5.1 Model Selection via DIC under Hierarchical Model

We have discussed the Bayes factor approach for model selection in Chapters 2

and 3 under the IRT models, which are essentially hierarchical models. In future

research, we want to investigate the model selection performance of the deviance

information criterion (DIC) (Spiegelhalter et al., 2002) and its variants for hierar-

chical models, and potentially develop a reliable model selection criterion for the

IRT models.

The DIC is initially defined by Spiegelhalter et al. (2002) as

DIC(θ) = −4Eθ|ylog(y|θ)+ 2 log(y|θ), (5.1)

where θ = E(θ|y) is generally taken as the posterior mean. The quantity

PD = −2Eθ|ylog(y|θ)+ 2 log(y|θ) (5.2)

is the effective number of model parameters with the “focus” on Θ, θ ∈ Θ.

77

78

As briefly discussed in the end of Chapter 3, when one wants to apply the DIC in

a hierarchical model setting, it needs to be clearly defined what are the parameters

“in focus”. In addition, different versions of DIC are proposed by Celeux et al.

(2006) when there are missing data. In particular, denote the missing data as Z,

complete data DICs are proposed in (Celeux et al., 2006), and the commonly used

DIC4 is defined as follows:

DIC4 = −4Eθ,z|ylog(y, Z|θ)+ 2Ez[logy, z|Eθ(θ|y, z)|y], (5.3)

which requires to compute the posterior mean of θ for each value of z from the

posterior. When the analytical form of Eθ(θ|y, z) is available, the computation is

not difficult.

For item response theory models, the first layer usually links a few latent variables

with the observed responses via a logistic function or probit function. If one wishes

to use DIC to perform model selection under the IRT model framework, e.g., the

dynamic IRT model proposed in Chapter 3, it is not easy to decide what shall be

the parameters “in focus”. If only the β coefficients and other hyper parameters,

such as ψ, φ, etc., are considered as the focus, it will involve complex integrations

to obtain the data likelihood with respect to these focused parameters. Instead, if

all parameters are considered as the focus, the log data likelihood only involves the

first layer. Other types of parameter focus may be proposed as well, which could

lead to different results.

Besides the choice of the focus of parameters, if the augmented data is introduced

to facilitate the sampling process (Albert and Chib, 1993), one may also use the

79

complete data DIC criterion such as the DIC4. Also, it is possible to treat some of

the intermediate variables as “missing data” in the definition of the complete DIC4

if they are not parameters “in focus”, which may lead to interesting results for the

PD, i.e., the “effective” number of parameters. This motivates us to investigate the

performance of the original DIC (called as DIC1 by Celeux et al. (2006)) and DIC4

under the hierarchical model framework.

5.1.1 The Model

We start with the following simple hierarchical model:

yij = Φ(X ijβ + θi), i = 1, . . . , n, j = 1, . . . ,m, (5.4)

θi ∼ N(0, σ21) for i = 1, . . . , [

n

2],

θi ∼ N(0, σ22) for i = [

n

2] + 1, . . . , n,

where Xij = (X ij1, . . . , Xijk), β = (β1, . . . , βk), [n2] is the largest integer less than

or equal to n2.

Denote τ1 = 1/σ21 and τ2 = 1/σ2

2. We want to select between the model with

the equal variance, i.e., σ21 = σ2

2, and the model with unequal variances, i.e., σ21 6=

σ22. Following Albert and Chib (1993), we introduce the augmented data Zij ∼

N(X ijβ + θi, 1), under the condition that Zij > 0 if Yij = 1 and Zij ≤ 0 if Yij = 0.

We assign the noninformative priors for τ and β as follows: π(τi) ∝ τ−1i , i = 1, 2,

π(β) ∝ 1.

80

Thus the joint likelihood function can be written as

P (Y ,Z,β,θ,X, τ1, τ2)

= P (Y |Z)P (Z|β,θ,X)P (θ|τ1, τ2)π(β)π(τ1)π(τ2)

=n∏i=1

m∏j=1

[I(Yij = 1)I(Zij > 0) + I(Yij = 0)I(Zij ≤ 0)](2π)−12

× exp(−(Zij −X ijβ − θi)2

2)(2π)−n/2

×(τ1)n4 (τ2)

n4 exp(−

τ1

∑n/2i=1 θ

2i + τ2

∑ni=1+n/2 θ

2i

2)× τ−1

1 τ−12

= (2π)−mn+n

2

n∏i=1

m∏j=1

[I(Yij = 1)I(Zij > 0) + I(Yij = 0)I(Zij ≤ 0)]

× exp(−(Zij −X ijβ − θi)2

2)(τ1)

n4−1(τ2)

n4−1 exp(−

τ1

∑n/2i=1 θ

2i + τ2

∑ni=1+n/2 θ

2i

2),

where Y represents the vector of all Yij’s, and Z represents the vector of all Zij’s.

5.1.2 Choose θ as the Parameters in Focus

Mimicking the definition of DIC4 by Celeux et al. (2006), if we treat only θ as

the parameters and fix all others, then denote

DIC(Y ,Z,β, τ )

= 2PD(Y ,Z,β, τ ) +D(Y ,Z,β, τ |θ),

= −4Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ

+2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ), (5.5)

81

where

θ = Eθ(θ|Y ,Z,β, τ ),

D(Y ,Z,β, τ |θ) = −2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ),

PD(Y ,Z,β, τ ) = −2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ

+2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ ),

and, hence, define complete data DIC4 as

DICθ4 (Y )

= EZ ,β,τ (DIC(Y ,Z,β, τ )|Y )

= EZ ,β,τ 2PD(Y ,Z,β, τ ) +D(Y ,Z,β, τ |θ).

In the first term of equation (5.5),

log f(Y ,Z,β, τ |θ)

= −mn+ n

2log(2π)− 1

2

n∑i=1

m∑j=1

(Zij −X ijβ − θi)2.

In the second term

Eθ(θ|Y ,Z,β, τ ) =

∑m

j=1 (Zij −X ijβ)/(m+ τ1), i = 1, . . . , n/2;∑mj=1 (Zij −X ijβ)/(m+ τ2), i = n/2 + 1, . . . , n,

we have

−2 log fY ,Z,β, τ |Eθ(θ|Y ,Z,β, τ )

= mn log(2π) +n∑i=1

m∑j=1

(Zij −X ijβ)2

−n/2∑i=1

(m+ 2τ1)[∑m

j=1(Zij −X ijβ)]2

(m+ τ1)2−

n∑i=n/2+1

(m+ 2τ2)[∑m

j=1(Zij −X ijβ)]2

(m+ τ2)2.

82

The full conditional distribution of θi given Y ,Z,β, τ is

p(θi|Y ,Z,β, τ ) ∼ N(m∑j=1

(Zij −X ijβ)/(m+ τq), 1/(m+ τq)),

where q = 1 if i ≤ n/2, otherwise q = 2.

To compute the mean deviance D = −2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ, we

have

−2Eθlog f(Y ,Z,β, τ |θ)|Y ,Z,β, τ

= Eθ|Y ,Z ,β,τ n/2∑i=1

[m∑j=1

(Zij −X ijβ)2 − 2m∑j=1

(Zij −X ijβ)θi +mθ2i ]

+

n/2∑i=1

[m∑j=1

(Zij −X ijβ)2 − 2m∑j=1

(Zij −X ijβ)θi +mθ2i +mn log(2π)

=n∑i=1

m∑j=1

(Zij −X ijβ)2 +

n/2∑i=1

[m

m+ τ1

−(m+ 2τ1)[

∑mj=1(Zij −X ijβ)]2

(m+ τ1)2]

+n∑

i=n/2+1

[m

m+ τ2

−(m+ 2τ2)[

∑mj=1(Zij −X ijβ)]2

(m+ τ2)2] +mn log(2π)

= −2 log f(Y ,Z,β, τ |Eθ(θ|Y ,Z,β, τ )) +mn

2(m+ τ1)+

mn

2(m+ τ2),

thus PD(Y ,Z,β, τ ) is equal to mn2(m+τ1)

+ mn2(m+τ2)

, and we can get the PD in the

complete data DIC4 by EZ ,β,τ PD(Y ,Z,β, τ ).

5.1.3 Choose τ as the Parameters in Focus

Similarly, if we focus on τ , we can define the DIC4 as follows. Firstly we define

DIC(Y ,Z,β,θ)

= 2PD(Y ,Z,β,θ) +D(Y ,Z,β,θ|τ ),

= −4Eτ log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ

+2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ) (5.6)

83

where

τ = E(τ |Y ,Z,β,θ),

D(Y ,Z,β,θ|τ ) = −2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ)

PD(Y ,Z,β,θ) = −2Eτ log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ

+2 log fY ,Z,β,θ|Eτ (τ |Y ,Z,β,θ),

and the complete DIC is defined as

DICτ4 (Y )

= EZ ,β,θ(DIC(Y ,Z,β,θ)|Y )

= EZ ,β,θ[2PD(Y ,Z,β,θ) +D(Y ,Z,β,θ|τ )].

In the first term of equation (5.6),

log f(Y ,Z,β,θ|τ )

= −mn+ n

2log(2π)−

n∑i=1

m∑j=1

(Zij −X ijβ + θi)2

2+n

4[log(τ1) + log(τ2)]

−τ1

∑n/2i=1 θ

2i + τ2

∑ni=1+n/2 θ

2i

2,

note that the conditional likelihood does not contain the prior for τ . And in the

second term

Eτ (τ |Y ,Z,β,θ) = (n

2∑n/2

i=1 θ2i

,n

2∑n

i=n/2+1 θ2i

)′.

Regarding the computation of the mean deviance

D = −2Eτ [log f(Y ,Z,β,θ|τ )|Y ,Z,β,θ],

84

notice that if a random varialbe γ ∼ Γ(α, 1),

dΓ(α)

dα=

∫ ∞0

log(x)xα−1 exp(−x)dx = Γ(α)Elog(γ),

thus Elog(γ) = Γ′(α)Γ(α)

. Since

P (τ1|Y ,Z,β,θ) ∼ Γ(n/4,2∑n/2i=1 θ

2i

), P (τ2|Y ,Z,β,θ) ∼ Γ(n/4,2∑n

i=1+n/2 θ2i

),

we have

Eτ |Y ,Z ,β,θlog(τ1)

= Eτ |Y ,Z ,β,θlog(τ1

∑n/2i=1 θ

2i

2) − log(

∑n/2i=1 θ

2i

2)

=Γ′(n/4)

Γ(n/4)− log(

∑n/2i=1 θ

2i

2)

= ψ(n/4)− log(

∑n/2i=1 θ

2i

2),

where ψ(x) = Γ′(x)Γ(x)

is a built-in function in Matlab. Similarly we can compute

Eτ |Y ,Z ,β,θlog(τ2).

Thus we can obtain PD(Y ,Z,β,θ) = nψ(n/4) + log(2) − log(n/2) for the

case when τ1 6= τ2, which is a constant that only depends on n. Hence the PD

corresponding to the complete data DIC4 is equal to PD(Y ,Z,β,θ).

5.1.4 Simulation Results

In the simulation study, we set k = 3, and let X1 = 1 to be the intercept,

X2i.i.d∼ N(1, 4),X3

i.i.d∼ N(0, 1). We set the true value of β = (1,−1, 1.5)′.

Consider these two setups for the variance parameters for θ: σ21 = σ2

2 = 1, and

(σ21, σ

22) = (1, 4).

85

Several sample size combinations (n,m) are considered: (100, 8), (80, 10), (40, 20)

and (100, 20). For each simulation setup, 100 independent simulations are con-

ducted. 8000 MCMC samples are drawn via the Gibbs sampling algorithm and the

first 40% of the draws are discarded for burn-in.

We also consider the DIC1 and treat all parameters as the parameters in focus for

the comparison. Model selection results and the PDs are summarized in Tables 5.1

and 5.2. Results for PD are the median values across 100 independent simulations,

being rounded to the nearest tenth. From Table 5.1 we see that DICθ4 performs

much worse than the other two DICs in the case when the true model is σ21 6= σ2

2,

and DICτ4 does slightly better than DIC1. However, when σ21 = σ2

2 is true, the

selection results of DICτ4 and DICθ4 are close, but both are worse than DIC1. The

performance of DIC1 seems worse when (n,m) changed from (100, 8) to (100, 20) in

this case, which needs further investigation for justifications.

In terms of PD, the effective number of parameters, note that the PD of DICτ4

are identical (after rounding to the nearest tenth) to the number of parameters in

focus: when σ21 6= σ2

2, the PD is about 2 and when σ21 6= σ2

2, the PD is equal to

1. This seems like a desirable result intuitively, at least under the non-hierarchical

model setup, however it is unclear under the hierarchical framework.

Since DIC can be decomposed as the sum of the deviance and PD, it is interesting

to investigate in the future whether combining the deviance of DIC1 and the PD

of DIC4 under the hierarchical model framework can lead to a better measure that

outperforms DIC1.

86

Table 5.1: Model identification counts in 100 simulations per setting of DIC1, DICθ4 ,and DICτ4 .

(n,m)(σ2

1, σ22)

From (1, 4) From (1, 1)Select Select Select Selectσ2

1 6= σ22 σ2

1 = σ22 σ2

1 = σ22 σ2

1 6= σ22

DIC1 selections(100,8) 95 5 77 23(80,10) 90 10 84 16(40,20) 77 23 71 29(100,20) 94 6 67 33

DICθ4 selections(100,8) 80 20 38 62(80,10) 75 25 37 63(40,20) 62 38 39 61(100,20) 56 44 57 43

DICτ4 selections(100,8) 90 10 36 64(80,10) 98 2 43 57(40,20) 84 16 46 54(100,20) 93 7 62 38

Table 5.2: PD results in 100 simulations per setting of DIC1, DICθ4 , and DICτ4 .

(n,m)(σ2

1, σ22)

From (1, 4) From (1, 1)Fit Fit Fit Fit

σ21 6= σ2

2 σ21 = σ2

2 σ21 = σ2

2 σ21 6= σ2

2

PD for DIC1

(100,8) 79.0 82.1 72.3 72.1(80,10) 66.5 69.1 61.7 62.2(40,20) 38 38.7 36.5 36.7(100,20) 91.0 93.2 87.2 87.4

PD for DICθ4(100,8) 92.7 94.6 88.8 88.1(80,10) 75.1 76.7 72.8 72.2(40,20) 38.7 39.1 38.0 37.9(100,20) 96.9 97.8 95.1 95.0

PD for DICτ4(100,8) 2.0 1.0 1.0 2.0(80,10) 2.0 1.0 1.0 2.0(40,20) 2.0 1.0 1.0 2.0(100,20) 2.0 1.0 1.0 2.0

Appendix A

Posterior Propriety and MCMC Sampling Scheme

With some mild assumptions stated in Theorem 1 of Wang et al. (2013), we can

show the usage of the objective priors introduced in Section 4.2.1 will lead to the

proper posterior. The proof of the proper posterior can be followed directly using

the lemmas in Appendix C of Wang et al. (2013) by rewriting Model (4.1)–(4.2)

into the Probit model without data augmentation as

Pr(Xi,j,s,k = 1 | βi,0, βi,1, f 0i (ti,j), ωi,j, di,j,s,k, ηi,j,s)

= Φ(βi,0 + βi,1f0i (ti,j) + ωi,j − di,j,s,k + ηi,j,s).

Further, we redefine Xi,j,s,k = −1 to indicate that a student answers a question

wrong. Then, following the notation given in Equation (4.7), provided that Λ and

87

88

the data X are known, the joint posterior distribution of parameters β, η, τ , φ is:

f(β, η, τ , φ|Λ,X)

∝

n∏i=1

Ti∏j=1

Si,j∏s=1

Ki,j,s∏k=1

Φ[Xi,j,s,k(βi,0 + βi,1f

0i (ti,j) + ωi,j − ai,j,s + ηi,j,s + υi,j,s,k)

]×

n∏i=1

Ti∏j=1

Si,j∏s=1

Ki,j,s∏k=1

1√2πσ

exp

(−υ2i,j,s,k

2σ2

)

n∏i=1

Ti∏j=2

√φ

2π∆i,j

exp

(−φω2

i,j

2∆i,j

)

×

n∏i=1

Ti∏j=1

( τi2π

)Si,j2

exp

(−τi∑Si,j

s=1 η2i,j,s

2

)[n∏i=1

π(τi)π(βi,0)π(βi,1)

]π(φ),(A.1)

where π(φ) ∝ φ−3/2, π(τi) ∝ τ−3/2i , π(βi,0) ∝ 1 and π(βi,1) ∝ 1. Hence, us-

ing similar ideas that derive Lemma 2, Lemma 3 and Lemma 4 of Appendix C

in Wang et al. (2013), we are able to show the joint posterior distribution above

is proper. Since the prior we impose on Λ is proper and f(β, η, τ,Λ, φ|X) =

f(β, η, τ , φ|Λ,X)∏n

i=1 π(Λi), then we can conclude the joint posterior distribution

of parameters β, η, τ,Λ, φ given the data X will be proper. Therefore, we are legit-

imate to use the objective priors we specify in Section 4.2.1 to make our statistical

inference based on the posterior distribution derived from those priors.

Next, we are going to give the detailed derivation of the MCMC sampling scheme

for each step and we will follow the sampling procedure mentioned earlier in Section

4.2.2 to run the MCMC from Step A.1 to Step A.7 until our MCMC converges.

89

A.1 Sampling of Z: Truncated Normal Distribution

Given θ, η, the augmented variables Zi,j,s,l are sampled from the truncated

normal distribution below:

Zi,j,s,k ∼ N+(θi,j − ai,j,s + ηi,j,s, 1 + σ2), if Xi,j,s,k = 1;

Zi,j,s,k ∼ N−(θi,j − ai,j,s + ηi,j,s, 1 + σ2), if Xi,j,s,k = 0.

A.2 Sampling of η: Normal Distribution

Given θ, τ,Z, the full conditional distribution of ηi,j,s is N (ζi,j,s, δi,j,s), where

ζi,j,s =

∑Ki,j,sk=1 ψi,j,s,k(Zi,j,s,k − θi,j+ai,j,s)∑Ki,j,s

l=1 ψi,j,s,l + τi, and δi,j,s =

Ki,j,s∑k=1

ψi,j,s,k + τi

−1

.

A.3 Sampling of τ : Gamma Distribution

Given η, the full conditional distribution of τi is

f(τi|η) ∼ Ga(

∑Tij=1 Si,j − 1

2,

∑Tij=1

∑Si,js=1 η

2i,j

2),

where Ga(·, ·) denotes the symbol of the gamma distribution, (∑Ti

j=1 Si,j − 1)/2 is

the shape parameter and∑Ti

j=1

∑Si,js=1 η

2i,j/2 is the rate parameter of the gamma

distribution, respectively.

A.4 Sampling of Λ: Metropolis-Hastings within RJ-MCMC Algorithm

Given θ(q−1) (here we add the iteration index as the superscript to make the

discussion much clear for Subsection A.4), it is easy to derive the full conditional

90

distribution of Λ(q) as follows

f(Λ(q) | θ(q−1)) (A.2)

∝ |X ′desD−1Xdes|−12

|D| 12Γ(

∑ni=1 Ti − 3n− 1

2)

× [(θ(q−1) −Xdesβ)′(θ(q−1) −Xdesβ)

2]−

∑ni=1 Ti−3n−1

2

n∏i=1

π(Λi),

where Γ(·) is a gamma function, Xdes = diag(x1, · · · ,xn) is a block diagonal matrix

with xi =

1 f 0

i (ti,1)

......

1 f 0i (ti,Ti)

, D = diag(D1, · · · ,Dn) is also a block diagonal matrix

with Di = diag(∆i,1, · · · ,∆i,Ti) being a diagonal matrix, π(Λi) = π(Li)π(ξi)π(πi)

is the prior distribution of Λi and β = (X ′desD−1Xdes)

−1X ′desD−1θ(q−1). In our

study, we use π(Li) = λLiLi!(eλ−1)

, which is the zero truncated Poisson prior with

parameter λ, π(ξi) = ( 1γU−γL

)Li , which indicates we apply U(0, 1) as the prior for

bi,`’s and U(γL, γU) as the prior for γi,`’s, and let π(πi) = Γ(Liρ)

Γ(ρ)Li

∏Li`=1 π

ρ−1i,` , which is

a symmetric Dirichlet distribution.

For each subject i = 1, . . . , n, we develop a RJ-MCMC to sample Λi from its

full conditional distribution f(Λi|θ,Λ−i), where Λ−i =⋃j 6=iΛj. At the (q+ 1)-th

iteration, the RJ-MCMC procedure is divided into three parts:

1. UPDATE when the dimension index Li does not change, i.e., L(q+1)i = L

(q)i ;

2. ADD when the dimension Li increases to Li + 1, i.e., L(q+1)i = L

(q)i + 1;

3. REMOVE when the dimension Li decreases to Li− 1, i.e., L(q+1)i = L

(q)i − 1.

91

(While q < M)Draw u from U(0, 1);

if (u < PU(L(q)i )) obtain Λ

(q+1)i with UPDATE step;

else if (PU(L(q)i ) ≤ u < PU(L

(q)i ) + PA(L

(q)i )) obtain Λ

(q+1)i with ADD step;

else if (PU(L(q)i ) + PA(L

(q)i ) < u ≤ 1) obtain Λ

(q+1)i with REMOVE step;

Figure A.1: A pseudo code of the RJ-MCMC algorithm

We assign the corresponding probability PA(L), PR(L) and PU(L) to decide whether

the next step is to add, remove or update when the current dimension is L.

In our program, we use equal probability for PA(L), PR(L) and PU(L) when

L ≥ 2, i.e., PA(L) = PR(L) = PU(L) = 1/3. As we require the dimension to be

positive, when L = 1, we assign PR(1) = 0, while PA(1) = PU(1) = 1/2.

The entire algorithm is shown in Figure A.1, where M is the total number of

iterations. In the following paragraphs, we will describe the details of each step.

A.4.1 UPDATE step

1. Computep(Λ∗i ,Λ

(q)−i |θ

(q))

p(Λ(q)|θ(q)) using Equation (A.2).

2. For the weights π, we omit the subscript i for brevity in the discussion and

use a Dirichlet proposal distribution q(π∗ | π) for π∗:

q(π∗ | π) =

∏Lij=1 Γ(αj)

Γ(∑Li

j=1 αj)

Li∏j=1

(π∗j )αj−1,

where αj = h0πj + 1 with h0 = 10 and similarly, a Dirichlet proposal distribu-

tion q(π | π∗) for π,

q(π | π∗) =

∏Lij=1 Γ(α∗j )

Γ(∑Li

j=1 α∗j )

Li∏j=1

(πj)α∗j ,

92

where α∗j = h0π∗j + 1 with h0 = 10.

3. Assume γ is bounded on the interval [γL, γU ], where similarly we omit the

subscript i and ` for brevity. Firstly, standardize γ by

m =γ − γLγU − γL

.

For the transformed γ, a beta distribution proposal for m∗ is employed as

below:

q(m∗ | m) =Γ(α + β)

Γ(α)Γ(β)(m∗)α−1(1−m∗)β−1,

where α = m(h1 + 2) + 1 and β = h1 + 2−m(h1 + 2) + 1 with h1 = 10. Then,

γ∗ = (γU − γL)m∗ + γL. Similarly, we use a beta distribution proposal for m

as follows:

q(m | m∗) =Γ(α∗ + β∗)

Γ(α∗)Γ(β∗)mα∗−1(1−m)β

∗−1,

where α∗ = (h1 + 2)m∗ + 1 and β = h1 + 2− (h1 + 2)m∗ + 1 with h1 = 10.

4. For b, we suspend the subscript for b in this discussion and also use the beta

distribution as the proposal for b∗, i.e.,

q(b∗ | b) =Γ(α + β)

Γ(α)Γ(β)(b∗)α−1(1− b∗)β−1,

where α = b(h2 + 2) + 1 and β = h2 + 2− b(h2 + 2) with h2 = 10. Similarly,

the proposal for b is also a beta distribution, i.e.,

q(b | b∗) =Γ(α∗ + β∗)

Γ(α∗)Γ(β∗)bα∗−1(1− b)β∗−1,

where α∗ = (h2 + 2)b∗ + 1 and β∗ = h2 + 2− (h2 + 2)b∗ with h2 = 10.

93

5. Compute the proposal distribution ratiof(Λ

(q)i |Λ

∗i )

f(Λ∗i |Λ(q)i )

.

6. Obtain the Metropolis-Hastings ratio MH(Λ∗i ,Λ(q)i ) =

p(Λ∗i ,Λ(q)−i |θ

(q))

p(Λ(q)|θ(q)) ·f(Λ

(q)i |Λ

∗i )

f(Λ∗i |Λ(q)i )

.

7. Draw u from U(0, 1), take Λ(q+1)i = Λ∗i if u < MH(Λ∗i ,Λ

(q)i ), otherwise take

Λ(q+1)i = Λ

(q)i .

A.4.2 ADD step

1. We sample (π∗, ξ∗) from a proposal distribution q(π, ξ), specifically, sample π∗

from U(0, 1) and b∗ from U(0, 1), γ∗ from U(γL, γU).

2. Let r be the index to insert the new variables, and we sample r from a discrete

uniform distribution of 1, 2, · · · , L(q)i + 1.

3. We set Λ∗i = (L(q)i + 1, π

(q)i,1 (1− π∗), π(q)

i,2 (1− π∗), · · · , π(q)i,r−1(1− π∗), π∗, π(q)

i,r (1−

π∗), · · · , π(q)

i,L(q)i

(1− π∗), ξ(q)i,1 , · · · , ξ

(q)i,r−1, ξ

∗, ξ(q)i,r , · · · , ξ

(q)

i,L(q)i

)′.

4. Calculate the posterior conditional distribution ratiop(Λ∗i |θ(q),Λ

(q)−i )

p(Λ(q)i |θ(q),Λ

(q)−i )

=p(Λ∗i ,Λ

(q)−i |θ

(q))

p(Λ(q)|θ(q))

by Equation (A.2).

5. Define MH(Λ∗i ,Λ(q)i ) =


(q))

p(Λ(q)|θ(q))PR(L

(q)i +1)

PA(L(q)i )q(π∗,ξ∗)

(1 − π∗)L(q)i , where PR(L

(q)i +

1) and PA(L(q)i ) can be computed based on their definitions in the para-

graph above Figure A.1, the proposal q(π∗, ξ∗) in the ratio of MH(Λ∗i ,Λ(q)i )

is q(π∗, ξ∗) = 1γU−γL

and the Jacobian in the RJ-MCMC is J = (1− π∗)L(q)i .



Λ(q+1)i = Λ

(q)i .

94

A.4.3 REMOVE step

1. Let r be the index to remove the existing variables and we sample r from a

uniform distribution of 1, 2, · · · , L(q)i .

2. We set Λ∗i as (L(q)i − 1,

π(q)i,1

(1−π(q)i,r ),

π(q)i,2

(1−π(q)i,r ), · · · , π

(q)i,r−1

(1−π(q)i,r ),

π(q)i,r+1

(1−π(q)i,r ), · · · ,

π(q)

i,L(q)i

(1−π(q)i,r ), ξ

(q)i,1 , · · · ,ξ

(q)i,r−1,

ξ(q)i,r+1, · · · , ξ

(q)

i,L(q)i

)′.

3. Calculatep(Λ∗i ,Λ

(q)−i |θ

(q))

p(Λ(q)|θ(q)) similarly as Step 4 in A.4.2.

4. Define MH(Λ∗i ,Λ(q)i ) =


(q))

p(Λ(q)|θ(q))PA(L

(q)i −1)q(π

(q)i,r ,ξ

(q)i,r )

PR(L(q)i )

(1− π(q)i,r )1−L(q)

i , where the

proposal q(π(q)i,r , ξ

(q)i,r ) in the ratio of MH(Λ∗i ,Λ

(q)i ) is q(π

(q)i,r , ξ

(q)i,r ) = 1

γU−γLand

the Jacobian for the RJ-MCMC is J = (1− π(q)i,r )1−L(q)

i .



Λ(q+1)i = Λ

(q)i .

By looping A.4.1-A.4.3, we finish the sampling of Λ(q+1)i , i = 1, 2, · · · , n.

A.5 Sampling of φ: Gamma Distribution

Given θ,Λ, we sample φ from a gamma distribution:

f(φ | θ,Λ) ∝ Ga

(∑ni=1 Ti − 3n− 1

2,(θ −Xdesβ)′(θ −Xdesβ)

2

), (A.3)

where (∑n

i=1 Ti − 3n− 1)/2 is the shape parameter and (θ −Xdesβ)′(θ −Xdesβ)/2

is the rate parameter of the gamma distribution, respectively.

95

A.6 Sampling of β: Multivariate Normal Distribution

Given θ,Λ, φ, we sample β from a multivariate normal distribution:

P (β | θ,Λ, φ) ∝ N (β, (X ′desΣ−1Xdes)

−1), (A.4)

where Σi = φ−1Di and Σ = diag(Σ1, · · · ,Σn) is a block diagonal matrix.

A.7 Sampling of θ: Normal Distribution

Given all other parameters, we sample θi,j from a normal distributionN (µi,j,Ωi,j)

for i = 1, · · · , n and j = 1, · · · , Ti, where

µi,j =

∑Si,js=1

∑Ki,j,sk=1 ψi,j,s,k(Zi,j,s,k+ai,j,s−ηi,j,s) + φ

∆i,j[βi,0 + βi,1f

0i (ti,j)]

∑Si,j

s=1

∑Ki,j,sk=1 ψi,j,s,k + φ

∆i,j

,

Ω−1i,j =

Si,j∑s=1

Ki,j,s∑k=1

ψi,j,s,k +φ

∆i,j

.

A.8 The Characteristics of EdSphere datasets

The characteristics of the 10 selected subjects are displayed in Table A.1.

Table A.1: Characteristics for each selected individual in the EdSphere dataset.

ID Total tests #Test Days Max tests/days Range of items/test Max Gap Grade

1 335 185 9 4-21 115 1

2 139 115 3 2-26 177 6

3 161 117 5 2-29 194 2

4 219 118 10 2-22 204 1

5 234 129 9 2-20 196 1

6 217 113 13 2-23 150 1

7 221 128 10 1-25 101 2

8 266 118 9 3-21 125 1

9 185 120 6 2-21 105 3

10 232 103 11 2-24 112 4

Bibliography

Albers, W., Does, R., Imbos, T., and Janssen, M. (1989). A stochastic growth

model applied to repeated tests of academic knowledge. Psychometrika, 54(3),

451–466.

Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous

response data. Journal of the American statistical Association, 88(422), 669–679.

Bartolucci, F., Pennoni, F., and Vittadini, G. (2011). Assessment of school perfor-

mance through a multilevel latent Markov Rasch model. Journal of Educational

and Behavioral Statistics , 36(4), 491–522.

Bock, R. D. and Mislevy, R. J. (1989). A hierarchical item response model for

educational testing. In Multilevel Analysis of Educational Data, pages 57–74.

Elsevier.

Bollen, K. A. and Curran, P. J. (2004). Autoregressive latent trajectory (ALT)

models a synthesis of two traditions. Sociological Methods and Research, 32(3),

336–383.

96

97

Bollen, K. A. and Curran, P. J. (2006). Latent curve models: A structural equation

perspective, volume 467. John Wiley & Sons.

Bornkamp, B. and Ickstadt, K. (2009). Bayesian nonparametric estimation of

continuous monotone functions with applications to dose–response analysis. Bio-

metrics , 65(1), 198–205.

Brezger, A. and Steiner, W. J. (2008). Monotonic regression based on Bayesian

P–splines: An application to estimating price response functions from store-level

scanner data. Journal of Business & Economic Statistics , 26(1), 90–104.

Cao, J. and Stokes, S. L. (2008). Bayesian IRT guessing models for partial guess-

ing behaviors. Psychometrika, 73(2), 209.

Celeux, G., Forbes, F., Robert, C. P., Titterington, D. M., et al. (2006). Deviance

information criteria for missing data models. Bayesian Analysis , 1(4), 651–673.

Chen, M.-H. (1994). Importance-weighted marginal Bayesian posterior density

estimation. Journal of the American Statistical Association, 89(427), 818–824.

Chen, M.-H. (2005). Computing marginal likelihoods from a single MCMC out-

put. Statistica Neerlandica, 59(1), 16–29.

Chen, M.-H., Shao, Q.-M., and Ibrahim, J. G. (2012). Monte Carlo methods in

Bayesian computation. Springer Science & Business Media.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the

American Statistical Association, 90(432), 1313–1321.

98

Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis–

Hastings output. Journal of the American Statistical Association, 96(453), 270–

281.

Cho, S.-J., Athay, M., and Preacher, K. J. (2013). Measuring change for a multidi-

mensional test using a generalized explanatory longitudinal item response model.

British Journal of Mathematical and Statistical Psychology , 66(2), 353–381.

Choi, T., Kim, H.-J., and Jo, S. (2016). Bayesian variable selection approach to

a Bernstein polynomial regression model with stochastic constraints. Journal of

Applied Statistics , 43(15), 2751–2771.

DiCiccio, T. J., Kass, R. E., Raftery, A., and Wasserman, L. (1997). Computing

Bayes factors by combining simulation and asymptotic approximations. Journal

of the American Statistical Association, 92(439), 903–915.

Embretson, S. (1991). A multidimensional latent trait model for measuring learn-

ing and change. Psychometrika, 56(3), 495–515.

Erickson, R. O. (1976). Modeling of plant growth. Annual Review of Plant

Physiology , 27(1), 407–434.

Fan, Y., Wu, R., Chen, M.-H., Kuo, L., and Lewis, P. O. (2010). Choosing among

partition models in Bayesian phylogenetics. Molecular Biology and Evolution,

28(1), 523–532.

99

Fernandez, C., Ley, E., and Steel, M. F. (2001). Benchmark priors for Bayesian

model averaging. Journal of Econometrics , 100(2), 381–427.

Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous response

data. British Journal of Mathematical and Statistical Psychology , 58(1), 145–

172.

Friel, N. and Pettitt, A. N. (2008). Marginal likelihood estimation via power

posteriors. Journal of the Royal Statistical Society: Series B (Statistical Method-

ology), 70(3), 589–607.

Geiser, C., Bishop, J., Lockhart, G., Shiffman, S., and Grenard, J. L. (2013).

Analyzing latent state-trait and multiple-indicator latent growth curve models as

multilevel structural equation models. Frontiers in Psychology , 4(975), 1–23.

Gelfand, A. E. and Kuo, L. (1991). Nonparametric Bayesian bioassay including

ordered polytomous response. Biometrika, 78(3), 657–666.

Gelfand, A. E., Smith, A. F., and Lee, T.-M. (1992). Bayesian analysis of con-

strained parameter and truncated data problems using Gibbs sampling. Journal

of the American Statistical Association, 87(418), 523–532.

Gelman, A. and Meng, X.-L. (1998). Simulating normalizing constants: From

importance sampling to bridge sampling to path sampling. Statistical Science,

pages 163–185.

100

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction,

and estimation. Journal of the American Statistical Association, 102(477), 359–

378.

Green, P. J. and Hastie, D. I. (2009). Reversible jump MCMC. Genetics , 155(3),

1391–1403.

Harris, D. (1989). Comparison of 1-, 2-, and 3-parameter IRT models. Educa-

tional Measurement: Issues and Practice, 8(1), 35–41.

Hsieh, C.-A., von Eye, A., Maier, K., Hsieh, H.-J., and Chen, S.-H. (2013). A

unified latent growth curve model. Structural Equation Modeling: A Multidisci-

plinary Journal , 20(4), 592–615.

Johnson, C. and Raudenbush, S. W. (2006). A Repeated Measures, Multilevel

Rasch Model With Application to Self-Reported Criminal Behavior , pages 131–

164. New York: Routledge.

Karabatsos, G. (2016). Bayesian nonparametric response models. In Handbook of

Item Response Theory, Volume One, pages 351–364. Chapman and Hall/CRC.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American

Statistical Association, 90(430), 773–795.

Kim, S. and Camilli, G. (2014). An item response theory approach to longitudi-

nal analysis with application to summer setback in preschool language/literacy.

Large-scale Assessments in Education, 2(1), 1.

101

Lartillot, N. and Philippe, H. (2006). Computing Bayes factors using thermody-

namic integration. Systematic Biology , 55(2), 195–207.

Lewis, S. M. and Raftery, A. E. (1997). Estimating Bayes factors via poste-

rior simulation with the Laplace–Metropolis estimator. Journal of the American


Li, Y. and Clyde, M. A. (2018). Mixtures of g-priors in generalized linear models.

Journal of the American Statistical Association, 113(524), 1828–1845.

Liang, F., Paulo, R., Molina, G., Clyde, M. A., and Berger, J. O. (2008). Mixtures

of g priors for Bayesian variable selection. Journal of the American Statistical

Association, 103(481), 410–423.

Lin, L. and Dunson, D. B. (2014). Bayesian monotone regression using Gaussian

process projection. Biometrika, 101(2), 303–317.

Luo, Y. and Jiao, H. (2018). Using the stan program for Bayesian item response

theory. Educational and psychological measurement , 78(3), 384–408.

Martin, A. D. and Quinn, K. M. (2002). Dynamic ideal point estimation via

Markov chain Monte Carlo for the U.S. Supreme Court, 1953-1999. Political

Analysis , 10(2), 134–153.

McKay Curtis, S. and Ghosh, S. K. (2011). A variable selection approach to

monotonic regression with Bernstein polynomials. Journal of Applied Statistics ,

38(5), 961–976.

102

Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psy-

chometrika, 51(2), 177–195.

Natesan, P., Nandakumar, R., Minka, T., and Rubright, J. D. (2016). Bayesian

prior choice in IRT estimation using MCMC and variational Bayes. Frontiers in

Psychology , 7, 1422.

Neelon, B. and Dunson, D. B. (2004). Bayesian isotonic regression and trend

analysis. Biometrics , 60(2), 398–406.

Newton, M. A. and Raftery, A. E. (1994). Approximate Bayesian inference by the

weighted likelihood Bootstrap. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 56, 3–48.

Ongaro, A. and Cattaneo, C. (2004). Discrete random probability measures: a

general framework for nonparametric Bayesian inference. Statistics & Probability

letters , 67(1), 33–45.

Park, J. H. (2011). Modeling preference changes via a hidden Markov item re-

sponse theory model. Handbook of Markov Chain Monte Carlo, pages 479–491.

Park, T. and Casella, G. (2008). The Bayesian lasso. Journal of the American


Paul, D., Peng, J., Burman, P., et al. (2016). Nonparametric estimation of dy-

namics of monotone trajectories. The Annals of Statistics , 44(6), 2401–2432.

103

Petris, G. and Tardella, L. (2003). A geometric approach to transdimensional

Markov chain Monte Carlo. The Canadian Journal of Statistics , 31(4), 469–482.

Petris, G. and Tardella, L. (2007). New perspectives for estimating normalizing

constants via posterior simulation. Tech. rep.

Rasch, G. (1960). Studies in mathematical psychology: I. probabilistic models

for some intelligence and attainment tests. Copenhagen: Danish Institute for

Educational Research.

Raudenbush, S. W. and Bryk, A. S. (2002). Hierarchical linear models: Applica-

tions and data analysis methods , volume 1. Sage.

Rodrigues, A., Chaves, L. M., Silva, F. F., Garcia, I. P., Duarte, D. A. S., and

Ventura, H. T. (2018). Isotonic regression analysis of Guzera cattle growth curves.

Revista Ceres , 65(1), 24–27.

Shively, T. S., Sager, T. W., and Walker, S. G. (2009). A Bayesian approach to

non-parametric monotone function estimation. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 71(1), 159–175.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002).

Bayesian measures of model complexity and fit. Journal of the Royal Statistical

Society: Series B (Statistical Methodology), 64(4), 583–639.

104

Tan, E. S., Ambergen, A. W., Does, R. J. M. M., and Imbos, T. (1999). Approxi-

mations of normal IRT models for change. Journal of Educational and Behavioral

Statistics , 24(2), 208–223.

te Marvelde, J. M., Glas, C. A. W., Landeghem, G. V., and Damme, J. V. (2006).

Application of multidimensional item response theory models to longitudinal data.

Educational and Psychological Measurement , 66(1), 5–34.

Van Dorp, J. R. and Kotz, S. (2002). The standard two-sided power distribution

and its properties: with applications in financial engineering. The American

Statistician, 56(2), 90–99.

Van Dyk, D. A. and Park, T. (2008). Partially collapsed Gibbs samplers: Theory

and methods. Journal of the American Statistical Association, 103(482), 790–

796.

Verhagen, J. and Fox, J.-P. (2013). Longitudinal measurement in health-related

surveys. a Bayesian joint growth model for multivariate ordinal responses. Statis-

tics in Medicine, 32(17), 2988–3005.

Wang, X. and Berger, J. O. (2016). Estimating shape constrained functions using

Gaussian processes. SIAM/ASA Journal on Uncertainty Quantification, 4(1), 1–

25.

Wang, X., Berger, J. O., and Burdick, D. S. (2013). Bayesian analysis of dynamic

item response models in educational testing. The Annals of Applied Statistics ,

7(1), 126–153.

105

Wang, Y.-B., Chen, M.-H., Kuo, L., and Lewis, P. O. (2018). A new Monte Carlo

method for estimating marginal likelihoods. Bayesian Analysis , 13(2), 311–333.

Watanabe, S. (2010). Asymptotic equivalence of Bayes cross validation and widely

applicable information criterion in singular learning theory. Journal of Machine

Learning Research, 11, 3571–3594.

Wei, Z., Wang, X., and Conlon, E. M. (2017). Parallel Markov chain Monte

Carlo for Bayesian dynamic item response models in educational testing. Stat ,

6(1), 420–433.

Wu, H.-H., Ferreira, M. A., Gompper, M. E., et al. (2016). Consistency of hyper-

g-prior-based Bayesian variable selection for generalized linear models. Brazilian

Journal of Probability and Statistics , 30(4), 691–709.

Xie, W., Lewis, P. O., Fan, Y., Kuo, L., and Chen, M.-H. (2011). Improving

marginal likelihood estimation for Bayesian phylogenetic model selection. Sys-

tematic Biology , 60(2), 150–160.

Zellner, A. (1986). On assessing prior distributions and Bayesian regression anal-

ysis with g-prior distributions. Bayesian Inference and Decision Techniques .

Zellner, A. and Siow, A. (1980). Posterior odds ratios for selected regression

hypotheses. Trabajos de estadıstica y de investigacion operativa, 31(1), 585–603.

Bayesian Item Response Theory: Methods and Applications

Documents

Transcript of Bayesian Item Response Theory: Methods and Applications