Model Diagnostics for Bayesian Networksmath.ntnu.edu.tw/.../10_PPMC-Bayesian-Networks.pdfBayesian...

Model Diagnostics for Bayesian Networks

Sandip SinharayEducational Testing Service

Bayesian networks are frequently used in educational assessments primarily forlearning about students’ knowledge and skills. There is a lack of works on assess-ing fit of Bayesian networks. This article employs the posterior predictive modelchecking method, a popular Bayesian model checking tool, to assess fit of simpleBayesian networks. A number of aspects of model fit, those of usual interest topractitioners, are assessed using various diagnostic tools. This article suggests adirect data display for assessing overall fit, suggests several diagnostics forassessing item fit, suggests a graphical approach to examine if the model canexplain the association among the items, and suggests a version of the Mantel–Haenszel statistic for assessing differential item functioning. Limited simulationstudies and a real data application demonstrate the effectiveness of the suggestedmodel diagnostics.

Keywords: discrepancy measure, Mantel–Haenszel statistic, p-values, posterior predictivemodel checking

1. Introduction

Bayesian inference networks or Bayesian networks (BIN or BN; Pearl, 1988)are frequently used in educational assessments (Mislevy, 1995; Mislevy, Almond,Yan, & Steinberg, 2001; Mislevy, Steinberg, & Almond, 2003; Weaver & Junker,2003), especially in the context of evidence-centered design (ECD; Almond &Mislevy, 1999), for learning about students’ knowledge and skills given informationabout their performance in an assessment.

A BN is an appropriate tool when an examination of the items in an assessmentsuggests that solving each item requires at least a certain level of one or more of anumber of prespecified skills. A BN first assumes that for each skill, each exami-nee has a corresponding discrete proficiency variable that states the skill level ofthe examinee. The next step in a BN is to describe a probability model for the

Journal of Educational and Behavioral StatisticsSpring 2006, Vol. 31, No. 1, pp. 1–33

1

This research was supported by Educational Testing Service. The author thanks Shelby Haberman,Russell Almond, Robert Mislevy, Paul Holland, Andrew Gelman, Hal Stern, Matthias von Davier, thetwo anonymous referees, and Howard Wainer for their invaluable advice. The author also thanksAndrew Gelman and Michael Friendly for their help in creating some of the graphical plots. The authoris grateful to Kikumi Tatsuoka for providing the mixed number subtraction data set. The author grate-fully acknowledges the help of Rochelle Stern and Kim Fryer with proofreading. Any opinions expressedin this article are those of the author and not of Educational Testing Service.

response of an examinee to an item based on his or her attainment of the skillsrequired to solve the item. For example, a simple BN might assume that the skillvariables are binary (i.e., one either has the skill or does not), that the examineeswho have all the required skills for item j have a certain probability π1j of solvingthe item, while those who do not have the required skills have probability π0j

(where π0j < π1j) of solving the item. Finally, a BN specifies prior distributions overthe skill variables, usually with the help of a graphic model (Pearl, 1988) and theitem parameters.

Model checking for BNs is not straightforward. For even moderately large num-ber of items, the possible number of response patterns is huge, and the standard χ2

test of goodness of fit does not apply directly. Standard item response theory (IRT)model diagnostics do not apply to BNs because of the presence of discrete andmultivariate proficiency variables in the latter models. As an outcome, the use ofstatistical diagnostic tools in this context is notably lacking (Williamson, Almond,& Mislevy, 2000).

Recently, Yan, Mislevy, and Almond (2003), and Sinharay, Almond, and Yan(2004) suggested a number of model checking tools for BNs, including item fitplots and an item-fit test statistic. However, a number of questions remain unansweredin this area. Hambleton (1989) comments that the assessment of fit of IRT modelsusually involves the collection of a wide variety of evidence about model fit, andthen making an informed judgment about model fit and usefulness of a model witha particular set of data. Hambleton’s comment applies equally well to BNs, but isyet to be addressed completely for these models. For example, there is no work yetassessing differential item functioning (DIF) for BNs. In summary, there is a signif-icant scope of further research in this area.

The posterior predictive model checking (PPMC) method (Rubin, 1984) is apopular Bayesian model checking tool because of its simplicity, strong theoreticalbasis, and obvious intuitive appeal. The method primarily consists of comparing theobserved data with replicated data—those predicted by the model—using a num-ber of discrepancy measures; a discrepancy measure is like a test statistic in a fre-quentist analysis that measures the difference between an aspect of the observeddata set and a replicated data set. This article applies the PPMC method to assessseveral aspects of fit of BNs.

First, this article suggests a direct data display (like those suggested by Gelman,Carlin, Stern, & Rubin, 2003) as a quick overall check of model fit. This article thenuses the PPMC method to examine item fit for BNs using a number of discrepancymeasures.

Sinharay and Johnson (2003) and Sinharay (2005) apply the PPMC method tofind the odds ratios (e.g., Agresti, 2002) corresponding to the responses to pairs ofitems to be useful discrepancy measures in detecting misfit of the unidimensionalIRT models. This article uses the PPP-values for the odds ratios to check whetherthe BNs can explain the association among the items adequately.

This article then assesses DIF using the Mantel–Haenszel test suggested byHolland (1985). The test uses “matching groups” based on raw scores of examinees;

2

Sinharay

however, these groups may not be homogeneous in the context of BNs. Therefore,this article applies the PPMC method using a discrepancy based on a version of theMantel–Haenszel test that uses the equivalence class memberships (rather than rawscores) to form the “matching groups.”

The BNs deal with latent variables, hence there is a possibility of nonidentifia-bility or weak identifiability of some parameters. Weak identifiability may lead tononapplicability of standard asymptotic theory, problems with MCMC conver-gence, and misleading inferences. Therefore, a diagnostic for model identifiabil-ity, although not directly a model-checking tool, is a necessary tool for assessingthe performance of a BN. This article uses plots of prior versus posterior distribu-tions of the model parameters to assess their identifiability.

Section 2 introduces the BNs. Section 3 introduces the mixed-number subtrac-tion example (Tatsuoka, 1990) that will form the basis of a number of analyses later.Section 4 provides a brief discussion of the PPMC method and the PPP-value. Sec-tion 5 introduces the discrepancy measures to be used for assessing different aspectsof model fit. Section 6 applies the direct data display to the mixed-number data anda simulated data. Section 7 applies the suggested item fit measures to the mixed-number subtraction data set and a simulated data set. Section 8 discusses results ofexamining association among item pairs. Section 9 discusses results of the DIFanalysis. Section 10 assesses identifiability of the parameters. The conclusions ofthis study are discussed in Section 11.

2. Bayesian Networks

Increasingly, users of educational assessments want more than a single sum-mary statistic out of an assessment. They would like to see a profile of the state ofacquisition of a variety of knowledge, skills, and proficiencies for each learner. Onetechnique for profile scoring is the use of a BN.

The idea of a Bayesian inference network (BIN)—also known as a Bayesian net-work or Bayes net (BN) (Pearl, 1988)—comes primarily from the theory of graph-ical models, where it is defined as an acyclic-directed graph in which each noderepresents a random variable, or uncertain quantity, which can take two or more pos-sible values. To a statistician, BNs are statistical models that describe a number ofobservations from a process with the help of a number of latent (or unobserved)categorical variables. In the context of educational assessments, BNs are appropri-ate when the task analysis of an assessment suggests that the satisfactorily completionof any task requires an examinee to have at least a certain level of a number of skills.The unobserved (or latent) variables, representing the level of the skills in the exam-inees, affect the probability distribution of the observed responses in a BN. BNs arecalled so because they use Bayes’s rule for inference. Mislevy (1995), Almond andMislevy (1999), Mislevy et al. (2001), Mislevy et al. (2003), and Weaver and Junker(2003) show how BNs may be useful in psychometrics.

Consider an assessment with I examinees and J items. Suppose Xij denotes theresponse of the i-th examinee to the j-th item. In general, Xij can be a vector-valuedquantity; that is, a single “task” could consist of multiple “items.” It can contain

3


polytomous outcomes as well. In our example, each item produces a single dichoto-mous outcome, which is 1 if the response is correct and 0 if it is incorrect. Denotethe ability/skill variable vector for the i-th examinee to be �i = (θi1, θi2, . . . , θip)′, θik

denoting the skill level of Examinee i for Skill k, and the item parameter vector forthe j-th item to be �j = (πj1, πj2, . . . , πjm)′. A standard assumption is that of local inde-pendence, that is, independence of the responses conditional on the skills. Supposethat a cognitive analysis of the items produced a Q-matrix (e.g., Tatsuoka, 1990),an incidence matrix showing for each item the skills required to solve that item. Thenext step is to define P(Xij = 1��i, �j). As an example of a simple BN, assume thatsolving the j-th item needs a number of skills, define �i to consist of the indicatorsof attainment of the skills, let �j = (πj0 , πj1)′, and choose

To perform a Bayesian analysis of a BN, one needs a prior (population) distribu-tion p(��) on the ability variables �. Mislevy (1995) showed that forming a graphicmodel (Lauritzen & Spiegelhalter, 1988; Pearl, 1988) over the skill variables providesa convenient way to specify this distribution.

BNs are similar to the deterministic inputs, noisy “and” gate (DINA) model ofJunker and Sijtsma (2001), the only difference between the two being the graphicmodel for p(��) in a BN, the noisy inputs deterministic and gate (NIDA) modelof Maris (1999), the higher-order latent trait models (de la Torre & Douglas, 2004),and the fusion model of Hartz, Roussos, and Stout (2002), all of which fall in thefamily of cognitive diagnostic models.

3. The Mixed-Number Subtraction Example

Here, we introduce what will be a running example used through this article,one regarding a test of mixed-number subtraction. This example is grounded in acognitive analysis of middle school students’ solutions of mixed-number subtrac-tion problems. The focus here is on the students learning to use the following method(discussed in, e.g., Tatsuoka, Linn, Tatsuoka, & Yamamoto, 1988):

Method B: Separate mixed numbers into whole number and fractional parts;subtract as two subproblems, borrowing one from minuend whole number ifnecessary; and then simplify and reduce if necessary.

As is usual with a typical application of a BN, the analysis started with a cogni-tive analysis (e.g., Tatsuoka, 1984) of such items to determine the “attributes,” whichare important for solving different kinds of problems. The cognitive analysis mappedout a flowchart for applying Method B to a universe of fraction–subtraction prob-lems. A number of key procedures appear, a subset of which is required to solve agiven problem according to its structure. To simplify the model, we eliminate theitems for which the fractions do not have a common denominator (leaving us with

P Xi

ij i jj=( ) =1 1�� ,

π if the examinee has thee skills required to solve itemothe

j

j

,π 0 rrwise.

⎧⎨⎩

4

Sinharay

15 items). The procedures are as follows: (a) Skill 1: Basic fraction subtraction;(b) Skill 2: Simplify/reduce fraction or mixed number; (c) Skill 3: Separate wholenumber from fraction; (d) Skill 4: Borrow one from the whole number in a givenmixed number; and (e) Skill 5: Convert a whole number to a fraction. The cognitiveanalysis identified Skill 3 as a prerequisite of Skill 4; that is, no students have Skill 4without also having Skill 3. Thus, there are only 24 possible combinations of the fiveskills that a given student can possess.

We consider the responses of 325 examinees from a data set discussed in, for example, Tatsuoka (1984) and Tatsuoka et al. (1988). This data set played acentral role in the development of the rule space methodology (e.g., Tatsuoka,1990). The 325 examinees considered here were found to have used Method B(Mislevy, 1995). Table 1 lists 15 items from the data set that will be studied here,characterized by the skills they require in the column “Skills” that also representsthe Q-matrix. An “x” for a skill denotes that the skill is required by the item whilean “o” denotes the opposite. For example, Item 9 requires Skills 1, 3, and 4. Thetable also shows the proportion-correct scores (“Prop. Corr.”) for each item. Theitems are ordered with respect to an increasing order of difficulty (or decreasingorder of proportion-correct score). The last five columns of the table are dis-cussed later. Many rows of the Q-matrix are identical, corresponding to a groupof items that require the same set of skills to solve. Following the terminologyof evidence-centered design (ECD; Almond, & Mislevy, 1999), we call the pat-terns corresponding to the rows evidence models (EM); these are shown in Col-umn 4 of the table.

Sinharay et al. (2004) discussed the fact that Items 5 and 6 may not require anymixed number subtraction skills; for example, an examinee may solve Item 5 justby noticing that any number minus the same number results in zero.

Certain patterns of skills will be indistinguishable on the basis of the results ofthis test. For example, because every item requires Skill 1, the 12 profiles (i.e., skillpatterns) that lack Skill 1 are indistinguishable on the basis of this data. Similarlogic reveals that there are only nine identifiable equivalence classes of studentprofiles. This property of the test design will manifest itself later in the analysis.Table 2 describes the classes by relating them to the EMs. Often, distinctionsamong members of the same equivalence class are instructionally irrelevant. Forexample, students judged to be in Equivalence Class 1 would all be assigned reme-dial work in basic subtraction, so no further distinction is necessary.

Mislevy (1995) analyzed the data using a BN. Let �i = (θi1, . . . , θi5) denote theattribute/skill vector for Examinee i. In the context of ECD, where the BNs havefound most use, θiks are called proficiency variables. These are latent variablesdescribing knowledges, skills, and abilities of the examinees we wish to draw infer-ences about. (the term variable is used to emphasize its person specific nature.)

Mixed-Number Subtraction Link Models

The “link models” in ECD terminology state the likelihood distribution of the datagiven the model parameters and latent variables. It is assumed that a participant needs

5


6

Sinharay

TABLE 1Skill Requirements and Some P-Values for the Mixed-Number Subtraction Problems

Text of Skills Prop. Ppbis PDjeq,χ(y, �) PDj

raw,χ(y, �) PDjraw,G(y, �)

No. the Item 12345 EM Corr. rpbis 2LC 2LC 2LC 2LC

1 xoooo 1 .79 .50 .40 .74 .26 .19

5 xoooo 1 .70 .52 .00* .00* .00* .00*

4 xxooo 2 .71 .56 .10 .43 .06 .23

2 xoxoo 3 .75 .59 .14 .28 .05 .03*

3 xoxoo 3 .74 .61 .02* .19 .03* .02*

6 xoxoo 3 .69 .23 .08 .30 .00* .00*

9 xoxxo 4 .37 .80 .05 .06 .07 .11

10 xoxxo 4 .37 .76 .45 .56 .42 .30

11 xoxxo 4 .34 .74 .36 .49 .06 .08

14 xoxxo 4 .31 .77 .43 .41 .30 .28

7 xoxxo 4 .41 .70 .08 .04* .01* .04*

8 xoxxx 5 .38 .71 .00* .00* .00* .03*

12 xoxxx 5 .33 .66 .04* .04* .00* .00*

15 xoxxx 5 .26 .75 .06 .52 .30 .38

13 xxxxo 6 .31 .78 .23 .49 .53 .5244

122

7

12−

7 14

3−

3 21

5−

21

3−

41

102

8

10−

41

31

5

3−

73

5

4

5−

41

32

4

3−

31

22

3

2−

37

82−

45

71

4

7−

34

53

2

5−

11

8

1

8−

3

4

3

4−

6

7

4

7−

to have mastered all of the skills required for an item in order to solve it. In an appli-cation however, we will get false positive and false negative results. The linkmodel described below follows that intuition. The BN applied by Mislevy (1995),referred to as the 2-parameter latent class (2LC) model hereafter, uses two param-eters for each item, the true-positive and false-positive probabilities, in

Suppose the j-th item uses the Evidence Model s, s = 1, 2, . . . , 6. Although s is deter-mined by the item, this notation does not reflect that. Let δi(s) be the 0/1 indicatordenoting whether the examinee i has mastered the skills needed for tasks using Evi-dence Model s or not. The quantity δi(s) for any examinee is completely determinedby the values of θ1, θ2, . . . , θ5 for that examinee. The likelihood of the response ofthe i-th examinee to the j-th item is then expressed as

The model relies on the assumption of local independence, that is, given the profi-ciency �i, the responses of an examinee to the different items are assumed indepen-dent. The probability πj1 represents a “true-positive” probability for the item, this is,it is the probability of getting the item right for students who have mastered all ofthe required skills. The probability πj0 represents a “false-positive” probability; it isthe probability of getting the item right for students who have yet to master at leastone of the required skills. The probabilities πj0 and πj1 are allowed to differ over j(i.e., from item to item). However, we use the same independent priors for all items:

π πj j0 13 5 23 5 23 5 3 5 3~ . , . , ~ . , . . ( )Beta Beta( ) ( )

Xij j i ji s i sπ πδ δ( ) ( )( ), ~ . ( )� Bernoulli 2

P Xi

ij i jj=( )=1 1�θ π

π,

if examine mastered alll the skills needed to solve itemoth

j

jπ 0 eerwise.

⎧⎨⎪⎩⎪

( )1

7


TABLE 2Description of Each Equivalence Class With Evidence Models Items That Should BeSolved by Examinees of an Equivalence Class

Equivalence EM

Class Class Description 1 2 3 4 5 6

1 No Skill 12 Only Skill 1 x3 Skills 1 & 3 x x4 Skills 1, 3, & 4 x x x5 Skills 1, 3, 4, & 5 x x x x6 Skills 1 & 2 x x7 Skills 1, 2, & 3 x x x8 Skills 1, 2, 3, & 4 x x x x x9 All Skills x x x x x x

Mixed-Number Subtraction Proficiency Model

The prior distribution of the proficiency variables is known as the proficiencymodel in ECD literature. In a BN, the prior distribution P(��) is expressed as adiscrete Bayesian network (this is where the name of the model comes from) orgraphic model (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). Prior analysesrevealed that Skill 3 is a prerequisite to Skill 4. A three-level auxiliary variable θWN

incorporates this constraint. Level 0 of θWN corresponds to the participants whohave mastered neither skill; Level 1 represents participants who have masteredSkill 3 but not Skill 4; Level 2 represents participants who mastered both skills.The dependence relationship (see, e.g., Figure 2 of Sinharay et al., 2004, for agraphic representation of the relationship) among the skill variables provided bythe expert analysis corresponds to the factorization

The parameters λ are defined as follows:

Finally, one requires prior distributions P(�). We assume that λ1, �2, �5, and �WN

are a priori independent.The natural conjugate priors for the components of � are either Beta or Dirichlet

distributions. The hyperparameters are chosen, as in Mislevy et al. (2001), so thatthey sum to 27 (relatively strong numbers given the sample size of 325). With sucha complex latent structure, strong priors such as those here are necessary to preventproblems with identifiability. These must be supported by relatively expensive elic-itation from the experts. Prior probabilities give a chance of 87% of acquiring a skillwhen the previous skills are mastered and 13% of acquiring the same skill when theprevious skills are not mastered. They are given below:

λλ

1

2 0

23 5 3 5

3 5 23 5

~ . , . ,

~ . , .,

Beta

Beta

( )( );; ~ . , . ,

~ . ,

,

,

Beta

Beta

λ

λ2 1

5 0

23 5 3 5

3 5 2

( )33 5 13 5 13 55 1 5 2. ; ~ . , . ; ~, ,

( ) ( )Beta Betλ λ aa 23 5 3 5

0 0 0 0 1 0 2

. , . ,

, ,, , , , , , ,

( )=� � � �WN WN WN WN(( ) ( )~ , , ,

. ~, ,

Dirichlet

Dirichlet

15 7 5

1�WN 111 9 7 7 9 112 3, , ; . ~ , , ;, , ,( ) ( )� �WN WNDirichlet ,, . ~ , , .Dirichlet 5 7 15( )

λ θλ θ θ

λ

1 1

2 2 1

5

1

1 0 1

= =( )= = =( ) =

=

P

P m m

P

m

m

.

, .,

,

for

θθ θ θ

λ θ θ5 1 2

1

1 0 1 2= + =( ) =

= = +

m m

P nWN m n WN

for , , .

, , θθ θ2 5 0 1 2 3 0 1 2+ =( ) = =m m nfor and, , , , , , .

p p p pWN WN WN� � � � �3 3 4 1 2 5( ) = ( ) ( ) ( )θ θ θ θ θ θ θ θ, , , , , pp

p p

θ θ θθ θ θ

5 1 2

2 1 1

, ,

, , .

�

� �

( )( ) ( )

8

Fitting the 2LC Model to the Mixed-Number Subtraction Data

The WinBUGS software (Spiegelhalter, Thomas, & Best, 1999) is used to fit anMCMC algorithm to the data set. For details of the model fitting, see, for example,Sinharay et al. (2004). Five chains of size 2,000 each are used, after a burn-in of1,000 each, resulting in a final posterior sample size of 10,000.

4. Posterior Predictive Model Checking Method

Let p(y��) denote the likelihood distribution for a statistical model applied todata (examinee responses in this context) y, where � denotes all the parameters inthe model. Let p(�) be the prior distribution on the parameters. Then the posteriordistribution of � is

The posterior predictive model checking (PPMC) method (Rubin, 1984, buildingon the work in Guttman, 1967) suggests checking a model using the posterior pre-dictive distribution of replicated data yrep,

as a reference distribution for the observed data y.The basis of the PPMC method is to compare the observed data y to its refer-

ence distribution (Equation 4). In practice, test quantities or discrepancy mea-sures D(y, �) are defined (Gelman, Meng, & Stern, 1996), and the posteriordistribution of D(y, �) compared to the posterior predictive distribution of D(yrep,�), with any significant difference between them indicating a model failure. Aresearcher may use D(y, �) = D(y), a discrepancy measure depending on the dataonly, if appropriate.

A popular summary of the comparison is the tail-area probability or posteriorpredictive p-value (PPP-value), the Bayesian counterpart of the classical p-value(Bayesian p-value):

where I[A] denotes the indicator function for the event A.Because of the difficulty in dealing with Equations 4 or 5 analytically for all but

simple problems. Rubin (1984) suggests simulating replicated data sets from theposterior predictive distribution in practical applications of the PPMC method. Onedraws N simulations �1, �2, . . . , �N from the posterior distribution p(��y) of �,and draws yrep,n from the likelihood distribution p(y��n), n = 1, 2, . . . , N. The

p P D D

I

rep

D rep D

= ( ) ≥ ( )[ ]= ( ) ( )⎡⎣ ⎤≥

y y y

y y

, ,

, ,

� �

� � ⎦⎦∫∫ ( ) ( )y

y y yrep p p d drep rep

�� , , , , ( )5

p p p drep repy y y y( ) = ( ) ( )∫ � � �, ( )4

pp y p

p p d�

� � �y

y( ) ≡

( ) ( )

( ) ( )∫� �

�

.

9

process results in N draws from the joint posterior distribution p(yrep, �� y), and,equivalently, from p(yrep�y). The expression in Equation 5 implies that the propor-tion of the N replications for which D(yrep,n, �n) exceeds D(y, �n) provides an esti-mate of the PPP-value. Extreme PPP-values (close to 0, or 1, or both, dependingon the nature of the discrepancy measure) indicate model misfit.

Gelman et al. (1996) suggest that the preferable way to perform PPMC is tocompare the realized discrepancies D(y, �n) and the replicated/predicted discrep-ancies D(yrep,n, �n) by plotting the pairs {D(y, �n), D(yrep,n, �n)}, n = 1, 2, . . . , N,in a scatter-plot.

Researchers like Robins, van der Vaart, and Ventura (2000) have shown thatPPP-values are conservative (i.e., often fails to detect model misfit), even asymp-totically, for some choices of discrepancy measure; e.g., when the discrepancymeasure is not centered. However, a conservative diagnostic with reasonablepower is better than tests with unknown properties or poor Type I error rates,because evidence of misfit often leads to removal of test items from the item pooland it may be to the advantage of the administrators to fail to reject the hypothe-sis for a borderline item and thus retain the item for future use. Gelman et al.(1996) comment that the PPMC method is useful if one thinks of the currentmodel as a plausible ending point with modifications to be made only if substan-tial lack of fit is found and we are dealing with applications exactly as referred toby them.

Empirical and theoretical studies suggest that PPP-values generally have rea-sonable long-run frequentist properties (Gelman et al., 1996, p. 754). The PPMCmethod combines well with the Markov chain Monte Carlo (MCMC) algorithms(e.g., Gelman et al., 2003).

5. The Suggested Model Diagnostics

A discrepancy measure that many will be tempted to apply here is the χ2-typemeasure

where the summation is over the examinees, items, or both. Although measures ofthis type are intuitive and useful for other types of models, this measure was foundnot useful by Yan et al. (2003) for the 2LC model. The presence of the latent vari-able vector �i in the 2LC model ensures that the residual for an individual cannotbe too unusual, and even an inadequate model predicts the χ2-type quantity ade-quately; the same phenomenon is observed by Gelman, Goegebeur, Tuerlinckx,and van Mechelen (2000) in a discrete data regression model. The χ2-type measurewas also found to be of little use when computed using E(Xij��j, �) and Var(Xij��j, �),i.e., when the expectation and variance of Xij are computed by marginalizing overthe distribution of �i. Therefore, the χ2-type measure is not considered any more in

X E X

Xij ij j i

ij j i

− ( )[ ]( )∑

π θπ θ

,

,,

Var

2

10

Sinharay

this article. The remaining of this section describes the measures that were founduseful in this research.

Direct Data Display

Suitable display showing the observed data and a few replicated data sets may bea powerful tool to provide a rough idea about model fit by revealing interesting pat-terns of differences between the observed and replicated data (e.g., Gelman et al.,2003). For large data sets, however, plotting all data points is prohibitive—but plotsfor a sample (for example, few examinees each from low-scoring, high-scoring, andmiddle groups) is possible.

Item Fit Analysis

This article first examines if the fitted model can predict a set of simple and nat-ural quantities, the point biserial correlations (e.g., Lord, 1980, p. 33), that is, thesimple correlations between the item scores and raw (number-correct) scores of theexaminees. Sinharay and Johnson (2003) and Sinharay (2005) apply a similarapproach successfully to the unidimensional IRT models.

Item Fit Measures Based on Equivalence Class Membership

Here, this article examines item fit measures based on equivalence class mem-berships of the examinees. We group the examinees according to the equivalenceclasses because the equivalence class membership of an examinee, although latent,is a natural quantity in this context. The natural groups are then the nine equivalenceclasses defined in Table 2. Equivalence Class 1 represents no skills and EquivalenceClass 9 represents all skills. The classes in between are roughly ordered in order ofincreasing skills; however, the ordering is only partial. Even though the membershipsare unknown (latent) quantities, the Markov chain Monte Carlo (MCMC) algorithmused for fitting the model produces values of the latent abilities in each iteration,which determine the equivalence class memberships and allow the grouping.

The first discrepancy measure examined is the proportion of examinees Õjk inequivalence class k, k = 1, 2, . . . , K = 9 who answer item j correctly, j = 1, 2, . . . , J.The quantity Õjk is not a truly observed quantity (reflected in the notation), butdepends on the parameter values of the model (that is why they are referred to asdiscrepancy measures here, as in Gelman et al., 1996). They become known ifone knows the equivalence class membership of the examinees, which happensin each iteration of the MCMC algorithm. For a replicated data set, there is onesuch proportion, denoted Õjk

rep. For each combination of item j and equivalenceclass k, comparison of the values of Õjk and Õjk

rep over the iterations of the MCMCprovides an idea regarding the fit of the item. This article computes the PPP-valuefor each combination of item and equivalence class. Even though Õjks depend onthe latent equivalence class membership and are unknown, the PPMC methodintegrates (sums in this context) out the unknown quantities with respect to theirposterior distribution to provide natural Bayesian p-values. These p-values provide

11


useful information about the fit of the items, and may suggest possible improve-ments of the model. For example, significant p-values for most items for Equiv-alence Class 1 (examinees with no Skill 1) will mean that the model is notflexible enough to explain examinees in that class and that one should add a com-ponent to the model to address that issue (this happens in the mixed-numberexample later).

The above mentioned item fit p-values provide useful feedback about item fit,but it will be useful to summarize the fit information for each item into one plot orone number, preferably a p-value. To achieve that, this paper uses the two test sta-tistics (χ2-type and G2-type), somewhat similar to those in Sinharay (2003) as dis-crepancy measures.

Consider for this discussion that the values of the �is and �js are known (whichis true for each iteration of the MCMC algorithm, the generated values of �is and�js acting as known values), which means, for example, that the equivalence classmemberships of all examinees are known. For each item, one can then computeÕjk, k = 1, 2, . . . , K and Õjk

reps, k, k = 1, 2, . . . , K, which are described above. Let Nk,k = 1, 2, . . . , K denote the number of examinees in Equivalence Class k. Given the�is and �js, the expected probability of an examinee in Equivalence Class kanswering Item j correctly, Ejk, is the suitable πjδi(s). For example, for examinees inEquivalence Class 1 (one without skill 1), Ej1 = πj0∀j, because these examineesdo not have the necessary skills to solve any item. On the other hand, for exami-nees in Equivalence Class 2 (with Skill 1 only), Ej2 = πj1I[j∈{1,5}] + πj0I[j∉{1,5}],because these examinees have necessary skills to solve Items 1 and 5 only. Let usdenote the set of all item parameters and ability variables, that is, the set of all �isand �js, as �.

The χ2-type measure for item j, denoted henceforth as Deq,χj (y, �), is given by

The G2-type measure, denoted henceforth as Deq,Gj(y, �), is given by

Each of these two statistics summarizes the fit information for an item, withlarge values indicating poor fit. A comparison between the posterior distributionof Dj

eq,χ( y, �) [or Djeq,G( y, �)] and the posterior predictive distribution of

Djeq,χ ( yrep, �) [Dj

eq,G( yrep, �)] provides a summary regarding the fit of item i. Thecomparison can be done using a graphical plot (e.g., Figure 4). Another con-venient summary is the posterior predictive p-value (PPP-value) for the discrep-ancy measure, which can be estimated using the process described in Section 4.Here, a PPP-value very close to 0 indicates a problem (indicating that the vari-

D OO

EN Oj

eq Gjk

jk

jkk jk

, , logy �( ) =⎛

⎝⎜

⎞

⎠⎟ + −( )2 �

�� llog . ( )

N E

N Ek jk

k jkk

K −−

⎛

⎝⎜

⎞

⎠⎟

⎡

⎣⎢⎢

⎤

⎦⎥⎥=

∑1

7

D NO E

E N Ejeq x

kk

Kjk jk

jk k jk

, , . (y �( ) =−( )

−( )=∑

1

2�66)

12

Sinharay

ability in the data set appears unusually large compared to that predicted by themodel).

Adding the discrepancy measures over the items, an overall discrepancy mea-sure for assessing the fit of the model can be obtained as:

The PPP-value corresponding to this measure will provide an idea about theoverall fit of the model to the data set.

The computation of these PPP-values does not need any approximation (e.g., arough χ2 approximation, as in Sinharay et al., 2004), and hence results in naturalBayesian p-values summarizing the fit of the items. The cells (item-group com-bination) with small frequencies (especially for low or high raw scores) is not aproblem with the PPMC approach because this does not need a χ2 assumption.However, for more stability of the discrepancy measures and the PPP-values,equivalence classes with too few examinees can be pooled. Examining the dis-tribution of the values of Dj

eq,χ( yrep,m, �m), m = 1, 2, . . . , M is a way to check forstability. These quantities should look like draws from a χ2-distribution with (K −d − 1) d.f., where K is the number of equivalence classes for the problem, and d isthe adjustment in the d.f. due to pooling, if there are sufficient number of exami-nees for each class—departure from the χ2 distribution will point to the instabilityof the discrepancy measure. It is possible to examine the mean, variance and his-togram of the replicated discrepancies to ensure the stability of the measure.

Item Fit Measures Based on Raw Scores

The latent quantities (equivalent class memberships), on which the above itemfit measure is based on, may themselves be estimated with error, especially for asmall data set, and hence may lead to a test with poor error rates. Therefore, thispart examines item fit using examinee groups based on observed raw scores (as inOrlando & Thissen, 2000). Even though the raw scores do not have a clear inter-pretation in the context of the multidimensional 2LC model, they are natural quan-tities for any assessment with dichotomous items. The approach used here to assessitem fit is very similar to that in Sinharay (2003).

The first discrepancy examined is the proportion of examinees in raw scoregroup k, k = 1, 2, . . . , (J − 1) who answer item j correctly, j = 1, 2, . . . , J. Denotethe observed proportion of examinees in group k answering item j correctly asOjk. For each replicated data set, there is one (replicated) proportion correct forthe j-th item and k-th group, denoted Ojk

rep. For each item j, comparison of the val-ues of Ojk and Ojk

reps, k, k = 1, 2, . . . , (J − 1), provides an idea regarding the fit ofthe item (Oj0 = 0 and OjJ = 1; so they are not included). One way to make thecomparison is to compute the posterior predictive p-value (PPP-value) for eachitem-group combination. This article suggests use of item fit plots to make thecomparison. The approach will be described in detail with the data exampleslater.

D Doveralleq x

jeg x

j

, ,, , . ( )y y� �( ) = ( )∑ 8

13


The item fit plots promise to provide useful feedback about item fit, but it willbe useful to summarize the fit information for each item into one number, prefer-ably a p-value. To achieve that, this article uses the two test statistics (χ2-type andG2-type) similar to those suggested by Orlando and Thissen (2000) as discrepancymeasures. Because the measures depend on Ejk = E(Ojk��, �), the next paragraphdescribes computation of Ejk.

The computation of Ejk, the expected proportion of examinees in group kanswering item j correctly, is not trivial. Orlando and Thissen (2000) used therecursive approach of Lord and Wingersky (1984) to compute the expectationsin the context of IRT models and this paper uses a slight variation of that; a briefdescription of the approach follows. Let Tj(�) be the probability of success onItem j for proficiency � (remember that for the 2LC model, this will be eitherπj1 or πj0, depending on � and the skill requirements of the item). Suppose Sk(�)denote the probability of obtaining a raw score of k in an J-item test by an exam-inee with proficiency variable �. For convenience, the dependence of Sk(�) on� and � is not reflected in the notation. Further, denote S*j

k(�) to be the proba-bility of a raw score k at proficiency � on items 1, 2, . . . , j − 1, j + 1, . . . , J (i.e.,omitting Item j from the set of all items). Then,

where the summation is over all possible values of � (which is 24 for this prob-lem) and p(��) is the prior distribution on �. For an IRT model, the summationin Equation 9 is replaced by an integration (Orlando & Thissen, 2000). For com-puting Sk

*j(�) and Sk (�), this article uses the recursive approach of Lord andWingersky (1984) and Orlando and Thissen (2000); the approach involves theterms Tj(�)’s. The quantity Ejk does not depend on the latent proficiency vari-ables �is.

The suggested χ2-type measure, denoted henceforth as Djraw,χ(y, �), is given by

The G2-type measure, denoted henceforth as D jraw,G(y, �), is given by

Each of the above two statistics summarizes the fit information for an item, withlarge values indicating poor fit. One can use ∑jD j

raw,χ( y, �) or ∑jD jraw,G( y, �) to

provide an idea regarding the overall fit of the model. As discussed for the equiv-

D N OO

EOj

raw gk jk

jk

jkjk

, , log ly �( ) =⎛

⎝⎜

⎞

⎠⎟ + −( )2 1 oog . ( )

1

111

1

1 −−

⎛

⎝⎜

⎞

⎠⎟

⎡

⎣⎢⎢

⎤

⎦⎥⎥=

−

∑O

Ejk

jkk

J

D NO E

E Ejraw g

kjk jk

jk jkk

J, , .y �( ) =

−( )−( )=

−

∑2

1

1

1(( )10

ET S p

S pjkj k

j

k

=( ) ( ) ( )

( ) ( )−Σ

Σθ

θ

� � � �

� � �

1 9*

, ( )

14

Sinharay

alence class-based measures earlier, a researcher can use graphical plots or PPP-values for D j

raw,χ(y, �) [or D jraw,G( y, �)] to summarize item fit, and these mea-

sures can be monitored for stability and examinee groups can be pooled, ifrequired.

Examining Association Among the Items

Consider an item pair in an assessment. Denote nk,k ′ to be the number of individu-als scoring k on the first item and k′ on the second item, k, k′ = 0, 1. We use

the sample odds ratio (see, for example, Agresti, 2002, p. 45), referred to as the oddsratio hereafter, as a test statistic with the PPMC method. Chen and Thissen (1997)apply these to detect violation of the local independence assumption of simple IRTmodels. Sinharay and Johnson (2003) and Sinharay (2005), applying the PPMCmethod, find the odds ratios to be useful discrepancy measures in detecting misfitof the simple IRT models. Examining the model prediction of odds ratios, whichare measures of association, should help a researcher to detect if the fitted modelcan adequately explain the association among the test items. The BNs assume theability variables to be multidimensional and attempt to explain the associationsamong items; therefore, examining how the model failed to reproduce the oddsratios might also provide some insight regarding any associations that the modelfailed to capture and might lead to an improved model.

Differential Item Functioning

Holland (1985) suggested the Mantel–Haenszel test statistic for studying DIFin the context of item response data. The test has the strength that it does not requirefitting any model and straightforward to apply.

Suppose one is interested in examining if a given item j shows DIF over a “focalgroup” F and a “reference group” R. In an application of the Mantel–Haenszel test,the examinees are divided into K matching groups, typically based on their rawscores. For an item, the data from the k-th matched group of reference and focal groupmembers are arranged as a 2 × 2 table, as shown in Table 3. The Mantel–Haenszel

ORn n

n n= 11 00

10 01

,

15


TABLE 3Table for Mantel–Haenszel Statistic

Group Right on an Item Wrong on an Item Total

Reference Ak Bk nRk

Focal Ck Dk nFk

Total Rk Wk n+k

test statistic for an item is then given by

where μk = nRkRk/n+k , and σ2k = [(nRknFkRkWk)/(n2

+k(n+k − 1)]. This article applies theabove test statistic to determine DIF for BNs.

However, the test employs matched groups based on raw scores of examinees,which are not the most natural quantities in the context of BNs. No DIF in an appli-cation of a BN means that success probabilities of the reference group and the focalgroup are the same conditional on the skill variables, and raw scores do not neces-sarily contain all information about the skills; a test incorporating this fact may per-form better than the Mantel–Haenszel test.

One simple alternative to the Mantel–Haenszel test is to fit the model separatelyto the two groups (focal and reference) and compare the values of estimates of the�js for the two groups using, for example, a normal or a χ2-test (as in Lord, 1980,in testing for DIF in IRT models). The approach may be costly in a setting such ashere, where an MCMC algorithm is used to fit the model, especially if an investi-gator is interested in a number of different focal groups and/or reference groups.Furthermore, the approach may be problematic if one or more of the two groups issmall—the estimates obtained by fitting a BN to a small data set may not be stable.This line of work is not pursued any more in this article.

This article examines a discrepancy measure based on the Mantel–Haenszel teststatistic, forming matched groups with respect to their latent skills. This approachdoes not need running the MCMC more than once. Matching with respect to skills(or, equivalence classes) is a natural concept in this context; however, the problemin doing so is that the skills are unknown. Fortunately, the PPMC method and thedraws from the MCMC algorithm provides us a way to get around that problem.

Suppose we know the skill variables for each examinee and hence know theirequivalence class memberships (this will be true for each iteration of MCMC, thegenerated value acting as the known values). For each equivalence class, we can com-pute a table such as Table 3 and can compute a discrepancy such as Equation 12. Letus denote the measure for Item j as DMH

j ( y, �). It is straightforward to obtain an esti-mate of the PPP-value corresponding to DMH

j (y, �), by computing the proportion oftimes Dj

MH(yrep,m, �m) exceeds DjMH(y, �m), where �m, m = 1, 2, . . . , M, is a posterior

sample and yrep,m, m = 1, 2, . . . , M, are the corresponding posterior predictive data sets.

Discussion

One disadvantage of these techniques is the conservativeness of the PPMCmethods. However, as argued earlier, a conservative test is better than a test withunknown properties, and the conservativeness of the PPMC technique may be aboon here. Another disadvantage is that these techniques are based on the MCMCalgorithm and hence are computation-intensive. Another issue, which might makesome potential users uneasy, is that these techniques are Bayesian in nature and use

XA

MHk k k k

k k

22

2

0 512=

− −( )Σ ΣΣ

μσ

., ( )

16

Sinharay

prior distributions. In the absence of substantial prior information, one can use non-informative prior distributions to reduce the effect of the prior distribution on theinference. Alternatively, one can perform sensitivity analysis to detect the effectof the prior distributions.

However, the disadvantages are somewhat compensated by the advantages that(a) the suggested techniques provide natural probability statements from a Bayesianpoint of view, and (b) computations for the techniques requires neither complicatedtheoretical derivations nor extensive simulation studies once an MCMC algorithmhas been used to fit the model.

Assessing Identifiability of Model Parameters

The BNs deal with latent classes (see, e.g., Sinharay et al., 2004 and the refer-ences therein) and hence there is a possibility of nonidentifiability or weak identi-fiability, which means that data provide little information about some parameters.Lindley (1971) argues that even a weakly identifiable Bayesian model may be validin the sense that it can still describe the data via its identifiable parameters andinformation supplied a priori. On the contrary, although weakly identifiable mod-els can be valid, they are not optimal for summarizing population characteristics.By failing to recognize weak identifiability, one may make conclusions based ona weakly identified parameter when one has no direct evidence beyond prior beliefsto do so (Garrett & Zeger, 2000). Furthermore, such a model increases the chancesof nonapplicability of standard asymptotic theory (e.g., as follows from Haberman,1981) and problems with MCMC convergence. Therefore, a diagnostic for modelidentifiability, although not directly a model-checking tool, is a necessary tool forassessing the performance of a BN.

This article uses plots of prior versus posterior distributions of the model pa-rameters �s and �s to assess their identifiability. The closer the two densities, theless is the information provided the data on the parameter (because posterior ∝likelihood × prior), and the weaker is the identifiability of the parameter. It is pos-sible to use numerical measures such as percentage overlap between the prior andposterior (e.g., Garrett & Zeger, 2000), but this article does not use those measures.

The following sections apply the suggested model checking tools to the mixed-number subtraction data set and a data set simulated from the 2LC model.

6. Direct Data Display

Figure 1 presents a direct data display discussed in the section on fitting the 2LCmodel to the mixed-number subtraction data. The figure shows the mixed-numbersubtraction data and eight replicated data sets (chosen randomly) from the 10,000obtained during the fitting of the model. The plot is somewhat similar to Figures6.6 and 6.7 of Gelman et al. (2003) that are used to assess the fit of a logistic regres-sion model. Figure 1 shows that there are certain patterns in the observed data thatare not present in the replicated data sets. For example, the examinees with the low-est scores could answer only two items correctly (interestingly, these are Items 5 and6, the idiosyncrasies of which were discussed earlier—both of these can be solvedwithout any knowledge of mixed-number subtraction). In the replicated data sets,

17


however, these examinees get other items correct as well. The smartest examineesget all items correct in the observed data, which is not true for the replicated datasets. Furthermore, the weakest half of the examinees rarely get any hard items(those belonging to Evidence Models 4 and 5) correct in the observed data, whichis not true in the replicated data sets. These patterns suggest that the 2LC model isnot entirely satisfactory for the data set.

To examine the performance of the 2LC model when data come from the samemodel, a limited simulation study is performed. The above analysis of the mixed-number data using the 2LC model produces 10,000 posterior predictive data sets;each of these can be considered to have been generated from the 2LC model. Werandomly pick one of them, fit the 2LC model to the data, and generate 10,000 pos-terior predictive data sets as before.

Figure 2 shows the data simulated from the 2LC model and eight replicated datasets from its analysis. The data set on the left seem to fit in with the replicated datasets on the right. The direct data display does not indicate any deficiency of themodel, rightfully so.

7. Item Fit Analysis

Fit of the 2LC Model to the Mixed-Number Subtraction Data

This section examines, using various measures, item fit for the 2LC model withthe mixed-number subtraction data set.

18

Sinharay

FIGURE 1. Mixed-number subtraction data and eight replicated data sets. Note. The leftcolumn in Figure 1 shows all observed responses, with a little black box indicating a cor-rect response. The items, sorted according to decreasing proportion correct, are shownalong the x-axis; the examinees, sorted according to increasing raw scores, are shownalong the y-axis. Right columns show eight replicated data sets from the 2LC model.There are some clear differences between the observed and replicated data sets.

Point Biserial Correlations

Columns 6 and 7 in Table 1 show the observed point biserials (denoted rpbis) andthe p-values (denoted Ppbis 2LC) corresponding to each item. The p-values signif-icant at 5% level are marked with an asterisk. The values shows that the 2LC modelunderpredicts most of the point biserial correlations for this data set, and signifi-cantly so for Items 3, 5, 8, and 12.

Figure 3 shows a histogram for the predicted average biserial correlations (fromthe 10,000 replicated data sets). A vertical line shows the observed average (averaged

19


FIGURE 2. 2LC data and eight replicated data sets.

FIGURE 3. Observed and predicted average biserial correlations when the 2LC modelis fitted to the mixed-number subtraction data.

over all items), which appears improbable under the 2LC model (PPP-value of.00). This indicates some inadequacy of the 2LC model for these data.

Measures Based on Equivalence Classes

Figure 4 shows, for Items 12 and 13, plots of the realized discrepancy Djeq,χ

( y, �) and the predicted discrepancy Djeq,χ(yrep, �). A diagonal line is there for

convenience—points consistently below the diagonal (showing that the realizeddiscrepancy is mostly larger than the predicted discrepancy) indicate a problem.The plot toward the left shows that the model cannot explain the responses forthe item adequately, the model-predicted discrepancy being almost alwayssmaller than the realized discrepancy. The PPP-value is .04. The plot toward theright shows the opposite—the model seems to reproduce the discrepancy andhence explain the responses for the item adequately (PPP-value = .49).

Column 8 in Table 1 shows the item fit p-values (denoted “PDjeq,χ

( y, �) 2LC”) forthe measure D j

eq,χ( y, �). Because the p-values for the χ2-type and G2-type mea-sures are very close, we show and discuss the results for the former only. The p-valuefor assessing overall fit, that corresponding to the discrepancy measure (Equation 8),is .01 and indicates that the model cannot explain the data set adequately.

Figure 5 shows the p-values for each item for each equivalence class. The plot islike one used in Murdoch and Chow (1996), where elliptical glyphs represent cor-relations. For a correlation ρ, an ellipse is drawn with the shape of the contour of abivariate normal distribution with unit variances and correlation ρ. Here, we plotthe quantity 2 * ( p-value − .5) for each combination of item and equivalence class.A p-value close to zero (one) will be transformed to −1 (1) and will be representedby almost a line with slope −1 (1); one such example is provided by Item 5, Class 9(Item 8, Class 1). The figure provides more insight regarding items with extreme p-values. For example, the 2LC model often results in inaccurate predictions forEquivalence Class 1, with a number of extreme p-values, and does not perform too

20

Sinharay

FIGURE 4. Comparison of realized and predicted discrepancies when the 2LC model isfit to mixed-number subtraction data.

well for Equivalence Class 9 either. It is also relevant here that the posterior meanof the number of examinees in the equivalence classes are 52, 9, 12, 2, 3, 12, 120,34, and 80, respectively. An extreme p-value for a class with bigger size (e.g., Class1, 7, or 8) is more severe, while the same for small classes, such as Classes 4 and 5,should not be of much concern. The figure shows how the model fails for Items 5,7, 8, 12, those with extreme p-values for D j

eq,χ( y, �). The p-values corroborate theevidence found in Sinharay et al. (2004) that the model overpredicts the scores ofindividuals with low proficiency and underpredicts those for high proficiency.

Measures Based on Raw Scores

The Item Fit Plots. Figure 6 provides item fit plots for all items for this example.The horizontal axis of each plot denotes the groups (i.e., the raw scores) of the exam-inees. The vertical axis represents the proportion corrects for the groups. For anygroup, a point denotes the observed proportion correct, and a box represents the dis-tribution of the replicated proportion corrects for that group. A line joins theobserved proportions for ease of viewing. The whiskers of the box stretch until the5th and 95th percentiles of the empirical distribution and a notch near the middle ofthe box denotes the median. The width of the box is proportional to the square rootof the observed number of examinees in the group (the observed number in a group

21


FIGURE 5. Item fit p-values based on equivalence classes when the 2LC model is fit tomixed-number data.

22

FIGURE 6. Item fit plots when the 2LC model is fitted to the mixed-number subtraction data.

provides some idea about the severity of a difference between the observed andreplicated proportions; a significant difference for a large group is more severe thanthat for a small group). For any item, too many observed proportions lying far fromthe center of the replicated values or lying outside the range spanned by the whiskers(i.e., lying outside a 90% prediction interval) indicate a failure of the model toexplain the responses to the item. In Figure 6, the plot for all the items other than 1,10, 13, 14, and 15 provide such examples, with a number of proportions (out of atotal of 14) lying outside the 90% prediction interval and a few others lying far fromthe center of the box.

Other than providing an overall idea about the fit for an item, the suggesteditem fit plot also provides some idea about the region where the misfit occurs(e.g., for low- or high-scoring individuals), the direction in which the misfitoccurs (e.g., whether the model overestimates or underestimates the performancefor the discrepant regions), certain aspects of the item such as its difficulty. Onecan create the plots with 95% prediction intervals as well.

The item fit plots point to an interesting feature of the BNs—the model-predictedproportion correct score occasionally goes down as the raw score increases (e.g., forItems 4, 5, 8, 10)—this is an outcome of the multidimensional nature of the model.

The χ2-Type Discrepancy Measures and Bayesian p-Values. The last twocolumns in Table 1 show the item fit p-values (denoted “PDj

raw,χ( y, �) 2LC” and

“PDjraw,G

(y, �) 2LC”) corresponding to the discrepancy measures based on raw scores.The table shows that the model cannot adequately explain the responses for Items2, 3, 5, 6, 7, 8, and 12. It is interesting to observe that this set of items includes items5, 7, 8, and 12, which were found misfitting using the measures based on the equiv-alence classes. In addition, this set has three more items. These differences mightbe attributable to the problems mentioned earlier about diagnostics based on latentquantities (what is called “observed” there is actually estimated and the estimationerror may affect the results of the diagnostic) and probably indicates the superior-ity of the measures based on raw scores over those based on the equivalenceclasses—this issue needs further research. The p-value corresponding to the over-all fit measure ∑jD j

raw,χ( y, �) is .00, indicating a poor overall fit of the 2LC model.

Fit of the 2LC Model to a Data Set Simulated From the 2LC Model

As before, to examine the performance of the 2LC model when data come fromthe same model, we randomly pick one of the posterior predictive data sets obtainedwhile fitting the 2LC model to the mixed-number data set, and use WinBUGS to fitthe 2LC model to the data set.

The 2LC model explains the point biserials adequately because none of the p-values corresponding to the biserials for the items is significant at 5% level. Thep-values corresponding to the average and variance of the biserials are not signif-icant either.

The p-value corresponding to the overall discrepancy measure (Equation 8) is.56—therefore, no evidence for a model misfit is found.

23


24

Sinharay

The item fit p-value for D jeq,χ (y, �) is not significant for any item, which is

expected because the model fitted is the correct model. There are a few extreme p-values for the item-equivalence class combinations, but that can be attributed tochance, and all but one of them are for the small equivalence classes (Classes 2–6).

The item fit p-values corresponding to Djraw,χ(y, �) and Dj

raw,G(y, �) show thatthe model cannot adequately explain Item 5 (p-values .01 and .00, respectively),which can be attributed to chance. The p-values for few other items are low as well,although not significant.

8. Examining Association Among the Items

Figure 7, like Figure 5 earlier, shows the PPP-values, obtained by fitting the 2LCmodel to the mixed-number data set, corresponding to the odds ratio for each itempair. Only the upper diagonal is shown in the plot because the plot is symmetricwith respect to the diagonal. The plotting order of the items is determined by thenumber of extreme p-values for them. The plot indicates that the 2LC model under-predicts the associations among the items too often. The model especially performspoorly for item pairs involving Items 3, 5, 7, 8, and 12. Remember that four of these

FIGURE 7. PPP-values for odds ratios for 2LC model fit to the data.

25


five were found problematic in item fit analysis earlier. The 2LC model could notcapture adequately the dependence structure among the items adequately.

When the 2LC model is fitted to a data set simulated from the 2LC model (andanalyzed earlier, during item fit analysis), a similar plot (not shown here) has nosignificant p-value at all, indicating that the 2LC model predicts the associationsamong the items of a data set generated adequately from the 2LC model.

9. Differential Item Functioning

Analysis of a Simulated Data Set With No DIF

Unfortunately, no demographic information regarding the examinees taking themixed-number subtraction test are available. Therefore, this article applies the DIFmeasures to data sets simulated from the 2LC model; randomly chosen subsets ofthe simulated data are used as the focal group and the reference group. The tworows of numbers under “No DIF” in Table 4 provide the p-values correspondingto the Mantel–Haenszel (MH) statistic (Holland, 1985) and the measure Dj

MH(y, �)when 150 randomly chosen examinees (from a total of 325) form the focal groupand the remaining form the reference group. The numbers show that there is a con-siderable similarity between the two sets of p-values. Most importantly, no itemhas significant p-value at 5% level under any of the two analyses, which is expectedbecause there is no DIF present.

To study the issue further, the above process is repeated 100 times, each time treat-ing a different random sample of 150 examinees as the focal group and the remain-ing as the reference group. The overall proportion of times a Mantel–Haenszelstatistic is significant at the 5% level (the Type I error rate) is 0.05, right at the nom-inal significance level. The same proportion is 0.04 for Dj

MH(y, �), probably point-ing to the slight conservativeness of the measure.

Analysis of Simulated Data With DIF

As a next step, this article analyzes a data set under conditions when DIF ispresent. The starting point is a data set simulated from the 2LC model (analyzed

TABLE 4P-Values for DIF Analysis for a Simulated Data Set With No DIF and Another SimulatedData Set with DIF for Item 13

Item

Data Set 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

NO DIF:MH .57 .82 .53 .32 .69 .20 .27 .48 .64 .59 .66 .99 .75 .15 .14DMH

j ( y, �) .61 .60 .61 .57 .79 .14 .37 .36 .85 .56 .85 .82 .78 .19 .27

With DIF:MH .93 .34 .67 .32 .46 .29 .27 .67 .48 .59 .66 .77 .01 .15 .14DMH

j ( y, �) .73 .29 .57 .57 .89 .15 .41 .51 .81 .53 .87 .71 .00 .20 .28

26

Sinharay

earlier). Table 4 shows that the p-value for Item 13 is quite high for both themethods—so DIF will be artificially introduced to that item. The overall propor-tion correct of the item is 0.31. A sample of 150 examinees (the same sample cor-responding to the “No DIF” case) is chosen randomly, which will act as the focalgroup, while the remaining examinees form the reference group. Of the wrongresponses (score 0) to Item 13 in the focal group, 15 were reset to score 1. Toensure that the overall ability of the focal group does not increase by this process,a score of 1 on another randomly chosen item for each of these 15 examinees wasreset to 0. This process artificially introduces a difference of proportion correctscore of 0.1 for Item 13 between the two groups. The data set is analyzed usingthe two methods and the same focal and reference groups employed for the datageneration are used in analyzing the data.

The two row of numbers under “With DIF” in Table 4 provide the p-values cor-responding to the Mantel–Haenszel test and the measure D j

MH(y, �). The resultsare encouraging. The only p-value significant at 5% level under both the methodsis that for Item 13, the one for D j

MH(y, �) being slightly more extreme than that forthe Mantel–Haenszel statistic.

Furthermore, a few more data sets are simulated under conditions where DIF ispresent. The data-generating scheme is very similar to that employed above, exceptthat the number of wrong responses to Item 13, which are reset to correct responsesin the focal group is reduced gradually from 15 to lower numbers. The DIF p-valuesfor Item 13 for both the statistics increase as an outcome, the p-value for the Mantel–Haenszel test statistic always being slightly larger than that for Dj

MH(y, �).For example, when 11 scores of 0 are reset to 1, the p-value for the Mantel–Haenszeltest statistic is .051, while that for D j

MH( y, �) is .039.From the above limited simulations, both the Mantel–Haenszel DIF statistic and

DjMH(y, �) seem to perform reasonably.

10. Identifiability of the Model Parameters

Figure 8 shows the prior distributions and posterior distributions for all the �sobtained by fitting the 2LC model to the mixed-number subtraction data. For anyitem, the hollow circles denote a random sample from the joint prior distributionof πj0 (plotted along the x-axis) and πj1 (plotted along the y-axis), while the solidcircles denote a random sample from the joint posterior distribution of πj0 and πj1.Figure 9 plots the prior distributions vs the posterior distributions for the λs.

The plots indicate identifiability problems with the 2LC model in regard to themixed-number subtraction data set. From Figure 8, it is clear that πj0s are barelyidentifiable (with a similar horizontal spread of the prior and posterior sample) forthe first three or four items (which are the easiest); one reason for this is the pres-ence of few examines who do not have the required skills for these items, butanswer them correctly. Figure 8 also indicates that the prior distributions for πj0sfor Items 5 and 6 (whose idiosyncrasies were discussed earlier) are not in agreementwith the corresponding posteriors.

27

FIGURE 8. Prior sample (hollow circle) and posterior sample (solid circle) for π’s for2LC model fit to the data.

28

FIG

UR

E 9

.P

rior

vs.

pos

teri

or d

istr

ibut

ions

of λ

s fo

r 2L

C m

odel

fit t

o th

e da

ta. N

ote.

Pri

or d

istr

ibut

ions

are

den

oted

by

dash

ed li

nes

and

post

erio

r di

stri

buti

ons

by s

olid

line

s. T

he x

-axi

s, n

ot la

bele

d, r

ange

s fr

om 0

to 1

.

29


Figure 9 shows more severe problems with the model applied. Of the 18 �s plot-ted, the prior and posterior distribution almost completely overlap for six (λ2,0, λ5,0,λWN,0,0, λWN,0,1, λWN,0,2, λWN,1,0). The two distributions are not too far for a few others(λ5,1, λWN,1,1, λWN,1,2, λWN,2,1, λWN,2,2). The nonidentifiability of the parameters indicatethat, for example, there is little information (because of the lack of such an item inthe test) about the chance of having Skill 2 in the absence of Skill 1.

The findings above point to an undesirable characteristic of the BNs. It is appar-ent that although a model may seem attractive from a cognitive perspective, its param-eters may not be well identified. A user of BNs and similar models such as DINA(Junker & Sijtsma, 2001), NIDA (Maris, 1999), higher-order latent trait models (dela Torre & Douglas, 2004), and the fusion model (Hartz, Roussos, & Stout, 2002)should, therefore, exercise caution. The best option is to design an assessment thatappropriately captures the necessary skills that interest the researcher and then applya model whose parameters are well identified. If not, one must be careful in makingconclusions from such models to prevent the possibility of drawing inferences basedon the weakly identified parameters.

11. Conclusions

There is a lack of statistical techniques for assessing fit of Bayesian networks (BN).Yan et al. (2003) and Sinharay et al. (2004) suggest a number of diagnostic measuresfor such models, but there is substantial scope of further investigation in the area. Thisarticle shows how to use the PPMC method to assess fit of simple BNs.

First, a direct data display, comparing the observed data and a number of predicteddata sets, is suggested as a quick check of the fit of the model. The display is foundto provide useful information about the fit of the model.

Second, the PPP-values corresponding to the point biserial correlations (a naturalquantity for any assessment with dichotomous items) are shown to provide usefulfeedback about model fit.

Third, this article suggests two sets of item fit diagnostics, one set using exam-inee groups based on equivalence class membership (which is determined by theskills) of the examinees and another set using groups based on raw scores of theexaminees. It is possible to examine item fit plots associated with the raw score-based item fit diagnostic. Limited simulation studies and real data analyses demon-strate the usefulness of the item fit plots and the corresponding PPP-value. Incombination with the item fit plots suggested in Sinharay et al. (2004), the mea-sures based on equivalence classes promise to provide useful feedback regardingitem fit for BNs. The item fit diagnostics based on raw scores seem to have morepower than those based on equivalence classes.

Next, a graphic approach to examine the prediction of odds ratio among itempairs provides useful model fit information regarding interactions between items.This article also investigates the issue of DIF in the context of BNs, using theMantel–Haenszel statistic (Holland, 1985) and a measure based on posterior pre-

dictive checks. Both the statistics seem to perform reasonably well for BNs invery limited simulations, but the issue needs further investigation.

Finally, this article suggests assessing identifiability of the model parameters,by comparing the prior and posterior distributions of the model parameters, as a partof any analysis fitting a BN. Such comparisons show that the 2LC model, appliedto the mixed-number subtraction data set in a number of publications, has severeidentifiability problems. This phenomenon warns the users of the BNs and othersimilar models to be cautious—in an application, it is highly undesirable to makeconclusions based on weakly identified parameters.

Overall, this article suggests useful model diagnostics for BNs to address theareas of model fit that are of usual concern to psychometricians. The PPP-values areknown to be conservative (see, e.g., Robins et al., 2000)—so the usual concern isprimarily about the power of the method; the results from this study are encourag-ing from that viewpoint. The measures detect the misfit of the 2LC model, althoughthe data set is rather small, supporting the findings from Sinharay et al. (2004) thatthe model is inadequate for the data set. The diagnostics promise to be useful notonly to the users of BNs, but also to those using other similar cognitive diagnosticmodels, such as the DINA (Junker & Sijtsma, 2001) and NIDA (Maris, 1999) mod-els, the higher-order latent trait models (de la Torre & Douglas, 2004), and thefusion model (Hartz et al., 2002). Most of these models are fitted using the MCMCalgorithm and, hence, the PPMC method with the suggested measures is a naturalcandidate for a model checking tool.

Gelman et al. (2003, p. 176) comment that finding extreme p-values and reject-ing a model is never the end of an analysis and that the departures of the test quan-tities from the posterior predictive distributions will often suggest improvementsof the model or places to check the data. If the 2LC model would be the final modelto be applied in this application, then removing Items 3, 5, 8, and 12 (found prob-lematic in most of the analyses with the model) from the data set and fitting themodel to the remaining 11 items results in an adequate model fit to all the aspectsexamined here. It is also possible to expand the model. Figures 1 and 5 suggest thatthe model fails to fit the weakest and strongest examinees: one reason could be theuse of a dichotomous distribution for the probability of a correct response to anitem; i.e., the model divides the examinees into two groups based on whether ornot they have mastered all the necessary skills for solving an item and assigns thesame success probabilities to all examinees within a group. This may be too restric-tive. For example, those who have mastered all the five skills discussed earlier mayhave a higher change of success than those who mastered the skills necessary foran item, but not all skills. Sinharay et al. (2004) expand the 2LC model to suggestthe 4-parameter latent class (4LC) model; the model has four parameters for eachitem (πj0, πj1, πj2, πj3), one each for examinees who (a) have not mastered Skill 1(which is required for every item in the data set—see Table 1), (b) have masteredSkill 1, but lacks one or more skills necessary for the item, (c) have mastered allfive skills, and (d) have not mastered all skills, but mastered the necessary skills

30

Sinharay

31


for the item. The expanded model seems to fit the data more satisfactorily, andquite adequately, if Items 8 and 12 are removed from the data set.

However, there are a number of issues that need further examination. Extensivesimulations examining the Type I error and power of the suggested measures mayreveal valuable information. For large numbers of items and or skills, the numberof equivalence classes may be large—it will be interesting to examine whether itis feasible to use some of these methods within reasonable time limits. This articlediscusses diagnostics in the context of a simple BN; application or extension of these to a more complicated BN (e.g., one where the skill variable/responsemight be polytomous) is not straightforward—this issue needs further research.Finally, evaluating practical consequences of misfit of a BN is another area forfuture investigation.

References

Agresti, A. (2002). Categorical data analysis, 2nd ed. New York: Wiley.Almond, R. G., & Mislevy, R. J. (1999). Graphical models and computerized adaptive test-

ing. Applied Psychological Measurement, 23, 223–238.Chen, W., & Thissen, D. (1997). Local dependence indexes for item pairs using item

response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289.de la Torre, J., & Douglas, J. (2004). Higher-order latent trait models for cognitive diagno-

sis. Psychometrika, 69, 333–353.Garrett, E. S., & Zeger, S. L. (2000). Latent class model diagnosis. Biometrics, 56, 1055–1067.Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis. New

York: Chapman & Hall.Gelman, A., Goegebeur, Y., Tuerlinckx, F., & van Mechelen, I. (2000). Diagnostic checks

for discrete-data regression models using posterior predictive simulations. Applied Sta-tistics, 49, 247–268.

Gelman, A., Meng, X. L., & Stern, H. S. (1996). Posterior predictive assessment of modelfitness via realized discrepancies (with discussion). Statistica Sinica, 6, 733–807.

Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit prob-lems. Journal of the Royal Statistical Society B, 29, 83–100.

Haberman, S. J. (1981). Tests for independence in two-way contingency tables based oncanonical correlation and on linear-by-linear interaction. Annals of Statistics, 9, 1178–1186.

Hambleton, R. K. (1989). Principles and selected applications of item response theory. In R. L.Linn (Eds.), Educational measurement (3rd ed., pp. 143–200). New York: Macmillan.

Hartz, S., Roussos, L., & Stout, W. (2002). Skills diagnosis: Theory and practice. User man-ual for Arpeggio software. Princeton, NJ: ETS.

Holland, P. W. (1985). On the study of differential item performance without IRT. Pro-ceedings of the 27th annual conference of the Military Testing Association (Vol. I, pp. 282–287), San Diego.

Junker, B., & Sijtsma, K. (2001). Cognitive assessment models with few assumptions, andconnections with non-parametric item response theory. Applied Psychological Measure-ment, 23(3), 258–272.

Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities ongraphical structures and their applications to expert systems (with discussion). Journal ofthe Royal Statistical Society, Series B, 50, 157–224.

Lindley, D. V. (1971). Bayesian statistics: A review. Philadelphia: Society for Industrial andApplied Mathematics.

Lord, F. M. (1980). Applications of item response theory to practical testing problems.Hillsdale, NJ: Lawrence Erlbaum Associates.

Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentileobserved-score “equatings.” Applied Psychological Measurement, 8, 453–461.

Maris, E. (1999). Estimating multiple classification latent class models. Psychometrika, 64,187–212.

Mislevy, R. J. (1995). Probability-based inference in cognitive diagnosis. In P. Nichols, S. Chipman, & R. Brennan (Eds.), Cognitively diagnostic assessment (pp. 43–71). Hills-dale, NJ: Erlbaum.

Mislevy, R. J., Almond, R. G., Yan, D., & Steinberg, L. S. (2001). Bayes nets in educationalassessment: Where the numbers come from. In Proceedings of the Fifteenth Conferenceon Uncertainty in Artificial Intelligence, 437–446.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educationalassessment. Measurement: Interdisciplinary research and perspective, 1(1), 3–62.

Murdoch, D. J., & Chow, E. D. (1996). A graphical display of large correlation matrices.The American Statistician, 50, 178–180.

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous itemresponse theory models. Applied Psychological Measurement, 24, 50–64.

Pearl, J. (1988). Probabilistic reasoning in intelligence systems: Networks of plausible infer-ence. San Mateo, CA: Morgan–Kaufmann.

Robins, J. M., van der Vaart, A., & Ventura, V. (2000). The asymptotic distribution of p-values in composite null models. Journal of the American Statistical Association, 95,1143–1172.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for theapplied statistician. Annals of Statistics, 12, 1151–1172.

Sinharay, S. (2003). Bayesian item fit analysis for dichotomous item response theory models(ETS RR-03-34). Princeton, NJ: ETS.

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using aBayesian approach. Journal of Educational Measurement, 42(4), 375–394.

Sinharay, S., Almond, R. G., & Yan, D. (2004). Model checking for models with discreteproficiency variables in educational assessment (ETS RR-04-04). Princeton, NJ: ETS.

Sinharay, S., & Johnson, M. S. (2003). Simulation studies applying posterior predictive modelchecking for assessing fit of the common item response theory models (ETS RR-03-28).Princeton, NJ: ETS.

Spiegelhalter, D. J., Thomas, A., & Best, N. G. (1999). WinBUGS version 1.2 user manual[Computer software manual]. Cambridge, UK: MRC Biostatistics Unit.

Tatsuoka, K. K. (1984). Analysis of errors in fraction addition and subtraction problems.(NIE Final report for Grant no. NIE-G-81-002). Urbana, IL: University of Illinois,Computer-based Education Research.

Tatsuoka, K. K. (1990). Toward an integration of item response theory and cognitive errordiagnosis. In N. Frederiksen, R. Glaser, A. Lesgold, & M. G. Shafto (Eds.), Diagnosticmonitoring of skill and knowledge acquisition (pp. 453–488). Hillsdale, NJ: Erlbaum.

Tatsuoka, K. K., Linn, R. L., Tatsuoka, M. M., & Yamamoto, K. (1988). Differential itemfunctioning resulting from the use of different solution strategies. Journal of EducationalMeasurement, 25(4), 301–319.

32

Sinharay

Weaver, R. L., & Junker, B. (2003). Model specification for cognitive assessment of pro-portional reasoning. Carnegie Melon University Technical Report 777. Available fromhttp://www/stat.cmu.edu/tr.

Williamson, D. M., Almond, R. G., & Mislevy, R. J. (2000). Model criticism of Bayesiannetworks with latent variables. Uncertainty in Artificial Intelligence Proceedings 2000,634–643.

Yan, D., Mislevy, R. J., & Almond, R. G. (2003). Design and analysis in a cognitive assess-ment (ETS RR-03-32). Princeton, NJ: ETS.

Author

SANDIP SINHARAY is Research Scientist, Educational Testing Service, Rosedale Road,Princeton, NJ 08541; [email protected]. His areas of specialization are Bayesian sta-tistics, model checking and model selection, educational statistics, and statisticalcomputing.

Manuscript received April 7, 2004Revision received August 19, 2004

Accepted November 23, 2004

33


Model Diagnostics for Bayesian Networksmath.ntnu.edu.tw/.../10_PPMC-Bayesian-Networks.pdfBayesian...

Documents

Transcript of Model Diagnostics for Bayesian Networksmath.ntnu.edu.tw/.../10_PPMC-Bayesian-Networks.pdfBayesian...