,R ANDERS AND WILLIAM H. BATCHELDER UNIVERSITY OF...

PSYCHOMETRIKA2013DOI: 10.1007/S11336-013-9379-4

HIERARCHICAL BAYESIAN MODELING FOR TEST THEORY WITHOUTAN ANSWER KEY

ZITA ORAVECZ, ROYCE ANDERS, AND WILLIAM H. BATCHELDER

UNIVERSITY OF CALIFORNIA, IRVINE

Cultural Consensus Theory (CCT) models have been applied extensively across research domains inthe social and behavioral sciences in order to explore shared knowledge and beliefs. CCT models operateon response data, in which the answer key is latent. The current paper develops methods to enhance theapplication of these models by developing the appropriate specifications for hierarchical Bayesian infer-ence. A primary contribution is the methodology for integrating the use of covariates into CCT models.More specifically, both person- and item-related parameters are introduced as random effects that canrespectively account for patterns of inter-individual and inter-item variability.

Key words: Cultural Consensus Theory, Bayesian statistics, hierarchical model, covariate modeling.

1. Introduction

Cultural Consensus Theory (CCT) explores the shared knowledge of respondents while re-lying on formal cognitive and measurement models (Batchelder & Romney, 1988; Romney &Batchelder, 1999; Romney, Weller, & Batchelder, 1986). A typical data set for CCT analysis in-volves a set of respondents answering questions that pertain to some shared knowledge domain.The knowledge domain need not be factual, and may often be a cultural knowledge domain. Forexample, domains have involved illness beliefs (Hruschka, Kalim, Edmonds, & Sibley, 2008;Baer, Weller, de Alba Garcia, de Glazer, Trotter, Pachter et al. 2003; Weller, Baerm, Pachter,Trotter, Glazer, de Alba Garcia, et al. 1999), judgments of personality traits (Iannucci & Rom-ney, 1990), and national consciousness (Yoshino, 1989). In each application, CCT models assessthe assumption of a single consensus answer key, and infer this consensus-based truth whichis shared by the respondents, while also accounting for their level of knowledge and guessingtendencies. When compared to simple, commonly practiced aggregation techniques, CCT-basedinformation aggregation has proven to be superior (see e.g., Weller, Pachter, Trotter, & Baer,1993).

The focus of the present paper is to enhance application techniques for the General Con-dorcet Model (GCM, Batchelder & Romney, 1988; Karabatsos & Batchelder, 2003), which is aCCT model for dichotomous True/False data. A special case of the GCM, in which only the latentanswer key and person-specific abilities are estimated, while setting the rest of the parameters to acertain value, is currently the most widely used CCT model (e.g.; recent applications are in Bim-ler, 2013; Miller, 2011; Hopkins, 2011). However, these other parameters can model importantaspects of decision making, such as person-specific acquiescence (guessing) bias and differentialitem difficulty. For instance, some people may be biased to more often respond ‘True’ ratherthan ‘False’ when guessing. Secondly, it is unrealistic to assume that a questionnaires composedfor any given topic has all equally difficult items, or that all are equally salient for the group ofrespondents.

Requests for reprints should be sent to Zita Oravecz, UCI, Department of Cognitive Sciences, 3213 Social &Behavioral Sciences Gateway Building, Irvine, CA 92697-5100, USA. E-mail: [email protected]

© 2013 The Psychometric Society

mailto:[email protected]

PSYCHOMETRIKA

The present paper demonstrates how GCMs can be extended hierarchically so that inter-individual and inter-item differences can be analyzed in terms of random effects. Models thatcan capture individual differences in various psychological traits have been extensively ap-plied in the field of personality psychology (hierarchical/multilevel models, see Raudenbush &Bryk, 2002; Snijders & Bosker, 1999). However, individual variability in psychologically in-terpretable parameters of cognitive models has only been in focus since Batchelder and Riefer(1999) and Riefer, Knapp, Batchelder, Bamber, and Manifold (2002). Therefore the number ofcognitive psychometric models has just started to increase. For example, Klauer (2010), Lee(2011), Rouder and Lu (2005), Scheibehenne, Rieskamp, and Wagenmakers (2013), Smith andBatchelder (2010), and Wetzels, Vandekerckhove, Tuerlinckx, and Wagenmakers (2010) haveshown how cognitive models can benefit from taking into account individual variation. Advan-tages of modeling items as random effects are demonstrated by De Boeck (2008). Generallyspeaking, in random-effect modeling the item- or person-specific traits are considered to be asample from a population to which one aims to generalize the results.

Additionally, we introduce the methodology for incorporating person and item covariatesinto the GCM. While covariate modeling is often utilized by models in fields such as personalitypsychology, educational measurements models etc., adding covariate information to cognitivemodels has not been a mainstream practice. We aim to demonstrate this technique and its meritsso that it can be later generalized to other types of CCT models, or incorporated into othercognitive models.

Finally, two techniques are described and compared for modeling the population distributionof probability variables in CCT models. The first is a hierarchical Gaussian approach, which in-volves the logit-transformation of the unit-scale parameters in the GCM. This hierarchical Gaus-sian structure is newly formulated for CCT models and it is shown to provide a flexible frame-work. The second technique avoids the logit-transformation of probability parameters by usingthe beta distribution as a population distribution (instead of the Gaussian) to directly model theparameters which are between 0 and 1, as described for parameters in similar models (Batchelder& Anders, 2012; Smith & Batchelder, 2010). In addition, the incorporation of person and itemcovariates is developed for each of these approaches, and illustrated with an application to realdata.

The paper is organized as follows. First, a summary of the most important properties ofthe GCM is provided. Then the extension of adding a population level is developed to providefor the hierarchical General Condorcet Model (HGCM) model. Next, a straightforward way ofintroducing covariate information for the HGCM is shown. Then statistical inference for theHGCM is derived in the Bayesian framework. Following these specifications, the usefulness ofthese extensions are demonstrated on a real data set pertaining to judgments of grammaticality forvarious types of English phrases. Finally, the discussion reflects on the results and the propertiesof the two hierarchical modeling choices that were applied.

2. Properties of the General Condorcet Model

In a GCM setting, a set of respondents answer a number of True/False questions whichmeasure aspects of the same underlying knowledge space. While in the present paper the focusis on dichotomous data, the GCM has been expanded to other response options as well (e.g.;Batchelder, Strashny, & Romney, 2010).

2.1. Data

Assume that each person i = 1, . . . ,N answers ‘True’ or ‘False’ to each of a set of k =1, . . . ,M questions. Then the data set Y consists of N × M answers, typically coded in the


following way:

Yik ={

1 if i responds ‘True’ to item k,

0 if i responds ‘False’ to item k.

It is important to note that the GCM works with the complete N × M data of dichotomousresponses, not just the marginals.

2.2. The Parametrization of the General Condorcet Model

For an axiom-based description of the GCM please consult Karabatsos and Batchelder(2003). Here the model is instead described by a thorough interpretation of its parameters. First,the model specifies item-specific model parameters, Zk ∈ {0,1}, which represent the ‘correct’(consensus) answers to the items, and they are dichotomously coded to correspond to the re-sponse data:

Zk ={

1 if item k is ‘True’,0 if item k is ‘False’.

Based on the answer key Z = (Zk), latent hit- and false-alarm rates can be defined as Hik =Pr(Yik = 1 | Zk = 1) and Fik = Pr(Yik = 1 | Zk = 0). So far, the model can be viewed as ageneral signal detection model, where the correct answers are latent parameters and the hit-and false-alarm parameters are heterogeneous over both items and respondents. The GCM re-parameterizes the hits and false-alarm rates with the double high threshold (DHT) model (seee.g.; Macmillan & Creelman, 2005; Morey, 2011) to describe the cognitive processes of therespondents. The DHT model assumes that a person either knows and responds with the correctanswer to an item k with probability Dik , or with probability 1 − Dik , the response is made byguessing. The guess is driven by the person-specific guessing bias parameter gi , which is theprobability of guessing ‘True’ when the answer is not known. Based on the DHT model, the hit-and false-alarm rates are parameterized as

∀i, Hik = Dik + (1 − Dik)gi, Fik = (1 − Dik)gi, (1)

where 0 < Dik, gi < 1. As pointed out in Batchelder and Romney (1988), it is necessary that∀ik Hik ≥ Fik to identify the model. The DHT model described in Equation (1) satisfies thisconstraint by definition. Moreover, in Appendix A the formal proof of identifiability for theGCM is derived.

The underlying process model of the GCM is illustrated by a tree diagram in Figure 1.The first split in the tree represents the latent state of item k and the remaining splits describethe response model. There are two detection thresholds, one for items with Zk = 1, and one foritems with Zk = 0; in this way, the items are never incorrectly identified (or ‘Detected’) as ‘True’or ‘False’.

A key assumption of the GCM is that the response random variables, Y = (Yik), satisfy aspecial conditional independence assumption given by

Pr(Y = yN×M | Z,H,F) =N∏

i=1

M∏k=1

Pr(Yik | Zk,Hik,Fik), (2)

PSYCHOMETRIKA

FIGURE 1.Processing tree of the General Condorcet Model for an item k.

for all possible realizations of y. Given the parameterizations in Equation (1), it is easily seenthat the likelihood function of the model becomes

L(Z,D,G | Y = Yik) =N∏

i=1

M∏k=1

[Dik + (1 − Dik)gi

]YikZk[(1 − Dik)gi

]Yik(1−Zk)

× [(1 − Dik)(1 − gi)

](1−Yik)Zk

× [Dik + (1 − Dik)(1 − gi)

](1−Yik)(1−Zk). (3)

Note that since the terms in the exponential position of Equation (3) are dichotomous 1-0 vari-ables, for their every combination, only one of the exponents is 1 and the remaining equal 0.

A direct consequence of the single answer key assumption is that the correlation betweentwo respondents over items equals the product of each respondent’s correlation with the answerkey, formally for ∀1 ≤ i �= j ≤ N :

ρ(YiK,YjK) = ρ(YiK,ZK)ρ(YjK,ZK), (4)

where K is a random variable that selects a random item index; so that ∀k,Pr(K = k) = 1/M ,and

ρ(YiK,YjK) = E(YiK,YjK) − E(YiK)E(YjK)√Var(YiK)Var(YjK)

. (5)

The term in Equation (5) and their counterparts for ρ(YiK,ZK) and ρ(YjK,ZK) are easily cal-culated using the property of conditional expectation (see, e.g., Batchelder & Anders, 2012,p. 318 for details). Equation (4) leads to a consequence that for all distinct respondents denotedas i, j,m,n,

ρ(YiK,YjK)ρ(YmK,YnK) = ρ(YiK,YnK)ρ(YmK,YjK), (6)

which is a form of Spearman’s law of tetrads (Spearman, 1904). The law of tetrads is the basisfor Spearman’s two-factor model for intelligence. In our case, this result occurs because there is asingle answer key, Z = (Zk)1×M , behind the respondent-by-respondent correlations over items,and the two factors are the respondents’ abilities and the shared consensus answer key. In a latersection on posterior predictive model checking we use this result to test whether the one latentunderlying answer key assumption of the GCM is met in an example application.


Following a proposal for introducing item difficulty by Batchelder and Romney (1988),Karabatsos and Batchelder (2003) specified Dik , the probability that a person i knows the cor-rect answer to item k, as a function of both the person’s ability and the item’s difficulty in thefollowing way:

Dik = θi(1 − δk)

θi(1 − δk) + δk(1 − θi), (7)

where θi ∈ [0,1] is the ability parameter belonging to informant i, and δk ∈ [0,1] denotes theitem difficulty of question k.

From a parameter interpretation point of view, formulating the probability of knowing thecorrect answer as a function of ability and item difficulty (Equation (7)) for the GCM shows acorrespondence to one of the Rasch model parameterizations (Fischer & Molenaar, 1995; Rasch,1960). If the GCM did not have the guessing bias parameter1 (i.e., if the lowest level of thedecision tree in Figure 1 were deleted), it would result in a similar model to the Rasch model, withthe exception of having an unknown answer key. However, in our formulation, the item difficultyand ability parameters are kept on the unit scale, which is not typical in psychometric test theory.Since the GCM has a person-specific guessing bias probability, keeping the ability and itemdifficulty also on the unit scale results in a convenient parameter interpretation framework typicalof threshold models of signal detection.

In the proposed model, the ability, item-difficulty, guessing bias and answer key parametersare estimated at the same time. The ability and item-difficulty scales directly relate to the prob-ability of knowing the culturally accepted consensus-based answer, which is represented by theanswer key. That is to say that the ability of a person is relative to how well his/her responsesgenerally match the consensus truth of the group. Certainly, items tapping into some knowledgedomain would be heterogeneous with respect to how difficult it is to know the consensus on dif-ferent aspects of the knowledge domain represented by the items. Even if the consensus is weak,the model specifies and estimates an answer key, in terms of which we can interpret the abilityand the item-difficulty parameters. This way the level of consensus knowledge informs us abouthow reliable the answer key estimate is (higher consensus—more stable estimate), while theposterior probability of the answer key estimate for each item reflects the degree of uncertaintyitemwise.

To summarize, the model introduced above has 2 ×N person-specific parameters, namelythe ability parameters, θ = (θi)1×N , and the guessing bias, G = (gi)1×N , parameters. Also, ithas 2 ×M item-specific parameters: the answer key for each item, Z = (Zk)1×M , and the item-difficulty parameter for each item, δ = (δk)1×M . Except for the answer key, which takes discretevalues of 0 or 1, all of the parameters are on the unit scale.

Originally, statistical inference for the GCM was carried out in the classical inferentialframework (Batchelder & Romney, 1988; Batchelder, Kumbasar, & Boyd, 1997). Later, Kara-batsos and Batchelder (2003) derived inference for different versions of the GCM (including onewith item difficulty, see later) in the non-hierarchical Bayesian framework. Recently Oravecz,Vandekerckhove, and Batchelder (in press) developed a user-friendly software package with agraphical user interface that can estimate non-hierarchical GCMs in the Bayesian framework.An extension of the GCM allowing cultural truth to be continuous as in fuzzy logic rather thantwo-valued can be found in Batchelder and Anders (2012).

1The guessing bias parameter in the GCM is a person-specific, cognitive latent variable, which should not be con-fused with the item-specific guessing rate parameter of the three parameter logistic model (Birnbaum, 1968).

PSYCHOMETRIKA

3. Hierarchical Model Formulation for the GCM

The modeling framework proposed for the GCM involves a hierarchical model structurethat allows for the pooling of information across participants as well as items. In this setting, theperson and item parameters are treated as random variables. These unit-scale parameters can betransformed onto the real line and normal (Gaussian) distributions can be chosen to model theirpopulation distributions. Although this proposed technique is generally used (see e.g.; De Boeck& Wilson, 2004) when dealing with unit-scale parameters, an alternative method that does notinvolve transforming the parameters, namely the use of the beta distribution, is also describedin this section. The advantages and disadvantages these two modeling approaches will also bediscussed.

The hierarchical modeling framework allows for a straightforward way to involve covariateinformation in the model, without needing to resort to a two-stage analysis (i.e. first estimatingmodel parameters, and then exploring their association with the covariates through correlation orregression analysis). Incorporating predictors in models is not yet a widely explored techniquein the field of cognitive psychology, although in many cases it can be done in a straightforwardmanner and can result in interesting findings; this will be demonstrated later in the Applicationsection.

3.1. Assigning Normal Population DistributionsUnit-scale parameters can be transformed onto the real line by a link function. For the GCM,

the suggested link function is the logit transform (other types, like the probit transform, arealso possible), which is defined by logit(x) = log[x/(1 − x)], where x ∈ [0,1] is a variable towhich the logistic function is applied. On the logit scale, the pre-transformed values less than 0.5correspond to negative values, while values greater than 0.5 correspond to positive values. Thepopulation distributions are modeled on the transformed scale.

Person-Specific Parameters In the case of the logit-transformed ability parameters (θi ) andguessing biases (gi ), a bivariate normal population distribution can be assigned as their popula-tion distribution: [

logit(θi)

logit(gi)

]∼ Normal2(μl(θg),�l(θg)), (8)

where Σl(θg) is a 2×2 covariance matrix representing the variation and covariation in the person-specific ability (θi ) and guessing bias parameters (gi ) of the population sample on the logit scale,while vector μl(θg) contains the population mean for these two variables. An easily derivable andscale-independent parameter based on Σl(θg) is the correlation between ability and guessing bias,which will be denoted as ρl(θg). The l notation in the subindex indicates that these parametersare under the logit transform.

Item-Specific Parameters Since the item-difficulty parameter, δk , is also on the unit-scaleas are the informant response-probability parameters discussed above, one can apply the logittransformation and assign a population distribution on it in the same manner. These two proce-dures are formally written as

logit(δk) ∼ Normal(μl(δ), σ

2l(δ)

). (9)

In order to handle model identification in Equation (7), the population mean of the logit-scaleditem difficulties can be set to zero (μl(δ) = 0) in Equation (9). Then the population mean of theability parameter would represent the average probability (on the logit scale) that a respondentwould know the correct answer. Alternatively the population mean of the abilities can also be


set to 0, and then the population mean of the item difficulty would be estimated. In any casethe ability and item-difficulty parameters are interdependent and should be interpreted relative toeach other.

Finally, the answer key parameters can have a hierarchical structure as well. Typically in thehierarchical application, it is assumed that the answer key items, Zk , are generated hierarchicallyby a Bernoulli process with a specific hyperprior, π . That is,

Zk ∼ Bernoulli(π), (10)

where π ∈ [0,1] is the probability of an answer key item being ‘True’. The majority of GCMapplications (see summary in Weller, 2007), which are not hierarchical, fix the probability pa-rameter of this Bernoulli process, π , to 0.5, and this setting is less flexible as it designates ana priori equal chance for every latent answer key parameter to either be ‘True’ or ‘False’.

Incorporating Covariates In some applications, covariate information on the participants,such as demographic data on age, gender, nationality, education and/or responses on personalityquestionnaires, etc., are available in the data. Adding predictors can enhance model application,as this information can be used to explore connections between these covariates and the latentcognitive parameters of the GCM. Covariate modeling is already a well-established technique inthe area of personality psychology and psychometric test theory (see e.g., De Boeck & Wilson,2004; Gelman & Hill, 2007).

The score of respondent i on covariate c (c ∈ 1, . . . ,C) is denoted as xci . All respondent-specific covariate scores are collected into a vector of length C + 1, denoted as xi =(xi0, xi1, xi2, . . . , xiC)T, typically with intercept xi0 = 1. Continuous covariate scores shouldalways be standardized for the GCM to improve numerical stability of the model. Categoricalvariables should be dummy-coded. In order to avoid identification issues and to preserve theinterpretation of the intercept as a population grand mean, we consider it a good practice to alsostandardize categorical and binary dummy-coded variables. If there are categorical variables thatrepresent exclusive categories, for example if different exclusive groups are modeled, care shouldbe taken that the dummy-coded variable does not itself form an intercept variable (i.e.; either theintercept should then be omitted entirely, or the categorical variable should be represented by anumber of dummy variables that is equal to the number of exclusive categories minus one; thedummy variables will then indicate contrast from the reference category). Also, it should be notedthat in the current formulation of the model, it is implicitly assumed that the variance-covariancestructure is equal across different groups of respondents (or items; see below).

Assume that all regression coefficients (including an intercept) for the ability are collectedin a vector, βθ , and all regression coefficients for the guessing bias, are collected in anothervector βg . Then consequently, the population mean receives a person index, i, and can be writtenas a dot product of the predictors and regression coefficients:

μl(θi) = xTi β l(θ), (11)

μl(gi) = xTi β l(g). (12)

This is a general formulation that includes covariates, but in their absence, the same reasoningholds and β l(θi )

and/or β l(gi )reduce to a scalar, which is interpreted as a population mean.

Moreover, covariate information on the item side can be introduced as well; although itemcovariates are less often collected, one may gather this information by examining the charac-teristics of the items in the questionnaire that gave rise to the data. For example, in the caseof a questionnaire exploring beliefs of a certain illness, the items might belong to topic groupssuch as causes, symptoms, and treatments. Other possible predictors may come from the gram-matical form of the question, for example whether it is written as a negation or not. This in-formation can be added as item covariate, denoted by h (h = 1, . . . ,H). Then the value for the

PSYCHOMETRIKA

covariate h on item k will be denoted as vhk , and these H covariates are included in the vectorvk = (vk1, vk2, . . . , viH )T. Note that there is no intercept in this formulation in order to preservemodel identification. As the goal is to keep the population mean fixed to 0, all types of covariatesshould be standardized.

Similarly as the response-probability parameters, the population mean for item difficulty canbe written as

μl(δk) = vTk β l(δ). (13)

The population mean is again decomposed into regression coefficients, (β l(δ)), and covariates,(vk). If there is no covariate information included or available on the items, then β l(δ) should beset to 0.

With the novel model specification to include covariates, it is suggested here to apply themodel as specified above for handling both cases of data with or without covariate information.In this way, in cases without covariate information, the population grand mean is representedeither by an intercept, (β0l(θ) and β0l(g)), for the person-specific parameters, or by 0 for the meanof item difficulties.

3.2. An Alternative: Using the Beta Distribution to Model the Population

A previous hierarchical extension of the GCM in Batchelder and Anders (2012), withoutitem difficulty and covariate information, used the beta as the population distribution for eachparameter on the unit-scale. In their application, the beta distribution allows one to carry outinference on the unit-scale, which is the original scale of the probability parameters, and its in-terpretation is related to that. While statistical inference practices in psychometric test theorytypically involve transforming the unit-scale parameters and employing the Gaussian distribu-tion, some authors have instead argued for and implemented the use of the beta distribution (e.g.;Merkle, Smithson, & Verkuilen, 2011; Batchelder & Anders, 2012). In this subsection, the beta-distribution-based formulation of the HGCM with the inclusion of covariates and item difficultyis presented and compared to the normal distribution-based approach.

Person-Specific Parameters The person-specific parameters on the unit-scale, θi and gi ,can be assigned a beta population distribution. In standard notation, the beta distribution is usu-ally parameterized by two non-negative parameters, a and b, which respectively indicate whetherthe mass of the distribution is toward 1 or toward 0, as Beta(a, b). If probability variable θi isassumed to be beta distributed, its probability density function is written as

f (θi;aθ , bθ ) = Γ (aθ + bθ )

Γ (aθ ) + Γ (bθ )θ

aθ−1i (1 − θi)

bθ−1,

where Γ stands for the gamma function.Now for the sake of modeling a population mean and spread, the distribution can be repa-

rameterized in terms of a mean μθ ∈ [0,1] and a parameter that acts as a precision or inversevariance, τθ > 0 (e.g.; Kruschke, 2011). In this setting, the mean and precision are:

μθ = aθ/(aθ + bθ ),

τθ = aθ + bθ .

Then the mean and variance of this beta distribution are respectively

μθ and μθ(1 − μθ)/(1 + τθ ).

In using this parameterization of the beta, hierarchical distributions of the untransformed param-eters person-specific parameters can be set as

θi ∼ Beta(μθτθ , (1 − μθ)τθ

). (14)


The distribution for the guessing bias parameter is obtained by substituting subscript θ with g inthe above equation.

Item-Specific Parameters Using similar developments as discussed for the person-specificparameters, the item parameter on the unit-scale, item difficulty δk , can be modeled with a betadistribution in the same way where

δk ∼ Beta(μδτδ, (1 − μδ)τδ

). (15)

In the unit-scale setting, neutral item difficulty is a value of 0.5, and in order to identify the modelat the hierarchical level, one sets μδ = 0.5.

Incorporating Covariates In the case of introducing covariates for the beta distribution,a similar design as discussed for the normal distribution is employed, except with the modifi-cation that the regression structure is positioned on the unit scale to properly locate each pop-ulation mean in [0,1]. As the logit was used earlier to transform values from the unit-scaleto the real line, the inverse logit is the natural corollary for the inverse of this transform, aslogit−1(x) = exp(x)/(1 + exp(x)), to transform the real-line values back into the unit scale. Inthis way, the covariate information for the person- and item-specific parameters are set similarlyas before (see details in the part on modeling with normal distributions), with the exception ofthe inverse logit transformation:

μθi = logit−1(xTi βθ

). (16)

To derive the population mean for the guessing bias parameter and item-difficulty parameterssubscript θ is substituted in Equation (16) to g or δ (with the additional change of i to k).

3.3. Summary of the Two Population Modeling Approaches

Modeling with beta distributions has some potential disadvantages. First, estimating largevalues of the τ parameter is problematic. A characteristic property of the beta distribution pa-rameterized by μ and τ is that sharp changes in the variance result from changes in τ between lowvalues such as 0 to 8, and increasingly limited returns result from further increases. The resultof this characteristic is that large underlying τ values are easily overestimated at high magni-tudes when a diffuse prior is used. This overestimation issue may be handled with the use of atighter prior for the τ parameter around a lower range of values (see e.g., Batchelder & Anders,2012), such as a Gamma(4,2), which is used here in the demonstrated application of the modelon a real data set. Second, dependencies between the ability and guessing bias parameters at thehierarchical level cannot easily be incorporated into a hierarchical beta population distribution.

However, applying the beta distribution has the advantage of modeling the parameter valueson the probability scale. GCM parameters are interpreted cognitively as probability parametersas originated in the theory of signal detection, and their psychological interpretation relates tothis framework. With regard to modeling the logit transform of the person-specific parameterswith the normal distribution, the population parameter estimates of these variables are only inter-pretable on the transformed variable scale, not on the original unit scale.2 Also, the beta distribu-tion exhibits somewhat greater flexibility in fitting population trends existent in the data. While

2Although if returning to an interpretation of the estimated values on their original unit-scale is of interest, then thesevalues can be approximated through the transformation of variables technique (see e.g., Mood, Graybill, & Boes, 1974).However, these approximations do not account for the limits of the unit scale, and therefore in the case of large varianceson the transformed scale, these approximations are biased towards the extremes.

PSYCHOMETRIKA

the normal distribution has zero skewness, the beta distribution has the potential advantage inthat it can accommodate population distributions that may be skewed in one direction.

To summarize, we propose the hierarchical normal distribution-based GCM (HGCMN ) asa more advantageous alternative as compared to the GCM with beta population distributions(HGCMB), especially when incorporating covariance in parameters. In order to relate to previ-ous developments in the area of GCM modeling, we will also demonstrate statistical inference,including covariate modeling with the HGCMB .

4. Bayesian Statistical Inference

We analyze the hierarchical models within the Bayesian framework (Gelman, Carlin, Stern,& Rubin, 2004; Kruschke, 2011). Classical, maximum likelihood statistical inference for hierar-chical models is not a trivial task. For the model presented here, statistical inference with maxi-mum likelihood would involve a high-dimensional integration over the numerous random-effectdistributions. Since most of these integrals have no closed-form solutions, these would have tobe approximated by finite sums, which is computationally prohibitive for models with a largenumber of parameters. In the Bayesian paradigm, the explicit integration over the random effectsis avoided because the inference is based on the full joint posterior distribution of the parameters(and not on the marginal). Parameters in the Bayesian framework have a probability distribu-tion, which offers an intuitively appealing way of thinking about uncertainty and the knowledgeone has about the parameters. Moreover, the Bayesian framework offers a coherent method formaking decisions.

An advantage of Bayesian statistical inference is that sampling algorithms may easily beapplied to sample from the posterior density of the parameters. The posterior density representsthe probability distribution of the parameters given the data, and it is directly proportional tothe product of the likelihood of the data (given the parameters) and the prior distribution of theparameters. Formally, p(ξ |Y) ∝ p(Y|ξ)p(ξ), where ξ stands for the vector of all parameters inthe model and where Y for the data (the normalization constant, p(Y) does not depend on theparameters and is therefore not considered). The prior distribution incorporates prior knowledgeabout the parameters. In the absence of unambiguous prior knowledge, there are a number ofdefault priors that have been suggested in the Bayesian literature. The more data one acquires,the less influential the prior becomes on the posterior.

Since the presented models yield high-dimensional posteriors (due to the large number ofparameters), we opt for Markov chain Monte Carlo (MCMC, see e.g.; Robert & Casella, 2004)methods to draw values from the posteriors. In particular, these algorithms perform iterative sam-pling: values are drawn from approximate distributions, and the approximation to the posteriorimproves due to the increases in the number of samples. There are two freely available softwarepackages, namely JAGS (Plummer, 2011) and WinBUGS (Lunn, Thomas, Best, & Spiegelhalter,2000) to perform such computations. The Appendix C.1 provides scripts for the estimation ofthe model parameters, written for JAGS, however, it can be easily translated into WinBUGS aswell. With JAGS, results may be easily investigated by using complementary software, such as Ror MATLAB, which both contain freely accessible packages for the programs to communicate.

4.1. Prior Specifications

In the Bayesian context, the priors on the person- and item-specific parameters (θi , gi , δk ,and Zk) of HGCMN are defined by their population distributions in Equations (8), (9), and (10).These population distributions have free parameters that are estimated from the data, and here


diffuse or vague prior distributions for these population parameters are assigned (Gelman et al.,2004). More specifically, a diffuse normal prior is assigned for the vector of regression weights:

β l(θ) ∼ NormalJ+1(0,10IJ+1), (17)

where I stands for the identity matrix. Subscript l(θ) can be substituted with l(g) to denotethe guessing bias population distribution priors, or l(θ) is substituted with g for using the betapopulation distribution. The population covariance matrix for the bivariate normal distribution ofthe logit transform of θi and gi is modeled through an inverse-Wishart density with the identitymatrix as a scale matrix, and with 3 degrees of freedom:

Σl(θg) ∼ Inverse-Wishart(I2,3). (18)

The derivation of the priors for the item-specific l(δk) follows the same principle:

β l(δ) ∼ NormalH+1(0,10IH+1). (19)

The population variance of the item difficulty is assigned an inverse gamma prior distribution:

σ 2δ ∼ Inverse-Gamma(0.01,0.01). (20)

As for the prior on the Bernoulli probability for the answer key, Zk , a uniform prior distributionis assigned:

π ∼ Uniform(0,1). (21)

The conditional posterior density of all model parameters can be found in Appendix B.

5. Application: Judging Grammaticality of Sentences

To illustrate the amounts of information one can gain by applying the HGCMs to real data,the models are applied to a data set in Sprouse, Fukuda, Ono, and Kluender (2011), which in-volved dichotomous judgments of whether a sentence is grammatically acceptable (grammaticalor not). This topic may be considered especially fit for HGCM analysis, as although there arerules to determine the grammaticality of phrases in English, the language is constantly evolving,it varies somewhat from region to region, and it is the users of the language who form its rules.Thus a model that can estimate the underlying consensus answers while taking into account abili-ties, item difficulties, and guessing tendencies is especially well-fit to investigate shared syntacticrules.

Participants examined a variety of sentences, which were classified into several grammati-cal classes by linguists, and they responded whether or not each sentence type is grammatical.A major focus of the study was to measure so-called syntactic ‘island’ effects on assessmentsof grammaticality. In particular, syntatic islands relate to what is known as ‘wh-movement’ or‘wh-extraction’: the ability to introduce a ‘wh’ question word such as ‘who’, ‘what’, ‘where’,and ‘which’ at the beginning of a sentence, and still retain grammaticality by rearranging theother words. For example: ‘She should stop talking about syntax’ can be rearranged with a wh-extraction as ‘What should she stop talking about ?’ The underscore represents the canonicalposition of the word that the ‘wh’ replaced by extraction. In many cases, one can introduce a‘wh’ question away from its canonical position, or even manipulate this length further by intro-ducing more words, yet still retain grammaticality, such as ‘What do you think she should stoptalking about ?’ Now in contrast, a syntactic ‘island’ is a phrase in a sentence where generally,

PSYCHOMETRIKA

one cannot make a wh-extraction away from its canonical position while still retaining gram-maticality. For example: ‘She should stop talking about syntax because it is confusing to me’is grammatical, but when introducing any ‘wh’ question word, the resultant ‘island’ clause isungrammatical: such as, ‘what should she stop talking about because it is confusing to me’.3

It is noted that some cases of wh-extractions out of particular island types are accepted as gram-matical by some and not by others. The question for consensus analysis here by the HGCMs isto determine the consensus belief in the grammaticality of wh-extractions out of islands.

In the study described in Sprouse et al. (2011), data—pertaining to the sentence types asdescribed above—was obtained from a survey containing the two conditions: sentences with an‘island’ and without an ‘island.’ Respondents were recruited from the Amazon Mechanical Turkwebsite (Buhrmester, Kwang, & Gosling, 2011) and were paid $3 for their participation. Thetotal sample size was N = 102 (one out of 103 was dropped due to missing covariates), while thetotal number of items were M = 64; 32 items for each condition of non-island versus island withvarious combinations explained below.4 An additional factor that was studied was the distancefrom the ‘wh’ to its canonical position such as: ‘Who thinks that John bought a car?’ (short)versus ‘What do you think that John bought ?’ (long). In both conditions, half of the sentenceswere short and half were long. In the analysis below, these two indicators (‘island’ and ‘long’)are used as predictors of item difficulty.

There were four versions of the questionnaire, containing different tokens for each of theconditions, but from a linguistic point of view, the tokens share exactly the same properties.Thus the questionnaire types were collapsed to retain a non-sparse response matrix; however,the questionnaire-version was added as an indicator variable in the analysis to test whether re-spondents replying to different versions of the questionnaire exhibit different levels of abilityor different guessing tendencies, which might be an indicator of inequality in the questionnaireversions. Finally, two other person covariates, namely gender and age, are also added in theanalysis.

With the HGCM, ‘island’ effects can be investigated in terms of the consensus answer keyas well as item difficulty. The former is determined from examining the model’s latent answerkey estimates on both ‘island’ and ‘non-island’ items, and these estimates take into account thecognitive variables of decision making such as ability and guessing bias. Second, while differen-tial item-difficulty is also accounted for in the HGCMs using a parameter that is estimated in thecontext of the full model, these island versus non-island effects are well-summarized by involv-ing an indicator on the item side: that is, a covariate that is coded by whether an item contains an‘island’ or not.

5.1. Results of Fitting HGCMs

Both the HGCMN and HGCMB with five person-specific standardized covariates (age, gen-der, and three dummy-coded variables indicating questionnaire version) and two item-specificcovariates (indicating island/no-island and long/short) were fit to the data. The results are basedon analysis with JAGS running six chains, in which the retained number of iterations from eachchain was 4000, resulting in a final posterior sample size of 24000. All results were based onchains that passed the R̂ convergence test (Gelman et al., 2004) and visual assessment.

3This is an example of an ‘adjunct’ island. In this data set, there were four types of islands investigated: adjunct,subject, whether, and complex noun-phrase islands (see Sprouse et al., 2011).

4In particular, there are four types of phrases: adjunct, subject, whether, and complex noun-phrases. The 64 itemsin the questionnaire are composed of each phrase-type having eight tokens as islands and eight tokens as non-islands. Ineach eight-token set, four tokens were long distances between the wh-word and its canonical position while four wereshort distances. Since the data set is used here for demonstrational purposes, for sake of simplicity, these four tokenswere collapsed over of each condition (see Sprouse et al., 2011 for details).


With respect to the answer key estimates (Zk), the posterior mean estimate of their popula-tion distribution hyperparameter, π , was 0.75 (posterior std: 0.05), indicating that items with aconsensus answer “grammatically acceptable” were more likely in the questionnaires than un-grammatical items. The Zk posterior median estimates indeed showed this ratio: all (short andlong) ‘non-islands’ (32 items) were classified as grammatical according to the model, as well asall short ‘islands’ (16 items). For each of these estimates, the posterior standard deviation waspractically 0, indicating a high level of certainty in the estimates. With respect to the long ‘is-land’ items (16), they were all classified as non-grammatical. In the HGCMN , in eight out of16 cases, the posterior standard deviation was again 0, and in the remaining eight cases, therewas a very small amount of uncertainty (posterior standard deviation smaller than 0.08 for eachitem). As for the HGCMB , there was likewise no uncertainty in the 48 grammatical estimatesand just a very small amount of uncertainty (posterior standard deviation smaller than 0.09) inthe 16 posterior ungrammatical estimates. These findings of the model are interesting as the rawdata showed lack of agreement for many items (there was only one item on which all respon-dents agreed). For example, one of the largest disagreements was on a long, island structure: theresponse split was 44 ‘True’ and 58 ‘False’ in terms of grammaticality judgments. Despite thenearly even split, the HGCMN classified the underlying answer key as “non-grammatical” withhigh uncertainty (std ≤ 0.04). The advantage of fitting an HGCM is that full response pattern isevaluated in the model, therefore the information across items is aggregated by using endogenousestimated differential weights on each respondent’s answer. To summarize, the consensus of therespondents was that short/long non-island and short island clauses were grammatical, and longisland clauses were ungrammatical (the short island clauses retained their grammaticality as thewh-extraction involved no distance from the canonical location). These findings are consistentwith previous findings based on simple statistics (e.g.; Sprouse et al., 2011).

With respect to the predictors, the results from both models are shown in Table 1. As can beseen, the models are consistent in their findings. The intercept (β0l(θ)) is 1.75 for the HGCMNand 1.33 (β0θ ) for the HGCMB , and these estimates suggest that the population is quite knowl-edgeable, and that they show a high level of consensus on the grammaticality of these sentences.For an item with difficulty 0 (that is the population mean difficulty in our model), the probabilityof any given informant knowing the correct answer can be approximated by taking the inverselogit of the population mean of the abilities (β0l(θ)). However, this approximation works wellonly for cases where the population variance is small. In our case this turns out to be approxi-mately 0.8, which suggests that the population is rather knowledgeable in general.

We investigated whether the questionnaire types could affect performance. As mentionedpreviously, questionnaire 1 was coded as a baseline, and then three person-specific covariateswere made to indicate if a different questionnaire was filled out by the participant. As shownin the results of Table 1, while all population posterior mean estimates for these coefficients arenegative (−0.21,−0.06,−0.32), the magnitude of these negative effects is low, and the corre-sponding 95 % credible intervals (CI) are comparatively wide, providing no substantial evidencethat the ability or the guessing bias estimates would differ remarkably as a function of theseindicators.

In addition to the questionnaire type, the two other person-specific covariates used to predictability and guessing tendencies were age and gender. From Table 1 we can see that with respectto age and ability, the posterior mean estimate is rather positive (0.61), with a relatively narrow95 % CI of (0.29, 0.93), suggesting that age is positively related to ability with older respon-dents performing better. The rest of the coefficients for ability and guessing bias have very lowmagnitudes and high posterior variance providing no evidence for effects.

As for the item covariates, the two predictors respectively corresponded to whether the itemwas an island (structural effect), and whether the wh-word distance from the canonical positionwas long (length effect). As can be seen, both of these predictors have large magnitudes (0.51,

PSYCHOMETRIKA

TABLE 1.Results on the regression coefficients based on the HGCMN and HGCMB .

Model parameter Covariate data HGCMN HGCMBPosterior

meanCI percentiles Posterior

meanCI percentiles

2.5 % 97.5 % 2.5 % 97.5 %

Ability Intercept 1.75 1.32 2.20 1.33 0.99 1.68Age 0.61 0.29 0.93 0.45 0.19 0.77

Gender −0.10 −0.43 0.21 −0.09 −0.33 0.14Questionnaire 2 −0.21 −0.61 0.19 −0.16 −0.48 0.15Questionnaire 3 −0.06 −0.46 0.34 −0.06 −0.38 0.25Questionnaire 4 −0.32 −0.73 0.08 −0.24 −0.55 0.05

Guessing bias Intercept −0.37 −1.13 0.22 −0.66 −1.01 0.06Age −0.15 −0.57 0.28 −0.16 −0.49 0.19

Gender 0.06 −0.33 0.45 0.08 −0.21 0.36Questionnaire 2 −0.18 −0.72 0.36 −0.16 −0.54 0.23Questionnaire 3 −0.30 −0.83 0.23 −0.39 −0.68 0.12Questionnaire 4 −0.08 −0.61 0.43 −0.06 −0.44 0.31

Item-difficulty Structure (‘island’) 0.51 0.13 0.93 0.47 0.16 0.85Length (‘long’) 1.11 0.73 1.52 0.92 0.63 1.29

1.11) and while their corresponding CIs are not especially narrow they do indicate a connectionbetween these predictors and item-difficulty: the wh-word distance has a larger positive effect onitem difficulty while the island structure has a smaller, but still positive effect.

As can be seen from the various estimates, the HGCMN and HGCMB delivered very similarresults. The models functionally differ from one another with respect to their usage of populationdistribution types: in the HGCMN , the logit-transformation is applied to all unit-scale parametersand a normal distribution is assumed as the population distribution. In contrast, the HGCMBmodels unit-scale parameters on their original scale. Despite these differences, in this case, thesame conclusions could be drawn from the estimates of either model. Also, it was observed thatwhen generating a large amount of random samples based on the population parameter estimatesof the HGCMB and the HGCMN and the inverse of the logit transformation is applied to thelatter values, the actual shapes of the two types of population distributions do not seem to differremarkably.

Another noteworthy difference of the two hierarchical models is the ability of the HGCMNto directly model the dependence between the person-specific parameters via multivariate nor-mal parametrization. In this application, the dependency was modeled by ρθg that turned outto be equal to −0.18 (std = 0.14). By this parameter we allowed for the possible dependencybetween a person’s ability (θ ) and guessing bias (g) parameters to be expressed directly throughthe estimation.

6. Model Fit

Model fit can be investigated in both absolute and relative terms. For the former, posteriorpredictive model checks (see for example; Gelman et al., 2004) can be used. For the latter, theDeviance Information Criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002) issuggested, though others may be used.

6.1. Posterior Predictive Model Checking

Posterior predictive model checks (PPC) are set by selecting a statistic that reflects an im-portant feature of the real data, calculating that same statistic for many replicated data sets (based


on the model and parameter estimates), and then comparing the statistic of the real data with theones generated from the replicated data. If the real data statistic does not appear to be consistentwith the distribution of statistics generated from the replicated data, the proposed model is con-sidered to provide a poor description of the data. Posterior predictive checks have been criticizedfor being too optimistic (Dey, Gelfand, Swartz, & Vlachos, 1998), therefore although here weoffer two tests further improvements are desirable in this area.

One Answer Key An important assumption of the model is that all persons share the samelatent answer key. As discussed earlier, this is expressed by the single-factor structure of the cor-relation matrix obtained from correlating the responses of informants across items. This propertyfollows from the proven theorem that the correlation between any two informants is equal to theproduct of each informant’s correlation with the latent answer key, and this is formally written inEquation (6).

Thus the model check for the assumption that all persons share the same latent answer key in-volves verifying if the person-by-person correlation matrix has a single-factor structure; and thisis achieved for the GCM with a factor-analytic approach. Typically, standard minimum residualfactor analysis (MINRES, Comrey, 1962) is utilized, from which one obtains the eigenvalues ofthe person-by-person correlation matrix. The single-factor structure is checked for by observingthe pattern of the first and subsequent eigenvalues of the correlation matrix, and seeing how theydecline. A one factor solution is supported by a sharp decline after the first eigenvalue, with therest of the eigenvalues following a linearly decreasing trend.

The posterior predictive check for the single answer key property of the data is carried out byassessing how closely the eigenvalue series of the correlation matrices of the posterior predictivedata resemble the eigenvalue series of the correlation matrix of the real data. In particular, thegraphical PPC involves plotting the many series of eigenvalues from the posterior predictivedata against the series from the real data, in order to assess whether the posterior predictive datamimics a similar trend in eigenvalues as the real data; this is known as a graphical posteriorpredictive check.

Figure 2 displays the results of the test carried out for the grammaticality data set for both theHGCMN (left panel) and HGCMB (right panel). The continuous gray lines depict the eigenvalue

FIGURE 2.PPC: Eigenvalue curves for both the HGCMN (left) and HGCMB (right) (grammaticality data set). The gray lines depictthe eigenvalue series generated from the posterior predictive data, while the black line is the series from the real data.

PSYCHOMETRIKA

series generated from 1000 sets of the posterior predictive data, while the black line is the seriesfrom the real data. By the black line closely overlapping the gray area in each plot, the figureshows an appropriate fit of the eigenvalue trend in both cases, and thus the posterior predictivedata has a similar single-factor design as the real data.

Item Heterogeneity Another important property of the data concerns the marginal frequen-cies across items. We use a measure called Variance Dispersion Index (VDI, see, e.g.: Batchelder& Anders, 2012) to represent the variation in responses over informants on each item. VDI issimply calculated by first calculating the variance in responses over informants on each item (thevariation of each kth column), and then again taking the variance of these column variances.Formally,

VDI(X) =M∑

k=1

V 2k

/M −

(M∑

k=1

Vk

/M

)2

(22)

where

Vk = Pk(1 − Pk), and Pk =N∑

i=1

Xik

/N.

The VDI-based PPC involves calculating the VDI of the real data and checking if it iswithin the 95th percentile distribution of VDI statistics calculated from model-based simulateddata sets. To further illustrate the strength of the VDI check a HGCMN with homogeneousitems (all δk was set to 0.5) was also fit to the grammaticality data. In Figure 3 the VDI modelchecks are displayed for the HGCMN (left panel) with heterogeneous (continuous line) andwith homogeneous (dotted line) item-difficulty, and for HGCMB (right panel) with heteroge-neous item-difficulty. The black lines depict the VDI statistic calculated from the real data.The VDI checks are satisfied for both models with heterogeneous item-difficulty. As the dot-ted line in Figure 3 illustrates, the VDI check was not passed for the item-homogeneous versionof model.

FIGURE 3.PPC: VDI distributions based on generated data by the HGCMN (left panel) with heterogeneous (continuous line) andwith homogeneous (dotted line) item-difficulty, and heterogeneous HGCMB (right panel) are depicted with a gray curve,while the black lines indicates the VDI value of the real data set.


TABLE 2.DIC results on different HGCMs fit to the grammaticality data set.

Model types DIC

HGCMN with covariates 3842HGCMB with covariates 3869HGCMN no covariates 3876HGCMN with neutral bias 4116HGCMN with neutral item-difficulty 4309

Deviance Information Criterion The DIC evaluates a goodness-of-fit in relative terms. Asit measures model fit in terms of deviance between the model and the data, models with smallerDIC values should be able to predict the same type of data better. Table 2 shows the DIC valuesfor different versions of the HGCM on the grammaticality data: the HGCMN , the HGCMB , andthe HGCMN where the neutral bias was set to the same, neutral level (gi = 0.5) for all persons,and finally the HGCMN in which item-homogeneity (neutral item-difficulty) was assumed. Inall these models covariate information was included. In contrast, the HGCMN was estimatedwithout covariate information as well.

As can be seen, the HGCMN with its full parametrization seems to be the best fitting modelfor this data set. The second-best fitting model is the HGCMB , while HGCMN without covariateinformation still appears to do better than the version in which random effects were turned intofixed ones.

7. Discussion

This paper aims to provide multiple extensions to the estimation techniques of the popularGeneral Condorcet Model published originally by Batchelder and Romney (1988). By develop-ments of this paper, the GCM can be embedded in a hierarchical inference framework, in whichcovariates of both persons and items can also be incorporated into the estimation; the accompa-nying code for these developments are provided as online supplements.

We developed two approaches to estimating the model parameters: employing the normaldistribution on the transformed parameters, the HGCMN , or employing the beta distribution onthe unit scale, the HGCMB . A number of parameter recovery studies performed (though notreported in this paper) suggested that both models do well in recovering parameters from datasimulated by the identical model, as well as by the other model: that is, the HGCMN did well inrecovering equivalent parameters generated by HGCMB and vice versa.

A possible disadvantage of the beta-distribution-based modeling approach is that possibleconnections or dependencies between abilities and guessing biases cannot be modeled directly,whereas they are incorporated in the normal-based model. However, in the Bayesian modelingframework, such dependencies can still be discovered by post-processing the person-specificposterior distributions of these two parameters (such post-hoc parameters are sometimes calledstructural or derived parameters, see Jackman, 2009; Congdon, 2003). For example, the directlymodeled correlation of ability and bias for the informants in the HGCMN was ρθg = −0.18(std = 0.14). Approximate measures can ultimately be obtained with the beta model by corre-lating the posterior samples of the person-specific ability and willingness to guess parameters ateach iteration, which results in a posterior distribution of the derived correlation between the twoparameters for the HGCMB . The posterior mean of this derived correlation parameter turns outto be −0.15 (std = 0.08), which is very close to the measure delivered by the HGCMN . Whenestimating parameters in the HGCMB , the constrained prior information of independent ability

PSYCHOMETRIKA

and guessing bias affected the estimation process, which can therefore mitigate the effect of apossibly existing dependence between these two parameters. This is because by not modelingcorrelations, independence is assumed, and therefore the parameter estimates are biased towardsthat (especially the population parameters).

While the current application relied on a data set with a relatively large number of respon-dents and items, we emphasize that the HGCM also does well in recovering the answer key andother parameters with a relatively small number of informants. For example Batchelder and An-ders (2012) show excellent parameter recoveries using a model similar to HGCMB with as lowas N = 6 number of informants.

Acknowledgements

Work on this paper was supported by grant to the authors from the Army Research Office(ARO) and from the Oak Ridge Institute for Science and Education (ORISE). We would like tothank Jon Sprouse for making available to us his grammaticality data set.

We would also like to thank the four anonymous reviewers and Joachim Vandekerckhovefor their useful comments.

Appendix A. Proof of Identifiability for the General Condorcet Model

Let the parameters be (D,G,Z) of the GCM with parameter spaces, respectively (0,1)N ,(0,1)N , {0,1}M . Let Ω be the parameter space of (D,G,Z). Let Y be an N × M matrix of 1sand 0s. Let S be the space of all Y. Let h : Ω → Π , where Π is the space of all probabilitydistribution over Y. Let p(y | (D,G,Z)) be a particular probability density function of Π .

Definition. In this context a model is identified in case h is one-to-one, meaning that two differ-ent sets of parameters necessarily produce different probability density functions over Y.

Observation 1. The model is not identified if we allow all items to be false or if we allow allitems to be true. Let Z be all 1s and Z′ be all 0s, then so long as

∀i Di + (1 − Di)gi = (1 − D′

i

)g′

i ,

the model gives identical probabilities.

Observation 2. If we exclude these two extremes from the space of possible Z′s, the model isidentified. We need to show that

∀Y ∈ S :p(Y | D,G,Z) = p

(Y | D′,G′,Z′)

⇒ D = D′, G = G′, Z = Z′.

Suppose Z = Z′. Then there are Zk = 1 = Z′k and Zl = 0 = Z′

l . From these we have for allinformants i: Di + (1 − Di)gi = D′

i + (1 − D′i )g

′i and (1 − Di)gi = (1 − D′

i )g′. From these

we have Di = D′i and Gi = G′

i . So if the model is not identified it must be that Z �= Z′. Pick k

(without loss of generality since primed and unprimed can be swaped) with Zk = 1, Z′k = 0. For

all informants i we have

Di + (1 − Di)gi = (1 − D′

i

)g′

i . (A.1)


Next pick j such that Z′j = 1, this is possible because we have eliminated the case where Z′ can

all be zeros. If Zj = 1 then for all i,

Di + (1 − Di)gi = D′i + (

1 − D′i

)g′

i ,

and coupled with Equation (A.1), this is not possible. On the other hand if Zj = 0, then for allinformants i, we have

(1 − Di)gi = D′i + (

1 − D′i

)g′

i ,

and coupled with Equation (A.1), this is not possible. Thus the model is identified.

Appendix B. Posterior Distributions of the HGCMs

The posterior distribution given the data for the HGCM with Gaussian population distribu-tions can be derived the following way: For notational convenience, all of the person- and item-specific parameters are respectively collected into corresponding vectors (i.e., θ , g, δ, and Z).First, the conditional posterior for the model in which the probability variables have populationdistributions assigned on the transformed variable scale is derived as

Pr(θ ,βθ , σ

2θ ,g,βg, σ

2g , δ,βδ, σ

2δ ,Z,π | Y

)

∝I∏

i=1

K∏k=1

(ZkDik + (1 − Dik)gi

)(Yik≡1)

× (−ZkDik + Dik + (1 − Dik)(1 − gi))(Yik≡0)

×I∏

i=1

Normal2

([logit(θi)

logit(gi)

] ∣∣∣∣[

β l(θ)

β l(g)

],�l(θg)

)

×K∏

k=1

Normal(logit(δk) | β l(δ), σ

2δ

) K∏k=1

Bernoulli(Zk | π)

× NormalJ+1(β l(θ) | 0,10IJ+1)NormalJ+1(β l(g) | 0,1000IJ+1)

× NormalH+1(β l(δ) | 0,10IH+1)Uniform(π | 0,1)

× Inverse-Wishart(I2,3)Inverse-Gamma(σ 2

δ | 0.01,0.01).

After the proportionality sign, the first double product describes the likelihood of the parametersgiven the data, based on Equations (3). It is followed by the products of the population densi-ties of the person-specific parameters, as specified in Equations (8). The next line describes thepopulation densities of the item-specific parameters as in Equations (9) and (10). Finally, the lasttwo lines multiple all the above with the prior densities as chosen in Equations (17), (19), (21)and (20).

The only modification for the prior settings of the HGCM with the beta population distri-butions concerns the variance parameters. As θi and gi are sampled univariately, priors have tobe set for their ‘precision’ parameters (that determine their population variance). As is typically

PSYCHOMETRIKA

done for precision parameters, a moderately diffuse Gamma distribution can be chosen, where

τθ ∼ Gamma(1,0.1). (B.1)

Then τg as well as τδ for item difficulty are set similarly.Note that the HGCMB is different only in terms of the population distributions for the

item-and person-specific parameters. As discussed earlier, the beta distribution is parameter-ized in terms of regression coefficients and a precision parameter, as specified in Equations (16)and (B.1), and the posterior is written as

Pr(θ ,βθ , τθ ,g,βg, τg, δ,βδ, τδ,Z,π | Y)

∝I∏

i=1

K∏k=1

(ZkDik + (1 − Dik)gi

)(Yik≡1)

× (−ZkDik + Dik + (1 − Dik)(1 − gi))(Yik≡0)

×I∏

i=1

Beta(θi | βθ , τθ )

×I∏

i=1

Beta(gi | βg, τg)

×K∏

k=1

Beta(δk | βδ, τδ)

K∏k=1

Bernoulli(Zk | π)

× NormalJ+1(βθ | 0,10IJ+1)NormalJ+1(βg | 0,1000IJ+1)

× NormalH+1(βδ | 0,10IH+1)Uniform(π | 0,1)

× Gamma(τθ | 1,0.1)Gamma(τg | 1,0.1)Gamma(τδ | 1,0.1).

Appendix C. JAGS Code for the HGCMs

C.1. Normal Population Distributions

model{for (i in 1:n){

for (k in 1:m){D[i, k] <- (theta[i]*(1-delta[k]))/

(theta[i]*(1-delta[k])+(1-theta[i])

*delta[k])}pY[i,k] <- g[i] - D[i,k]*g[i] + D[i,k]*z[k]Y[i,k] ~ dbern(pY[i,k])}}

for (i in 1:n){

mean_theta[i] <- Xtheta[i,]%*%coeff_thetamean_g[i] <- Xg[i,]%*%coeff_g

mean_thetag[i,1] <- mean_theta[i]


mean_thetag[i,2] <- mean_g[i]

logit_thetag[i,1:2] ~ dmnorm(mean_thetag[i,1:2],precM_thetag)

theta[i] <- ilogit(logit_thetag[i,1])g[i] <- ilogit(logit_thetag[i,2])

}

for (k in 1:m){Z[k] ~ dbern(PI)delta[k] <- ilogit(logit_delta[k])mean_delta[k] <- Xdelta[k,]%*%coeff_deltalogit_delta[k] ~ dnorm(mean_delta[k],prec_delta)}

for (cov in 1:nrofthetacov){coeff_theta[cov] ~ dnorm(0, 0.01)}

for (cov in 1:nrofgcov){coeff_g[cov] ~ dnorm(0, 0.01)}

cov_thetag[1:2,1:2] <- inverse(precM_thetag[1:2, 1:2])var_theta <- cov_thetag[1,1]var_g <- cov_thetag[2,2]corr_thetag <- cov_thetag[1,2]/

(sqrt(var_theta)*sqrt(var_g))

precM_thetag[1:2, 1:2] ~ dwish(ID[1:2,1:2], 3)ID[1,1] <- 1ID[2,2] <- 1ID[1,2] <- 0ID[2,1] <- 0

for (cov in 1:nrofdeltacov){coeff_delta[cov] ~ dnorm(0, 0.01)}

prec_delta ~ dgamma(0.01, 0.01)var_delta <- 1/prec_delta

PI ~ dunif(0, 1)}

C.2. Beta Population Distributions

model{for (i in 1:n){

for (k in 1:m){D[i, k] <- (theta[i]*(1-delta[k]))/

(theta[i]*(1-delta[k])+(1-theta[i])

*delta[k])}pY[i,k] <- g[i] - D[i,k]*g[i] + D[i,k]*z[k]

PSYCHOMETRIKA

Y[i,k] ~ dbern(pY[i,k])}}

for (i in 1:n){

mean_theta[i] <- ilogit(Xtheta[i,]%*%coeff_theta)mean_g[i] <- ilogit(Xg[i,]%*%coeff_g)

theta[i] ~ dbeta(mean_theta[i]*prec_theta,(1-mean_theta[i])*prec_theta)

g[i] ~ dbeta(mean_g[i]*prec_g,(1-mean_g[i])*prec_g)}

for (k in 1:m){Z[k] ~ dbern(PI)mean_delta[k] <- ilogit(Xdelta[k,]%*%coeff_delta)delta[k] ~ dbeta(mean_delta[k]*prec_delta,

(1-mean_delta[k])*prec_delta)}

for (cov in 1:nrofthetacov){coeff_theta[cov] ~ dnorm(0, 0.01)}

for (cov in 1:nrofgcov){coeff_g[cov] ~ dnorm(0, 0.01)}

for (cov in 1:nrofdeltacov){coeff_delta[cov] ~ dnorm(0, 0.01)}

prec_theta_root ~ dgamma(1, 0.1)prec_theta <- pow(prec_theta_root,2)

prec_g_root ~ dgamma(1, 0.1)prec_g <- pow(prec_g_root,2)

prec_delta_root ~ dgamma(1, 0.1)prec_delta <- pow(prec_delta_root,2)

PI ~ dunif(0, 1)}

References

Baer, R.D., Weller, S.C., Alba Garcia, J.G., de Glazer, M., Trotter, R., & Pachter, L. et al. (2003). A cross-culturalapproach to the study of the folk illness nervios. Culture, Medicine and Psychiatry, 27, 315–337.

Batchelder, W.H., & Anders, R. (2012). Cultural consensus theory: comparing different concepts of cultural truth. Journalof Mathematical Psychology, 56, 316–332.

Batchelder, W.H., Kumbasar, E., & Boyd, J. (1997). Consensus analysis of three-way social network data. The Journalof Mathematical Sociology, 22, 29–58.

Batchelder, W.H., & Riefer, D.M. (1999). Theoretical and empirical review of multinomial process tree modeling. Psy-chonomic Bulletin & Review, 6(1), 57–86.

Batchelder, W.H., & Romney, A.K. (1988). Test theory without an answer key. Psychometrika, 53, 71–92.Batchelder, W.H., Strashny, A., & Romney, A. (2010). Cultural Consensus Theory: aggregating continuous responses

in a finite interval.Bimler, D. (2013). Two applications of the points-of-view model to subject variations in sorting data. Quality and Quan-

tity, 47(2), 775–790.


Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F.M. Lord & M.R.Novick (Eds.), Statistical theories of mental test scores. Reading: Addison-Wesley.

Buhrmester, M., Kwang, T., & Gosling, S.D. (2011). Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data? In Perspectives on psychological science.

Comrey, A.L. (1962). The minimum residual method of factor analysis. Psychological Reports, 11, 15–18.Congdon, P. (2003). Applied Bayesian modelling (Vol. 394). New York: Wiley.De Boeck, P. (2008). Random item IRT models. Psychometrika, 73, 533–559.De Boeck, P., & Wilson, M. (2004). Explanatory item response models: a generalized linear and nonlinear approach.

New York: Springer.Dey, D.K., Gelfand, A.E., Swartz, T.B., & Vlachos, P.K. (1998). A simulation-intensive approach for checking hierar-

chical models. Test, 7(2), 325–346.Fischer, G., & Molenaar, I. (1995). Rasch models: foundations, recent developments, and applications. New York:

Springer.Gelman, A., Carlin, J., Stern, H., & Rubin, D. (2004). Bayesian data analysis. New York: Chapman & Hall.Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge

University Press.Hopkins, A. (2011). Use of network centrality measures to explain individual levels of herbal remedy cultural competence

among the Yucatec Maya in Tabi, Mexico. Field Methods, 23(3), 307–328.Hruschka, D.J., Kalim, N., Edmonds, J., & Sibley, L. (2008). When there is more than one answer key: cultural theories

of postpartum hemorrhage in Matlab, Bangladesh. Field Methods, 20, 315–337.Iannucci, A., & Romney, A. (1990). Consensus in the judgment of personality traits among friends and acquaintances.

Journal of Quantitative Anthropology, 4, 279–295.Jackman, S. (2009). Bayesian analysis for the social sciences. New York: Wiley.Karabatsos, G., & Batchelder, W.H. (2003). Markov chain estimation methods for test theory without an answer key.

Psychometrika, 68, 373–389.Klauer, K. (2010). Hierarchical multinomial processing tree models: a latent—trait approach. Psychometrika, 75, 70–98.Kruschke, J.K. (2011). Doing Bayesian data analysis: a tutorial with R and BUGS. New York: Academic Press.Lee, M.D. (2011). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical

Psychology, 55, 1–7.Lunn, D.J., Thomas, A., Best, N., & Spiegelhalter, D. (2000). WinBUGS—a Bayesian modeling framework: concepts,

structure, and extensibility. Statistics and Computing, 10, 325–337.Macmillan, N.A., & Creelman, C.D. (2005). Detection theory: a users guide (2nd ed.). Mahwah: Erlbaum.Merkle, E.C., Smithson, M., & Verkuilen, J. (2011). Hierarchical models of simple mechanisms underlying confidence

in decision makings. Journal of Mathematical Psychology, 55, 57–67.Miller, E. (2011). Maternal health and knowledge and infant health outcomes in the Ariaal people of northern Kenya.

Social Science & Medicine, 73(8), 1266–1274.Mood, A.M., Graybill, F.A., & Boes, D.C. (1974). Introduction to the theory of statistics. New York: McGraw-Hill.Morey, R.D. (2011). A Bayesian hierarchical model for the measurement of working memory capacity. Journal of Math-

ematical Psychology, 55(1), 8–24.Oravecz, Z., Vandekerckhove, J., & Batchelder, W. H. (in press). Bayesian cultural consensus theory. Field Methods.Plummer, M. (2011). Rjags: Bayesian graphic models using MCMC (R package version 2.2.0-3). http://CRAN.R-project.

org/package=rjags.Rasch, G. (1960). Probabilistic models for some intelligent and attainment tests. Copenhagen: Danish Institute for Edu-

cational Research.Raudenbush, S.W., & Bryk, A.S. (2002). Hierarchical linear models: applications and data analysis methods. New-

bury Park: Sage.Riefer, D.M., Knapp, B.R., Batchelder, W.H., Bamber, D., & Manifold, V. (2002). Cognitive psychometrics: assessing

storage and retrieval deficits in special populations with multinomial processing tree models. Psychological Assess-ment, 14(2), 184.

Robert, C.P., & Casella, G. (2004). Monte Carlo statistical methods. New York: Springer.Romney, A.K., & Batchelder, W.H. (1999). Cultural consensus theory. In R. Wilson & F. Keil (Eds.), The MIT encyclo-

pedia of the cognitive sciences (pp. 208–209). Cambridge: MIT Press.Romney, A.K., Weller, S.C., & Batchelder, W.H. (1986). Culture as consensus: a theory of culture and informant accu-

racy. American Anthropologist, 88(2).Rouder, J., & Lu, J. (2005). An introduction to Bayesian hierarchical models with an application in the theory of signal

detection. Psychonomic Bulletin & Review, 12, 573–604.Scheibehenne, B., Rieskamp, J., & Wagenmakers, E. J. (2013). Testing adaptive toolbox models: a Bayesian hierarchical

approach. Psychological Review, 120, 39–64.Smith, J.B., & Batchelder, W.H. (2010). Beta-MPT: multinomial processing tree models for addressing individual differ-

ences. Journal of Mathematical Psychology, 54(1), 167–183.Snijders, T., & Bosker, R. (1999). Multilevel analysis: an introduction to basic and advanced multilevel modeling. Thou-

sand Oaks: Sage.Spearman, C.E. (1904). ‘General intelligence’ objectively determined and measured. The American Journal of Psychol-

ogy, 15, 72–101.Spiegelhalter, D.J., Best, N.G., Carlin, B.P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit

(with discussion). Journal of the Royal Statistical Society, Series B, 6, 583–640.

http://CRAN.R-project.org/package=rjags

http://CRAN.R-project.org/package=rjags

PSYCHOMETRIKA

Sprouse, J., Fukuda, S., Ono, H., & Kluender, R. (2011). Reverse island effects and the backward search for a licensor inmultiple wh-questions. Syntax, 14(2), 179–203.

Weller, S.C. (2007). Cultural consensus theory: applications and frequently asked questions. Field Methods, 19, 339–368.

Weller, S.C., Baerm, R.D., Pachter, L.M., Trotter, R., Glazer, M., de Alba Garcia, J.G. et al. (1999). Latino beliefs aboutdiabetes. Diabetes Care, 22, 722–728.

Weller, S.C., Pachter, L.M., Trotter, R.T., & Baer, R.D. (1993). Empacho in four Latino groups: a study of intra- andinter-cultural variation in beliefs. Medical Anthropology, 15(2), 109–136.

Wetzels, R., Vandekerckhove, J., Tuerlinckx, F., & Wagenmakers, E.J. (2010). Bayesian parameter estimation in theexpectancy valence model of the Iowa gambling task. Journal of Mathematical Psychology, 54, 14–27.

Yoshino, R. (1989). An extension of the “test theory without answer key” by Batchelder and Romney and its applicationto an analysis of data on national consciousness. Proceedings of the Institute of Statistical Mathematics, 37, 171–188.(in Japanese).

Manuscript Received: 14 DEC 2012

,R ANDERS AND WILLIAM H. BATCHELDER UNIVERSITY OF...

Documents

Transcript of ,R ANDERS AND WILLIAM H. BATCHELDER UNIVERSITY OF...