A Brief and Friendly Introduction to Mixed-Effects Models in … · 2009. 4. 30. · A Brief and...

A Brief and Friendly Introduction toMixed-Effects Models in Psycholinguistics

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

Cluster-specificparameters

(“random effects”)

Shared parameters(“fixed effects”)

Parameters governinginter-cluster variability

Roger Levy (modified by Florian Jaeger)

UC San DiegoDepartment of Linguistics

25 March 2009

Goals of this talk

I Briefly review generalized linear models and how to use them

I Give a precise description of multi-level models

I Show how to draw inferences using a multi-level model (fittingthe model)

I Discuss how to interpret model parameter estimatesI Fixed effectsI Random effects

I Briefly discuss multi-level logit models

Reviewing generalized linear models I

Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .

The picture:

yn· · ·

Predictors

Model parameters

Response

The picture:

yn· · ·

Predictors

Model parameters

Response

The picture:

yn· · ·

Predictors

Model parameters

Response

The picture:

yn· · ·

Predictors

Model parameters

Response

Reviewing GLMs II

Assumptions of the generalized linear model (GLM):

1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;

2. η is a linear combination of the {Xi}:

η = α + β1X1 + · · ·+ βNXN (linear predictor)

3. η determines the predicted mean µ of Y

η = l(µ) (link function)

4. There is some noise distribution of Y around the predictedmean µ of Y :

P(Y = y ;µ)

Reviewing GLMs II

P(Y = y ;µ)

Reviewing GLMs II

P(Y = y ;µ)

Reviewing GLMs II

P(Y = y ;µ)

Reviewing GLMs II

P(Y = y ;µ)

Reviewing GLMs III

Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.

I The predicted mean is just the linear predictor:

η = l(µ) = µ

I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:

ε ∼ N(0, σ)

I This gives us the traditional linear regression equation:

Predicted Mean µ = η︷︸︸︷α + β1X1 + · · ·+ βnXn +

Noise∼N(0,σ)︷︸︸︷ε

Reviewing GLMs III

η = l(µ) = µ

ε ∼ N(0, σ)

Reviewing GLMs III

η = l(µ) = µ

ε ∼ N(0, σ)

Reviewing GLMs III

η = l(µ) = µ

ε ∼ N(0, σ)

Reviewing GLMs IV

Logistic regression, too, is a kind of generalized linear model.

I The linear predictor:

η = α + β1X1 + · · ·+ βnXn

I The link function g is the logit transform:

E(Y) = p = g−1(η)⇔

g(p) = lnp

1− p= η = α + β1X1 + · · ·+ βnXn (1)

I The distribution around the mean is taken to be binomial.

Reviewing GLMs IV

η = α + β1X1 + · · ·+ βnXn

E(Y) = p = g−1(η)⇔

g(p) = lnp

1− p= η = α + β1X1 + · · ·+ βnXn (1)

Reviewing GLMs IV

η = α + β1X1 + · · ·+ βnXn

E(Y) = p = g−1(η)⇔

g(p) = lnp

1− p= η = α + β1X1 + · · ·+ βnXn (1)

Reviewing GLMs IV

η = α + β1X1 + · · ·+ βnXn

E(Y) = p = g−1(η)⇔

g(p) = lnp

1− p= η = α + β1X1 + · · ·+ βnXn (1)

Reviewing GLMs V

Predicted Mean︷︸︸︷α + β1X1 + · · · + βnXn +

I How do we fit the parameters βi and σ (choose modelcoefficients)?

I There are two major approaches (deeply related, yet different)in widespread use:

I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y

choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible

I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data

Reviewing GLMs V

P({βi}, σ|Y ) =P(Y |{βi}, σ)

Prior︷︸︸︷P({βi}, σ)

Reviewing GLMs V

P({βi}, σ|Y ) =

Likelihood︷︸︸︷P(Y |{βi}, σ)

Reviewing GLMs VI: a simple example

I You are studying non-word RTs in a lexical-decision task

tpozt Word or non-word?houze Word or non-word?

I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)

I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)

I If xi is neighborhood density, our simple model is

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

I We need to draw inferences about α, β, and σ

I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?

tpozt Word or non-word?

houze Word or non-word?

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

RTi = α + βxi +

∼N(0,σ)︷︸︸︷εi

Reviewing GLMs VII

I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:

Few neighbors Many neighborsgaty peme rixy lish pait yine

I There’s a wide range of neighborhood density:

Number of neighbors

0 2 4 6 8 10 12

Reviewing GLMs VII

I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:

Few neighbors Many neighborsgaty peme rixy lish pait yine

I There’s a wide range of neighborhood density:

Number of neighbors

0 2 4 6 8 10 12

Reviewing GLMs VIII: maximum-likelihood model fitting

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

I Here’s a translation of our simple model into R:

RT ∼ 1 + x

I The noise is implicit in asking R to fit a linear model

I (We can omit the 1; R assumes it unless otherwise directed)

I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)

> m <- glm(RT ~ neighbors, d, family="gaussian")

> summary(m)

[...]Estimate Std. Error t value Pr(>|t|)

(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466

> sqrt(summary(m)[["dispersion"]])

[1] 107.2248

Gaussian noise, implicit intercept

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

RT ∼ xI The noise is implicit in asking R to fit a linear model

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

> summary(m)

[1] 107.2248

RTi = α + βXi +

∼N(0,σ)︷︸︸︷εi

> summary(m)

[1] 107.2248

Reviewing GLMs: maximum-likelihood fitting IX

Intercept 383.00neighbors 4.83σ̂ 107.22

I Estimated coefficients arewhat underlies “best linearfit” plots

●●

0 2 4 6 8 10 12

Neighbors

●●

0 2 4 6 8 10 12

Neighbors

●●

0 2 4 6 8 10 1230

0Neighbors

Multi-level Models

I But of course experiments don’t have just one participant

I Different participants may have different idiosyncratic behavior

I And items may have idiosyncratic properties too

I We’d like to take these into account, and perhaps investigatethem directly too.

I This is what multi-level (hierarchical, mixed-effects) models arefor!

Multi-level Models II

I Recap of the graphical picture of a single-level model:

yn· · ·

Predictors

Model parameters

Response

Multi-level Models III: the new graphical picture

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

b1 b2 bM· · ·

· · ·

x11 x1n1

y11 y1n1

· · ·

x21 x2n2

y21 y2n2

· · ·

xM1 xMnM

yM1 yMnM

· · ·

Multi-level Models IV

An example of a multi-level model:

I Back to your lexical-decision experiment

I Non-words with different neighborhood densities should havedifferent average decision time

I Additionally, different participants in your study may alsohave:

I different overall decision speedsI differing sensitivity to neighborhood density

I You want to draw inferences about all these things at thesame time

Multi-level Models V: Model construction

I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗

I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Noise∼N(0,σε)︷︸︸︷εij

I In R, we’d write this relationship as

I Once again we can leave off the 1, and the noise term εij isimplicit

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RT ∼ 1 + x + (1 | participant)

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RT ∼ 1 + x + (1 | participant)

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RT ∼ x + (1 | participant)

Multi-level Models VI: simulating data

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

I Simulation of trial-level data can be invaluable for achievingdeeper understanding of the data

## simulate some data

> sigma.b <- 125 # inter-subject variation larger than

> sigma.e <- 40 # intra-subject, inter-trial variation

> alpha <- 500

> beta <- 12

> M <- 6 # number of participants

> n <- 50 # trials per participant

> b <- rnorm(M, 0, sigma.b) # individual differences

> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors

> subj <- rep(1:M,n)

> RT <- alpha + beta * nneighbors + # simulate RTs!

b[subj] + rnorm(M*n,0,sigma.e) #

Multi-level Models VI: simulating data

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

I Simulation of trial-level data can be invaluable for achievingdeeper understanding of the data

## simulate some data

> sigma.b <- 125 # inter-subject variation larger than

> sigma.e <- 40 # intra-subject, inter-trial variation

> alpha <- 500

> beta <- 12

> M <- 6 # number of participants

> n <- 50 # trials per participant

> b <- rnorm(M, 0, sigma.b) # individual differences

> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors

> subj <- rep(1:M,n)

> RT <- alpha + beta * nneighbors + # simulate RTs!

b[subj] + rnorm(M*n,0,sigma.e) #

Multi-level Models VII: simulating data

●●

● ●

●●

●● ●

●●

● ●●

●●

●● ●

●●

●●●

●●

0 2 4 6 8

# Neighbors

I Participant-level clustering is easily visible

I This reflects the fact that (simulated) inter-participantvariation (125ms) is larger than (simulated) inter-trialvariation (40ms)

I And the (simulated) effects of neighborhood density are alsovisible

●●

● ●

●●

●● ●

●●

● ●●

●●

●● ●

●●

●●●

●●

0 2 4 6 8

# Neighbors

Subject intercepts (alpha + beta[subj])

I Participant-level clustering is easily visible

I This reflects the fact that (simulated) inter-participantvariation (125ms) is larger than (simulated) inter-trialvariation (40ms)

●●

● ●

●●

●● ●

●●

● ●●

●●

●● ●

●●

●●●

●●

0 2 4 6 8

# Neighbors

I Participant-level clustering is easily visibleI This reflects the fact that (simulated) inter-participant

variation (125ms) is larger than (simulated) inter-trialvariation (40ms)

●●

● ●

●●

●● ●

●●

● ●●

●●

●● ●

●●

●●●

●●

0 2 4 6 8

# Neighbors

I Participant-level clustering is easily visibleI This reflects the fact that (simulated) inter-participant

variation (125ms) is larger than (simulated) inter-trialvariation (40ms)

Statistical inference with multi-level models

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

I It’s insightful to create a model and simulate data (in thiscase to understand the type of model, but also more generallyto understand your data) . . .

I but, more commonly, we have data and we need to infer amodel

I Specifically, the “fixed-effect” parameters α, β, and σε, plus theparameter governing inter-subject variation, σb

I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?

I Fortunately, we can use the same principles as before to dothis:

I The principle of maximum likelihoodI Or Bayesian inference

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Fitting a multi-level model using maximum likelihood

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

> m <- lmer(time ~ neighbors.centered +

(1 | participant),dat,REML=F)

> print(m,corr=F)

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error t value

(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03

σ̂ε

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

> print(m,corr=F)

σ̂ε

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

> print(m,corr=F)

σ̂ε

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

> print(m,corr=F)

σ̂ε

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

> print(m,corr=F)

σ̂ε

Interpreting parameter estimates

Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7

I The fixed effects are interpreted just as in a traditionalsingle-level model:

I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms

I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,

centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the

participants were drawn has standard deviation 70.18ms

I The “average” RT for a non-word in this study is 583.79ms

I Every extra neighbor increases “average” RT by 8.99ms

I Inter-trial variability σε also has the same interpretation

I Inter-trial variability for a given participant is Gaussian,centered around the participant+word-specific mean withstandard deviation 138.7ms

I Inter-participant variability σb is what’s new:

I Variability in average RT in the population from which theparticipants were drawn has standard deviation 70.18ms

Inferences about cluster-level parameters

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

I What about the participants’ idiosyncracies themselves—thebi?

I We can also draw inferences about these—you may haveheard about BLUPs

I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:

P(bi |α̂, β̂, σ̂b, σ̂ε,X)

I The BLUPs are the conditional modes of the bi s—the choicesthat maximize the above probability

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

RTij = α + βxij +

∼N(0,σb)︷︸︸︷bi +

Inferences about cluster-level parameters II

I The BLUP participant-specific “average” RTs for this datasetare black lines on the base of this graph

400 500 600 700 800

Millseconds

I The solid line is a guess at their distribution

I The dotted line is the distribution predicted by the model forthe population from which the participants are drawn

I Reasonably close correspondence

400 500 600 700 800

Millseconds

I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for

the population from which the participants are drawn

I Reasonably close correspondence

400 500 600 700 800

Millseconds

I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for

the population from which the participants are drawnI Reasonably close correspondence

Inference about cluster-level parameters III

I Participants may also have idiosyncratic sensitivities toneighborhood density

I Incorporate by adding cluster-level slopes into the model:

RTij = α + βxij +

∼N(0,Σb)︷︸︸︷b1i + b2i xij +

I In R (once again we can omit the 1’s):

RT ∼ 1 + x + (1 + x | participant)

> lmer(RT ~ neighbors.centered +

(neighbors.centered | participant), dat,REML=F)

[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042

neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286

These three numbers jointlycharacterize Σ̂b

RTij = α + βxij +

∼N(0,Σb)︷︸︸︷b1i + b2i xij +

RTij = α + βxij +

∼N(0,Σb)︷︸︸︷b1i + b2i xij +

RTij = α + βxij +

∼N(0,Σb)︷︸︸︷b1i + b2i xij +

RTij = α + βxij +

∼N(0,Σb)︷︸︸︷b1i + b2i xij +

Inference about cluster-level parameters IV

450 500 550 600 650 700 750

Intercept

●●

● ●

●●

Participants

I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more

affected by neighborhood density

Σ̂ =

(70.20 −0.3097−0.3097 4.41

450 500 550 600 650 700 750

Intercept

●●

● ●

●●

Participants

Σ̂ =

(70.20 −0.3097−0.3097 4.41

450 500 550 600 650 700 750

Intercept

●●

● ●

●●

Participants

Σ̂ =

(70.20 −0.3097−0.3097 4.41

450 500 550 600 650 700 750

Intercept

●●

● ●

●●

Participants

I Correlation visible in participant-specific BLUPs

I Participants who were faster overall also tend to be moreaffected by neighborhood density

Σ̂ =

(70.20 −0.3097−0.3097 4.41

450 500 550 600 650 700 750

Intercept

●●

● ●

●●

Participants

Σ̂ =

(70.20 −0.3097−0.3097 4.41

Advantages of Multilevel Approach I

I There are several respects in which multi-level models gobeyond AN(C)OVA:

1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).

2. They handle empty cells3. Group-level means are fitted considering the number of

observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.5. Like for any regression model, they make it easier to

investigate effect direction and even shape,

2. They handle empty cells

3. Group-level means are fitted considering the number ofobservations in the group in a rational way (cf. shrinkage)

4. Multilevel models have more power.5. Like for any regression model, they make it easier to

observations in the group in a rational way (cf. shrinkage)

4. Multilevel models have more power.5. Like for any regression model, they make it easier to

observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.

5. Like for any regression model, they make it easier toinvestigate effect direction and even shape,

Advantages of Multilevel Approach II

6. You can use non-linear linking functions (e.g., logit models forbinary-choice data)

7. You can cross cluster-level (random) effects

I Every trial belongs to both a participant cluster and an itemcluster (and could belong to even more clusters, e.g. socialvariables, experimental groups)

I You can build a single unified model for inferences from yourdata

I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)

The logit link function for categorical data

I Much psycholinguistic data is categorical rather thancontinuous:

I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .

I Linear models inappropriate; they predict continuous values

I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data

I e.g., the logit model (Agresti, 2002; Jaeger, 2008)

ηij = α + βXij + bi (linear predictor)

ηij = logµij

1− µij(link function)

P(Y = y ;µij) = µij (binomial noise distribution)

ηij = logµij

The logit link function for categorical data II

I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)

I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)

> lmer(correct ~ neighbors.centered + (1 | participant),

dat, family="binomial")

[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44

Fixed effects:Estimate Std. Error z value Pr(>|z|)

(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***

α̂—participants usually right

σ̂b (note there is no

σ̂ε for logit models)

β̂—effect small compared with inter-subject variation

Summary

I Multi-level models may seem strange and foreign

I But all you really need to understand them is three basicthings

I Generalized linear modelsI The principle of maximum likelihoodI Bayesian inference

Reviewing GLMs III: Bayesian model fitting

P({βi}, σ|Y ) =

I Alternative tomaximum-likelihood:Bayesian model fitting

I Simple (uniform, non-informative)

prior: all combinations of(α, β, σ) equally probable

I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)

I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

P(αα

I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

P(αα

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

P(αα

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

P(αα

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

P(αα

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 6000.

P(αα

P({βi}, σ|Y ) =

0.00 0.06

P(ββ|Y)

pMCMC = 0.46

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 600

Intercept

<αα̂,ββ̂>

200 300 400 500 6000.

P(αα

Bayesian inference for multilevel models

P({βi}, σb, σε|Y ) =

Likelihood︷︸︸︷P(Y |{βi}, σb, σε)

Prior︷︸︸︷P({βi}, σb, σε)

I We can also use Bayes’rule to draw inferencesabout fixed effects

I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it

560 580 600 620

Intercept

560 580 600 620

P(αα

P(αα|Y)

560 580 600 620

Intercept

560 580 600 620

P(αα

P(αα|Y)

560 580 600 620

Intercept

560 580 600 620

P(αα

P(αα|Y)

References I

Agresti, A. (2002). Categorical Data Analysis. Wiley, second edition.

Baayen, R. H., Davidson, D. J., and Bates, D. M. (2008). Mixed-effectsmodeling with crossed random effects for subjects and items. Journalof Memory and Language. In press.

Bicknell, K., Elman, J. L., Hare, M., McRae, K., and Kutas, M. (2008).Online expectations for verbal arguments conditional on eventknowledge. In Proceedings of the 30th Annual Conference of theCognitive Science Society, pages 2220–2225.

Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs(transformation or not) and towards logit mixed models. Journal ofMemory and Language. In press.

A Brief and Friendly Introduction to Mixed-Effects Models in … · 2009. 4. 30. · A Brief and...

Documents

Transcript of A Brief and Friendly Introduction to Mixed-Effects Models in … · 2009. 4. 30. · A Brief and...