A Brief and Friendly Introduction to Mixed-Effects Models in … · 2009. 4. 30. · A Brief and...
Transcript of A Brief and Friendly Introduction to Mixed-Effects Models in … · 2009. 4. 30. · A Brief and...
A Brief and Friendly Introduction toMixed-Effects Models in Psycholinguistics
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Roger Levy (modified by Florian Jaeger)
UC San DiegoDepartment of Linguistics
25 March 2009
Goals of this talk
I Briefly review generalized linear models and how to use them
I Give a precise description of multi-level models
I Show how to draw inferences using a multi-level model (fittingthe model)
I Discuss how to interpret model parameter estimatesI Fixed effectsI Random effects
I Briefly discuss multi-level logit models
Reviewing generalized linear models I
Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .
The picture:
θ
x1
y1
x2
y2
xn
yn· · ·
Predictors
Model parameters
Response
Reviewing generalized linear models I
Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .
The picture:
θ
x1
y1
x2
y2
xn
yn· · ·
Predictors
Model parameters
Response
Reviewing generalized linear models I
Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .
The picture:
θ
x1
y1
x2
y2
xn
yn· · ·
Predictors
Model parameters
Response
Reviewing generalized linear models I
Goal: model the effects of predictors (independent variables) X ona response (dependent variable) Y .
The picture:
θ
x1
y1
x2
y2
xn
yn· · ·
Predictors
Model parameters
Response
Reviewing GLMs II
Assumptions of the generalized linear model (GLM):
1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;
2. η is a linear combination of the {Xi}:
η = α + β1X1 + · · ·+ βNXN (linear predictor)
3. η determines the predicted mean µ of Y
η = l(µ) (link function)
4. There is some noise distribution of Y around the predictedmean µ of Y :
P(Y = y ;µ)
Reviewing GLMs II
Assumptions of the generalized linear model (GLM):
1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;
2. η is a linear combination of the {Xi}:
η = α + β1X1 + · · ·+ βNXN (linear predictor)
3. η determines the predicted mean µ of Y
η = l(µ) (link function)
4. There is some noise distribution of Y around the predictedmean µ of Y :
P(Y = y ;µ)
Reviewing GLMs II
Assumptions of the generalized linear model (GLM):
1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;
2. η is a linear combination of the {Xi}:
η = α + β1X1 + · · ·+ βNXN (linear predictor)
3. η determines the predicted mean µ of Y
η = l(µ) (link function)
4. There is some noise distribution of Y around the predictedmean µ of Y :
P(Y = y ;µ)
Reviewing GLMs II
Assumptions of the generalized linear model (GLM):
1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;
2. η is a linear combination of the {Xi}:
η = α + β1X1 + · · ·+ βNXN (linear predictor)
3. η determines the predicted mean µ of Y
η = l(µ) (link function)
4. There is some noise distribution of Y around the predictedmean µ of Y :
P(Y = y ;µ)
Reviewing GLMs II
Assumptions of the generalized linear model (GLM):
1. Predictors {Xi} influence Y through the mediation of a linearpredictor η;
2. η is a linear combination of the {Xi}:
η = α + β1X1 + · · ·+ βNXN (linear predictor)
3. η determines the predicted mean µ of Y
η = l(µ) (link function)
4. There is some noise distribution of Y around the predictedmean µ of Y :
P(Y = y ;µ)
Reviewing GLMs III
Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.
I The predicted mean is just the linear predictor:
η = l(µ) = µ
I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:
ε ∼ N(0, σ)
I This gives us the traditional linear regression equation:
Y =
Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +
Noise∼N(0,σ)︷︸︸︷ε
Reviewing GLMs III
Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.
I The predicted mean is just the linear predictor:
η = l(µ) = µ
I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:
ε ∼ N(0, σ)
I This gives us the traditional linear regression equation:
Y =
Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +
Noise∼N(0,σ)︷︸︸︷ε
Reviewing GLMs III
Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.
I The predicted mean is just the linear predictor:
η = l(µ) = µ
I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:
ε ∼ N(0, σ)
I This gives us the traditional linear regression equation:
Y =
Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +
Noise∼N(0,σ)︷︸︸︷ε
Reviewing GLMs III
Linear regression, which underlies ANOVA, is a kind of generalizedlinear model.
I The predicted mean is just the linear predictor:
η = l(µ) = µ
I Noise is normally (=Gaussian) distributed around 0 withstandard deviation σ:
ε ∼ N(0, σ)
I This gives us the traditional linear regression equation:
Y =
Predicted Mean µ = η︷ ︸︸ ︷α + β1X1 + · · ·+ βnXn +
Noise∼N(0,σ)︷︸︸︷ε
Reviewing GLMs IV
Logistic regression, too, is a kind of generalized linear model.
I The linear predictor:
η = α + β1X1 + · · ·+ βnXn
I The link function g is the logit transform:
E(Y) = p = g−1(η)⇔
g(p) = lnp
1− p= η = α + β1X1 + · · ·+ βnXn (1)
I The distribution around the mean is taken to be binomial.
Reviewing GLMs IV
Logistic regression, too, is a kind of generalized linear model.
I The linear predictor:
η = α + β1X1 + · · ·+ βnXn
I The link function g is the logit transform:
E(Y) = p = g−1(η)⇔
g(p) = lnp
1− p= η = α + β1X1 + · · ·+ βnXn (1)
I The distribution around the mean is taken to be binomial.
Reviewing GLMs IV
Logistic regression, too, is a kind of generalized linear model.
I The linear predictor:
η = α + β1X1 + · · ·+ βnXn
I The link function g is the logit transform:
E(Y) = p = g−1(η)⇔
g(p) = lnp
1− p= η = α + β1X1 + · · ·+ βnXn (1)
I The distribution around the mean is taken to be binomial.
Reviewing GLMs IV
Logistic regression, too, is a kind of generalized linear model.
I The linear predictor:
η = α + β1X1 + · · ·+ βnXn
I The link function g is the logit transform:
E(Y) = p = g−1(η)⇔
g(p) = lnp
1− p= η = α + β1X1 + · · ·+ βnXn (1)
I The distribution around the mean is taken to be binomial.
Reviewing GLMs V
Y =
Predicted Mean︷ ︸︸ ︷α + β1X1 + · · · + βnXn +
Noise∼N(0,σ)︷︸︸︷ε
I How do we fit the parameters βi and σ (choose modelcoefficients)?
I There are two major approaches (deeply related, yet different)in widespread use:
I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y
choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible
I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data
Reviewing GLMs V
Y =
Predicted Mean︷ ︸︸ ︷α + β1X1 + · · · + βnXn +
Noise∼N(0,σ)︷︸︸︷ε
I How do we fit the parameters βi and σ (choose modelcoefficients)?
I There are two major approaches (deeply related, yet different)in widespread use:
I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y
choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible
I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data
Reviewing GLMs V
Y =
Predicted Mean︷ ︸︸ ︷α + β1X1 + · · · + βnXn +
Noise∼N(0,σ)︷︸︸︷ε
I How do we fit the parameters βi and σ (choose modelcoefficients)?
I There are two major approaches (deeply related, yet different)in widespread use:
I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y
choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible
I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data
Reviewing GLMs V
Y =
Predicted Mean︷ ︸︸ ︷α + β1X1 + · · · + βnXn +
Noise∼N(0,σ)︷︸︸︷ε
I How do we fit the parameters βi and σ (choose modelcoefficients)?
I There are two major approaches (deeply related, yet different)in widespread use:
I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y
choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible
I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data
P({βi}, σ|Y ) =P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
Reviewing GLMs V
Y =
Predicted Mean︷ ︸︸ ︷α + β1X1 + · · · + βnXn +
Noise∼N(0,σ)︷︸︸︷ε
I How do we fit the parameters βi and σ (choose modelcoefficients)?
I There are two major approaches (deeply related, yet different)in widespread use:
I The principle of maximum likelihood: pick parameter valuesthat maximize the probability of your data Y
choose {βi} and σ that make the likelihoodP(Y |{βi}, σ) as large as possible
I Bayesian inference: put a probability distribution on the modelparameters and update it on the basis of what parameters bestexplain the data
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?
houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VI: a simple example
I You are studying non-word RTs in a lexical-decision task
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities∗ should havedifferent average RT ∗(= number of neighbors of edit-distance 1)
I A simple model: assume that neighborhood density has alinear effect on average RT, and trial-level noise is normallydistributed∗ ∗(n.b. wrong–RTs are skewed—but not horrible.)
I If xi is neighborhood density, our simple model is
RTi = α + βxi +
∼N(0,σ)︷︸︸︷εi
I We need to draw inferences about α, β, and σ
I e.g., “Does neighborhood density affects RT?”→ is β reliablynon-zero?
Reviewing GLMs VII
I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:
Few neighbors Many neighborsgaty peme rixy lish pait yine
I There’s a wide range of neighborhood density:
Number of neighbors
Fre
quen
cy
0 2 4 6 8 10 12
02
46
8
pait
Reviewing GLMs VII
I We’ll use length-4 nonword data from (Bicknell et al., 2008)(thanks!), such as:
Few neighbors Many neighborsgaty peme rixy lish pait yine
I There’s a wide range of neighborhood density:
Number of neighbors
Fre
quen
cy
0 2 4 6 8 10 12
02
46
8
pait
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ 1 + x
I The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ 1 + xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs VIII: maximum-likelihood model fitting
RTi = α + βXi +
∼N(0,σ)︷︸︸︷εi
I Here’s a translation of our simple model into R:
RT ∼ xI The noise is implicit in asking R to fit a linear model
I (We can omit the 1; R assumes it unless otherwise directed)
I Example of fitting via maximum likelihood: one subject fromBicknell et al. (2008)
> m <- glm(RT ~ neighbors, d, family="gaussian")
> summary(m)
[...]Estimate Std. Error t value Pr(>|t|)
(Intercept) 382.997 26.837 14.271 <2e-16 ***neighbors 4.828 6.553 0.737 0.466
> sqrt(summary(m)[["dispersion"]])
[1] 107.2248
Gaussian noise, implicit intercept
α̂
β̂
σ̂
Reviewing GLMs: maximum-likelihood fitting IX
Intercept 383.00neighbors 4.83σ̂ 107.22
I Estimated coefficients arewhat underlies “best linearfit” plots
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8 10 12
300
400
500
600
700
Neighbors
RT
(m
s)
Reviewing GLMs: maximum-likelihood fitting IX
Intercept 383.00neighbors 4.83σ̂ 107.22
I Estimated coefficients arewhat underlies “best linearfit” plots
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8 10 12
300
400
500
600
700
Neighbors
RT
(m
s)
Reviewing GLMs: maximum-likelihood fitting IX
Intercept 383.00neighbors 4.83σ̂ 107.22
I Estimated coefficients arewhat underlies “best linearfit” plots
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8 10 1230
040
050
060
070
0Neighbors
RT
(m
s)
Multi-level Models
I But of course experiments don’t have just one participant
I Different participants may have different idiosyncratic behavior
I And items may have idiosyncratic properties too
I We’d like to take these into account, and perhaps investigatethem directly too.
I This is what multi-level (hierarchical, mixed-effects) models arefor!
Multi-level Models II
I Recap of the graphical picture of a single-level model:
θ
x1
y1
x2
y2
xn
yn· · ·
Predictors
Model parameters
Response
Multi-level Models III: the new graphical picture
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Multi-level Models III: the new graphical picture
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Multi-level Models III: the new graphical picture
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Multi-level Models III: the new graphical picture
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Multi-level Models III: the new graphical picture
θ
Σb
b1 b2 bM· · ·
· · ·
x11 x1n1
y11 y1n1
· · ·
· · ·
x21 x2n2
y21 y2n2
· · ·
· · ·
xM1 xMnM
yM1 yMnM
· · ·
· · ·
Cluster-specificparameters
(“random effects”)
Shared parameters(“fixed effects”)
Parameters governinginter-cluster variability
Multi-level Models IV
An example of a multi-level model:
I Back to your lexical-decision experiment
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities should havedifferent average decision time
I Additionally, different participants in your study may alsohave:
I different overall decision speedsI differing sensitivity to neighborhood density
I You want to draw inferences about all these things at thesame time
Multi-level Models IV
An example of a multi-level model:
I Back to your lexical-decision experiment
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities should havedifferent average decision time
I Additionally, different participants in your study may alsohave:
I different overall decision speedsI differing sensitivity to neighborhood density
I You want to draw inferences about all these things at thesame time
Multi-level Models IV
An example of a multi-level model:
I Back to your lexical-decision experiment
tpozt Word or non-word?houze Word or non-word?
I Non-words with different neighborhood densities should havedifferent average decision time
I Additionally, different participants in your study may alsohave:
I different overall decision speedsI differing sensitivity to neighborhood density
I You want to draw inferences about all these things at thesame time
Multi-level Models V: Model construction
I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗
I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I In R, we’d write this relationship as
I Once again we can leave off the 1, and the noise term εij isimplicit
Multi-level Models V: Model construction
I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗
I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I In R, we’d write this relationship as
I Once again we can leave off the 1, and the noise term εij isimplicit
Multi-level Models V: Model construction
I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗
I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I In R, we’d write this relationship as
RT ∼ 1 + x + (1 | participant)
I Once again we can leave off the 1, and the noise term εij isimplicit
Multi-level Models V: Model construction
I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗
I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I In R, we’d write this relationship as
RT ∼ 1 + x + (1 | participant)
I Once again we can leave off the 1, and the noise term εij isimplicit
Multi-level Models V: Model construction
I Once again we’ll assume for simplicity that the number ofword neighbors x has a linear effect on mean reading time,and that trial-level noise is normally distributed∗
I Random effects, starting simple: let each participant i haveidiosyncratic differences in reading speed
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I In R, we’d write this relationship as
RT ∼ x + (1 | participant)
I Once again we can leave off the 1, and the noise term εij isimplicit
Multi-level Models VI: simulating data
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I Simulation of trial-level data can be invaluable for achievingdeeper understanding of the data
## simulate some data
> sigma.b <- 125 # inter-subject variation larger than
> sigma.e <- 40 # intra-subject, inter-trial variation
> alpha <- 500
> beta <- 12
> M <- 6 # number of participants
> n <- 50 # trials per participant
> b <- rnorm(M, 0, sigma.b) # individual differences
> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors
> subj <- rep(1:M,n)
> RT <- alpha + beta * nneighbors + # simulate RTs!
b[subj] + rnorm(M*n,0,sigma.e) #
Multi-level Models VI: simulating data
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I Simulation of trial-level data can be invaluable for achievingdeeper understanding of the data
## simulate some data
> sigma.b <- 125 # inter-subject variation larger than
> sigma.e <- 40 # intra-subject, inter-trial variation
> alpha <- 500
> beta <- 12
> M <- 6 # number of participants
> n <- 50 # trials per participant
> b <- rnorm(M, 0, sigma.b) # individual differences
> nneighbors <- rpois(M*n,3) + 1 # generate num. neighbors
> subj <- rep(1:M,n)
> RT <- alpha + beta * nneighbors + # simulate RTs!
b[subj] + rnorm(M*n,0,sigma.e) #
Multi-level Models VII: simulating data
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 2 4 6 8
400
600
800
1000
# Neighbors
RT
I Participant-level clustering is easily visible
I This reflects the fact that (simulated) inter-participantvariation (125ms) is larger than (simulated) inter-trialvariation (40ms)
I And the (simulated) effects of neighborhood density are alsovisible
Multi-level Models VII: simulating data
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 2 4 6 8
400
600
800
1000
# Neighbors
RT
Subject intercepts (alpha + beta[subj])
I Participant-level clustering is easily visible
I This reflects the fact that (simulated) inter-participantvariation (125ms) is larger than (simulated) inter-trialvariation (40ms)
I And the (simulated) effects of neighborhood density are alsovisible
Multi-level Models VII: simulating data
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 2 4 6 8
400
600
800
1000
# Neighbors
RT
Subject intercepts (alpha + beta[subj])
I Participant-level clustering is easily visibleI This reflects the fact that (simulated) inter-participant
variation (125ms) is larger than (simulated) inter-trialvariation (40ms)
I And the (simulated) effects of neighborhood density are alsovisible
Multi-level Models VII: simulating data
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0 2 4 6 8
400
600
800
1000
# Neighbors
RT
Subject intercepts (alpha + beta[subj])
I Participant-level clustering is easily visibleI This reflects the fact that (simulated) inter-participant
variation (125ms) is larger than (simulated) inter-trialvariation (40ms)
I And the (simulated) effects of neighborhood density are alsovisible
Statistical inference with multi-level models
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I It’s insightful to create a model and simulate data (in thiscase to understand the type of model, but also more generallyto understand your data) . . .
I but, more commonly, we have data and we need to infer amodel
I Specifically, the “fixed-effect” parameters α, β, and σε, plus theparameter governing inter-subject variation, σb
I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?
I Fortunately, we can use the same principles as before to dothis:
I The principle of maximum likelihoodI Or Bayesian inference
Statistical inference with multi-level models
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I It’s insightful to create a model and simulate data (in thiscase to understand the type of model, but also more generallyto understand your data) . . .
I but, more commonly, we have data and we need to infer amodel
I Specifically, the “fixed-effect” parameters α, β, and σε, plus theparameter governing inter-subject variation, σb
I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?
I Fortunately, we can use the same principles as before to dothis:
I The principle of maximum likelihoodI Or Bayesian inference
Statistical inference with multi-level models
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I It’s insightful to create a model and simulate data (in thiscase to understand the type of model, but also more generallyto understand your data) . . .
I but, more commonly, we have data and we need to infer amodel
I Specifically, the “fixed-effect” parameters α, β, and σε, plus theparameter governing inter-subject variation, σb
I e.g., hypothesis tests about effects of neighborhood density:can we reliably infer that β is {non-zero, positive, . . . }?
I Fortunately, we can use the same principles as before to dothis:
I The principle of maximum likelihoodI Or Bayesian inference
Fitting a multi-level model using maximum likelihood
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
> m <- lmer(time ~ neighbors.centered +
(1 | participant),dat,REML=F)
> print(m,corr=F)
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error t value
(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03
α̂
β̂
σ̂b
σ̂ε
Fitting a multi-level model using maximum likelihood
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
> m <- lmer(time ~ neighbors.centered +
(1 | participant),dat,REML=F)
> print(m,corr=F)
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error t value
(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03
α̂
β̂
σ̂b
σ̂ε
Fitting a multi-level model using maximum likelihood
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
> m <- lmer(time ~ neighbors.centered +
(1 | participant),dat,REML=F)
> print(m,corr=F)
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error t value
(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03
α̂
β̂
σ̂b
σ̂ε
Fitting a multi-level model using maximum likelihood
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
> m <- lmer(time ~ neighbors.centered +
(1 | participant),dat,REML=F)
> print(m,corr=F)
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error t value
(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03
α̂
β̂
σ̂b
σ̂ε
Fitting a multi-level model using maximum likelihood
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
> m <- lmer(time ~ neighbors.centered +
(1 | participant),dat,REML=F)
> print(m,corr=F)
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 4924.9 70.177Residual 19240.5 138.710Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error t value
(Intercept) 583.787 11.082 52.68neighbors.centered 8.986 1.278 7.03
α̂
β̂
σ̂b
σ̂ε
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79ms
I Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretation
I Inter-trial variability for a given participant is Gaussian,centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:
I Variability in average RT in the population from which theparticipants were drawn has standard deviation 70.18ms
Interpreting parameter estimates
Intercept 583.79neighbors.centered 8.99σ̂b 70.18σ̂ε 138.7
I The fixed effects are interpreted just as in a traditionalsingle-level model:
I The “average” RT for a non-word in this study is 583.79msI Every extra neighbor increases “average” RT by 8.99ms
I Inter-trial variability σε also has the same interpretationI Inter-trial variability for a given participant is Gaussian,
centered around the participant+word-specific mean withstandard deviation 138.7ms
I Inter-participant variability σb is what’s new:I Variability in average RT in the population from which the
participants were drawn has standard deviation 70.18ms
Inferences about cluster-level parameters
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I What about the participants’ idiosyncracies themselves—thebi?
I We can also draw inferences about these—you may haveheard about BLUPs
I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:
P(bi |α̂, β̂, σ̂b, σ̂ε,X)
I The BLUPs are the conditional modes of the bi s—the choicesthat maximize the above probability
Inferences about cluster-level parameters
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I What about the participants’ idiosyncracies themselves—thebi?
I We can also draw inferences about these—you may haveheard about BLUPs
I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:
P(bi |α̂, β̂, σ̂b, σ̂ε,X)
I The BLUPs are the conditional modes of the bi s—the choicesthat maximize the above probability
Inferences about cluster-level parameters
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I What about the participants’ idiosyncracies themselves—thebi?
I We can also draw inferences about these—you may haveheard about BLUPs
I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:
P(bi |α̂, β̂, σ̂b, σ̂ε,X)
I The BLUPs are the conditional modes of the bi s—the choicesthat maximize the above probability
Inferences about cluster-level parameters
RTij = α + βxij +
∼N(0,σb)︷︸︸︷bi +
Noise∼N(0,σε)︷︸︸︷εij
I What about the participants’ idiosyncracies themselves—thebi?
I We can also draw inferences about these—you may haveheard about BLUPs
I To understand these: committing to fixed-effect andrandom-effect parameter estimates determines a conditionalprobability distribution on participant-specific effects:
P(bi |α̂, β̂, σ̂b, σ̂ε,X)
I The BLUPs are the conditional modes of the bi s—the choicesthat maximize the above probability
Inferences about cluster-level parameters II
I The BLUP participant-specific “average” RTs for this datasetare black lines on the base of this graph
400 500 600 700 800
0.00
00.
002
0.00
40.
006
Millseconds
Pro
babi
lity
dens
ity
I The solid line is a guess at their distribution
I The dotted line is the distribution predicted by the model forthe population from which the participants are drawn
I Reasonably close correspondence
Inferences about cluster-level parameters II
I The BLUP participant-specific “average” RTs for this datasetare black lines on the base of this graph
400 500 600 700 800
0.00
00.
002
0.00
40.
006
Millseconds
Pro
babi
lity
dens
ity
I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for
the population from which the participants are drawn
I Reasonably close correspondence
Inferences about cluster-level parameters II
I The BLUP participant-specific “average” RTs for this datasetare black lines on the base of this graph
400 500 600 700 800
0.00
00.
002
0.00
40.
006
Millseconds
Pro
babi
lity
dens
ity
I The solid line is a guess at their distributionI The dotted line is the distribution predicted by the model for
the population from which the participants are drawnI Reasonably close correspondence
Inference about cluster-level parameters III
I Participants may also have idiosyncratic sensitivities toneighborhood density
I Incorporate by adding cluster-level slopes into the model:
RTij = α + βxij +
∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +
Noise∼N(0,σε)︷︸︸︷εij
I In R (once again we can omit the 1’s):
RT ∼ 1 + x + (1 + x | participant)
> lmer(RT ~ neighbors.centered +
(neighbors.centered | participant), dat,REML=F)
[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042
neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286
These three numbers jointlycharacterize Σ̂b
Inference about cluster-level parameters III
I Participants may also have idiosyncratic sensitivities toneighborhood density
I Incorporate by adding cluster-level slopes into the model:
RTij = α + βxij +
∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +
Noise∼N(0,σε)︷︸︸︷εij
I In R (once again we can omit the 1’s):
RT ∼ 1 + x + (1 + x | participant)
> lmer(RT ~ neighbors.centered +
(neighbors.centered | participant), dat,REML=F)
[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042
neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286
These three numbers jointlycharacterize Σ̂b
Inference about cluster-level parameters III
I Participants may also have idiosyncratic sensitivities toneighborhood density
I Incorporate by adding cluster-level slopes into the model:
RTij = α + βxij +
∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +
Noise∼N(0,σε)︷︸︸︷εij
I In R (once again we can omit the 1’s):
RT ∼ 1 + x + (1 + x | participant)
> lmer(RT ~ neighbors.centered +
(neighbors.centered | participant), dat,REML=F)
[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042
neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286
These three numbers jointlycharacterize Σ̂b
Inference about cluster-level parameters III
I Participants may also have idiosyncratic sensitivities toneighborhood density
I Incorporate by adding cluster-level slopes into the model:
RTij = α + βxij +
∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +
Noise∼N(0,σε)︷︸︸︷εij
I In R (once again we can omit the 1’s):
RT ∼ 1 + x + (1 + x | participant)
> lmer(RT ~ neighbors.centered +
(neighbors.centered | participant), dat,REML=F)
[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042
neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286
These three numbers jointlycharacterize Σ̂b
Inference about cluster-level parameters III
I Participants may also have idiosyncratic sensitivities toneighborhood density
I Incorporate by adding cluster-level slopes into the model:
RTij = α + βxij +
∼N(0,Σb)︷ ︸︸ ︷b1i + b2i xij +
Noise∼N(0,σε)︷︸︸︷εij
I In R (once again we can omit the 1’s):
RT ∼ 1 + x + (1 + x | participant)
> lmer(RT ~ neighbors.centered +
(neighbors.centered | participant), dat,REML=F)
[...]Random effects:Groups Name Variance Std.Dev. Corrparticipant (Intercept) 4928.625 70.2042
neighbors.centered 19.421 4.4069 -0.307Residual 19107.143 138.2286
These three numbers jointlycharacterize Σ̂b
Inference about cluster-level parameters IV
450 500 550 600 650 700 750
02
46
810
Intercept
neig
hbor
s
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Participants
I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more
affected by neighborhood density
Σ̂ =
(70.20 −0.3097−0.3097 4.41
)
Inference about cluster-level parameters IV
450 500 550 600 650 700 750
02
46
810
Intercept
neig
hbor
s
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Participants
I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more
affected by neighborhood density
Σ̂ =
(70.20 −0.3097−0.3097 4.41
)
Inference about cluster-level parameters IV
450 500 550 600 650 700 750
02
46
810
Intercept
neig
hbor
s
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Participants
I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more
affected by neighborhood density
Σ̂ =
(70.20 −0.3097−0.3097 4.41
)
Inference about cluster-level parameters IV
450 500 550 600 650 700 750
02
46
810
Intercept
neig
hbor
s
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Participants
I Correlation visible in participant-specific BLUPs
I Participants who were faster overall also tend to be moreaffected by neighborhood density
Σ̂ =
(70.20 −0.3097−0.3097 4.41
)
Inference about cluster-level parameters IV
450 500 550 600 650 700 750
02
46
810
Intercept
neig
hbor
s
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Participants
I Correlation visible in participant-specific BLUPsI Participants who were faster overall also tend to be more
affected by neighborhood density
Σ̂ =
(70.20 −0.3097−0.3097 4.41
)
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells3. Group-level means are fitted considering the number of
observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.5. Like for any regression model, they make it easier to
investigate effect direction and even shape,
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells3. Group-level means are fitted considering the number of
observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.5. Like for any regression model, they make it easier to
investigate effect direction and even shape,
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells
3. Group-level means are fitted considering the number ofobservations in the group in a rational way (cf. shrinkage)
4. Multilevel models have more power.5. Like for any regression model, they make it easier to
investigate effect direction and even shape,
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells3. Group-level means are fitted considering the number of
observations in the group in a rational way (cf. shrinkage)
4. Multilevel models have more power.5. Like for any regression model, they make it easier to
investigate effect direction and even shape,
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells3. Group-level means are fitted considering the number of
observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.
5. Like for any regression model, they make it easier toinvestigate effect direction and even shape,
Advantages of Multilevel Approach I
I There are several respects in which multi-level models gobeyond AN(C)OVA:
1. They handle imbalanced datasets quite well in contrast toAN(C)OVA (see e.g. Dixon, 2008; Baayen et al., 2008).
2. They handle empty cells3. Group-level means are fitted considering the number of
observations in the group in a rational way (cf. shrinkage)4. Multilevel models have more power.5. Like for any regression model, they make it easier to
investigate effect direction and even shape,
Advantages of Multilevel Approach II
6. You can use non-linear linking functions (e.g., logit models forbinary-choice data)
7. You can cross cluster-level (random) effects
I Every trial belongs to both a participant cluster and an itemcluster (and could belong to even more clusters, e.g. socialvariables, experimental groups)
I You can build a single unified model for inferences from yourdata
I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)
Advantages of Multilevel Approach II
6. You can use non-linear linking functions (e.g., logit models forbinary-choice data)
7. You can cross cluster-level (random) effects
I Every trial belongs to both a participant cluster and an itemcluster (and could belong to even more clusters, e.g. socialvariables, experimental groups)
I You can build a single unified model for inferences from yourdata
I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)
Advantages of Multilevel Approach II
6. You can use non-linear linking functions (e.g., logit models forbinary-choice data)
7. You can cross cluster-level (random) effects
I Every trial belongs to both a participant cluster and an itemcluster (and could belong to even more clusters, e.g. socialvariables, experimental groups)
I You can build a single unified model for inferences from yourdata
I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)
Advantages of Multilevel Approach II
6. You can use non-linear linking functions (e.g., logit models forbinary-choice data)
7. You can cross cluster-level (random) effects
I Every trial belongs to both a participant cluster and an itemcluster (and could belong to even more clusters, e.g. socialvariables, experimental groups)
I You can build a single unified model for inferences from yourdata
I ANOVA requires separate by-participants and by-itemsanalyses (quasi-F ′ is quite conservative)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data
I Much psycholinguistic data is categorical rather thancontinuous:
I Yes/no answers to alternations questionsI Speaker choice: (realized (that) her goals were unattainable)I Cloze continuations, and so forth. . .
I Linear models inappropriate; they predict continuous values
I We can stay within the multi-level generalized linear modelsframework but use different link functions and noisedistributions to analyze categorical data
I e.g., the logit model (Agresti, 2002; Jaeger, 2008)
ηij = α + βXij + bi (linear predictor)
ηij = logµij
1− µij(link function)
P(Y = y ;µij) = µij (binomial noise distribution)
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
The logit link function for categorical data II
I We’ll look at the effect of neighborhood density on correctidentification as non-word in Bicknell et al. (2008)
I Assuming that any effect is linear in log-odds space (seeJaeger (2008) for discussion)
> lmer(correct ~ neighbors.centered + (1 | participant),
dat, family="binomial")
[...]Random effects:Groups Name Variance Std.Dev.participant (Intercept) 0.9243 0.9614Number of obs: 1760, groups: participant, 44
Fixed effects:Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.16310 0.18998 16.649 < 2e-16 ***neighbors.centered -0.18207 0.03483 -5.228 1.72e-07 ***
α̂—participants usually right
σ̂b (note there is no
σ̂ε for logit models)
β̂—effect small compared with inter-subject variation
Summary
I Multi-level models may seem strange and foreign
I But all you really need to understand them is three basicthings
I Generalized linear modelsI The principle of maximum likelihoodI Bayesian inference
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
0.00
0
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
0.00
0
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
0.00
0
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
0.00
0
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
0.00
0
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 6000.
000
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Reviewing GLMs III: Bayesian model fitting
P({βi}, σ|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σ)
Prior︷ ︸︸ ︷P({βi}, σ)
P(Y )
I Alternative tomaximum-likelihood:Bayesian model fitting
I Simple (uniform, non-informative)
prior: all combinations of(α, β, σ) equally probable
I Multiply by likelihood →posterior probabilitydistribution over (α, β, σ)
I Bound the region of highestposterior probabilitycontaining 95% ofprobability density → HPDconfidence region
−10
−5
0
5
10
15
20
25
0.00 0.06
P(ββ|Y)
ββ
pMCMC = 0.46
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 600
−10
05
1020
Intercept
Nei
ghbo
rs
●
<αα̂,ββ̂>
200 300 400 500 6000.
000
alpha
P(αα
|Y)
I pMCMC (Baayen et al., 2008)is 1 minus the largestpossible symmetricconfidence interval wholly onone side of 0
Bayesian inference for multilevel models
P({βi}, σb, σε|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σb, σε)
Prior︷ ︸︸ ︷P({βi}, σb, σε)
P(Y )
I We can also use Bayes’rule to draw inferencesabout fixed effects
I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it
560 580 600 620
68
1012
Intercept
neig
hbor
s
560 580 600 620
0.00
αα
P(αα
|Y)
4
6
8
10
12
14
0.00
P(αα|Y)
ββ
Bayesian inference for multilevel models
P({βi}, σb, σε|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σb, σε)
Prior︷ ︸︸ ︷P({βi}, σb, σε)
P(Y )
I We can also use Bayes’rule to draw inferencesabout fixed effects
I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it
560 580 600 620
68
1012
Intercept
neig
hbor
s
560 580 600 620
0.00
αα
P(αα
|Y)
4
6
8
10
12
14
0.00
P(αα|Y)
ββ
Bayesian inference for multilevel models
P({βi}, σb, σε|Y ) =
Likelihood︷ ︸︸ ︷P(Y |{βi}, σb, σε)
Prior︷ ︸︸ ︷P({βi}, σb, σε)
P(Y )
I We can also use Bayes’rule to draw inferencesabout fixed effects
I Computationally morechallenging than withsingle-level regression;Markov-chain MonteCarlo (MCMC) samplingtechniques allow us toapproximate it
560 580 600 620
68
1012
Intercept
neig
hbor
s
560 580 600 620
0.00
αα
P(αα
|Y)
4
6
8
10
12
14
0.00
P(αα|Y)
ββ
References I
Agresti, A. (2002). Categorical Data Analysis. Wiley, second edition.
Baayen, R. H., Davidson, D. J., and Bates, D. M. (2008). Mixed-effectsmodeling with crossed random effects for subjects and items. Journalof Memory and Language. In press.
Bicknell, K., Elman, J. L., Hare, M., McRae, K., and Kutas, M. (2008).Online expectations for verbal arguments conditional on eventknowledge. In Proceedings of the 30th Annual Conference of theCognitive Science Society, pages 2220–2225.
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs(transformation or not) and towards logit mixed models. Journal ofMemory and Language. In press.