Interpretable Linear Modeling - Piazza
Transcript of Interpretable Linear Modeling - Piazza
LM GLM MLM
11/19/2019 Interpretable Linear Modeling 2
Overview
Linear Modeling
Generalized Linear Modeling
Multilevel Modeling
β’ Linear models (LM) are statistical prediction models with high interpretabilityβ’ They are best combined with a small number of interpretable featuresβ’ They allow you to quantify the size and direction of each feature's effectβ’ They allow you to isolate the unique effect of each feature
β’ LM incorporates regression, ANOVA, π‘π‘-tests, and πΉπΉ-testsβ’ LM allows for multiple ππ (predictor or independent) variablesβ’ LM allows for multiple ππ (explained or dependent) variables
β’ LM assumes the Y variables are normally distributed (to avoid β GLM)β’ LM assumes the data points are independent (to avoid β MLM)
11/19/2019 Interpretable Linear Modeling 4
What is linear modeling?
β’ Linear regression is often presented with the following notation:
π¦π¦ππ = πΌπΌ + π½π½1π₯π₯1ππ + β―+ π½π½πππ₯π₯ππππ + ππππ
β’ π¦π¦ is a vector of observations on the explained variableβ’ π₯π₯ is a vector of observations on the predictor variableβ’ πΌπΌ is the interceptβ’ π½π½ are the slopesβ’ ππ is a vector of normally distributed and independent residuals
11/19/2019 Interpretable Linear Modeling 5
Simple linear regression
β’ Instead, we will use the following (equivalent) notationβ’ This notation will make it easier to generalize LM
π¦π¦ππ ~ Normal ππππ ,ππ
ππππ = πΌπΌ + π½π½1π₯π₯1ππ + β―+ π½π½πππ₯π₯ππππ
β’ This can be read as π¦π¦ is normally distributed around ππππ with SD ππβ’ ππππ relates π¦π¦ππ to π₯π₯ππ and defines the "best fit" prediction lineβ’ ππ captures the residuals or errors when π¦π¦ππ β ππππ
11/19/2019 Interpretable Linear Modeling 6
Simple linear regression
Likelihood
Linear model
β’ The goal is to find the parameter values that minimize the residualsβ’ In linear modeling, this means estimating πΌπΌ, π½π½, and ππ
β’ This can be accomplished in several different waysβ’ Maximum likelihood (i.e., minimize the residual sum of squares)β’ Bayesian approximation (e.g., maximum a posteriori or MCMC)
β’ I will provide code for use in R's lme4 and in Python's statsmodelsβ’ These both use maximum likelihood estimation by defaultβ’ I also highly recommend MCMC (especially for MLM)
11/19/2019 Interpretable Linear Modeling 7
Estimating the regression coefficients
Simulated Data of 150 YouTube Book Review Videos
11/19/2019 Interpretable Linear Modeling 8
Simulated LM example dataset
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ
review_score ~ 1
11/19/2019 Interpretable Linear Modeling 9
Example of LM with intercept-only
Variable Estimate 95% CI
πΌπΌ (Intercept) 6.51 [6.20, 6.83]ππ Residual SD 1.94
Deviance = 563.16, Adjusted π π 2 = 0.000
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ + π½π½π₯π₯ππ
review_score ~ 1 + is_critic
11/19/2019 Interpretable Linear Modeling 10
Example of LM with binary predictor
Variable Estimate 95% CI
πΌπΌ (Intercept) 6.85 [6.46, 7.24]π½π½ is_critic β0.87 [β1.50,β0.24]ππ Residual SD 1.90
Deviance = 536.15, Adjusted π π 2 = 0.042
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ + π½π½π₯π₯ππ
review_score ~ 1 + smile_rate
11/19/2019 Interpretable Linear Modeling 11
Example of LM with continuous predictor
Variable Estimate 95% CI
πΌπΌ (Intercept) 5.40 [4.90, 5.90]π½π½ smile_rate 1.15 [0.72, 1.57]ππ Residual SD 1.78
Deviance = 471.43, Adjusted π π 2 = 0.157
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ + π½π½1π₯π₯1ππ + π½π½2π₯π₯2ππ
review_score ~ 1 + smile_rate + is_critic
11/19/2019 Interpretable Linear Modeling 12
Example of LM with two predictors
Variable Estimate 95% CI
πΌπΌ (Intercept) 5.71 [5.13, 6.29]π½π½1 smile_rate 1.07 [0.65, 1.49]π½π½2 is_critic β0.61 [β1.20,β0.01]ππ Residual SD 1.77
Deviance = 458.68, Adjusted π π 2 = 0.174
π¦π¦ππ ~ Normal(ππππ ,ππ)ππππ = πΌπΌ + π½π½1π₯π₯1ππ + π½π½2π₯π₯2ππ + π½π½3π₯π₯1πππ₯π₯2ππ
review_score ~ 1 + smile_rate * is_critic
11/19/2019 Interpretable Linear Modeling 13
Example of LM with interaction effect
Variable Estimate 95% CI
πΌπΌ (Intercept) 5.96 [5.31, 6.61]π½π½1 smile_rate 0.83 [0.33, 1.34]π½π½2 is_critic β1.31 [β2.32,β0.30]π½π½3 Interaction 0.79 [β0.13, 1.70]ππ Residual SD 1.76
Deviance = 449.88, Adjusted π π 2 = 0.185
β’ Generalized linear modeling (GLM) is an extension of LMβ’ It allows LM to accommodate non-normally distributed ππ variablesβ’ Examples include variables that are: binary, discrete, and bounded
β’ This is accomplished using GLM families and link functionsβ’ Families model the likelihood function using a specific distributionβ’ Link functions connect the linear model to the mean of the likelihoodβ’ Link function also ensure that the mean is the "right" kind of number
11/19/2019 Interpretable Linear Modeling 15
What is generalized linear modeling?
π¦π¦ππ ~ Family ππππ , β¦
link(ππππ) = πΌπΌ + π½π½1π₯π₯1ππ + β―+ π½π½πππ₯π₯ππππ
11/19/2019 Interpretable Linear Modeling 16
Common GLM families and link functions
Family Uses Supports Link Function
Normal Linear data β: (ββ,β) Identity ππ
Gamma Non-negative data(reaction times) β: (0,β) Inverse
or Power ππβ1
Poisson Count data β€: 0, 1, 2, β¦ Log log(ππ)
Binomial Binary dataCategorical data
β€: 0, 1β€: [0,πΎπΎ) Logit log
ππ1 β ππ
11/19/2019 Interpretable Linear Modeling 17
Simulated GLM example dataset
Simulated Data of 150 Romantic Couples' Interactions
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ + π½π½π₯π₯ππ
divorced ~ 1 + satisfaction
11/19/2019 Interpretable Linear Modeling 18
Example of LM for binary data
Variable Estimate 95% CI
πΌπΌ (Intercept) 0.98 [0.89, 1.07]π½π½ satisfaction β0.22 [β0.25,β0.19]ππ Residual SD 0.304
Deviance = 13.70, Adjusted π π 2 = 0.617
π¦π¦ππ ~ Binomial(1,ππππ)
logit(ππππ) = πΌπΌ + π½π½π₯π₯ππdivorced ~ 1 + satisfaction
family = binomial(link = "logit")
11/19/2019 Interpretable Linear Modeling 19
Example of GLM for binary data
Variable Estimate 95% CI
πΌπΌ (Intercept) 3.52 [2.45, 4.87]π½π½ satisfaction β1.78 [β2.40,β1.30]
Deviance = 82.04, Pseudo π π 2 = 0.664
Note: Results are in transformed (i.e., logit) units
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ + π½π½π₯π₯ππ
interruptions ~ 1 + satisfaction
11/19/2019 Interpretable Linear Modeling 20
Example of LM for count data
Variable Estimate 95% CI
πΌπΌ (Intercept) 17.11 [15.76, 18.45]π½π½ satisfaction β3.95 [β4.38,β3.52]ππ Residual SD 4.61
Deviance = 3143.62, Adjusted π π 2 = 0.688
π¦π¦ππ ~ Poisson(ππππ)
log(ππππ) = πΌπΌ + π½π½π₯π₯ππinterruptions ~ 1 + satisfactionfamily = poisson(link = "log")
11/19/2019 Interpretable Linear Modeling 21
Example of GLM for count data
Variable Estimate 95% CI
πΌπΌ (Intercept) 3.16 [3.08, 3.24]π½π½ satisfaction β0.77 [β0.83,β0.72]
Deviance = 325.24, Pseudo π π 2 = 0.895
Note: Results are in transformed (i.e., log) units
β’ MLM allows (G)LM to handle data points that are dependent/clusteredβ’ This is problematic because data within a cluster are likely to be similarβ’ Effects (e.g., intercepts and slopes) may also differ between clusters
11/19/2019 Interpretable Linear Modeling 23
What is multilevel modeling?
Person 1
Obs. 1
X
Y
Obs. 2
X
Y
Obs. 3
X
Y
Person 2
Obs. 1
X
Y
Obs. 2
X
Y
Obs. 3
X
Y
Person 3
Obs. 1
X
Y
Obs. 2
X
Y
Obs. 3
X
Y
β’ One approach would be to estimate effects across all clusters (complete pooling)β’ But this would require all clusters to be nearly identical, which they may not be
β’ Another would be to estimate effects for each cluster separately (no pooling)β’ But this would ignore similarities between clusters and may overfit the data
β’ The MLM approach tries to have the best of both approaches (partial pooling)β’ MLM tries to learn the similarities and the differences between clustersβ’ It estimates effects for each individual cluster separatelyβ’ But these effects are assumed to be drawn from a population of clustersβ’ MLM explicitly models this population and the cluster-specific variation
11/19/2019 Interpretable Linear Modeling 24
What is multilevel modeling?
β’ MLM comes with several important benefits over single-level modelsβ’ Improved estimates for repeat samplingβ’ Improved estimates for imbalanced samplingβ’ Estimates of variation across clustersβ’ Avoid averaging and retain variationβ’ Better accuracy and inference!
β’ Many have argued that "multilevel regression deserves to be the default approach"
11/19/2019 Interpretable Linear Modeling 25
What is multilevel modeling?
π¦π¦ππ ~ Normal ππππ ,ππ
ππππ = πΌπΌCLUSTER[ππ]
πΌπΌCLUSTER ~ Normal(πΌπΌ,ππCLUSTER)
β’ πΌπΌ captures the overall intercept (i.e., the mean of all clusters' intercepts)β’ πΌπΌCLUSTER ππ captures each cluster's deviation from the overall interceptβ’ ππCLUSTER captures how much variability there is across clusters' interceptsβ’ We now have an intercept for every cluster and a "population" of interceptsβ’ The model learns about each cluster's intercept from the population of intercepts
β’ This pooling within parameters is very helpful for imbalanced sampling
11/19/2019 Interpretable Linear Modeling 26
Varying intercepts
11/19/2019 Interpretable Linear Modeling 27
Actual MLM example dataset
Real Data of 751 Smiles from 136 Subjects (Girard, Shandar, et al. 2019)
11/19/2019 Interpretable Linear Modeling 28
Complete pooling (underfitting)
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌ
smile_int ~ 1
Variable Estimate 95% CI
πΌπΌ (Intercept) 2.14 [2.10, 2.19]ππ Residual SD 0.66
Deviance = 1495.8
11/19/2019 Interpretable Linear Modeling 29
No pooling (overfitting)
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌPERSON[ππ]
smile_int ~ 1 + factor(subject)
Variable Estimate 95% CI
πΌπΌ 1 Intercept for P1 2.14 [1.93, 2.89]πΌπΌ 2 Intercept for P2 2.17 [1.49, 2.85]β¦ β¦ β¦ β¦
ππ Residual SD 0.60
Deviance = 1210.10
11/19/2019 Interpretable Linear Modeling 30
Partial pooling (ideal fit)
π¦π¦ππ ~ Normal(ππππ ,ππ)
ππππ = πΌπΌππππππππππππ ππ
πΌπΌππππππππππππ ~ Normal πΌπΌ,ππππππππππππππ
smile_int ~ 1 + (1 | subject)
Variable Estimate 95% CI
πΌπΌ (Intercept) 2.15 [2.09, 2.21]ππππ Intercept SD 0.27ππ Residual SD 0.60
Deviance = 1458.81
π¦π¦ππ ~ Normal ππππ ,ππ
ππππ = πΌπΌCLUSTER ππ + π½π½CLUSTER ππ π₯π₯πππΌπΌCLUSTERπ½π½CLUSTER ~ MVNormal
πΌπΌπ½π½, ππ
ππ =πππΌπΌ 00 πππ½π½
πππππΌπΌ 00 πππ½π½
β’ We can pool and learn from the covariance between varying intercepts and slopesβ’ We do this with a 2D normal distribution with means πΌπΌ and π½π½ and covariance matrix ππβ’ We build ππ through matrix multiplication of the variances and correlation matrix ππβ’ This is more complex but results in even better pooling: within and across parameters
11/19/2019 Interpretable Linear Modeling 32
Varying intercepts and slopes
11/19/2019 Interpretable Linear Modeling 33
MLM with varying intercepts and slopes
π¦π¦ππ ~ Normal ππππ ,ππ
ππππ = πΌπΌPERSON ππ + π½π½PERSON ππ π₯π₯πππΌπΌPERSONπ½π½PERSON ~ MVNormal
πΌπΌπ½π½, ππ
ππ =πππΌπΌ 00 πππ½π½
πππππΌπΌ 00 πππ½π½
smile_int ~ 1 + amused_sr + (1 + amused_sr| subject)
Deviance = 1458.81
Variable Estimate 95% CI
πΌπΌ (Intercept) 1.98 [1.91, 2.06]π½π½ amused_sr 0.12 [0.09, 0.15]πππΌπΌ Intercept SD 0.24 [0.17, 0.32]πππ½π½ Slope SD 0.04 [0.00, 0.08]
ππ corr πππΌπΌ ,πππ½π½ 0.30 [β0.69, 0.96]
ππ Residual SD 0.57 [0.54, 0.60]