Binary and Multinomial Logistic Regression - · PDF fileBinary and Multinomial Logistic...

Binary and Multinomial Logistic Regression

stat 557Heike Hofmann

Outline

• Logistic Regression:

• model checking by grouping

• Model selection

• scores

• Intro to Multinomial Regression

Example: Happiness Data

> summary(happy) happy year age sex not too happy: 5629 Min. :1972 Min. : 18.00 female:28581 pretty happy :25874 1st Qu.:1982 1st Qu.: 31.00 male :22439 very happy :14800 Median :1990 Median : 43.00 NA's : 4717 Mean :1990 Mean : 45.43 3rd Qu.:2000 3rd Qu.: 58.00 Max. :2006 Max. : 89.00 NA's :184.00 marital degree finrela health divorced : 6131 bachelor : 6918 above average : 8536 excellent:11951 married :27998 graduate : 3253 average :23363 fair : 7149 never married:10064 high school :26307 below average :10909 good :17227 separated : 1781 junior college: 2601 far above average: 898 poor : 2164 widowed : 5032 lt high school:11777 far below average: 2438 NA's :12529 NA's : 14 NA's : 164 NA's : 4876

only consider extremes: very happy and not very happy individuals

female male

prodplot(data=happy, ~ happy+sex, c("vspine", "hspine"), na.rm=T, subset=level==2)# almost perfect independence# try a model

happy.sex <- glm(happy~sex, family=binomial(), data=happy)summary(happy.sex)

Call:glm(formula = happy ~ sex, family = binomial(), data = happy)

Deviance Residuals: Min 1Q Median 3Q Max -1.6060 -1.6054 0.8027 0.8031 0.8031

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96613 0.02075 46.551 <2e-16 ***sexmale 0.00130 0.03162 0.041 0.967 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 24053 on 20428 degrees of freedomResidual deviance: 24053 on 20427 degrees of freedomAIC: 24057

Number of Fisher Scoring iterations: 4

• Deviance difference is asymptotically χ2 distributed

• Null hypothesis of independence cannot be rejected

> anova(happy.sex)Analysis of Deviance Table

Model: binomial, link: logit

Response: happy

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. DevNULL 20428 24053sex 1 0.0016906 20427 24053

> confint(happy.sex)Waiting for profiling to be done... 2.5 % 97.5 %(Intercept) 0.92557962 1.00693875sexmale -0.06064378 0.06332427

Age and Happiness

age

count

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

happy

not too happy

very happy

age

count

0

100

200

300

400

20 30 40 50 60 70 80

happy

not too happy

very happy

qplot(age, geom="histogram", fill=happy, binwidth=1, data=happy)

qplot(age, geom="histogram", fill=happy, binwidth=1, position="fill", data=happy)

# research paper claims that happiness is u-shapedhappy.age <- glm(happy~poly(age,2), family=binomial(), data=na.omit(happy[,c("age","happy")]))

> summary(happy.age)

Call:glm(formula = happy ~ poly(age, 2), family = binomial(), data = na.omit(happy[, c("age", "happy")]))

Deviance Residuals: Min 1Q Median 3Q Max -1.6400 -1.5480 0.7841 0.8061 0.8707

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.96850 0.01571 61.660 < 2e-16 ***poly(age, 2)1 6.41183 2.22171 2.886 0.00390 ** poly(age, 2)2 -7.81568 2.21981 -3.521 0.00043 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Null deviance: 23957 on 20351 degrees of freedomResidual deviance: 23936 on 20349 degrees of freedomAIC: 23942


age

count

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

happy

not too happy

very happy

# effect of ageX <- data.frame(cbind(age=20:85))X$pred <- predict(happy.age, newdata=X, type="response")qplot(age, pred, data=X) + ylim(c(0,1))

age

count

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

happy

not too happy

very happy

age

pred

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

age

count

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

happy

not too happy

very happy

> anova(happy.age)Analysis of Deviance Table


Response: happy


Df Deviance Resid. Df Resid. DevNULL 20351 23957poly(age, 2) 2 20.739 20349 23936

age

pred3

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

sex

female

male

# effect of ageX <- data.frame(expand.grid(age=20:85, sex=c("female","male")))preds <- predict(happy.age, newdata=X, type="response", se.fit=T)X$pred <- preds$fitX$pred.se <- preds$se.fit

limits <- aes(ymax = pred + pred.se, ymin=pred - pred.se)qplot(age, pred, data=X, size=I(1)) + ylim(c(0,1)) + geom_point(aes(age, pred2), size=1, colour="blue") + geom_errorbar(limits) + geom_errorbar(limits2, colour="blue") + geom_point(aes(x=age, y=happy/(happy+not), colour=sex), data=happy.age.df)

> anova(midlife.sex)Analysis of Deviance Table


Response: happy


Df Deviance Resid. Df Resid. DevNULL 20351 23957poly(age, 4) 4 59.021 20347 23898sex 1 0.000 20346 23898poly(age, 4):sex 4 37.554 20342 23860

Problems with Deviance

• if X is continuous, deviance has no longer χ2 distribution. Two-fold violations:

• regard X to be categorical (with lots of categories): we might end up with a contingency table that has lots of small cells - which means, that the χ2 approximation does not hold.

• Increases in sample size, most likely increase the number of different values of X.Corresponding contingency table changes size (asymptotic distribution for the smaller contingency table doesn’t exist).

... but

• Differences in deviances between models that are only a few degrees of freedom apart, still have asymptotically χ2

age

pred3

0.0

0.2

0.4

0.6

0.8

1.0

20 30 40 50 60 70 80

sex

female

male

# effect of ageX <- data.frame(expand.grid(age=20:85, sex=c("female","male")))preds <- predict(happy.age, newdata=X, type="response", se.fit=T)X$pred <- preds$fitX$pred.se <- preds$se.fit

limits <- aes(ymax = pred + pred.se, ymin=pred - pred.se)qplot(age, pred, data=X, size=I(1)) + ylim(c(0,1)) + geom_point(aes(age, pred2), size=1, colour="blue") + geom_errorbar(limits) + geom_errorbar(limits2, colour="blue") + geom_point(aes(x=age, y=happy/(happy+not), colour=sex), data=happy.age.df)

> anova(midlife.sex)Analysis of Deviance Table


Response: happy


Df Deviance Resid. Df Resid. DevNULL 20351 23957poly(age, 4) 4 59.021 20347 23898sex 1 0.000 20346 23898poly(age, 4):sex 4 37.554 20342 23860

Model Checking by Grouping

• Group data along estimates, e.g. such that groups are approximately equal in size.

• Partition smallest n1 estimates into group 1, second smallest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hosmer-Lemeshow test statistic:

Problem with deviance: if X continuous, deviance has no longer χ2 distribution. The approximation as-sumptions are violated two-fold: even if we regard X to be categorical (with lots of categories) these means,that we end up with a contingency table that has lots of small cells - which means, that the χ2 approxima-tion does not hold. Secondly, if we increase the sample size, most likely the number of different values of Xincreases, too, which makes the corresponding contingency table change size (so we cannot even talk aboutan asymptotic distribution for the smaller contingency table, as it doesn’t exist anymore once the samplesize is larger).

Model Checking by Grouping To get around the problems with the distribution assumption of G2, wecan group the data along estimates, e.g. by partitioning on estimates such that groups are approximatelyequal in size.Partitioning the estimates is done by size, we group the smallest n1 estimates into group 1, the secondsmallest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hosmer-Lemeshow teststatistic

g�

i=1

��ni

j=1 yij −�ni

j=1 π̂ij

�2

��ni

j=1 π̂ij

� �1−

�j π̂ij/ni

� ∼ χ2g−2.

4.4 Effects of Coding

Let X be a nominal variable with I categories. An appropriate model would then be:

logπ(x)

1− π(x)= α + βi,

where βi is the effect of the ith category in X on the log odds, i.e. for each category one effect is estimated.This means that the above model is overparameterized (the “last” category can be explained in terms ofthe others). To make the solution unique again, we have to use an additional constraint. In R, β1 = 0,by default. Whenever one of the effects is fixed to be zero, this is called a contrast coding - as it allows acomparison of all the other effects to the baseline effect. For effect coding the constraint is on the sum of alleffects of a variable:

�i βi = 0. In a binary variable the effects are then the negatives of each other.

Predictions and inference are independent from the specific coding used and are not affected by changesmade in the coding.

Example: Alcohol and MalformationAlcohol during pregnancy is believed to be associated with congenital malformation. The following numbersare from an observational study - after three months of pregnancy questions on the average number of dailyalcoholic beverages were asked; at birth the infant was checked for malformations:

Alcohol malformed absent P(malformed)1 0 48 17066 0.00282 < 1 38 14464 0.00263 1-2 5 788 0.00634 3-5 1 126 0.00795 ≥ 6 1 37 0.0263

Models m1 and m2 are the same in terms of statistical behavior: deviance, predictions and inference willyield the same numbers. The variable Alcohol is recoded for the second model, giving different estimatesfor the levels.

Alcohol<-factor(c("0","<1","1-2","3-5",">=6"),levels=c("0","<1","1-2","3-5",">=6"))

malformed<-c(48,38,5,1,1)absent <- c(17066,14464,788,126,37)

44

Problems with Grouping

• Different groupings might (and will) lead to different decisions w.r.t model fit

• Hosmer et al (1997): “A COMPARISON OF GOODNESS-OF-FIT TESTS FOR THE LOGISTIC REGRESSION MODEL” (on Blackboard)

Model Selection

?

Theory for relationship between response and outcome is well developed, model is fitted because we want to fine-tune dependency structure

Ideal Situation:

Model Selection

?

After initial data check, visually inspect relationship between response and potential co-variatesinclude strongest co-variates first, build up from there, check whether additions are significant improvements

Exploratory Modelling

Model Selection

Include/Exclude variables based on goodness-of-fit criteria such as AIC, adjusted R2, ...

Stepwise Modelling (not recommended by itself)

In Practice: combination of all three methods

(Forward) Selection

• Results are often not easy to interpret - questionable value?

Step: AIC=18176cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree

Df Deviance AIC<none> 16714 18176+ sex:marital 4 16707 18177+ marital:degree 16 16688 18182+ poly(age, 4):marital 16 16688 18182+ sex:poly(age, 4) 4 16714 18184+ marital:finrela 16 16693 18187

(Forward) Selection

Step: AIC=18176cbind(happy, not) ~ sex + poly(age, 4) + marital + degree + finrela + degree:finrela + poly(age, 4):degree + poly(age, 4):finrela + sex:finrela + sex:degree

Df Deviance AIC<none> 16714 18176- sex:degree 4 16722 18176+ sex:marital 4 16707 18177- sex:finrela 4 16724 18178+ marital:degree 16 16688 18182+ poly(age, 4):marital 16 16688 18182+ sex:poly(age, 4) 4 16714 18184+ marital:finrela 16 16693 18187- poly(age, 4):finrela 16 16759 18189- poly(age, 4):degree 16 16766 18196- degree:finrela 16 16774 18204- marital 4 18232 19686

Investigate Interactions

• Financial Relation / Gender

female female female female femalefemalemale male male male malemale

prodplot(happy, ~happy+sex+finrela, c("vspine","hspine","hspine"), subset=level==3)

far below averagefar below averagebelow average below average average average above averageabove averagefar above averagefar above averageNANA

Investigate Interactions

• Financial Relation / Gender

female female female female femalefemalemale male male male malemale

prodplot(happy, ~happy+finrela+sex, c("vspine","hspine","hspine"), subset=level==3)

Effect plots

age

pred

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

0.2

0.4

0.6

0.8

bachelor

20 30 40 50 60 70 80 90

graduate

20 30 40 50 60 70 80 90

high school

20 30 40 50 60 70 80 90

junior college

20 30 40 50 60 70 80 90

lt high school

20 30 40 50 60 70 80 90

divorced

married

never m

arrie

dseparated

widowed

female

female

female

female

female

divorced

married

never m

arrie

dseparated

widowed

male

male

male

male

male

finrela

far below average

below average

average

above average

far above average

Standardized Residuals

• Standardize by dividing by leverage values:

• Hat matrix is result of iterative weighted fitting,

• with the weights determined by the link:

Diagnostics

• Residual Plots

• Predictive Power (corresponds to R2)

• Deletion Statistics (Belsley, Kuh and Welsch (1980), Cook and Weisberg (1982)): dfbeta, dffits, covratio, cooks.distance

Example: Alcohol during pregnancy

• Observational Study: at 3 months of pregnancy, expectant mothers asked for average daily alcohol consume.infant checked for malformation at birth

Problem with deviance: if X continuous, deviance has no longer χ2 distribution. The approximation as-sumptions are violated two-fold: even if we regard X to be categorical (with lots of categories) these means,that we end up with a contingency table that has lots of small cells - which means, that the χ2 approxima-tion does not hold. Secondly, if we increase the sample size, most likely the number of different values of Xincreases, too, which makes the corresponding contingency table change size (so we cannot even talk aboutan asymptotic distribution for the smaller contingency table, as it doesn’t exist anymore once the samplesize is larger).

Model Checking by Grouping To get around the problems with the distribution assumption of G2, wecan group the data along estimates, e.g. by partitioning on estimates such that groups are approximatelyequal in size.Partitioning the estimates is done by size, we group the smallest n1 estimates into group 1, the secondsmallest batch of n2 estimates into group 2, ... If we assume g groups, we get the Hosmer-Lemeshow teststatistic

g�

i=1

��ni

j=1 yij −�ni

j=1 π̂ij

�2

��ni

j=1 π̂ij

� �1−

�j π̂ij/ni

� ∼ χ2g−2.

4.4 Effects of Coding

Let X be a nominal variable with I categories. An appropriate model would then be:

logπ(x)

1− π(x)= α + βi,

where βi is the effect of the ith category in X on the log odds, i.e. for each category one effect is estimated.This means that the above model is overparameterized (the “last” category can be explained in terms ofthe others). To make the solution unique again, we have to use an additional constraint. In R, β1 = 0,by default. Whenever one of the effects is fixed to be zero, this is called a contrast coding - as it allows acomparison of all the other effects to the baseline effect. For effect coding the constraint is on the sum of alleffects of a variable:

�i βi = 0. In a binary variable the effects are then the negatives of each other.

Predictions and inference are independent from the specific coding used and are not affected by changesmade in the coding.

Example: Alcohol and MalformationAlcohol during pregnancy is believed to be associated with congenital malformation. The following numbersare from an observational study - after three months of pregnancy questions on the average number of dailyalcoholic beverages were asked; at birth the infant was checked for malformations:

Alcohol malformed absent P(malformed)1 0 48 17066 0.00282 < 1 38 14464 0.00263 1-2 5 788 0.00634 3-5 1 126 0.00795 ≥ 6 1 37 0.0263

Models m1 and m2 are the same in terms of statistical behavior: deviance, predictions and inference willyield the same numbers. The variable Alcohol is recoded for the second model, giving different estimatesfor the levels.

Alcohol<-factor(c("0","<1","1-2","3-5",">=6"),levels=c("0","<1","1-2","3-5",">=6"))

malformed<-c(48,38,5,1,1)absent <- c(17066,14464,788,126,37)

44

Saturated Modelglm(formula = cbind(malformed, absent) ~ Alcohol, family = binomial())

Deviance Residuals: [1] 0 0 0 0 0

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.87364 0.14454 -40.637 <2e-16 ***Alcohol<1 -0.06819 0.21743 -0.314 0.7538 Alcohol1-2 0.81358 0.47134 1.726 0.0843 . Alcohol3-5 1.03736 1.01431 1.023 0.3064 Alcohol>=6 2.26272 1.02368 2.210 0.0271 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Null deviance: 6.2020e+00 on 4 degrees of freedomResidual deviance: -3.0775e-13 on 0 degrees of freedomAIC: 28.627


‘Linear’ Effectglm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial())

Deviance Residuals: 1 2 3 4 5 0.7302 -1.1983 0.9636 0.4272 1.1692

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.2089 0.2873 -21.612 <2e-16 ***as.numeric(Alcohol) 0.2278 0.1683 1.353 0.176 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Null deviance: 6.2020 on 4 degrees of freedomResidual deviance: 4.4473 on 3 degrees of freedomAIC: 27.074


levels: 1,2,3,4,5

‘Linear’ Effectglm(formula = cbind(malformed, absent) ~ as.numeric(Alcohol), family = binomial())

Deviance Residuals: 1 2 3 4 5 0.5921 -0.8801 0.8865 -0.1449 0.1291

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.9605 0.1154 -51.637 <2e-16 ***as.numeric(Alcohol) 0.3166 0.1254 2.523 0.0116 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Null deviance: 6.2020 on 4 degrees of freedomResidual deviance: 1.9487 on 3 degrees of freedomAIC: 24.576


levels: 0,0.5,1.5,4,7

Scores

• Scores of categorical variables critically influence a model

• usually, scores will be given by data experts

• various choices: e.g. midpoints of interval variables,

• assume default scores are values 1 to n

Multinomial Models

• Response Y is categorical with J > 2 categories

• define πj(x) = P(Y=j | X=x)

• Baseline Categorical Model: pick one reference category i, express logit with respect to this reference:

5 Logit Models for Multinomial Logit

Let response variable Y be a nominal variable with J > 2 categories.

5.1 Baseline Category Logit Models

Pick one “special” category i, e.g. i = 1 or i = J or i is largest category of Y . Then define

logπj(x)πi(x)

= αj + β�jx for all j = 1, ..., J and all x.

It is enough to just look at the J − 1 differences; for categories a and b we get a comparison by

logπa(x)πb(x)

= logπa(x)πi(x)

− logπb(x)πi(x)

= (αa − αb) + (βa − βb)�x.

Haberman : G2 and X2 are χ2 distributed, if data is categorical and not sparse; if data is sparse orcontinuous, deviance differences between nested models are still χ2 distributed, if the models differ in fewparameters.

Example: Alligator - Food Choice 219 alligators were examined with respect to their primary foodchoice (fish, invertebrae, birds, reptile, other). Explanatory variables are lake (4 categories), size( < 2.3m, >2.3m) and gender. The full model then has the form

logπj(x)πF (x)

= αj + βLlj + βS

sj + βGgj + βLS

lsj + βLGlgj + βSG

sgj + βLSGlsgj , for j = 1, ..., 4

the number of parameters we estimate is then (in the above order):

(1 + 3 + 1 + 1 + 3 + 3 + 1 + 3) · 4 = 16 · 4 = 64

The full model has 0 degrees of freedom:

> library(nnet)> options(contrasts=c("contr.treatment","contr.poly"))> fitS<-multinom(food~lake*size*gender,data=table.7.1) # saturated model# weights: 85 (64 variable)initial value 352.466903iter 10 value 261.200857iter 20 value 245.788420iter 30 value 244.090612iter 40 value 243.812122iter 50 value 243.801212final value 243.800899converged

Since we only have 219 observations on 80 cells, we deal with a sparse table problem. We are only able tocompare differences of deviances.

fit0<-multinom(food~1,data=table.7.1) # nullfit1<-multinom(food~gender,data=table.7.1) # Gfit2<-multinom(food~size,data=table.7.1) # Sfit3<-multinom(food~lake,data=table.7.1) # Lfit4<-multinom(food~size+lake,data=table.7.1) # L + S

58

Multinomial Model

• Choices for baseline: largest category gives most stable results

• R picks first level

Example: Alligator Food

• 219 alligators from four lakes in Florida were examined with respect to their primary food choice: fish, invertebrae, birds, reptile, other.

• Additionally, size of alligators (≤2.3m, >2.3m) and gender were recorded.

> summary(alligator) ID food size gender lake Min. : 1.0 bird :13 <2.3:124 f: 89 george :63 1st Qu.: 55.5 fish :94 >2.3: 95 m:130 hancock :55 Median :110.0 invert:61 oklawaha:48 Mean :110.0 other :32 trafford:53 3rd Qu.:164.5 rep :19 Max. :219.0

<2.3 >2.3

fish

bird

invert

otherrep


xtabs(~lake + food, data = alligator)

lake

food

george hancock oklawaha trafford

fish

bird

invert

other

rep

xtabs(~gender + food, data = alligator)

gender

food

f m

fish

bird

invert

other

rep


library(nnet)

• Brian Ripley’s nnet package allows to fit multinomial models:

library(nnet)alli.main <- multinom(food~lake+size+gender, data=alligator)

> summary(alli.main)Call:multinom(formula = food ~ lake + size + gender, data = alligator)

Coefficients: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3bird -2.4321397 0.5754699 -0.55020075 1.237216 0.7300740invert 0.1690702 -1.7805555 0.91304120 1.155722 -1.3361658other -1.4309095 0.7667093 0.02603021 1.557820 -0.2905697rep -3.4161432 1.1296426 2.53024945 3.061087 0.5571846 gendermbird -0.6064035invert -0.4629388other -0.2524299rep -0.6276217

Std. Errors: (Intercept) lakehancock lakeoklawaha laketrafford size>2.3bird 0.7706720 0.7952303 1.2098680 0.8661052 0.6522657invert 0.3787475 0.6232075 0.4761068 0.4927795 0.4111827other 0.5381162 0.5685673 0.7777958 0.6256868 0.4599317rep 1.0851582 1.1928075 1.1221413 1.1297557 0.6466092 gendermbird 0.6888385invert 0.3955162other 0.4663546rep 0.6852750

Residual Deviance: 537.8655 AIC: 585.8655

Binary and Multinomial Logistic Regression - · PDF fileBinary and Multinomial Logistic...

Documents

Transcript of Binary and Multinomial Logistic Regression - · PDF fileBinary and Multinomial Logistic...