Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or...

30
CS109A Introduction to Data Science Pavlos Protopapas, Kevin Rader, and Chris Tanner Advanced Section #5: Generalized Linear Models: Logistic Regression and Beyond 1 Nick Stern

Transcript of Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or...

Page 1: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A Introduction to Data SciencePavlos Protopapas, Kevin Rader, and Chris Tanner

Advanced Section #5:Generalized Linear Models:

Logistic Regression and Beyond

1

Nick Stern

Page 2: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Outline

1. Motivation

• Limitations of linear regression

2. Anatomy

• Exponential Dispersion Family (EDF)

• Link function

3. Maximum Likelihood Estimation for GLM’s

• Fischer Scoring

2

Page 3: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

3

Page 4: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

4

Linear regression framework:

𝑦" = 𝑥"%𝛽 + 𝜖"

Assumptions:

1. Linearity: Linear relationship between expected value and predictors

2. Normality: Residuals are normally distributed about expected value

3. Homoskedasticity: Residuals have constant variance 𝜎*

4. Independence: Observations are independent of one another

Page 5: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

5

Expressed mathematically…

• Linearity

𝔼 𝑦" = 𝑥"%𝛽

• Normality

𝑦" ∼ 𝒩(𝑥"%𝛽, 𝜎*)

• Homoskedasticity

𝜎* (instead of) 𝜎"*

• Independence

𝑝 𝑦"|𝑦3 = 𝑝(𝑦") for 𝑖 ≠ 𝑗

Page 6: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

6

What happens when our assumptions break down?

Page 7: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

7

We have options within the framework of linear regression

Transform X or Y

(Polynomial Regression)

Nonlinearity

Weight observations

(WLS Regression)

Heteroskedasticity

Page 8: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

8

But assuming Normality can be pretty limiting…

Consider modeling the following random variables:

• Whether a coin flip is heads or tails (Bernoulli)

• Counts of species in a given area (Poisson)

• Time between stochastic events that occur w/ constant rate (gamma)

• Vote counts for multiple candidates in a poll (multinomial)

Page 9: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

9

We can extend the framework for linear regression.

Enter:

Generalized Linear Models

Relaxes:

• Normality assumption

• Homoskedasticity assumption

Page 10: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Motivation

10

Page 11: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy

11

Page 12: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy

12

Two adjustments must be made to turn LM into GLM

1. Assume response variable comes from a family of distributions called the exponential dispersion family (EDF).

2. The relationship between expected value and predictors is expressed through a link function.

Page 13: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

13

The EDF family contains: Normal, Poisson, gamma, and more!

The probability density function looks like this:

𝑓 𝑦"|𝜃" = exp𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ 𝑐 𝑦", 𝜙"

Where

𝜃 - “canonical parameter”𝜙 - “dispersion parameter”𝑏 𝜃 - “cumulant function”𝑐 𝑦, 𝜙 - “normalization factor”

Page 14: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

14

Example: representing Bernoulli distribution in EDF form.

PDF of a Bernoulli random variable:

𝑓 𝑦" 𝑝" = 𝑝"@A 1 − 𝑝" C D @A

Taking the log and then exponentiating (to cancel each other out) gives:

𝑓 𝑦" 𝑝" = exp 𝑦" log 𝑝" + 1 − 𝑦" log 1 − 𝑝"

Rearranging terms…

𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"

1 − 𝑝"+ log 1 − 𝑝"

Page 15: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

15

Comparing:

𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"

1 − 𝑝"+ log 1 − 𝑝" 𝑓 𝑦"|𝜃" = exp

𝑦"𝜃" − 𝑏 𝜃"𝜙"

+ 𝑐 𝑦", 𝜙"vs.

Choosing:

𝜃" = log𝑝"

1 − 𝑝"𝜙" = 1

𝑏(𝜃") = log 1 + 𝑒IA

𝑐(𝑦", 𝜙") = 0

And we recover the EDF form of the Bernoulli distribution

Page 16: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – EDF Family

16

The EDF family has some useful properties. Namely:

1. 𝔼 𝑦" ≡ 𝜇" = 𝑏N 𝜃"

2. 𝑉𝑎𝑟 𝑦" = 𝜙"𝑏NN 𝜃"(the proofs for these identities are in the notes)

Plugging in the values we obtained for Bernoulli, we get back:

𝔼 𝑦" = 𝑝" , 𝑉𝑎𝑟 𝑦" = 𝑝"(1 − 𝑝")

Page 17: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

17

Time to talk about the link function

Page 18: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

18

Recall from linear regression that:

𝜇" = 𝑥"%𝛽

Does this work for the Bernoulli distribution?

𝜇" = 𝑝" = 𝑥"%𝛽

Solution: wrap the expectation in a function called the link function:

𝑔 𝜇" = 𝑥"%𝛽 ≡ 𝜂"

*For the Bernoulli distribution, the link function is the “logit” function (hence “logistic” regression)

Page 19: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

19

Link functions are a choice, not a property. A good choice is:

1. Differentiable (implies “smoothness”)

2. Monotonic (guarantees invertibility)

1. Typically increasing so that 𝜇 increases w/ 𝜂

3. Expands the range of 𝜇 to the entire real line

Example: Logit function for Bernoulli

𝑔 𝜇" = 𝑔 𝑝" = log𝑝"

1 − 𝑝"

Page 20: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

20

Logit function for Bernoulli looks familiar…

𝑔 𝑝" = log𝑝"

1 − 𝑝"= 𝜃"

Choosing the link function by setting 𝜃" = 𝜂" gives us what is known as the “canonical link function.” Note:

𝜇" = 𝑏N 𝜃" → 𝜃" = 𝑏NDC(𝜇")(derivative of cumulant function must be invertible)

This choice of link, while not always effective, has some nice properties. Take STAT 149 to find out more!

Page 21: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Anatomy – Link Function

21

Here are some more examples (fun exercises at home)

Distribution 𝒇(𝒚𝒊|𝜽𝒊) Mean Function 𝝁𝒊 = 𝒃N(𝜽𝒊) Canonical Link 𝜽𝒊 = 𝒈(𝝁𝒊)

Normal 𝜃" 𝜇"

Bernoulli/Binomial 𝑒IA1 + 𝑒IA

log𝜇"

1 − 𝜇"

Poisson 𝑒IA log(𝜇")

Gamma−1𝜃"

−1𝜇"

Inverse Gaussian −2𝜃"DC*

−12𝜇"*

Page 22: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

22

Page 23: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

23

Recall from linear regression – we can estimate our parameters, 𝜃, by choosing those that maximize the likelihood, 𝐿 𝑦 𝜃), of the data, where:

𝐿 𝑦 𝜃 =^"

_

𝑝 𝑦" 𝜃"

In words: likelihood is the probability of observing a set of “N”independent datapoints, given our assumptions about the generative process.

Page 24: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

24

For GLM’s we can plug in the PDF of the EDF family:

𝐿 𝑦 𝜃 =^"`C

_

exp𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ 𝑐 𝑦", 𝜙"

How do we maximize this? Differentiate w.r.t. 𝜃 and set equal to 0. Taking the log first simplifies our life:

ℓ 𝑦 𝜃 = b"`C

_𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ b

"`C

_

𝑐 𝑦", 𝜙"

Page 25: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

25

Through lots of calculus & algebra (see notes), we can obtain the following form for the derivative of the log-likelihood:

ℓN 𝑦 𝜃 =b"`C

_1

𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇")

Setting this sum equal to 0 gives us the generalized estimating equations:

b"`C

_1

𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇") = 0

Page 26: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

26

When we use the canonical link, this simplifies to the normal equations:

b"`C

_𝑦" − 𝜇" 𝑥"%

𝜙"= 0

Let’s attempt to solve the normal equations for the Bernoulli distribution. Plugging in 𝜇" and 𝜙" we get:

b"`C

_

𝑦" −𝑒dA

ef

1 − 𝑒dAef𝑥"% = 0

Page 27: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

27

Sad news: we can’t isolate 𝛽 analytically.

Page 28: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

28

Good news: we can approximate it numerically. One choice ofalgorithm is the Fisher Scoring algorithm.

In order to find the 𝜃 that maximizes the log-likelihood, ℓ(𝑦|𝜃):

1. Pick a starting value for our parameter, 𝜃g.

2. Iteratively update this value as follows:

𝜃"hC = 𝜃" −ℓN(𝜃")

𝔼 ℓNN 𝜃"

In words: perform gradient ascent with a learning rate inversely proportional to the expected curvature of the function at that point.

Page 29: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Maximum Likelihood Estimation

29

Here are the results of implementing the Fisher Scoring algorithm for simple logistic regression in python:

DEMO

Page 30: Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or tails (Bernoulli) •Counts of species in a given area (Poisson) •Time between

CS109A, PROTOPAPAS, RADER

Questions?

30