Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or...

CS109A Introduction to Data SciencePavlos Protopapas, Kevin Rader, and Chris Tanner

Advanced Section #5:Generalized Linear Models:

Logistic Regression and Beyond

Nick Stern

CS109A, PROTOPAPAS, RADER

Outline

1. Motivation

• Limitations of linear regression

2. Anatomy

• Exponential Dispersion Family (EDF)

• Link function

3. Maximum Likelihood Estimation for GLM’s

• Fischer Scoring

Motivation

Linear regression framework:

𝑦" = 𝑥"%𝛽 + 𝜖"

Assumptions:

1. Linearity: Linear relationship between expected value and predictors

2. Normality: Residuals are normally distributed about expected value

3. Homoskedasticity: Residuals have constant variance 𝜎*

4. Independence: Observations are independent of one another

Motivation

Expressed mathematically…

• Linearity

𝔼 𝑦" = 𝑥"%𝛽

• Normality

𝑦" ∼ 𝒩(𝑥"%𝛽, 𝜎*)

• Homoskedasticity

𝜎* (instead of) 𝜎"*

• Independence

𝑝 𝑦"|𝑦3 = 𝑝(𝑦") for 𝑖 ≠ 𝑗

Motivation

What happens when our assumptions break down?

Motivation

We have options within the framework of linear regression

Transform X or Y

(Polynomial Regression)

Nonlinearity

Weight observations

(WLS Regression)

Heteroskedasticity

Motivation

But assuming Normality can be pretty limiting…

Consider modeling the following random variables:

• Whether a coin flip is heads or tails (Bernoulli)

• Counts of species in a given area (Poisson)

• Time between stochastic events that occur w/ constant rate (gamma)

• Vote counts for multiple candidates in a poll (multinomial)

Motivation

We can extend the framework for linear regression.

Enter:

Generalized Linear Models

Relaxes:

• Normality assumption

• Homoskedasticity assumption

Motivation

Anatomy

Two adjustments must be made to turn LM into GLM

1. Assume response variable comes from a family of distributions called the exponential dispersion family (EDF).

2. The relationship between expected value and predictors is expressed through a link function.

Anatomy – EDF Family

The EDF family contains: Normal, Poisson, gamma, and more!

The probability density function looks like this:

𝑓 𝑦"|𝜃" = exp𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ 𝑐 𝑦", 𝜙"

𝜃 - “canonical parameter”𝜙 - “dispersion parameter”𝑏 𝜃 - “cumulant function”𝑐 𝑦, 𝜙 - “normalization factor”

Example: representing Bernoulli distribution in EDF form.

PDF of a Bernoulli random variable:

𝑓 𝑦" 𝑝" = 𝑝"@A 1 − 𝑝" C D @A

Taking the log and then exponentiating (to cancel each other out) gives:

𝑓 𝑦" 𝑝" = exp 𝑦" log 𝑝" + 1 − 𝑦" log 1 − 𝑝"

Rearranging terms…

𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"

1 − 𝑝"+ log 1 − 𝑝"

Comparing:

𝑓 𝑦" 𝑝" = exp 𝑦" log𝑝"

1 − 𝑝"+ log 1 − 𝑝" 𝑓 𝑦"|𝜃" = exp

𝑦"𝜃" − 𝑏 𝜃"𝜙"

+ 𝑐 𝑦", 𝜙"vs.

Choosing:

𝜃" = log𝑝"

1 − 𝑝"𝜙" = 1

𝑏(𝜃") = log 1 + 𝑒IA

𝑐(𝑦", 𝜙") = 0

And we recover the EDF form of the Bernoulli distribution

The EDF family has some useful properties. Namely:

1. 𝔼 𝑦" ≡ 𝜇" = 𝑏N 𝜃"

2. 𝑉𝑎𝑟 𝑦" = 𝜙"𝑏NN 𝜃"(the proofs for these identities are in the notes)

Plugging in the values we obtained for Bernoulli, we get back:

𝔼 𝑦" = 𝑝" , 𝑉𝑎𝑟 𝑦" = 𝑝"(1 − 𝑝")

Anatomy – Link Function

Time to talk about the link function

Recall from linear regression that:

𝜇" = 𝑥"%𝛽

Does this work for the Bernoulli distribution?

𝜇" = 𝑝" = 𝑥"%𝛽

Solution: wrap the expectation in a function called the link function:

𝑔 𝜇" = 𝑥"%𝛽 ≡ 𝜂"

*For the Bernoulli distribution, the link function is the “logit” function (hence “logistic” regression)

Link functions are a choice, not a property. A good choice is:

1. Differentiable (implies “smoothness”)

2. Monotonic (guarantees invertibility)

1. Typically increasing so that 𝜇 increases w/ 𝜂

3. Expands the range of 𝜇 to the entire real line

Example: Logit function for Bernoulli

𝑔 𝜇" = 𝑔 𝑝" = log𝑝"

1 − 𝑝"

Logit function for Bernoulli looks familiar…

𝑔 𝑝" = log𝑝"

1 − 𝑝"= 𝜃"

Choosing the link function by setting 𝜃" = 𝜂" gives us what is known as the “canonical link function.” Note:

𝜇" = 𝑏N 𝜃" → 𝜃" = 𝑏NDC(𝜇")(derivative of cumulant function must be invertible)

This choice of link, while not always effective, has some nice properties. Take STAT 149 to find out more!

Here are some more examples (fun exercises at home)

Distribution 𝒇(𝒚𝒊|𝜽𝒊) Mean Function 𝝁𝒊 = 𝒃N(𝜽𝒊) Canonical Link 𝜽𝒊 = 𝒈(𝝁𝒊)

Normal 𝜃" 𝜇"

Bernoulli/Binomial 𝑒IA1 + 𝑒IA

log𝜇"

1 − 𝜇"

Poisson 𝑒IA log(𝜇")

Gamma−1𝜃"

−1𝜇"

Inverse Gaussian −2𝜃"DC*

−12𝜇"*

Maximum Likelihood Estimation

Recall from linear regression – we can estimate our parameters, 𝜃, by choosing those that maximize the likelihood, 𝐿 𝑦 𝜃), of the data, where:

𝐿 𝑦 𝜃 =^"

𝑝 𝑦" 𝜃"

In words: likelihood is the probability of observing a set of “N”independent datapoints, given our assumptions about the generative process.

For GLM’s we can plug in the PDF of the EDF family:

𝐿 𝑦 𝜃 =^"`C

exp𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ 𝑐 𝑦", 𝜙"

How do we maximize this? Differentiate w.r.t. 𝜃 and set equal to 0. Taking the log first simplifies our life:

ℓ 𝑦 𝜃 = b"`C

_𝑦"𝜃" − 𝑏 𝜃"

𝜙"+ b

𝑐 𝑦", 𝜙"

Through lots of calculus & algebra (see notes), we can obtain the following form for the derivative of the log-likelihood:

ℓN 𝑦 𝜃 =b"`C

𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇")

Setting this sum equal to 0 gives us the generalized estimating equations:

𝑉𝑎𝑟 𝑦"𝜕𝜇"𝜕𝛽 (𝑦" − 𝜇") = 0

When we use the canonical link, this simplifies to the normal equations:

_𝑦" − 𝜇" 𝑥"%

𝜙"= 0

Let’s attempt to solve the normal equations for the Bernoulli distribution. Plugging in 𝜇" and 𝜙" we get:

𝑦" −𝑒dA

1 − 𝑒dAef𝑥"% = 0

Sad news: we can’t isolate 𝛽 analytically.

Good news: we can approximate it numerically. One choice ofalgorithm is the Fisher Scoring algorithm.

In order to find the 𝜃 that maximizes the log-likelihood, ℓ(𝑦|𝜃):

1. Pick a starting value for our parameter, 𝜃g.

2. Iteratively update this value as follows:

𝜃"hC = 𝜃" −ℓN(𝜃")

𝔼 ℓNN 𝜃"

In words: perform gradient ascent with a learning rate inversely proportional to the expected curvature of the function at that point.

Here are the results of implementing the Fisher Scoring algorithm for simple logistic regression in python:

Questions?

Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or...

Documents

Transcript of Advanced Section #5: Generalized Linear Models: Logistic ... · •Whether a coin flip is heads or...

Clustering comparison of point processes with applications to …blaszczy/dcx_Austin.pdf · 2014. 6. 24. · local and global functionals of geometric structures over Poisson or Bernoulli

Product Sheet Bernoulli Filter. Bernoulli Filter Venezuela.

Probability Theory - Springer978-1-4684-0062-5/1.pdf · 2.1 Poisson theorem, interchangeable events, and their limiting probabilities 30 2.2 Bernoulli, Borel theorems 39 2.3 Central

Bivariate Poisson and Diagonal Inflated Bivariate Poisson ... · Bivariate Poisson and Diagonal Inﬂated Bivariate Poisson Regression Models in R ... of the bivariate Poisson model.

Named After Siméon-Denis Poisson. Binomial and Geometric distributions only work when we have Bernoulli trials. There are three conditions for those.

Fat Tails, Tall Tales, Puppy Dog Tails

STAT3612 Lecture 3 Generalized Linear Modelsexponential family of distributions (e.g. Gaussian, Bernoulli, Poisson, and Gamma); see Wikipeida. The introduced link functions take the

Special Distributions - Data Science Initiative...2020/09/04 · Bernoulli Dist. Binomial Dist. Poisson Dist. Continuous RVs Uniform Distribution Normal Distribution Gamma Distribution

1 Week 3.. 2 3 NORMAL DISTRIBUTION BERNOULLI TRIALS BINOMIAL DISTRIBUTION EXPONENTIAL DISTRIBUTION UNIFORM DISTRIBUTION POISSON DISTRIBUTION.

Analysis of Materials with Strain-Gradient Effects: A ... · A numerical analysis of the method is ... Such theories have applications in ... sical Bernoulli-Euler beam and Poisson-Kirchhoff

6. Bernoulli and Binomial 2019 - UMassbiep540w/pdf/6. Bernoulli and Binomial 2019.pdf · Bernoulli trial. The Bernoulli and Binomial probability distributions are used to model the

Lecture 4: Maximum Likelihood · Lecture 4: Maximum Likelihood. Review – Common Distributions Uniform Normal Lognormal Beta Exponental/Laplace Gamma Binomial Bernoulli Poisson

The diagonal Bernoulli differential estimation … diagonal Bernoulli differential estimation equation ... THE DIAGONAL BERNOULLI DIFFERENTIAL ESTIMATION EQUATION 283 ... simulation

Statistical Inference Lab Three. Bernoulli to Normal Through Binomial One flip Fair coin Heads Tails Random Variable: k, # of heads p=0.5 1-p=0.5 For.

Pascal & Bernoulli

Introduction to Probability: Lecture 22: The Poisson ... · The Poisson process • Readings: Start Section 5.2. Lecture outline • Review of Bernoulli process • Deﬁnition of

The Bernoulli Filter - Flowgasket · The Bernoulli Filter Compendium 1 The Bernoulli Filter Simple and ingenious filtration Table of content 1. The Bernoulli history 2. Introduction

Bernoulli Automatic Coarse Filters - HiFlux Filtrationhiflux-filtration.dk/media/23534/bernoulli-uk.pdf · Bernoulli Automatic Coarse Filters The Bernoulli filter is an automatic

SPECIAL CHARACTERIZATIONS OF STANDARD ...The Bernoulli and Poisson processes are described as genera-tors of such discrete models. A characterization of distributions by mixtures is

Multisensor Poisson Multi-Bernoulli Filter for Joint ... · SLAT in wireless sensor networks. Here, the measurement-to-target DA is known, but corrupted by noise leading to false