Bayesian Classification Methods - Nc State University › ~post › suchit ›...

Bayesian Classification Methods

Suchit Mehrotra

North Carolina State University

[email protected]

October 24, 2014

Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33

How do you define probability?

There exists a fundamental difference between how Frequentists andBayesians view probability:

Frequentist: Think of probability as choosing a random point in aVenn diagram and determing the percentage of times the point wouldfall in a certain set as you repeat this process an infinite number oftimes. Seeks to be ”objective”.

Bayesian: The probability assigned to an event depends person toperson. It is the condition under which a person would make a betwith their own money. Reflects a state of belief.


Conditional Probability and Bayes Rule

The fundamental mechanism through which a Bayesian would update theirbeliefs about a certain event is Bayes Rule.Let A and B be events.

Conditional Probability:

P(A|B) =P(A ∩ B)

P(B)whereP(B) > 0 =⇒

P(A ∩ B) = P(B|A)P(A)

Bayes Rule:

P(A|B) =P(B|A)P(A)

P(B)


Posterior Distributions

Let θ be a parameter of interest and y be observed data. The Frequentistapproach considers θ to be fixed and the data to be random, whereasBayesians view θ as a random variable and y as fixed.

The posterior distribution for θ is an application of Bayes Rule:

π(θ|y) =fY|θ(y|θ)fΘ(θ)

fY(y)

fΘ(θ) encapsulates our prior belief about the parameter. Our priorbeliefs are then updated by the data we observe.

Since θ is treated as a random variable, we can now make probabilitystatements about it.


Primary Goals

Frequentist:

Calculate point estimates for the parameters and use standard errorsto calculate confidence intervals.

Bayesian:

Derive the posterior distribution of the parameter.

Summarize this distribution with information about mean, median,mode, quantile, variance etc.

Used credible intervals to show ranges which contain the parameterwith high probability.


Bayesian Learning

The Bayesian framework provides us with a mathematically rigorousway to update our beliefs as we gather new information.

There are no restrictions on what can constitute prior information.Therefore we can use posterior distributions from time 1 as our priorfor time 2.

π1(θ|y1) ∝ fY|θ(y1|θ) fΘ(θ)

π2(θ|y1, y2) ∝ fY|θ(y2|θ) π1(θ|y1)

...

πn(θ|y1, . . . , yn) ∝ fY|θ(yn|θ) πn−1(θ|y1, . . . , yn−1)


Overview of Differences [Gill (2008)]

Interpretation of ProbabilityFrequentist: Observed result from infinite series of trials performed

or imagined under identical conditions.Probabilistic quantity of interest is p(data|Ho)

Bayesian: Probability is the researcher/observer ”degree of belief”before or after the data are observed.

Probabilistic quantity of interest is p(θ|data).

What is Fixed and VariableFrequentist: Data are a iid random sample from continuous stream.

Parameters are fixed by nature.Bayesian: Data observed and so fixed by the sample generated.

Parameters are unknown and described distributionally.

How are Results Summarized?Frequentist: Point estimates and standard errors.

95% confidence intervals indicating that 19/20 times theinterval covers the true parameter value, on average.

Bayesian: Descriptions of posteriors such as means and quantiles.Highest posterior density intervals indicating region of

highest probability.


Linear Regression (Frequentist)

Consider the linear model y = Xβββ + e where X is a n x k matrix withrank k, βββ is a k x 1 vector of coefficients and y is an n x 1 vector ofresponses. e is a vector of errors ∼ iid N(0, σ2I).

Frequentist regression seeks point estimates by maximizing likelihoodfunction with respect to βββ and σ2.

L(βββ, σ2|X, y) = (2πσ2)−n2 exp

[−1

2σ2(y − Xβββ)T (y − Xβββ)

]The unbiased OLS estimates are:

β̂ββ = (XTX)−1XTy σ̂2 =(y − Xβ̂ββ)T (y − Xβ̂ββ)

n − k


Bayesian Regression

Bayesian regression, on the other hand, finds the posterior distribution ofthe parameters βββ and σ2.

π(βββ, σ2 |X, y) ∝ L(βββ, σ2 |X, y)fβββ,σ2(βββ, σ2)

∝ L(βββ, σ2 |X, y)fβββ|σ2(βββ |σ2)fσ2(σ2)

Setup Prior Posterior

Conjugate βββ |σ2 ∼ N(b, σ2) βββ |X ∼ t(n + a− k)σ2 ∼ IG(a, b) σ2 |X ∼ IG(n + a− k, 1

2σ̂2(n + a− k))

Uninformative βββ ∝ c over [−∞, ∞] βββ |X ∼ t(n − k)σ2 = 1

σover [0 :∞] σ2 |X ∼ IG( 1

2(n − k − 1), 1

2σ̂2(n − k))

Gill(2008)

Additionally, the posterior distribution conditioned on σ2 for βββ in thenon-informative case:

βββ|σ2, y ∼ N(β̂ββ, σ2Vβββ) where Vβββ = (XTX)−1


Example: Median Home Prices

From the MASS library: Data regarding 506 housing values insuburbs of Boston

Fit a model with medv (median value of owner occupied homes)as the response and four predictors.

Implementation with the blinreg function in the LearnBayespackage.

boston.fit <- lm(medv ~ crim + indus + chas + rm, data = Boston, y = TRUE,

x = TRUE)

theta.sample <- blinreg(boston.fit$y, boston.fit$x, 100000)


Example: Posterior Samples of Parameters

Crime

β1

Fre

quen

cy

−0.35 −0.25 −0.15 −0.05

040

00

Industry

β2

Fre

quen

cy

−0.4 −0.3 −0.2 −0.1

030

00

Charles River

β3

Fre

quen

cy

0 2 4 6 8

030

00

Rooms

β4

Fre

quen

cy

6 7 8 9

040

00Std. Deviation

σ

Fre

quen

cy

5.5 6.0 6.5

010

0030

00


Example: Confidence Intervals for Parameters

Bayesian regression with non-informative priors leads to similar estimatesas OLS.

OLS Confidence Intervals:

2.5 % 97.5 %

(Intercept) -26.43 -15.29crim -0.26 -0.12

indus -0.35 -0.18chas 2.48 6.65

rm 6.62 8.25

Bayesian Credible Intervals:

Intercept crim indus chas rm

2.5% -26.42 -0.26 -0.35 2.48 6.6297.5% -15.30 -0.12 -0.18 6.65 8.25


Logistic Regression

Like Linear Regression, Logistic Regression is also a likelihoodmaximization problem in the frequentist setup. When the number ofparameters is two, the log-likelihood function is:

`(β0, β1|y) = β0

n∑i=1

yi + β1

n∑i=1

xiyi −n∑

i=1

log(1 + eβ0+β1xi )

In the Bayesian setting, we incorporate prior information and find theposterior distribution of the parameters:

π(βββ|X, y) ∝ L(βββ|X, y)fβββ (βββ)

The standard ”weakly-informative” prior used in the arm package is theStudent-t distribution with 1 degree of freedom (Cauchy distribution).


Logistic Regression: Example with Cancer Data Set

Goal: Run Logistic Regression on the cancer data set from the Universityof Wisconsin using the bayesglm function in the arm package.

cancer.bayesglm <- bayesglm(malignant ~ . , data = cancer, family = binomial

(link = "logit"))

Using this function, you can get point estimates (posterior modes) for yourparameters and their standard errors. These estimates will be the same asthe output given by glm in most cases without substantive priorinformation.

Estimate Std. Error(Intercept) -8.2409 1.2842

id 0.0000 0.0000thickness 0.4522 0.1270unif.size 0.0063 0.1676

unif.shape 0.3642 0.1876adhesion 0.2002 0.1067cell.size 0.1460 0.1458

bare.nuclei1 -1.1499 0.8526

Some estimates from bayesglm


Logistic Regression: Example with Cancer Data Set

0.00

0.25

0.50

0.75

1.00

2.5 5.0 7.5 10.0thickness

mal

igna

nt


Logistic Regression: Separation

Separation occurs when a particular linear combination of predictors isalways associated with a particular response.

Can lead to maximum likelihood estimates of ∞ for some parameters!

A common fix is to drop predictors from the model.

Can lead to the best predictors being dropped from the model[Zorn(2005)]

2008 paper by Gelman, Jakulin, Pittau & Su proposes a Bayesian fix.

Use the t-distrubtion as the ”weakly-informative” prior.Created bayesglm in the arm (Applied Regression and MultilevelModeling) package with the t-distribution as the default prior.

Using bayesglm instead of glm can solve this problem!


Logistic Regression: Separation Example

0.00

0.25

0.50

0.75

1.00

−2 0 2 4 6x

y



This problem is fixed using bayesglm

0.00

0.25

0.50

0.75

1.00

−2 0 2 4 6x

y



Estimate Std. Error z value Pr(> |z|)(Intercept) -164.0925 69589.0320 -0.00 0.9981

x 75.6372 32263.2265 0.00 0.9981

glm fit

Estimate Std. Error

(Intercept) -4.0144 0.9054x 1.6192 0.3867

bayesglm fit


Naive Bayes: Overview

Like Linear Discriminant Analysis, Naive Bayes uses Bayes Rule todetermine the probability that a particular outcome is in class C`.

P(Y = C`|X ) =P(X |Y = C`)P(Y = C`)

P(X )

P(Y = C`|X ) is the posterior probability of the class given a set ofpredictors

P(Y = C`) is the prior probability of the outcome

P(X ) is the probability of the predictors

P(X |Y = C`) is the probablity of observing a set of predictors giventhat Y belongs to class `.


Naive Bayes: Independence Assumption

P(X|Y = C`) is extremely difficult to calculate without a strong set ofassumptions. The Naive Bayes model simplifies this calculation with theassumption that all predictors are mutually independent given Y.

P(Y = C` |X1 . . .Xn) =P(Y = C`)P(X1, . . . ,Xn |Y = C`)

P(X1, . . . ,Xn)

∝ P(Y = C`)P(X1, . . . ,Xn |Y = C`)

∝ P(Y = C`)P(X1 |Y = C`) . . .P(Xn |Y = C`)

∝ P(Y = C`)n∏

i=1

P(Xi |Y = C`)


Naive Bayes: Class Probabilities and Assignment

The classifier assigns the class which has the highest posteriorprobability:

argmaxC`

P(Y = C`)n∏

i=1

P(Xi = xi |Y = C`)

We need to assume prior probabilites for class priors. We have twobasic options.

Equiprobable probabilites: P(C`) = 1n

Calculation from training set: P(C`) = number of samples in class `total number of samples

The third option is to use information based on prior beliefs about theprobablity that Y = C`


Naive Bayes: Distributional Assumptions

Even though the independence assumptions simplifies the classifier, we stillneed to make assumptions about the predictors.

In the case of continuous predictors, the class probabilities arecalculated using a Normal Distribution by default.

The classifier estimates the µ` and σ` for each predictor,

P(X = x |Y = C`) =1

σ`√

2πexp

[−1

2σ2`

(x − µ`)2

]Other choices include:

MultinomialBernoulliSemi-supervised methods


Naive Bayes: Simple Example

Classify the Iris dataset using the NaiveBayes function in the klaR package.

Sepal.Length

2.0 2.5 3.0 3.5 4.0

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●● ●

●●

●●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●●

●●

●

●●●

●●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●●

●

●●●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●●

●●

●●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●●

●●

●

●●●

●●

●

●

●

●

●● ●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●●

●●

●

●

0.5 1.0 1.5 2.0 2.5

4.5

5.5

6.5

7.5

●●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●● ●●●

●●

●●

●

●●

●

●

●

●●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●●

●

●

●●

●●

●●●

●

●

●●●

●●

●

●

●

●

●●●

●

●

●

●●●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●●

●●

●

●

●

●

●

●

●

●●

●

●●

●

●●

●

●

●●

●

●●

●

●

●●●

●●

●

●

2.0

2.5

3.0

3.5

4.0

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●● ●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●● ●

●

●●

●

●

●

●

●Sepal.Width

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●●

●

●●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●

●

●

●●

●

●●

●●

●

●●●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●

●

●●

●

●●

● ●

●

●

●

●

●

●●

●

●

●

●

●●●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●●

●

●

●

●

●

●

●●

●

●

●

●●

● ●●

●

●●

●

●

●

●

●

●●●● ●

●● ●● ● ●●

●● ●

●●●

●●

●●

●

●●

●● ●●●● ●● ●●

● ●●●●

●●●●●

●●

● ●●

●●

●

●

●●●

●

●

●●

●●

●

●

●●●

●

●

●

●

●●

● ●●

●

●

●●●

●

●

● ●●

●●●

●●

●

●

●●● ●

●

●

●

●

●●

●

●

●

●

●●

●●

●

●●●●

●●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●●

●

●●

●●

●●

●●

●●

●

●● ●● ●

●●●● ● ●●

●● ●

●●●

●●

●●

●

●●

● ●●●●● ● ●●●● ●●●

●●● ●●

●

●●

● ●●

●●

●

●

●●●

●

●

●●

●●

●

●

●●●

●

●

●

●

●●

●●●

●

●

●●●

●

●

● ●●

●●●

●●

●

●

● ●●●

●

●

●

●

●●

●

●

●

●

●●

●●

●

● ●●

●

●●

●

●

●

●

●

●●

● ●

●●

●●

●

●

●

●

●●

●

●●

●●

●●

●●

●●

●

Petal.Length

12

34

56

7

●●●●●

●●●●●●●●

●●●●●

●●

●●

●

●●● ●●●●● ●●●●●●

●●●

●●●●

●

●●●●●

●●●

●

●●●

●

●

●●

●●

●

●

●●●

●

●

●

●

●●

●●●

●

●

●●●

●

●

●●●

●●●

●●

●

●

●●●●

●

●

●

●

●●

●

●

●

●

●●

●●

●

● ●●

●

●●

●

●

●

●

●

●●

●●

●●

●●

●

●

●

●

●●

●

●●

●●

●●

●●

●●

●

4.5 5.5 6.5 7.5

0.5

1.0

1.5

2.0

2.5

●●●● ●

●●

●●●

●●●●

●

●●● ●●

●

●

●

●

● ●

●

●●●●

●

●●●● ●

●● ●

●●●

●

●●

●● ●●

●● ●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●●

● ●

●

●

●●●

●

●●

●●

●●●●

●

●

●

●●● ●

●

●

●

●

●

●

●●

●●●

●

●●

●●

●●

●

●●

●

●

● ●

●

●

●●●

●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●●

●

●

●● ●● ●

●●●●

●●●

●●●

●●● ●●

●

●

●

●

●●

●

●●●●

●

●●●● ●

●● ●

●●●

●

●●

●● ●●

●●●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●

●

●

●●●

●

●●

●●

● ●●●

●

●

●

●●

●●

●

●

●

●

●

●

●●

●●●

●

●●

●●

●●

●

●●

●

●

●●

●

●

●● ●

●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●●

●

●

1 2 3 4 5 6 7

●●●●●

●●●●●●●

●●●

●●●●●

●

●

●

●

●●

●

●●●●

●

●●●●●●

●●●●●

●

●●

●●●●

●● ●

●

●

●

●

●

●●

●

●

●

●●

●●

●

●

●

●

●

●

●●● ●

●

●

●●

●

●

●●●

●

●●●●

●

●

●

●●●●

●

●

●

●

●

●

●●

●●●

●

●●

●●

●●

●

●●

●

●

● ●

●

●

●●●

●

●

●●

●

●●

●●

●●

●

●●

●

●

●

●

●●

●

●

Petal.Width

Iris Data


Naive Bayes: Simple Example

1 2 3 4 5 6 7

0.0

0.2

0.4

0.6

Petal.Length

Den

sity

0.5 1.5 2.5

0.0

0.4

0.8

1.2

Petal.WidthD

ensi

ty

setosa versicolor virginica

setosa 50 0 0versicolor 0 47 3virginica 0 3 47


Naive Bayes: Non-Parametric Estimation

cancer.nb <- NaiveBayes(malignant ~. ,data = cancer, usekernel = TRUE )

0 10 445 91 12 232

TRUE (97%)

0 10 434 61 23 235

FALSE (96%)

2 4 6 8 10

0.00

0.10

thickness

Den

sity


Naive Bayes: Priors

Bayesian approaches allow for the inclusion of prior information into themodel. In this context priors are applied to the probability of belonging toa group.

0 1

0 442 31 15 238

P(0) = 0.1; P(1) = 0.9

0 1

0 443 41 14 237

P(0) = 0.25; P(1) = 0.75

0 1

0 444 71 13 234

P(0) = 0.5; P(1) = 0.5

0 1

0 445 181 12 223

P(0) = 0.9; P(1) = 0.1


Naive Bayes: Sepsis Example

Fit Naive Bayes model on data regarding whether or not a child developssepsis based on three predictors. Data from research done at KaiserPermanente (Courtesy of Dr. David Draper at UCSC).

Response is whether or not a child developed Sepsis (66,846observations).

This is an extremely bad outcome for a child. They either die or gointo a permanent coma.

Two predictors ANC and I2T are derived from a blood test called theComplete Blood Count.

Third predictor is the baby’s age in hours at which the blood testresults were obtained.

In this scenario, our goal is likely to predict accurately which children willget sepsis. Overall classification rate is likely not as important.



0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Age <= 1

ANC

I2T

0 20 40

0.0

0.2

0.4

0.6

0.8

1.0

1 < Age <= 4

ANC

I2T

0 20 40 60 80

0.0

0.2

0.4

0.6

0.8

1.0

4 < Age

ANCI2

T



As can be seen by the confusion matrices below, the overall classificationrate of around 98.6% does not make up for the fact that we only correctlyclassify the sepsis incidence 11% of the time.

0 1

0 65887 2161 715 28

usekernel = FALSE

0 1

0 66589 2391 13 5

usekernel = TRUE


Naive Bayes: Pros and Cons

AdvantagesDue to the independence assumption, there is no curse ofdimensionality! We can fit models with p > n without any issues.Independence assumption simplifies math and speeds up the modelfitting process.We can include prior information to increase accuracy.Have the option of not making distributional assumptions about eachpredictor.

DisadvantagesIf independence assumptions do not hold, this model can performpoorly.Non-parametric approach can overfit data.Our prior information could be wrong.


Logistic Regression with Sepsis Data

How does Logistic regression fare with the Sepsis data?

0 1

0 66510 2331 92 11

Cutoff: 0.1, % Correct: 99.51%

0 1

0 66245 2171 357 27


0 1

0 65582 1801 1020 64


0 1

0 63339 1351 3263 109



The End

Questions?


Bayesian Classification Methods - Nc State University › ~post › suchit ›...

Documents

Transcript of Bayesian Classification Methods - Nc State University › ~post › suchit ›...