Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve...

37
Bayesian Machine Learning Seung-Hoon Na Chonbuk National University

Transcript of Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve...

Page 1: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Machine Learning

Seung-Hoon Na

Chonbuk National University

Page 2: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Concept Learning• Likelihood: 𝑝(𝐷|ℎ)

• Prior: 𝑝(ℎ)

– The mechanism by which background knowledge can be thought to bear on a problem

• Without prior, rapid learning is impossible

– Subjectivity: Controversial, but quite useful

• Human have hot only different priors but also different hypothesis spaces

• Posterior: 𝑝(ℎ|𝐷)

• Posterior predictive distributionBayesian model averaging

Page 3: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Concept Learning

• Number game [Tenebaum ‘99]– Empirical predictive distribution averaged over 8 humans in

the number game

Page 4: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Concept Learning

• Number game: Prior, Likelihood, Posterior

Page 5: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Concept Learning

• Number game: Predictive distributions for the model using the full hypothesis space

Page 6: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Beta-Binomial Model• Want to estimate the probability of heads

• 𝑋𝑖 ∼ 𝐵𝑒𝑟(𝜃)

• Likelihood of 𝑫: 𝑝 𝐷 𝜃 = 𝜃𝑁1(1 − 𝜃)𝑁0

– 𝑁1: the number of heads in 𝐷

– 𝑁0, 𝑁1: Sufficient statistics

• 𝑝 𝐷 𝜃 = 𝑝 𝑠(𝐷) 𝜃 = 𝑝 𝑁1 heads in 𝑁 trials= 𝑝( 𝑁1, 𝑁 |𝜃)

Bias된 동전을 던져서 앞면이 나올 확률

동전이 얼마나 bias되어있는가?

Likelihood of Sufficient statistics

Page 7: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Beta-Binomial Model• Conjugate Prior: If the prior and the posterior have the same form

• Beta distribution: Conjugate prior for the Bernoulli

The parameters of the prior are called

Page 8: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Beta-Binomial Model• Posterior

• Posterior mean and mode

• The posterior mean is convex combination of the prior mean and the MLE

=

𝛼0 = 𝑎 + 𝑏 : the of the prior

Page 9: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Dirichlet-Multinomial Model

• Generalized to estimate probability that a dice with K sides comes up as face k

• Likelihood

• Prior

• Posterior

동전에서 주사위 눈금 확률 구하는 식으로 일반화

Page 10: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Dirichlet-Multinomial Model

• The MAP estimate

• The MLE

• Posterior predictive

Equivalent sample size

Page 11: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Dirichlet-Multinomial Model

• Language models using bag of words

– Bayesian method for smoothing: Provides non-zero probfor unseen words

Mary had a little lamb, little lamb, little lamb,Mary had a little lamb, its fleece as white as snow

Posterior predictive

𝛼𝑗 = 1

Page 12: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Naïve Bayes Classification• Assume the features are conditionally independent given

the class label.

• The class-conditional density depends on the type of each feature

– Real-valued features

– Binary features

– Categorical features

the class conditional density

Page 13: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Naïve Bayes Classification

• MLE for naïve Bayes classification

• The MLE for class prior

• The MLE for class cond prob using binary features

Page 14: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Naïve Bayes

• MLE overfitting

• Bayesian approach: solution to overfitting

– The factored prior

– The factored posterior

Page 15: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Naïve Bayes

• The predictive posterior

Page 16: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Exercise

• Classifying documents using bag of words

• http://nlp.jbnu.ac.kr/BML/BML_assignment_3.pdf

𝑥𝑖𝑗 = 1 iff word 𝑗 occurs in document 𝑖, otherwise 𝑥𝑖𝑗 = 0.

Bernoulli product model, or the binary independence model.

Page 17: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Gaussian Models

Page 18: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Gaussian Models• MLE for a Gaussian

– If we have N iid samples 𝒙𝑖 ∼ 𝑁(𝝁, 𝚺), then the MLE for the parameters is given by

– That is, the MLE is just the empirical mean and empirical covariance. In the univariate case, we get the following familiar

results: Gaussian 분포를따르는 N개의샘플들로부터MLE기반파라미터 (평균, covariance)는샘플로부터 계산된 empirical mean과 covariance이다.

Page 19: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Matrix Differentiation

유용한 공식들: MLE for a Gaussian 등을 유도

Page 20: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Gaussian discriminant analysis• Define the class conditional densities in a generative

classifier:

• Classify a new test vector:

Σ𝑐가대각행렬이면Naïve Bayes와등가

용어는 discriminant이지만생성모델

Page 21: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Quadratic discriminant analysis (QDA)

• The posterior over class labels:

Page 22: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Linear discriminant analysis (LDA)

• The covariance matrices are tied or shared across classes: Σ𝑐 = Σ

Covariance matrix가class간서로동일한경우는Decision boundary가선형

Page 23: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Linear discriminant analysis (LDA)

Page 24: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Multi-class logistic regression

• Directly obtain the class posterior:

Discriminative model

Page 25: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Joint Gaussian Distribution• Suppose 𝒙 = (𝒙1, 𝒙2) is jointly Gaussian with

parameters

• Then the marginals are given by

• and the posterior conditional is given by

Page 26: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Joint Gaussian Distribution

• Linear Gaussian systems

• Bayes rule for linear Gaussian systems

Page 27: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for Gaussian • Inferring an unknown vector from noisy

measurements• Consider 𝑁 vector-valued observations

• Gaussian prior

Effective observation

Page 28: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for Gaussian

• Inferring an unknown vector from noisy measurements

the sensor noise covariance Σ𝑦is known but 𝑥 is unknown

Page 29: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

The Wishart distribution• The generalization of the Gamma distribution to positive

definite matrices

• Inverse Wishart distribution

degrees of freedomScale matrix

used to model uncertainty in covariance matrices

Wishart 분포는 precision matrix에 대한 prior로 활용

Page 30: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN: 𝑝 𝝁|𝐷, 𝚺

• Likelihood

• Prior of 𝝁

• Posterior distribution of 𝝁

N sample data

Page 31: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝚺|𝐷, 𝝁

• Likelihood

• Prior of Σ: the inverse Wishart distribution

• Posterior distribution

Page 32: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝝁, 𝚺|𝐷

• Likelihood

Page 33: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝝁, 𝚺|𝐷

• Consider a prior

– But this is not conjugate to the likelihood

– Instead Semi-conjugate or conditionally conjugate

• 𝑝 𝜇 Σ , 𝑝 Σ 𝜇 are individually conjugate

Page 34: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝝁, 𝚺|𝐷• A full conjugate prior: Normal-inverse-wishart (NIW)

Likelihood식과 비교

Page 35: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝝁, 𝚺|𝐷

• Posterior: NIW N개의 x1 … xN에 대한 posterior는NIW

Page 36: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Inference for MVN:𝑝 𝝁, 𝚺|𝐷

Page 37: Chonbuk National Universitynlp.chonbuk.ac.kr/PGM2019/slides_jbnu/BML.pdf · 2019-03-13 · Naïve Bayes Classification •Assume the features are conditionally independent given the

Bayesian Statistics

• Use the posterior distribution to summarize everything we know about a set of unknown variables