2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis...
-
date post
19-Dec-2015 -
Category
Documents
-
view
248 -
download
0
Transcript of 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis...
![Page 1: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/1.jpg)
2D1431 Machine Learning
Bayesian Learning
![Page 2: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/2.jpg)
Outline
Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes classifier Bayes optimal classifier Bayesian belief networks Expectation maximization (EM) algorithm
![Page 3: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/3.jpg)
Handwritten characters classification
![Page 4: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/4.jpg)
Gray level pictures:object Gray level pictures:object classificationclassification
![Page 5: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/5.jpg)
Gray level pictures: human Gray level pictures: human action classificationaction classification
![Page 6: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/6.jpg)
Literature & Software T. Mitchell: chapter 6 S. Russell & P. Norvig, “Artificial Intelligence – A
Modern Approach” : chapters 14+15 R.O. Duda, P.E. Hart, D.G. Stork, “Pattern
Classification 2nd ed.” : chapters 2+3 David Heckerman: “A Tutorial on Learning with
Bayesian Belief Networks” http://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf Bayes Net Toolbox for Matlab (free), Kevin Murphy http://www.cs.berkeley.edu/~murphyk/Bayes/bnt.html
![Page 7: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/7.jpg)
Bayes Theorem
P(h|D) = P(D|h) P(h) / P(D)
P(D) : prior probability of the data D, evidence P(h) : prior probability of the hypothesis h,
prior P(h|D) : posterior probability of the hypothesis
given the data D, posterior P(D|h) : probability of the data D given the
hypothesis h , likelihood of the data
![Page 8: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/8.jpg)
Bayes Theorem
P(h|D) = P(D|h) P(h) / P(D)
posterior = likelihood x prior / evidence
By observing the data D we can convert the prior probability P(h) to the a posteriori probability (posterior) P(h|D)
The posterior is probability that h holds after data D has been observed.
The evidence P(D) can be viewed merely as a scale factor that guarantees that the posterior probabilities sum to one.
![Page 9: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/9.jpg)
Choosing HypothesesP(h|D) = P(D|h) P(h) / P(D)
Generally want the most probable hypothesis given the training data
Maximum a posteriori hypothesis hMAP
hMAP = argmaxhH P(h|D)
= argmaxhH P(D|h) P(h) / P(D)
= argmaxhH P(D|h) P(h) If the priors of hypothesis are equally likely
P(hi)=P(hj) then one can choose the maximum likelihood (ML) hypothesis
hML = argmaxhH P(D|h)
![Page 10: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/10.jpg)
Bayes Theorem ExampleA patient takes a lab test and the result is positive. The test
returns a correct positive () result in 98% of the cases in which the disease is actually present, and a correct negative () result in 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have the disease. Hypotheses : disease, ¬disease
priors P(h) : P(disease) = 0.008, P(¬ disease)=0.992likelihoods P(D|h): P(|disease)=0.98, P( |disease)=0.02
P(|¬disease)=0.03, P(|¬disease)=0.97
Maximum posteriors argmax P(h|D): P(disease|)~ P(|disease)P(disease)=0.0078 P(¬ disease|)~ P(|¬ disease) P(¬ disease) = 0.0298P(disease|) = 0.0078/(0.0078+0.0298) = 0.21P(¬ disease|) = 0.0298/(0.0078+0.0298) = 0.79
![Page 11: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/11.jpg)
Basic Formula for Probabilities
Product rule: P(AB) = P(A) P(B) Sum rule: P(AB) = P(A) + P(B) - P(AB) Theorem of total probability: if A1, A2, …,
An are mutually exclusive events i P(Ai) = 1, then
P(B) = i P(B|Ai) P(Ai)
![Page 12: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/12.jpg)
Bayes Theorem Example
P(x1,x2|1,2,) = 1/(2) exp -i (xi-i)2/2
h={1,2,}
D={x1,…,xm}
![Page 13: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/13.jpg)
Gaussian Probability Function
P(D|1,2,) = m P(xm|1,2,) Maximum likelihood hypothesis hML
hML = argmax 1,2, P(D|1,2,) Trick: maximize log-likelihood log P(D|1,2,) = m log P(xm|1,2,)
= m log (1/(2) exp -i (xmi-i)2/22
= -M log (2) - m i (xmi-i)2/22
![Page 14: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/14.jpg)
Gaussian Probability Function
log P(D|1,2,)/ i = 0
m xmi-i
= 0 i ML = 1/M m xmi = E[xm]
log P(D|1,2,)/ = 0
MLm i (xmi-i)2 / 2M = E[(i (xm
i-i)2] / 2
Maximum likelihood hypothesis hML = {iML,ML}
![Page 15: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/15.jpg)
Maximum Likelihood Hypothesis
ML= (0.20, -0.14) ML = 1.42
![Page 16: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/16.jpg)
Bayes Decision Rule x = examples of class c1
o = examples of class c2
{2,2}
{1,1}
![Page 17: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/17.jpg)
Bayes Decision Rule
Assume we have two Gaussians distributions associated to two separate classes c1, c2.
P(x|ci) = P(x|i,i)= 1/(2) exp -i (xi-i)2/2
Bayes decision rule (max posterior probability) Decide c1 if P(c1|x) > P(c2|x)
otherwise decide c2. if P(c1) = P(c2) use maximum likelihood P(x|ci) else use maximum posterior P(ci|x) = P(x|ci) P(ci)
![Page 18: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/18.jpg)
Bayes Decision Rule
c2
c1
![Page 19: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/19.jpg)
Two-Category Case
Discriminant functions: if g(x) > 0 then c1 else c2
g(x) = P(c1|x) – P(c2|x)
= P(x|c1) P(c1) - P(x|c1) P(c1) g(x) = log P(c1|x) – log P(c2|x)
= log P(x|c1)/P(x|c2) - log P(c1)/ P(c2) Gaussian probability functions with identical i
g(x) = (x-2)2/22 - (x-1)2/22 + log P(c1) – log P(c2)decision surface is a line/hyperplane
![Page 20: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/20.jpg)
Learning a Real Valued Function
Consider a real-valued target function f Noisy training examples <xi,di>
di = f(xi) + ei
ei is a random variable drawn from a Gaussian distribution with zero mean.
The maximum likelihood hypothesis hML is the one that minimizes the squared sum of errors
hML = argmin hH i (di – h(xi))2
fe hML
![Page 21: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/21.jpg)
Learning a Real Valued Function
hML = argmax hH P(D|h)
= argmax hH i P(xi|h)
= argmax hH i (2)-0.5 exp -(di-h(xi))2/22
maximizing logarithm log P(D|h)hML = argmax hH i –0.5 log(2) -(di-h(xi))2/22
= argmax hH i -(di - h(xi))2
= argmin hH i (di – h(xi))2
![Page 22: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/22.jpg)
Learning to Predict Probabilities
Predicting survival probability of a patient Training examples <xi,di> where di is 0 or 1 Objective: train a neural network to output a
probability h(xi) = p(di=1) given xi
Maximum likelihood hypothesis:
hML = argmax hH i di ln h(xi) + (1-di) ln (1-h(xi))
maximize cross entropy between di and h(xi) Weight update rule for synapses wk to output
neuron h(xi) wk = wk + i (di-h(xi)) xk Compare to standard BP weight update rule
wk = wk + i h(xi)(1-h(xi)) (di-h(xi)) xk
![Page 23: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/23.jpg)
Most Probable Classification
So far we sought the most probable hypothesis hMAP? What is most probable classification of a new
instance x given the data D?hMAP(x) is not the most probable classification, although
often a sufficiently good approximation of it. Consider three possible hypotheses: P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3 Given a new instance x, h1(x)=+, h2(x)=-, h3(x)=-
hMAP(x) = h1(x) = + most probable classification: P(+)=P(h1|D)=0.4 P(-)=P(h2|D) + P(h3|D) = 0.6
![Page 24: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/24.jpg)
Bayes Optimal Classifier
cmax = argmax cjC hiH P(cj|hi) P(hi|D) Example:
P(h1|D) = 0.4, P(h2|D) = 0.3, P(h3|D) = 0.3
P(+|h1)=1, P(-|h1)=0
P(+|h2)=0, P(-|h2)=1
P(+|h3)=0, P(-|h3)=1therefore
hiH P(+|hi) P(hi|D) = 0.4
hiH P(- |hi) P(hi|D) = 0.6
argmax cjC hiH P(vj|hi) P(hi|D) = -
![Page 25: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/25.jpg)
MAP vs. Bayes Method
The maximum posterior hypothesis estimates a point hMAP in the hypothesis space H.
Bayes method instead estimates and uses a complete distribution P(h|D).
The difference appears when inference MAP or Bayes method are used for inference of unseen instances and one compares the distributions P(x|D)
MAP: P(x|D) = hMAP(x) with hML = argmax hH P(h|D) Bayes: P(x|D) = hiH P(x|hi) P(hi|D) For reasonable prior distributions P(h) MAP and
Bayes solution are equivalent in the asymptotic limit of infinite training data D.
![Page 26: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/26.jpg)
Naïve Bayes Classifier
popular, simple learning algorithm moderate or large training set available assumption: attributes that describe instances are
conditionally independent given classification (in practice works surprisingly well even if assumption is violated)
Applications: diagnosis text classification (newsgroup articles 20
newsgroups, 1000 documents per newsgroup, classification accuracy 89%)
![Page 27: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/27.jpg)
Naïve Bayes Classifier Assume discrete target function F: XC, where each
instance x described by attributes <a1,a2,…,an> Most probable value of f(x) is:
cMAP= argmax cjC P(cj| <a1,a2,…,an>)
= argmax cjC P(<a1,a2,…,an>|cj) P(cj) / P(<a1,a2,…,an>)
= argmax cjC P(<a1,a2,…,an>|cj) P(cj)
Naïve Bayes assumption: P(<a1,a2,…,an>|cj) = i P(ai|cj)
cNB = argmax cjC P(cj) i P(ai|cj)
![Page 28: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/28.jpg)
Naïve Bayes Learning Algorithm
Naïve_Bayes_Learn(examples) for each target value cj estimate P(cj)
for each attribute value ai estimate of each attribute a estimate P(ai|cj)
Classify_New_Instance(x)
cNB = argmax cjC P(cj) aix P(ai|cj)
![Page 29: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/29.jpg)
Naïve Bayes Example
Consider PlayTennis and new instance<Outlook=Sunny, Temp=cool, Humidity=high, Wind=strong>
Compute cNB = argmax cjC P(cj) aix P(ai|cj)
playtennis (9+,5-)P(yes) = 9/14, P(no) = 5/14wind=strong (3+,3-)P(strong|yes) = 3/9 , P(strong|no) 3/5…P(yes) P(sun|yes) P(cool|yes) P(high|yes) P(strong|yes)= 0.005P(no) P(sun|no) P(cool|no) P(high|no) P(strong|no)= 0.021
![Page 30: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/30.jpg)
Estimating Probabilities
What if none (nc=0) of the training instances with target value cj have attribute ai?
P(ai|cj) = nc/n = 0 and P(cj) aix P(ai|cj) = 0
Solution: Bayesian estimate for P(ai|cj) P(ai|cj) = (nc + mp)/(n + m)
n : number of training examples for which c=cj nc : number of examples for which c=cj and a=ai p : prior estimate of P(ai|cj) m : weight given to prior (number of “virtual” examples)
![Page 31: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/31.jpg)
Bayesian Belief Networks
naïve assumption of conditional independency too restrictive
full probability distribution intractable due to lack of data
Bayesian belief networks describe conditional independence among subsets of variables
allows combining prior knowledge about causal relationships among variables with observed data
![Page 32: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/32.jpg)
Conditional IndependenceDefinition: X is conditionally independent of Y given Z is
the probability distribution governing X is independent of the value of Y given the value of Z, that is, if
xi,yj,zk P(X=xi|Y=yj,Z=zk) = P(X=xi|Z=zk) or more compactly P(X|Y,Z) = P(X|Z)Example: Thunder is conditionally independent of Rain
given Lightning P(Thunder |Rain, Lightning) = P(Thunder |Lightning)Notice: P(Thunder |Rain) P(Thunder)Naïve Bayes uses conditional independence to justify: P(X,Y|Z) = P(X|Y,Z) P(Y|Z) = P(X|Z) P(Y|Z)
![Page 33: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/33.jpg)
Bayesian Belief Network
Network represents a set of conditional independence assertions: Each node is conditionally independent of its non-descendants,
given its immediate predecessors. (directed acyclic graph)
StormBusTourGroup
Lightning Campfire
ForestfireThunder
S,B S,¬B ¬S,B
S, ¬B
C 0.4 0.1 0.8 0.2
¬C
0.6 0.9 0.2 0.8
Campfire
![Page 34: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/34.jpg)
Bayesian Belief Network
Network represents joint probability distribution over all variables P(Storm,BusGroup,Lightning,Campfire,Thunder,Forestfire) P(y1,…,yn) = i=1
n P(yi|Parents(Yi)) joint distribution is fully defined by graph plus P(yi|Parents(Yi))
StormBusTourGroup
Lightning Campfire
ForestfireThunder
S,B S,¬B ¬S,B
S, ¬B
C 0.4 0.1 0.8 0.2
¬C
0.6 0.9 0.2 0.8
Campfire
P(C|S,B)
![Page 35: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/35.jpg)
Expectation Maximization EM
when to use data is only partially observable unsupervised clustering: target value
unobservable supervised learning: some instance attributes
unobservableapplications training Bayesian Belief Networks unsupervised clustering learning hidden Markov models
![Page 36: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/36.jpg)
Generating Data from Mixture of Gaussians
Each instance x generated by choosing one of the k Gaussians at random Generating an instance according to that
Gaussian
![Page 37: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/37.jpg)
EM for Estimating k Means
Given: instances from X generated by mixture of k Gaussians unknown means <1,…,k> of the k Gaussians don’t know which instance xi was generated by which
GaussianDetermine: maximum likelihood estimates of <1,…,k>
Think of full description of each instance as yi=<xi,zi1,zi2> zij is 1 if xi generated by j-th Gaussian xi observable zij unobservable
![Page 38: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/38.jpg)
EM for Estimating k MeansEM algorithm: pick random initial h=<1,2> then iterate E step: Calculate the expected value E[zij] of each
hidden variable zij, assuming the current hypothesis h=<1,2> holds.
E[zij] = p(x=xi|=j) / n=12 p(x=xi|=j)
= exp(-(xi-j)2/22) / n=12 exp(-(xi-n)2/22)
M step: Calculate a new maximum likelihood hypothesis h’=<1’,2’> assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in the E-step. Replace h=<1,2> by h’=<1’,2’>
j = i=1m E[zij] xi / i=1
m E[zij]
![Page 39: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/39.jpg)
EM Algorithm
Converges to local maximum likelihood and provides estimates of hidden variables zij.
In fact local maximum in E [ln (P(Y|h)] Y is complete (observable plus non-observable
variables) data Expected valued is taken over possible values
of unobserved variables in Y
![Page 40: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/40.jpg)
General EM ProblemGiven: observed data X = {x1,…,xm} unobserved data Z = {z1,…,zm} parameterized probability distribution P(Y|h) where
Y = {y1,…,ym} is the full data yi=<xi,zi> h are the parameters
Determine: h that (locally) maximizes E[ln P(Y|h)]Applications: train Bayesian Belief Networks unsupervised clustering hidden Markov models
![Page 41: 2D1431 Machine Learning Bayesian Learning. Outline Bayes theorem Maximum likelihood (ML) hypothesis Maximum a posteriori (MAP) hypothesis Naïve Bayes.](https://reader033.fdocuments.in/reader033/viewer/2022061607/56649d3a5503460f94a145fe/html5/thumbnails/41.jpg)
General EM Method
Define likelihood function Q(h’|h) which calculates Y = X Z using observed X and current
parameters h to estimate Z Q(h’|h) = E[ ln( P(Y|h’) | h, X] EM algorithm:Estimation (E) step: Calculate Q(h’|h) using the
current hypothesis h and the observed data X to estimate the probability distribution over Y.
Q(h’|h) = E[ ln( P(Y|h’) | h, X] Maximization (M) step: Replace hypothesis h by
the hypothesis h’ that maximizes this Q function.
h = argmaxh’H Q(h’|h)