Introduction to Bayesian Learning -...
Transcript of Introduction to Bayesian Learning -...
![Page 1: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/1.jpg)
Course InformationIntroduction
Introduction to Bayesian Learning
Davide Bacciu
Dipartimento di InformaticaUniversità di [email protected]
Apprendimento Automatico: Fondamenti - A.A. 2016/2017
![Page 2: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/2.jpg)
Course InformationIntroduction
Outline
Lecture 1 - IntroductionProbabilistic reasoning and learning(Bayesian Networks)
Lecture 2 - Parameters Learning (I)Learning fully observed Bayesian models
Lecture 3 - Parameters Learning (II)Learning with hidden variables
If we have time, we will cover also some application examplesof Bayesian learning and Bayesian Networks
![Page 3: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/3.jpg)
Course InformationIntroduction
Course Information
Reference Webpage:
http://pages.di.unipi.it/bacciu/teaching/apprendimento-automatico-fondamenti/
Here you can findCourse informationLecture slidesArticles and course materials
Office HoursWhere? Dipartimento di Informatica - 2nd Floor - Room
367When? Monday 17-19When? Basically anytime, if you send me an email
beforehand.
![Page 4: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/4.jpg)
Course InformationIntroduction
Lecture Outline
1 Introduction
2 Probability TheoryProbabilities and Random VariablesBayes Theorem and Independence
3 Learning as Bayesian InferenceHypothesis SelectionBayesian Classifiers
4 Bayesian NetworksRepresentationLearning and Inference
![Page 5: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/5.jpg)
Course InformationIntroduction
Bayesian Learning
Why Bayesian? Easy, because of frequent use of Bayestheorem...Bayesian Inference A powerful approach to probabilistic
reasoningBayesian Networks An expressive model for describing
probabilistic relationshipsWhy bothering?
Real-world is uncertainData (noisy measurements and partial knowledge)Beliefs (concepts and their relationships)
Probability as a measure of our beliefsConceptual framework for describing uncertainty in worldrepresentationLearning and reasoning become matters of probabilisticinferenceProbabilistic weighting of the hypothesis
![Page 6: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/6.jpg)
Probability TheoryLearning as Bayesian Inference
Part I
Probability and Learning
![Page 7: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/7.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Random Variables
A Random Variable (RV) is a function describing theoutcome of a random process by assigning unique valuesto all possible outcomes of the experiment
Random Process =⇒ Coin Toss
Discrete RV =⇒ X =
{0 if heads1 if tails
The sample space S of a random process is the set of allpossible outcomes, e.g. S = {heads, tails}An event e is a subset e ∈ S, i.e. a set of outcomes, thatmay occur or not as a result of the experiment
Random variables are the building blocks for representing ourworld
![Page 8: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/8.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Probability Functions
A probability function P(X = x) ∈ [0,1] (P(x) in short)measures the probability of a RV X attaining the value x ,i.e. the probability of event x occurringIf the random process is described by a set of RVsX1, . . . ,XN , then the joint conditional probability writes
P(X1 = x1, . . . ,XN = xn) = P(x1 ∧ · · · ∧ xn)
Definition (Sum Rule)Probabilities of all the events must sum to 1∑
x
P(X = x) = 1
![Page 9: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/9.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Product Rule and Conditional Probabilities
Definition (Product Rule a.k.a. Chain Rule)
P(x1, . . . , xi , . . . , xn|y) =N∏
i=1
P(xi |x1, . . . , xi−1, y)
P(x |y) is the conditional probability of x given yReflects the fact that the realization of an event y mayaffect the occurrence of xMarginalization: sum and product rules together yield thecomplete probability equation
P(X1 = x1) =∑x2
P(X1 = x1,X2 = x2)
=∑x2
P(X1 = x1|X2 = x2)P(X2 = x2)
![Page 10: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/10.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Bayes Rule
Given hypothesis hi ∈ H and observations d
P(hi |d) =P(d|hi)P(hi)
P(d)=
P(d|hi)P(hi)∑j P(d|hj)P(hj)
P(hi) is the prior probability of hi
P(d|hi) is the conditional probability of observing d giventhat hypothesis hi is true (likelihood).P(d) is the marginal probability of dP(hi |d) is the posterior probability that hypothesis is truegiven the data and the previous belief about thehypothesis.
![Page 11: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/11.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
So, Is This All About Bayes???
Frequentists Vs Bayesians
Figure: Credit goes to Mike West @ Duke University
![Page 12: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/12.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Independence and Conditional Independence
Two RV X and Y are independent if knowledge about Xdoes not change the uncertainty about Y and vice versa
I(X ,Y )⇔ P(X ,Y ) = P(X |Y )P(Y )
= P(Y |X )P(X ) = P(X )P(Y )
Two RV X and Y are conditionally independent given Z ifthe realization of X and Y is an independent event of theirconditional probability distribution given Z
I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Y ,Z )P(Y |Z )
= P(Y |X ,Z )P(X |Z ) = P(X |Z )P(Y |Z )
![Page 13: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/13.jpg)
Probability TheoryLearning as Bayesian Inference
Probabilities and Random VariablesBayes Theorem and Independence
Representing Probabilities with Discrete Data
Joint Probability Distribution Table
X1 . . . Xi . . . Xn P(X1, . . . ,Xn)
x′
1 . . . x′
i . . . x′n P(x
′
1, . . . , x′n)
x l1 . . . x l
i . . . x ln P(x l
1, . . . , xln)
Describes P(X1, . . . ,Xn) for all the RV instantiations x1, . . . , xn
In general, any probability of interest can be obtained startingfrom the Joint Probability Distribution P(X1, . . . ,Xn)
![Page 14: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/14.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Wrapping Up....
We know how to represent the world and the observationsRandom Variables =⇒ X1, . . . ,XNJoint Probability Distribution =⇒ P(X1 = x1, . . . ,XN = xn)
We have rules for manipulating the probabilistic knowledgeSum-ProductMarginalizationBayesConditional Independence
It is about time that we do some learning...in other words, lets discover the values forP(X1 = x1, . . . ,XN = xn)
![Page 15: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/15.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Bayesian Learning
Statistical learning approaches calculate the probability ofeach hypothesis hi given the data D, and selectshypotheses/makes predictions on that basisBayesian learning makes predictions using all hypothesesweighted by their probabilities
P(X |D = d) =∑
i
P(X |d,hi)P(hi |d)
=∑
i
P(X |hi) · P(hi |d)
New prediction Hypothesis prediction Posterior weighting
![Page 16: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/16.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Single Hypothesis Approximation
Computational and Analytical Tractability IssueBayesian Learning requires a (possibly infinite) summation overthe whole hypothesis space
Maximum a-Posteriori (MAP) predicts P(X |hMAP) using themost likely hypothesis hMAP given the training data
hMAP = arg maxh∈H
P(h|d) = arg maxh∈H
P(d|h)P(h)P(d)
= arg maxh∈H
P(d|h)P(h)
Assuming uniform priors P(hi) = P(hj), yields theMaximum Likelihood (ML) estimate P(X |hML)
hML = arg maxh∈H
P(d|h)
![Page 17: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/17.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
All Too Abstract?
Let’s go to the Cinema!!!
How do I choose the next movie(prediction)?I might ask my friends for their favoritechoice given their personal taste(hypothesis)Select the movie
Bayesian advice? Make a voting from allthe friends’ suggestions weighted by theirattendance to cinema and tasteMAP advice? From the friend who goesoften to the cinema and whose taste ItrustML advice? From the friend who goesmore often to the cinema
![Page 18: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/18.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
The Candy Box Problem
A candy manufacturer produces 5 types of candy boxes(hypothesis) that are indistinguishable in the darkness ofthe cinema
h1 100% cherry flavorh2 75% cherry and 25% lime flavorh3 50% cherry and 50% lime flavorh4 25% cherry and 75% lime flavorh5 100% lime flavor
Given a sequence of candies d = d1, . . . ,dN extracted andreinserted in a box (observations), what is the most likelyflavor for the next candy (prediction)?
![Page 19: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/19.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemHypothesis Posterior
First, we need to compute the posterior for eachhypothesis (Bayes)
P(hi |d) = αP(d|hi)P(hi)
The manufacturer is kind enough to provide us with theproduction shares (prior) for the 5 boxes
P(h1),P(h2),P(h3),P(h4),P(h5) = (0.1,0.2,0.4,0.2,0.1)
Data likelihood can be computed under the assumptionthat observations are independently and identicallydistributed (i.i.d.)
P(d|hi) =N∏
j=1
P(dj |hi)
![Page 20: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/20.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemHypothesis Posterior Computation
Suppose that the bag is a h5 and consider a sequence of 10observed lime candies
1 | )d
d)|2P(h
d)|4P(h
P(h 3 | )d
P(h 5 | )d
dNumber of samples in
20 4 6 8 10
0.2
0.4
0.6
0.8
1P(h
Post
erio
r pr
obab
ility
of
h
Hyp d0 d1 d2h1 0.1 0 0h2 0.2 0.1 0.03h3 0.4 0.4 0.30h4 0.2 0.3 0.35h5 0.1 0.2 0.31
P(hi |d) = αP(hi)P(d = l |hi)N
Posteriors of false hypothesis eventually vanish
![Page 21: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/21.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Candy Box ProblemComparing Predictions
Bayesian learning seeks
P(d11 = l |d1 = l , . . . ,d10 = l) =5∑
i=1
P(d11 = l |hi)P(hi |d)
d
Probability next candy is lime
Post
erio
r pr
obab
ility
of
h
1
0.8
0.6
0.4
0.2
108640
Number of samples in
2
![Page 22: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/22.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Observations
Both ML and MAP are point estimates since they onlymake predictions based on the most likely hypothesisMAP predictions are approximately Bayesian ifP(X |d) ∼ P(X |hMAP)
MAP and Bayesian predictions become closer as moredata gets availableML is a good approximation to MAP if dataset is large andthere are no a-priori preferences on the hypotheses
ML is fully data-drivenFor large data sets, the influence of prior becomesirrelevantML has problems with small datasets
![Page 23: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/23.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Regularization
P(h) introduces preference across hypothesesPenalize complexity
Complex hypotheses have a lower prior probabilityHypothesis prior embodies trade-off between complexityand degree of fit
MAP hypothesis hMAP
maxh
P(d|h)P(h) ≡ minh− log2(P(d|h))− log2P(h)
Number of bits required to specify hMAP =⇒ choosing the hypothesis that provides maximumcompressionMAP is a regularization of the ML estimation
![Page 24: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/24.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Bayes Optimal Classifier
Suppose we want to classify a new observation as one of theclasses cj ∈ C
The most probable Bayesian classification is
c∗OPT = arg maxcj∈C
P(cj |d) = arg maxcj∈C
∑i
P(cj |hi)P(hi |d)
Bayesian Optimal ClassifierNo other classification method (using same hypothesis spaceand prior knowledge) can outperform a Bayesian OptimalClassifier in average.
![Page 25: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/25.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Gibbs Algorithm
Bayes Optimal Classifier provides the best results, but can beextremely expensive
Use approximated methods =⇒ the Gibbs Algorithm1 Choose a hypothesis hG from H at random, by drawing it
from the current posterior probability distribution P(hi |d)2 Use hG to predict the classification of the next instance
Under certain conditions, the expected number ofmisclassifications is at most twice the expected value of aBayesian Optimal classifier
![Page 26: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/26.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Naive Bayes Classifier
One of the simplest, yet popular, tools based on a strongprobabilistic assumption
Consider the settingTarget classification function f : X −→ CEach instance x ∈ X is described by a set of attributes
x =< a1, . . . ,al , . . . ,aL >
Seek the MAP classification
cNB = arg maxcj∈C
P(cj |a1, . . . ,aL)
![Page 27: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/27.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Naive Bayes Assumption
The MAP classification rewrites
cNB = arg maxcj∈C
P(cj |a1, . . . ,aL)
= arg maxcj∈C
P(a1, . . . ,aL|cj)P(cj)
Naive Bayes: Assume conditional independence between theattributes al given classification cj
P(a1, . . . ,aL|cj) =L∏
l=1
P(al |cj)
Naive Bayes Classification
cNB = arg maxcj∈C
P(cj)L∏
l=1
P(al |cj)
![Page 28: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/28.jpg)
Probability TheoryLearning as Bayesian Inference
Hypothesis SelectionBayesian Classifiers
Observations on Naive Bayes
Use it whenDealing with moderate or large training setsAttributes that describe instances are seeminglyconditionally independent
In reality, conditional independence is often violated......but Naive Bayes does not seem to know it. It often workssurprisingly well!!
Text classificationMedical DiagnosisRecommendation systems
Next class we’ll see how to train a Naive Bayes classifier
![Page 29: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/29.jpg)
Bayesian NetworksSummary
Part II
Bayesian Networks
![Page 30: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/30.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Representing Conditional Independence
Bayes Optimal Classifier (BOC)No independence information between RVComputationally expensive
Naive Bayes (NB)Full independence given the classExtremely restrictive assumption
AlarmAnnouncement
NeighbourCall
Radio
BurglaryEarthquake
Bayesian Network (BNs) describeconditional independence betweensubsets of RV by a graphical model
I(X ,Y |Z )⇔ P(X ,Y |Z ) = P(X |Z )P(Y |Z )
Combine a-priori informationconcerning RV dependencies withobserved data
![Page 31: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/31.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Bayesian Network - Directed Representation
Directed Acyclic Graph (DAG)Nodes represent random variables
Shaded⇒ observedEmpty⇒ un-observed
Edges describe the conditionalindependence relationshipsEvery variable is conditionallyindependent w.r.t. itsnon-descendant, given its parents
Conditional Probability Tables (CPT) local to each nodedescribe the probability distribution given its parents
P(Y1, . . . ,YN) =N∏
i=1
P(Yi |pa(Yi))
![Page 32: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/32.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Naive Bayes as a Bayesian Network
... ... aL
c
ala1 a2 D
L
c
al
Naive Bayes classifier can be represented as a BayesianNetworkA more compact representation =⇒ Plate NotationAllows specifying more (Bayesian) details of the model
![Page 33: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/33.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Plate Notation
L C
D
al βj
c
π
Bayesian Naive Bayes
Boxes denote replication for anumber of times denoted by theletter in the cornerShaded nodes are observedvariablesEmpty nodes denote un-observedlatent variablesBlack seeds identify constant terms(e.g. the prior distribution π over theclasses)
![Page 34: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/34.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Inference in Bayesian Networks
Inference - How can one determine the distribution of thevalues of one/several network variables, given the observedvalues of others?
Bayesian Networks contain all the needed information(joint probability)NP-hard problem in generalSolves easily for a single unobserved variable......otherwise
Exact inference on small DAGsApproximated inference (Variational methods)Simulate the network (Monte Carlo)
![Page 35: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/35.jpg)
Bayesian NetworksSummary
RepresentationLearning and Inference
Learning with Bayesian Networks
StructureFixed Structure Fixed Variables
X YP(Y |X)
X YP(X , Y )
Dat
a Com
plet
e
Naive BayesCalculate Frequencies (ML)
Discover dependenciesfrom the data
Structure SearchIndependence tests
Inco
mpl
ete
Latent variables
EM Algorithm (ML)MCMC, VBEM (Bayesian)
Difficult Problem
Structural EM
Parameter Learning Structure Learning
![Page 36: Introduction to Bayesian Learning - unipi.itpages.di.unipi.it/bacciu/wp-content/uploads/sites/12/2016/11/ml_lect1.pdf · Probabilistic reasoning and learning (Bayesian Networks) Lecture](https://reader034.fdocuments.in/reader034/viewer/2022042104/5e827a2f07f99b504f329a81/html5/thumbnails/36.jpg)
Bayesian NetworksSummary
Take Home Messages
Bayesian techniques formulate learning as probabilisticinferenceML learning selects the hypothesis that maximizes datalikelihoodMAP learning selects the most likely hypothesis given thedata (ML Regularization)Conditional independence relations can be expressed by aBayesian network
Parameters Learning (Next Lecture)Structure Learning