Learning with Bayesian Networks

54
Learning with Bayesian Networks Author: David Heckerman Presented By: Yan Zhang - 2006 Jeremy Gould – 2013 Chip Galusha -2014 1

description

Learning with Bayesian Networks. Author: David Heckerman Presented By: Yan Zhang - 2006 Jeremy Gould – 2013 Chip Galusha -2014. Outline. Bayesian Approach Bayesian vs. classical probability methods Bayes Theorm Examples Bayesian Network Structure Inference Learning Probabilities - PowerPoint PPT Presentation

Transcript of Learning with Bayesian Networks

Learning with Bayesian Networks

Learning with Bayesian NetworksAuthor: David Heckerman

Presented By:Yan Zhang - 2006Jeremy Gould 2013Chip Galusha -201411My name is ___ and this presentation is on Bayesian Networks from David Heckermans A Tutorial on Learning With Bayesian Networks.OutlineBayesian ApproachBayesian vs. classical probability methodsBayes TheormExamplesBayesian NetworkStructureInferenceLearning ProbabilitiesDealing with UnknownsLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions22BN were briefly introduced to use earlier in the semester by Prof. Wei. A BN is a graphical model for probabilistic relationship among a set of variables. It has become a popular representation for encoding uncertain knowledge in many applications, including medical diagnosis system, gene regulatory networks, etc. The paper I goanna present today is a very good entry level tutorial, which introduces many topics of BNs. In my talk today, I will mainly focus on several fundamental topics, which are outlined here. Firstly, I will give you a review of bayesian approach, compare the difference between bayesian method and classical probability method, and give you an example as an illustration of bayesian method. After that, I will spend most of time on the several topics on BNs, including its structure, the inference in a BN, How to learn parameters and structures in BNs, etc. In the end I will provide a summary and discuss our exam questions. Bayesian vs. the Classical ApproachBayesian statistical methods start with existing 'prior' beliefs, and update these using data to give 'posterior' beliefs, which may be used as the basis for inferential decisions and probability assessments.

Classical probability refers to the true or actual probability of the event and is not concerned with observed behavior.

33If we make a comparison between bayesian method and classical method, we will notice that We will also notice that Let me give you an example. I have a coin here. If I toss it, what is the probability of the outcome being a head? Bayesian statisticians and classical statisticians have different logic to think this question. For classical stats, they cannot answer this question before performing precise measurement of this coins physical properties. One thing that is certain for they is, this probability is fixed in the range 0 and 1. No matter what observations we will have, this probability will not change. For Bayesian stats, however, they will answer this question just according to their personal belief. Example Is this Man a Martian Spy?

44ExampleWe start with two concepts:Hypothesis (H) He either is or is not a Martian spy.Data (D) Some set of information about the subject. Perhaps financial data, phone records, maybe we bugged his office55ExampleFrequentist SaysBayesian SaysGiven a hypothesis (He IS a Martian) there is a probability P of seeing this data:

P( D | H )

(Considers absolute ground truth, the uncertainty/noise is in the data.)Given this data there is a probability P of this hypothesis being true:

P( H | D )

(This probability indicates our level of belief in the hypothesis.)

6This is why a frequentist cant really say: There is a 70% chance of rain tomorrow. From his perspective it either will or will not rain. The Bayesian on the other hand is free to say I am 70% confident that it will rain tomorrow..6Bayesian vs. the Classical ApproachBayesian approach restricts its prediction to the next (N+1) occurrence of an event given the observed previous (N) events.

Classical approach is to predict likelihood of any given event regardless of the number of occurrences.7NOTE: The Bayesian approach can be updated as new data is observed.7Bayes Theorem8where

In both cases, p(y) is a marginal distribution and can be thought of as a normalizing constant which allows us to rewrite the above as:

8I think all of you have already seen this bayes theorem before. It states that the probability of hypothesis A given the observed evidence B can be calculated from this equation. If A is a contiguous variable, p(B) can be calculated by integrating over A; If A is a discrete variable, p(B) can be calculated by summing over all possible Ais. Instead of using A and B, we sometimes use other notations, which are more meaningful under different context.

Example Coin TossI want to toss a coin n = 100 times. Lets denote the random variable X as the outcome of one flip:p(X=head) = p(X=tail) =1-

Before doing this experiment we have some belief in our mind: the prior probability. Lets assume that this event will have a Beta distribution (a common assumption):

9In the following example, I will show you how the bayesian method works in this coin toss problem.

A = B = 5 is saying I believe that if I flip a coin 10 times I will see 5 heads and five tails.Example Coin TossIf we assume a fair coin we can fix = = 5 which gives:(Hopefully, what you were expecting!)

Beta Distribution

10In the following example, I will show you how the bayesian method works in this coin toss problem.

A = B = 5 is saying I believe that if I flip a coin 10 times I will see 5 heads and five tails.

Gamma(n) = (n-1)!Example Coin TossNow I can run my experiment. As I go I can update my beliefs based on the observed heads (h) and tails (t) by applying Bayes Law to posterior, we have:11

P(Theta|D,Xi) = Given Data & state of information prob of theta

h & t are the # of observed heads and tails

Notice Im basically just changing alpha and beta11Example Coin Toss12Since were assuming a Beta distribution this becomes:our posterior probability. Supposing that we observed h = 45, t = 55, we would get:

P(Theta|D,Xi) = Given Data & state of information prob of theta

h & t are the # of observed heads and tails

Notice Im basically just changing alpha and beta12Example Coin Toss 13

13Dashed is prior belief, solid is belief modified by empirical dataIntegration14To find the probability that Xn+1= heads, we could also integrate over all possible values of to find the average value of which yields: This might be necessary if we were working with a distribution with a less obvious Expected Value.14Remember slide 8?

This is basically what we did before, except we already know the result of doing this to a beta distribution.More than Two OutcomesIn the previous example, we used a Beta distribution to encode the states of the random variable. This was possible because there were only 2 states/outcomes of the variable X.In general, if the observed variable X is discrete, having r possible states {1,,r}, the likelihood function is given by:

15In this general case we can use a Dirichlet distribution instead:

Just using a different distribution really. Rest of the math is the same. Could use Guassian, Gamma, whatever.15Vocabulary ReviewPrior Probability, P( | y ): Prior Probability of a particular value of given no observed data (our previous belief)

Posterior Probability, P( | D): Probability of a particular value of given that D has been observed (our final value of ).

Observed Probability or Likelihood, P(D|): Likelihood of sequence of coin tosses D being observed given that is a particular value.

P(D): Raw probability of D

1616Bayesian AdvantagesIt turns out that the Bayesian technique permits us to do some very useful things from a mining perspective!1. We can use the Chain Rule with Bayesian Probabilities:17Ex.This isnt something we can easily do with classical probability!2. As weve already seen using the Bayesian model permits us to update our beliefs based on new data.P(A,B,C)=P(A|B,C)+P(B|C)+P(C)On the right hand side of the equation, we see have the join prob. distribution. We can ,say with bayes rule, that P(A,B)=P(A|B)P(B). This will be useful down stream from here. Note, as before, were removing the denominator as its regarded as a constant.17OutlineBayesian ApproachBayes TheromBayesian vs. classical probability methodsCoin toss an exampleBayesian NetworkStructureInferenceLearning ProbabilitiesDealing with UnknownsLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions1818I have had a review of the bayesian approach. Now, I goanna introduce our main topic BN. Bayesian Network19 To create a Bayesian network we will ultimately need 3 things:A set of Variables X={X1,, Xn}A Network Structure SConditional Probability Table (CPT) Note that when we start we may not have any of these things or a given element may be incomplete! Probabilities encoded by a Bayesian network may be Bayesian of physical or both191)The first step is identifying variables for the model. We consider the problem at hand and determine a reasonable set of variables.2) We must identify the goals of modeling (prediction vs explanation vs exploration)3) Build a directed acyclic graph that encode ascertains of conditional dependence and independence.

A Little Notation S: The network structure Sh: The hypothesis corresponding to network structure Sc: A complete network structureXi: A variable and corresponding nodePai: The variable or node corresponding to the parents of node Xi D: A data set 20A little notation before we dive into the networks.20Lets start with a simple case where we are given all three things: a credit fraud network designed to determine the probability of credit fraud.21

Bayesian Network: Detecting Credit Card FraudThe problem in this example is detecting whether or not a credit card transaction is fraudulent (the problem). The variables which correspond to the nodes are Fraud(1,0), Gas( gas in last 24hrs;1,0), Jewelry( jewelry last 24hrs;1,0), age and sec of card holderTo determine the structure, we have to order the variables somehow (our order: FASGJ) ->this process has many drawbacks as we may fail to reveal many conditional independencies among variablesThe parent nodes pa1,..,pan and fraud, age, sex 21Bayesian Network: SetupCorrectly identify goalsIdentify many possible relevant observationsDetermine what subset of those observations is worth modelingOrganize observations into variables and order

22Set of Variables23Each node represents a random variable.(Lets assume discrete for now.)

Each variable represents a node.23Network Structure24Each edge/arch represents a conditional dependence between variables.

The edges represent the conditional dependencies.24Conditional Probability Table25Each rule represents the quantification of a conditional dependency.

These rules are prior probabilities->p(xi|pai)2526Since weve been given the network structure we can easily see the conditional dependencies:P(A|F,A,S,G) = P(A)P(S|F,A,S,G) = P(S)P(G|F,A,S,G) = P(G|F)P(J|F,A,S,G) = P(J|F,A,S)

Conditional DependenciesNeed to be careful with the order!2627Note that the absence of an edge indicates conditional independence:P(A|G) = P(A)

2728Important Note: The presence of a of cycle will render one or more of the relationships intractable!

Intractibility = chicken and eggA Bayesian network must be a DAGIn our example, if we added a arc from Age to gas in the structure, we would obtain a structure with one undirected cycle (F-G-A-J-F).28Inference29Now suppose we want to calculate (infer) our confidence level in a hypothesis on the fraud variable f given some knowledge about the other variables. This can be directly calculated via:(Kind of messy)We have constructed the BN, but we want to determine various probabilities of interest from the model. Ex what is the probability of fraud given observations of the other variables.First is just basic law of probability. The Joint over the marginalMess just gets worse as the number of variables increases.29Inference30Fortunately, we can use the Chain Rule to simplify!This Simplification is especially powerful when the network is sparse which is frequently the case in real world problems.

This shows how we can use a Bayesian Network to infer a probability not stored directly in the model.

Using Bayes that we know can love, we can rewrite the probability of fraud. In a BN, we know that the probability of the joint distribution is the probability of each variable given its parent node (write formula on board)Factor out the P(A)P(S) and cancel (not dependent on f can come out of sum)30Now for the Data Mining! So far we havent added much value to the data. So lets take advantage of the Bayesian models ability to update our beliefs and learn from new data. First well rewrite our joint probability distribution in a more compact form:31Configuration vector relates to local node parents.31Learning Probabilities in a Bayesian NetworkFirst we need to make two assumptions:There is no missing data (i.e. the data accurately describes the distribution)

The parameter vectors are independent (generally a good assumption, at least locally).

3232Parameter independence. The problem of learning probabilities in a BN can now be stated simply: Given a random sample D, computet the posterior distribution of theta

X and Y are independent: P(X,Y) = P(X)P(Y)Independent for any given data set:P(X,Y|D) = P(X|D)P(Y|D)Learning Probabilities in a Bayesian NetworkIf these assumptions hold we can express the probabilities as:

3333Parameter independence. The problem of learning probabilities in a BN can now be stated simply: Given a random sample D, computet the posterior distribution of thetaWhat if we are missing data?Distinguish if missing data is dependent on variable states or independent of state.If independent of state:Monte Carlo Methods -> Gibbs SamplingGaussian Approximations34Dealing with UnknownsNow we know how to use our network to infer conditional relationships and how to update our network with new data. But what if we arent given a well defined network? We could start with missing or incomplete:Set of VariablesConditional Relationship DataNetwork Structure

35The three original parameters35Unknown Variable SetOur goal when choosing variables is to:

Organizeinto variables having mutually exclusive and collectively exhaustive states.

This is a problem shared by all data mining algorithms: What should we measure and why? There is not and probably cannot be an algorithmic solution to this problem as arriving at any solution requires intelligent and creative thought.

36Heckerman recommends the use of domain knowledge and some statistical aids but admits that the problem remains non trivial.

36Unknown Conditional RelationshipsThis can be easy.

So long as we can generate a plausible initial belief about a conditional relationship we can simply start with our assumption and let our data refine our model via the mechanism shown in the Learning Probabilities in a Bayesian Network slide.

3737Unknown Conditional RelationshipsHowever, when our ignorance becomes serious enough that we no longer even know what is dependent on what we segue into the Unknown Structure scenario.3838Learning the Network StructureSometimes the conditional relationships are not obvious. In this case we are uncertain with the network structure: we dont know where the edges should be.3939Now, let us consider the problem of learning about both the structure and the probabilities of a BN given data. Assuming we think structure can be improved, we must be uncertain about the network structure that encode the physical joint probability distribution for x.

This is a problem so hard it can be used in cryptography!Learning the Network StructureTheoretically, we can use a Bayesian approach to get the posterior distribution of the network structure:

Unfortunately, the number of possible network structure increase exponentially with n the number of nodes. Were basically asking ourselves to consider every possible graph with n nodes!4040Now, let us consider the problem of learning about both the structure and the probabilities of a BN given data. Assuming we think structure can be improved, we must be uncertain about the network structure that encode the physical joint probability distribution for x.

This is a problem so hard it can be used in cryptography!Learning the Network StructureTwo main methods for shortening the search for a network model:Model SelectionTo select a good model (i.e. the network structure) from all possible models, and use it as if it were the correct model.

Selective Model AveragingTo select a manageable number of good models from among all possible models and pretend that these models are exhaustive.

The math behind both techniques is quite involved so Im afraid well have to content ourselves with a toy example today.4141This tutorial introduced the criteria for model selection, I am not going to get into it though. For those of you who are interested in it, you can read through it in detail.Two Coin Toss ExampleExperiment: flip two coins and observe the outcomePropose two network structures: Sh1 or Sh2Assume P(Sh1)=P(Sh2)=0.5After observing some data, which model is more accurate for this collection of data?42X1X2X1X2p(H)=p(T)=0.5p(H)=p(T)=0.5p(H|H)= 0.1p(T|H)= 0.9p(H|T)= 0.9p(T|T)= 0.1Sh1Sh2P(X2|X1)42Here I am goanna give you an example to show how to choose network structure given observed data. I make up this example, which is far from the complex situation in a realistic problem. I just want to let you have a feeling about how it works.

This is a score based assesment

Assumption is that each model is equally likely

Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT4343Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT44Recall for Sh1, P(X1|X2) = P(X1) and P(X2|X1)=P(X2) due to the complete lack of conditional independence.

Note that each pair of terms corresponds to 1 Data point44Two Coin Toss ExampleX1X21TT2TH3HT4HT5TH6HT7TH8TH9HT10HT45...Recall for Sh1, P(X1|X2) = P(X1) and P(X2|X1)=P(X2) due to the complete lack of conditional independence.

Note that each pair of terms corresponds to 1 Data point45Two Coin Toss Example4646Two Coin Toss Example4747Two Coin Toss Example4848OutlineBayesian ApproachBayes TheromBayesian vs. classical probability methodscoin toss an exampleBayesian NetworkStructureInferenceLearning ProbabilitiesLearning the Network StructureTwo coin toss an exampleConclusionsExam Questions4949ConclusionsBayesian methodBayesian networkStructureInference Learning with Bayesian NetworksDealing with Unknowns

5050Firstly, I reviewed the bayes theorem, and I compared the difference between bayesian method and classical method. I showed a coin toss example to illustrate how to encode the personal belief as the prior information, and how to use the observed data to update the belief. Secondly, I introduced several fundamental topics in BNs. I introduced the structure of a BN: variables, structure, CPT. How to make inference in BN. How to update the local probabilities in CPT. I briefly mentioned how to refine the network structures: model selection and model averaging. And we walked through an example of model averaging.Question1: What are Bayesian Networks?A graphical model the encodes probabilistic relationship among variables of interest

5151Question 2: Compare the Bayesian and classical approaches to probability (any one point).Bayesian Approach:Classical Probability:+Reflects an experts knowledge+The belief is kept updating when new data item arrives- Arbitrary (More subjective)Wants P( H | D )

+Objective and unbiased- Need repeated trialsWants P( D | H )

52Bayesain: We can use prior information combined with observed data which allows for constant updates of probability assessments. Negative: prior selection can be subjectiveClassic: Unbiased. Negative: need repeated trials(eg: what is the probability the Wizards win the NBA championship? Bayesian could make an arbitrary assessment, frequent could need repeated sampling.52Question 3: Mention at least 1 Advantage of Bayesian NetworksHandle incomplete data sets by encoding dependenciesLearning about causal relationshipsCombine domain knowledge and dataAvoid over fitting53We didnt really get into overfitting, but it means that we dont need to train the model.53The EndAny Questions?5454