Oliver Schulte Machine Learning 726 Bayes Nets and Probabilities.

Click here to load reader

download Oliver Schulte Machine Learning 726 Bayes Nets and Probabilities.

of 59

Transcript of Oliver Schulte Machine Learning 726 Bayes Nets and Probabilities.

  • Slide 1
  • Oliver Schulte Machine Learning 726 Bayes Nets and Probabilities
  • Slide 2
  • 2/57 Bayes Nets: General Points Represent domain knowledge. Allow for uncertainty. Complete representation of probabilistic knowledge. Represent causal relations. Fast answers to types of queries: Probabilistic: What is the probability that a patient has strep throat given that they have fever? Relevance: Is fever relevant to having strep throat?
  • Slide 3
  • 3/57 Bayes Net Links Judea Pearl's Turing Award See UBCs AISpaceAISpace
  • Slide 4
  • 4/57 Probability Reasoning (With Bayes Nets)
  • Slide 5
  • 5/57 Random Variables A random variable has a probability associated with each of its values. A basic statement assigns a value to a random variable. VariableValueProbability WeatherSunny0.7 WeatherRainy0.2 WeatherCloudy0.08 WeatherSnow0.02 CavityTrue0.2 CavityFalse0.8
  • Slide 6
  • 6/57 Probability for Sentences A sentence or query is formed by using and, or, not recursively with basic statements. Sentences also have probabilities assigned to them. SentenceProbability P(Cavity = false AND Toothache = false)0.72 P(Cavity = true OR Toothache = false)0.08
  • Slide 7
  • 7/57 Probability Notation Often probability theorists write A,B instead of A B (like Prolog). If the intended random variables are known, they are often not mentioned. ShorthandFull Notation P(Cavity = false,Toothache = false) P(Cavity = false Toothache = false) P(false, false) P(Cavity = false Toothache = false)
  • Slide 8
  • 8/57 Axioms of probability For any formula A, B 0 P(A) 1 P(true) = 1 and P(false) = 0 P(A B) = P(A) + P(B) - P(A B) P(A) = P(B) if A and B are logically equivalent. Formulas considered as sets of complet
  • Slide 9
  • 9/57 Rule 1: Logical Equivalence P(NOT (NOT Cavity))P(Cavity) 0.2 P(NOT (Cavity OR Toothache) P(Cavity = F AND Toothache = F) 0.72 P(NOT (Cavity AND Toothache)) P(Cavity = F OR Toothache = F) 0.88
  • Slide 10
  • 10/57 The Logical Equivalence Pattern P(NOT (NOT Cavity)) =P(Cavity) 0.2 P(NOT (Cavity OR Toothache) =P(Cavity = F AND Toothache = F) 0.72 P(NOT (Cavity AND Toothache)) =P(Cavity = F OR Toothache = F) 0.88 Rule 1: Logically equivalent expressions have the same probability.
  • Slide 11
  • 11/57 Rule 2: Marginalization P(Cavity, Toothache) P(Cavity, Toothache = F) P(Cavity) 0.120.080.2 P(Cavity, Toothache) P(Cavity = F, Toothache) P(Toothache) 0.120.080.2 P(Cavity = F, Toothache) P(Cavity = F, Toothache = F) P(Cavity = F) 0.080.720.8
  • Slide 12
  • 12/57 The Marginalization Pattern P(Cavity, Toothache) +P(Cavity, Toothache = F) =P(Cavity) 0.120.080.2 P(Cavity, Toothache) +P(Cavity = F, Toothache) =P(Toothache) 0.120.080.2 P(Cavity = F, Toothache) +P(Cavity = F, Toothache = F) =P(Cavity = F) 0.080.720.8
  • Slide 13
  • 13/57 Prove the Pattern: Marginalization Theorem. P(A) = P(A,B) + P(A, not B) Proof. 1. A is logically equivalent to [A and B) (A and not B)]. 2. P(A) = P([A and B) (A and not B)]) = P(A and B) + P(A and not B) P([A and B) (A and not B)]). Disjunction Rule. 3. [A and B) (A and not B)] is logically equivalent to false, so P([A and B) (A and not B)]) =0. 4. So 2. implies P(A) = P(A and B) + P(A and not B).
  • Slide 14
  • 14/57 Completeness of Bayes Nets A probabilistic query system is complete if it can compute a probability for every sentence. Proposition: A Bayes net is complete. Proof has two steps. 1. Any system that encodes the joint distribution is complete. 2. A Bayes net encodes the joint distribution.
  • Slide 15
  • 15/57 The Joint Distribution
  • Slide 16
  • 16/57 Assigning Probabilities to Sentences A complete assignment is a conjunctive sentence that assigns a value to each random variable. The joint probability distribution specifies a probability for each complete assignment. A joint distribution determines an probability for every sentence. How? Spot the pattern.
  • Slide 17
  • 17/57 Probabilities for Sentences: Spot the Pattern SentenceProbability P(Cavity = false AND Toothache = false)0.72 P(Cavity = true OR Toothache = false)0.08 P(Toothache = false)0.8
  • Slide 18
  • 18/57 Inference by enumeration
  • Slide 19
  • 19/57 Inference by enumeration Marginalization: For any sentence A, sum the joint probabilities for the complete assignments where A is true. P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2.
  • Slide 20
  • 20/57 Completeness Proof for Joint Distribution Theorem [from propositional logic] Every sentence is logically equivalent to a disjunction of the form A 1 or A 2 or... or A k where the A i are complete assignments. 1. All of the A i are mutually exclusive (joint probability 0). Why? 2. So if S is equivalent to A 1 or A 2 or... or A k, then P(S) = i P(A i ) where each A i is given by the joint distribution.
  • Slide 21
  • 21/57 Bayes Nets and The Joint Distribution
  • Slide 22
  • 22/57 Example: Complete Bayesian Network
  • Slide 23
  • 23/57 The Story You have a new burglar alarm installed at home. Its reliable at detecting burglary but also responds to earthquakes. You have two neighbors that promise to call you at work when they hear the alarm. John always calls when he hears the alarm, but sometimes confuses alarm with telephone ringing. Mary listens to loud music and sometimes misses the alarm.
  • Slide 24
  • 24/57 Computing The Joint Distribution A Bayes net provides a compact factored representation of a joint distribution. In words, the joint probability is computed as follows. 1. For each node X i : 2. Find the assigned value x i. 3. Find the values y 1,..,y k assigned to the parents of X i. 4. Look up the conditional probability P(x i |y 1,..,y k ) in the Bayes net. 5. Multiply together these conditional probabilities.
  • Slide 25
  • 25/57 Product Formula Example: Burglary Query: What is the joint probability that all variables are true? P(M, J, A, E, B) = P(M|A) p(J|A) p(A|E,B)P(E)P(B) =.7 x.9 x.95 x.002 x.001
  • Slide 26
  • 26/57 Compactness of Bayesian Networks Consider n binary variables Unconstrained joint distribution requires O(2 n ) probabilities If we have a Bayesian network, with a maximum of k parents for any node, then we need O(n 2 k ) probabilities Example Full unconstrained joint distribution n = 30: need 2 30 probabilities for full joint distribution Bayesian network n = 30, k = 4: need 480 probabilities
  • Slide 27
  • 27/57 Summary: Why are Bayes nets useful? - Graph structure supports - Modular representation of knowledge - Local, distributed algorithms for inference and learning - Intuitive (possibly causal) interpretation - Factored representation may have exponentially fewer parameters than full joint P(X 1,,X n ) => - lower sample complexity (less data for learning) - lower time complexity (less time for inference)
  • Slide 28
  • 28/57 Is it Magic? How can the Bayes net reduce parameters? By exploiting conditional independencies. Why does the product formula work? 1. The Bayes net topological or graphical semantics. The graph by itself entails conditional independencies. 2. The Chain Rule.
  • Slide 29
  • 29/57 Conditional Probabilities and Independence
  • Slide 30
  • 30/57 Conditional Probabilities: Intro Given (A) that a die comes up with an odd number, what is the probability that (B) the number is 1. a 2 2. a 3 Answer: the number of cases that satisfy both A and B, out of the number of cases that satisfy A. Examples: 1. #faces with (odd and 2)/#faces with odd = 0 / 3 = 0. 2. #faces with (odd and 3)/#faces with odd = 1 / 3.
  • Slide 31
  • 31/57 Conditional Probs ctd. Suppose that 50 students are taking 310 and 30 are women. Given (A) that a student is taking 310, what is the probability that (B) they are a woman? Answer: #students who take 310 and are a woman/ #students in 310 = 30/50 = 3/5. Notation: P(A|B)
  • Slide 32
  • 32/57 Conditional Ratios: Spot the Pattern Spot the Pattern P(Student takes 310)P(Student takes 310 and is woman) P(Student is woman|Student takes 310) =50/15,00030/15,0003/5 P(die comes up with odd number) P(die comes up with 3) P(3|odd number) 1/21/61/3
  • Slide 33
  • 33/57 Conditional Probs: The Ratio Pattern Spot the Pattern P(Student takes 310) /P(Student takes 310 and is woman) =P(Student is woman|Stud ent takes 310) =50/15,00030/15,0003/5 P(die comes up with odd number) /P(die comes up with 3) =P(3|odd number) 1/21/61/3 P(A|B) = P(A and B)/ P(B) Important!
  • Slide 34
  • 34/57 Conditional Probabilities: Motivation Much knowledge can be represented as implications B 1,..,B k =>A. Conditional probabilities are a probabilistic version of reasoning about what follows from conditions. Cognitive Science: Our minds store implicational knowledge.
  • Slide 35
  • 35/57 The Product Rule: Spot the Pattern P(Cavity)P(Toothache|Cavity)P(Cavity,Toothache) 0.120.080.2 P(Cavity =F)P(Toothache|Cavit y = F) P(Toothache,Cavity =F) 0.120.080.2 P(Toothache)P(Cavity| Toothache) P(Cavity,Toothache) 0.080.720.8
  • Slide 36
  • 36/57 The Product Rule Pattern P(Cavity)xP(Toothache|Cavity)=P(Cavity,Toothache) 0.120.080.2 P(Cavity =F)xP(Toothache|Cavit y = F) =P(Toothache,Cavity =F) 0.120.080.2 P(Toothache)xP(Cavity| Toothache)=P(Cavity,Toothache) 0.080.720.8
  • Slide 37
  • 37/57 Independence A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) Suppose that Weather is independent of the Cavity Scenario. Then the joint distribution decomposes: P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity) P(Weather) Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do?
  • Slide 38
  • 38/57 Exercise Prove that the three definitions of independence are equivalent (assuming all positive probabilities). A and B are independent iff 1. P(A|B) = P(A) 2. or P(B|A) = P(B) 3. or P(A, B) = P(A) P(B)
  • Slide 39
  • 39/57 Conditional independence If I have a cavity, the probability that the probe catches in it doesn't depend on whether I have a toothache: (1) P(catch | toothache, cavity) = P(catch | cavity) The same independence holds if I haven't got a cavity: (2) P(catch | toothache, cavity) = P(catch | cavity) Catch is conditionally independent of Toothache given Cavity: P(Catch | Toothache,Cavity) = P(Catch | Cavity) The equivalences for independence also holds for conditional independence, e.g.: P(Toothache | Catch, Cavity) = P(Toothache | Cavity) P(Toothache, Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) Conditional independence is our most basic and robust form of knowledge about uncertain environments.
  • Slide 40
  • 40/57 Bayes Nets Graphical Semantics
  • Slide 41
  • 41/57 Common Causes: Spot the Pattern Cavity Catchtoothache Catch is independent of toothache given Cavity.
  • Slide 42
  • 42/57 Burglary Example JohnCalls, MaryCalls are conditionally independent given Alarm.
  • Slide 43
  • 43/57 Spot the Pattern: Chain Scenario MaryCalls is independent of Burglary given Alarm. JohnCalls is independent of Earthquake given Alarm.
  • Slide 44
  • 44/57 The Markov Condition A Bayes net is constructed so that: each variable is conditionally independent of its nondescendants given its parents. The graph alone (without specified probabilities) entails conditional independencies. Causal Interpretation: Each parent is a direct cause.
  • Slide 45
  • 45/57 Derivation of the Product Formula
  • Slide 46
  • 46/57 The Chain Rule We can always write P(a, b, c, z) = P(a | b, c, . z) P(b, c, z) (Product Rule) Repeatedly applying this idea, we obtain P(a, b, c, z) = P(a | b, c, . z) P(b | c,.. z) P(c|.. z)..P(z) Order the variables such that children come before parents. Then given its parents, each node is independent of its other ancestors by the topological independence. P(a,b,c, z) = x. P(x|parents)
  • Slide 47
  • 47/57 Example in Burglary Network P(M, J,A,E,B) = P(M| J,A,E,B) p(J,A,E,B)= P(M|A) p(J,A,E,B) = P(M|A) p(J|A,E,B) p(A,E,B) = P(M|A) p(J|A) p(A,E,B) = P(M|A) p(J|A) p(A|E,B) P(E,B) = P(M|A) p(J|A) p(A|E,B) P(E)P(B) Colours show applications of the Bayes net topological independence.
  • Slide 48
  • 48/57 Explaining Away
  • Slide 49
  • 49/57 Common Effects: Spot the Pattern Influenza and Smokes are independent. Given Bronchitis, they become dependent. InfluenzaSmokes Bronchitis Battery Age Charging System OK Battery Voltage Battery Age and Charging System are independent. Given Battery Voltage, they become dependent.
  • Slide 50
  • 50/57 Conditioning on Children A B C Independent Causes: A and B are independent. Explaining away effect: Given C, observing A makes B less likely. E.g. Bronchitis in UBC Simple Diagnostic Problem. A and B are (marginally) independent, become dependent once C is known.
  • Slide 51
  • 51/57 D-separation A, B, and C are non-intersecting subsets of nodes in a directed graph. A path from A to B is blocked if it contains a node such that either a)the arrows on the path meet either head-to-tail or tail- to-tail at the node, and the node is in the set C, or b)the arrows meet head-to-head at the node, and neither the node, nor any of its descendants, are in the set C. If all paths from A to B are blocked, A is said to be d- separated from B by C. If A is d-separated from B by C, the joint distribution over all variables in the graph satisfies.
  • Slide 52
  • 52/57 D-separation: Example
  • Slide 53
  • 53/57 Mathematical Analysis Theorem: If A, B have no common ancestors and neither is a descendant of the other, then they are independent of each other. Proof for our example: P(a,b) = c P(a,b,c) = c P(a) P(b) P(c|a,b) c P(a) P(b) P(c|a,b) = P(a) P(b) c P(c|a,b) = P(a) P(b) A B C
  • Slide 54
  • 54/57 Bayes Theorem
  • Slide 55
  • 55/57 Abductive Reasoning Implications are often causal, from cause to effect. Many important queries are diagnostic, from effect to cause. Burglary Alarm Cavity Toothache
  • Slide 56
  • 56/57 Bayes Theorem: Another Example A doctor knows the following. The disease meningitis causes the patient to have a stiff neck 50% of the time. The prior probability that someone has meningitis is 1/50,000. The prior that someone has a stiff neck is 1/20. Question: knowing that a person has a stiff neck what is the probability that they have meningitis?
  • Slide 57
  • 57/57 Spot the Pattern: Diagnosis P(Cavity)P(Toothache|Cavit y) P(Toothache)P(Cavity|Toothach e) 0.20.60.20.6 P(Wumpus)P(Stench|Wumpus ) P(Stench)P(Wumpus|Stench ) 0.20.60.20.6 P(Meningitis)P(Stiff Neck| Meningitis) P(Stiff Neck)P(Meningitis|Stiff Neck) 1/50,0001/21/200.6
  • Slide 58
  • 58/57 Spot the Pattern: Diagnosis P(Cavity)xP(Toothache| Cavity) /P(Toothache)=P(Cavity|Tootha che) 0.20.60.20.6 P(Wumpus)xP(Stench|Wumpus)/P(Stench)=P(Wumpus|Ste nch) 0.20.60.20.6 P(Meningitis)xP(Stiff Neck| Meningitis) /P(Stiff Neck)=P(Meningitis|St iff Neck) 1/50,0001/21/201/5,000
  • Slide 59
  • 59/57 Explain the Pattern: Bayes Theorem Exercise: Prove Bayes Theorem P(A | B) = P(B | A) P(A) / P(B).
  • Slide 60
  • 60/57 On Bayes Theorem P(a | b) = P(b | a) P(a) / P(b). Useful for assessing diagnostic probability from causal probability: P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect). Likelihood: how well does the cause explain the effect? Prior: how plausible is the explanation before any evidence? Evidence Term/Normaliza tion Constant: how surprising is the evidence?