I NFERENCE IN B AYESIAN N ETWORKS. A GENDA Reading off independence assumptions Efficient inference...

INFERENCE IN BAYESIAN NETWORKS

AGENDA

Reading off independence assumptions Efficient inference in Bayesian Networks

Top-down inference Variable elimination Monte-Carlo methods

SOME APPLICATIONS OF BN

Medical diagnosis Troubleshooting of hardware/software

systems Fraud/uncollectible debt detection Data mining Analysis of genetic sequences Data interpretation, computer vision, image

understanding

MORE COMPLICATED SINGLY-CONNECTED BELIEF NET

Battery

SparkPlugs

Starts

Region = {Sky, Tree, Grass, Rock}

BN to evaluate insurance risks

BN FROM LAST LECTURE

Burglary Earthquake

MaryCallsJohnCalls

causes

effects

Directed acyclic graph

Intuitive meaning of arc from x to y:

“x has direct influence on y”

ARCS DO NOT NECESSARILY ENCODE CAUSALITY!

2 BN’s that can encode the same joint probability distribution

READING OFF INDEPENDENCE RELATIONSHIPS

Given B, does the value of A affect the probability of C? P(C|B,A) = P(C|B)?

No! C parent’s (B) are

given, and so it is independent of its non-descendents (A)

Independence is symmetric:C A | B => A C | B

WHAT DOES THE BN ENCODE?

Burglary Earthquake

MaryCallsJohnCalls

A node is independent of its non-descendents, given its parents

How about Burglary Earthquake | Alarm ? No! Why?

Burglary Earthquake

MaryCallsJohnCalls

How about Burglary Earthquake | Alarm ? No! Why? P(BE|A) = P(A|B,E)P(BE)/P(A) = 0.00075 P(B|A)P(E|A) = 0.086

Burglary Earthquake

MaryCallsJohnCalls

How about Burglary Earthquake | JohnCalls? No! Why? Knowing JohnCalls affects the probability of

Alarm, which makes Burglary and Earthquake dependent

Burglary Earthquake

MaryCallsJohnCalls

INDEPENDENCE RELATIONSHIPS

Rough intuition (this holds for tree-like graphs, polytrees): Evidence on the (directed) road between two

variables makes them independent Evidence on an “A” node makes descendants

independent Evidence on a “V” node, or below the V, makes

the ancestors of the variables dependent (otherwise they are independent)

Formal property in general case : D-separation independence (see R&N)

BENEFITS OF SPARSE MODELS

Modeling Fewer relationships need to be encoded (either

through understanding or statistics) Large networks can be built up from smaller ones

Intuition Dependencies/independencies between variables

can be inferred through network structures Tractable inference

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCESuppose we want to compute P(Alarm)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(Alarm) = Σb,e P(A,b,e)2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(Alarm) = Σb,e P(A,b,e)2. P(Alarm) = Σb,e P(A|b,e)P(b)P(e)3. P(Alarm) = P(A|B,E)P(B)P(E) +

P(A|B, E)P(B)P(E) +P(A|B,E)P(B)P(E) +P(A|B,E)P(B)P(E)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCESuppose we want to compute P(Alarm)1. P(A) = Σb,e P(A,b,e)2. P(A) = Σb,e P(A|b,e)P(b)P(e)3. P(A) = P(A|B,E)P(B)P(E) +

P(A|B, E)P(B)P(E) +P(A|B,E)P(B)P(E) +P(A|B,E)P(B)P(E)

4. P(A) = 0.95*0.001*0.002 +0.94*0.001*0.998 +0.29*0.999*0.002 +0.001*0.999*0.998= 0.00252

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)1. P(M) = P(M|A)P(A) + P(M| A) P(A)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCENow, suppose we want to compute P(MaryCalls)1. P(M) = P(M|A)P(A) + P(M| A) P(A)2. P(M) = 0.70*0.00252 + 0.01*(1-0.0252)

= 0.0117

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

TOP-DOWN INFERENCE WITH EVIDENCE

Suppose we want to compute P(Alarm|Earthquake)

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

Suppose we want to compute P(A|e)1. P(A|e) = Σb P(A,b|e)2. P(A|e) = Σb P(A|b,e)P(b)3. P(A|e) = 0.95*0.001 +

0.29*0.999 += 0.29066

TOP-DOWN INFERENCE

Only works if the graph of ancestors of a variable is a polytree

Evidence given on ancestor(s) of the query variable

Efficient: O(d 2k) time, where d is the number of ancestors

of a variable, with k a bound on # of parents Evidence on an ancestor cuts off influence of

portion of graph above evidence node

QUERYING THE BN

The BN gives P(T|C) What about P(C|T)?

Cavity

Toothache

C P(T|C)

0.40.01111

BAYES’ RULE

P(AB) = P(A|B) P(B)= P(B|A) P(A)

So… P(A|B) = P(B|A) P(A) / P(B)

APPLYING BAYES’ RULE Let A be a cause, B be an effect, and let’s say we

know P(B|A) and P(A) (conditional probability tables)

What’s P(B)?

What’s P(B)? P(B) = Sa P(B,A=a) [marginalization]

P(B,A=a) = P(B|A=a)P(A=a) [conditional probability]

So, P(B) = Sa P(B | A=a) P(A=a)

What’s P(A|B)?

What’s P(A|B)? P(A|B) = P(B|A)P(A)/P(B) [Bayes

rule] P(B) = Sa P(B | A=a) P(A=a) [Last

slide] So, P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)]

HOW DO WE READ THIS?

P(A|B) = P(B|A)P(A) / [Sa P(B | A=a) P(A=a)] [An equation that holds for all values A can take on,

and all values B can take on] P(A=a|B=b) =

and all values B can take on] P(A=a|B=b) = P(B=b|A=a)P(A=a) /

[Sa P(B=b | A=a) P(A=a)]

Are these the same a?

[Sa P(B=b | A=a) P(A=a)]

Are these the same a?

[Sa’ P(B=b | A=a’) P(A=a’)]

Be careful about indices!

QUERYING THE BN The BN gives P(T|C) What about P(C|T)? P(Cavity|Toothache) =

P(Toothache|Cavity) P(Cavity)

P(Toothache)

[Bayes’ rule]

Querying a BN is just applying Bayes’ rule on a larger scale…

Cavity

Toothache

C P(T|C)

0.40.01111 Denominator computed by

summing out numerator over Cavity and Cavity

PERFORMING INFERENCE

Variables X Have evidence set E=e, query variable Q Want to compute the posterior probability

distribution over Q, given E=e Let the non-evidence variables be Y (= X \ E) Straight forward method:

1. Compute joint P(YE=e)2. Marginalize to get P(Q,E=e)3. Divide by P(E=e) to get P(Q|E=e)

INFERENCE IN THE ALARM EXAMPLE

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

P(J|M) = ??

Query Q

Evidence E=e

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

P(J|MaryCalls) = ??

1. P(J,A,B,E,MaryCalls) =P(J|A)P(MaryCalls|A)P(A|B,E)P(B)P(E)

P(x1x2…xn) = Pi=1,…,nP(xi|parents(Xi))

full joint distribution table

24 entries

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

P(J|MaryCalls) = ??

2. P(J,MaryCalls) =Sa,b,e P(J,A=a,B=b,E=e,MaryCalls)

2 entries:one for JohnCalls,the other for JohnCalls

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

P(J|MaryCalls) = ??

2. P(J,MaryCalls) =Sa,b,e P(J,A=a,B=b,E=e,MaryCalls)

3. P(J|MaryCalls) = P(J,MaryCalls)/P(MaryCalls)= P(J,MaryCalls)/(SjP(j,MaryCalls))

HOW EXPENSIVE?

P(X) = P(x1x2…xn) = Pi=1,…,n P(xi|parents(Xi))

Straightforward method:1. Use above to compute P(Y,E=e)2. P(Q,E=e) = Sy1 … Syk P(Y,E=e)

3. P(E=e) = Sq P(Q,E=e) Step 1: O( 2n-|E| ) entries!

Normalization factor – no big deal once we have P(Q,E=e)

Can we do better?

VARIABLE ELIMINATION

Consider linear network X1X2X3

P(X) = P(X1) P(X2|X1) P(X3|X2)

P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)

P(X) = P(X1) P(X2|X1) P(X3|X2)

P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)

= Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1)

Rearrange equation…

P(X) = P(X1) P(X2|X1) P(X3|X2)

P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)

= Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1)

= Σx2 P(X3|x2) P(x2)Computed for each value of X2

Cache P(x2) for both values of X3!

P(X) = P(X1) P(X2|X1) P(X3|X2)

P(X3) = Σx1 Σx2 P(x1) P(x2|x1) P(X3|x2)

= Σx2 P(X3|x2) Σx1 P(x1) P(x2|x1)

= Σx2 P(X3|x2) P(x2)Computed for each value of X2

How many * and + saved?*: 2*4*2=16 vs 4+4=8+ 2*3=8 vs 2+1=3

Can lead to huge gains in larger networks

VE IN ALARM EXAMPLE

P(E|j,m)=P(E,j,m)/P(j,m) P(E,j,m) = ΣaΣb P(E) P(b) P(a|E,b) P(j|a) P(m|a)

VE IN ALARM EXAMPLE

= P(E) Σb P(b) Σa P(a|E,b) P(j|a) P(m|a)

VE IN ALARM EXAMPLE

= P(E) Σb P(b) P(j,m|E,b) Compute for all values of E,b

VE IN ALARM EXAMPLE

= P(E) Σb P(b) P(j,m|E,b)

= P(E) P(j,m|E) Compute for all values of E

WHAT ORDER TO PERFORM VE?

For tree-like BNs (polytrees), order so parents come before children # of variables in each intermediate probability

table is 2^(# of parents of a node) If the number of parents of a node is

bounded, then VE is linear time!

Other networks: intermediate factors may become large

NON-POLYTREE NETWORKS

P(D) = Σa Σb Σc P(A)P(B|A)P(C|A)P(D|B,C) = Σb Σc P(D|B,C) Σa P(A)P(B|A)P(C|A)

No more simplifications…

APPROXIMATE INFERENCE TECHNIQUES

Based on the idea of Monte Carlo simulation Basic idea:

To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

Conditional simulation: To estimate the probability P(H) that a coin picked

out of bucket B flips heads, I can:1. Pick a coin C out of B (occurs with probability P(C))2. Flip C and observe whether it flips heads (occurs

with probability P(H|C))3. Put C back and repeat from step 1 many times4. Return the fraction of heads observed (estimate of

APPROXIMATE INFERENCE: MONTE-CARLO SIMULATION

Sample from the joint distribution

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=0A=0J=1M=0

As more samples are generated, the distribution of the samples approaches the joint distribution!

B=0E=0A=0J=1M=0

B=0E=0A=0J=0M=0

B=1E=0A=1J=1M=0

Inference: given evidence E=e (e.g., J=1) Remove the samples that conflict

B=0E=0A=0J=1M=0

B=0E=0A=0J=0M=0

B=1E=0A=1J=1M=0

Distribution of remaining samples approximates the conditional distribution!

HOW MANY SAMPLES?

Error of estimate, for n samples, is on average

Variance-reduction techniques

RARE EVENT PROBLEM:

What if some events are really rare (e.g., burglary & earthquake ?)

# of samples must be huge to get a reasonable estimate

Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the

ratio of(how likely the sampled value is to occur in the real world)

(how likely you were to generate the sampled value)

LIKELIHOOD WEIGHTING

Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=1

w=0.008

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=1A=1

w=0.0023

A=1 is enforced, and the weight updated to reflect the likelihood that this occurs

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=1A=1M=1J=1

w=0.0016

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=0

w=3.988

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=0A=1

w=0.004

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=0E=0A=1M=1J=1

w=0.0028

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=1E=0A=1

w=0.00375

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=1E=0A=1M=1J=1

w=0.0026

B E P(A|…)

0.950.940.290.001

Burglary Earthquake

MaryCallsJohnCalls

A P(J|…)

0.900.05

A P(M|…)

0.700.01

B=1E=1A=1M=1J=1

w=5e-7

N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375

B=0E=1A=1M=1J=1

w=0.0016

B=0E=0A=1M=1J=1

w=0.0028

B=1E=0A=1M=1J=1

w=0.0026

B=1E=1A=1M=1J=1

Efficient inference in BNs Variable elimination Approximate methods: Monte-Carlo sampling

NEXT LECTURE

Statistical learning: from data to distributions R&N 20.1-2

I NFERENCE IN B AYESIAN N ETWORKS. A GENDA Reading off independence assumptions Efficient inference...

Documents

Transcript of I NFERENCE IN B AYESIAN N ETWORKS. A GENDA Reading off independence assumptions Efficient inference...

B AYESIAN N ETWORKS. S OME A PPLICATIONS OF BN Medical diagnosis Troubleshooting of hardware/software systems Fraud/uncollectible debt detection Data.

Inference Inference by Enumerationcs188/sp12/slides/cs188... · Inference by Enumeration Given unlimited time, inference in BNs is easy ! Recipe: ! State the marginal probabilities

S ECTION 2.2 S TATISTICAL I NFERENCE FROM S AMPLE TO P OPULATION.

INDUCTION (probable inference) : inference moving from specific facts to general conclusions. DEDUCTION (necessary inference): inference moving from general.

Name Definitions: Evidence Inference Period Inference vs ...

Inference. Overview The MC-SAT algorithm Knowledge-based model construction Lazy inference Lifted inference.

F uzzy L ogic I nference for P ong ( FLIP )

Iteration - StarkeyPro and Hierarchical ayesian Models for Subjective Preference Data ... 18 hearing-impaired participants compared three different sound ... interaction main ...

M ETHODS OF INFERENCE Hasan Zafari. M ETHODS OF INFERENCE What is reasoning? Inferences with rules trees The inference tree Inference by Inheritance Inference.

Introduction to Statistical Inferencefab2/inference_talk.pdfIntroduction to Statistical Inference. Statistical Inference. Statistical Inference. Inference data" if the ratio of its

2017 LEADERSHIP C NFERENCE - cdn.ymaws.com · leadership c nference 2017 july 12-14 great wolf lodge wisconsin dells. created date: 6/28/2017 7:40:27 am

SCH L REF RM C NFERENCE - mdrc

A SSESSING R ELEASE L IMITS AND M ANUFACTURING R ISK FROM A B AYESIAN P ERSPECTIVE 1 Areti Manola amanola@its.jnj.com.

HFMA REGION 1 TENTH ANNUAL HEALTHCARE CO NFERENCE · 2 HFMA REGION 1 TENTH ANNUAL HEALTHCARE CO NFERENCE . Agenda-at-a-Glance Page 2 . Track Descriptions Page 3-5 . Schedule Page

Lecture 5 Fuzzy expert systems: Fuzzy inference Mamdani fuzzy inference Mamdani fuzzy inference Sugeno fuzzy inference Sugeno fuzzy inference Case study.

National Nanotechnology Initiative S TUDENT N ETWORKFinal_Poster_for... · 2019. 12. 5. · TechConnect 20 techconnectmg TechConnect 20 techconnect.org Connect D INNOVATIO' NFERENCE

Variational Inference and Mean Field · Variational Inference “Variational inference”: Formulate inference problem as constrained optimization. Approximate the function or constraintsto

Parametric Inference Maximum Likelihood Inference …bioucas/IP/files/statistical_inference.pdf · Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential

P ROBABILISTIC I NFERENCE. A GENDA Conditional probability Independence Intro to Bayesian Networks.

PLAN MANAGEMENT CO NFERENCE SAMPLE · RETIREMENT & HEALTHCARE PLAN MANAGEMENT CO NFERENCE MID-SIZED Featuring 40+ Sessions June 4-7, 2019 An educational conference focused on key