Sampling Bayesian Networks

ICS 276

Answering BN Queries

Probability of Evidence P(e) ? NP-hard

Conditional Prob. P(xi|e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y x ?

NPPP-hard Approximating P(e) or P(xi|e) within : NP-hard

Approximation Algorithms

Structural Approximations Eliminate some dependencies

Remove edges Mini-Bucket Approach

Search Approach for optimization tasks: MPE, MAP

SamplingGenerate random samples and compute values of interest from samples, not original network

Algorithm Tree

Sampling

Input: Bayesian network with set of nodes X Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

Tuple may include all variables (except evidence) or a subset

Sampling schemas dictate how to generate samples (tuples)

Ideally, samples are distributed according to P(X|E)

Sampling Fundamentals

dxXxggE )()(

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Sampling From (X)

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

},...,,{ 21tn

ttt xxxS A sample St is an instantiation:

Sampling Basics

Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Generate k samples: 0,1,1,1,0,1,1,0,1 Approximate P’(X):

}6.0,4.0{)('

)1(#)1('

)0(#)0('

samples

XsamplesXP

samples

XsamplesXP

How to draw a sample ?

Given random variable X, D(X)={0, 1}

Given P(X) = {0.3, 0.7} Sample X P (X)

draw random number r [0, 1] If (r < 0.3) then set X=0 Else set X=1

Can generalize for any domain size

Sampling in BN

Same Idea: generate a set of samples T

Estimate P(Xi|E) from samples Challenge: X is a vector and P(X) is a

huge distribution represented by BN Need to know:

How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(Xi|e) ?

Sampling Algorithms

Forward Sampling Gibbs Sampling (MCMC)

Blocking Rao-Blackwellised

Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle

Filtering) in Dynamic Bayesian Networks

Forward Sampling

Forward Sampling Case with No evidence E={} Case with Evidence E=e # samples N and Error Bounds

Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian networkX= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples Process nodes in topological order – first process the

ancestors of a node, then the node itself:

1. For t = 0 to T2. For i = 0 to N3. Xi sample xi

t from P(xi | pai)

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

Assume D(Xi)={0,1} Assume P(Xi | pai) = (0.3, 0.7)

Draw a random number r from [0,1]If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

0 10.3 r

Forward sampling (example)

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from Sample .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

Evidence No

Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn} estimate P(Xi = xi) :

xXsamplesxXP ii

Basically, count the proportion of samples where Xi = xi

Forward Sampling w/ Evidence

Input: Bayesian networkX= {X1,…,XN}, N- #nodesE – evidence, T - # samples

Output: T samples consistent with E1. For t=1 to T2. For i=1 to N3. Xi sample xi

t from P(xi | pai)

4. If Xi in E and Xi xi, reject sample: 5. i = 1 and go to step 2

Forward sampling (example)

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from Sample 5.

otherwise 1, fromstart and

samplereject 0, If .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

0 :Evidence

Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u

Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Chebychev’s Bound.

222])(,)([)( NeyPyPyPP

Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

Derived from Hoeffding’s Bound (full proof is given in Koller).

222])(,)([)( NeyPyPyPP

Forward Sampling:Performance

Advantages: P(xi | pa(xi)) is readily available Samples are independent !Drawbacks: If evidence E is rare (P(e) is low), then

we will reject most of the samples! Since P(y) in estimate of T is unknown,

must estimate P(y) from samples themselves!

If P(e) is small, T will become very big!

Problem: Evidence

Forward Sampling High Rejection Rate

Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling

Forward Sampling Bibliography

{henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

Gibbs Sampling

Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|e) Guaranteed to converge when all P > 0 Methods to improve convergence:

Blocking Rao-Blackwellised

Error Bounds Lag-t autocovariance Multiple Chains, Chebyshev’s Inequality

Gibbs Sampling (Pearl, 1988)

A sample t[1,2,…],is an instantiation of all variables in the network:

Sampling process Fix values of observed variables e Instantiate node values in sample x0 at

random Generate samples x1,x2,…xT from P(x|e) Compute posteriors from samples

},...,,{ 2211tNN

ttt xXxXxXx

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:),\|(

),,...,,|(

exxxPxX

exxxxPxX

from sampled

ProcessAllVariablesIn SomeOrder

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiit

i paxPpaxPxxxP )|()|()\|(

:)\|( )\|( :Important it

i xmarkovxPxxxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

blanketMarkov

Ordered Gibbs Sampling Algorithm

Input: X, EOutput: T samples {xt } Fix evidence E Generate samples from P(X | E)1. For t = 1 to T (compute samples)2. For i = 1 to N (loop through variables)3. Xi sample xi

t from P(Xi | markovt \ Xi)

Answering Queries

Query: P(xi |e) = ? Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

iiii XmarkovxXPT

xXsamplesxXP ii

Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}X1

X8 X5 X2

X1 = x10

X6 = x60

X2 = x20

X7 = x70

X3 = x30

X8 = x80

X4 = x40

X5 = x50

X8 X5 X2

X1 P (X1 |X0

2,…,X0

8 ,X9}

E = {X9}X1

X8 X5 X2

X2 P(X2 |X1

1,…,X0

8 ,X9}

E = {X9}

X8 X5 X2

Gibbs Sampling: Illustration

Gibbs Sampling: Burn-In

We want to sample from P(X | E) But…starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample valkues

from approximate P(x|e) (for example, run IBP first)

Gibbs Sampling: Convergence

Converge to stationary distribution * :

* = * Pwhere P is a transition kernel

pij = P(Xi Xj) Guaranteed to converge iff chain is :

irreducible aperiodic ergodic ( i,j pij > 0)

Irreducible

A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).

In other words, i,j k : P(k)ij > 0 where

k is the number of steps taken to get to state j from state i.

Aperiodic

Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

Ergodicity

A recurrent state is a state to which the chain returns with probability 1:

n P(n)ij =

Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

Gibbs Convergence

Gibbs convergence is generally guaranteed as long as all probabilities are positive!

Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: once we sample and assign X=0, then we are forced to

assign Y=0; once we sample and assign Y=0, then we are forced to

assign X=0;

we will never be able to change their values again!

Another problem: it can take a very long time to converge!

Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)-Disadvantage: convergence may be slow

Problems:

Samples are dependent ! Statistical variance is too big in high-

dimensional problems

Gibbs: Speeding Convergence

Objectives:1. Reduce dependence between

samples (autocorrelation) Skip samples Randomize Variable Sampling Order

2. Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation

Skipping Samples

Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

Randomized Variable Order

Random Scan Gibbs SamplerPick each next variable Xi for update at

random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance

Reduction”)

Blocking

Sample several variables together, as a block Example: Given three variables X,Y,Z, with

domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993“Blocking Gibbs Sampling Very Large

Probabilistic Expert Systems” Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

Sample P(Ei | Ai)

Rao-Blackwellisation

Do not sample all variables! Sample a subset! Example: Given three variables

X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)yt+1 P(xt+1)

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

Blocking vs. Rao-Blackwellisation

Standard Gibbs:P(x|y,z),P(y|x,z),P(z|x,y) (1)

Blocking:P(x|y,z), P(y,z|x) (2)

Rao-Blackwellised:P(x|y), P(y|x) (3)

Var3 < Var2 < Var1 [Liu, Wong, Kong, 1994Covariance structure of the Gibbs

sampler…]

Rao-Blackwellised Gibbs: Cutset Sampling

Select C X (possibly cycle-cutset), |C| = m

Fix evidence E Initialize nodes with random values:

For i=1 to m: ci to Ci = c 0i

For t=1 to n , generate samples:For i=1 to m:Ci=ci

t+1 P(ci|c1 t+1,…,ci-1

t+1,ci+1t,…,cm

Cutset Sampling

Select a subset C={C1,…,CK} X A sample t[1,2,…],is an instantiation of C:

Sampling process Fix values of observed variables e Generate sample c0 at random Generate samples c1,c2,…cT from P(c|e) Compute posteriors from samples

},...,,{ 2211tKK

ttt cCcCcCc

Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:),\|(

),,...,,|(

ecccPcC

eccccPcC

from sampled

How to compute P(ci|c t\ci, e) ?

Compute joint P(ci, c t\ci, e) for each ci

D(Ci) Then normalize:

P(ci| c t\ci , e) = P(ci, c

t\ci , e) Computation efficiency depends

on choice of C

How to choose C ? Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.

Pick C wisely so as to minimize w notion of w-cutset

w-cutset Sampling

C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w

Complexity of exact inference: bounded by w !

cycle-cutset is a special case

Cutset Sampling-Answering Queries

Query: ci C, P(ci |e)=? same as Gibbs: Special case of w-cutset

computed while generating sample t

compute after generating sample t

ii ecccPT

|e)(cP1

),\|(1

Query: P(xi |e) = ?

tii ,ecxP

T|e)(xP

Cutset Sampling Example

0 ,xxc

),(1)(

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

X6 X5 X4

Sample a new value for X2 :

),(1)(

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,x| xxP x

X6 X5 X4

Sample a new value for X5 :

,x| xxP

,x| xxP x

X6 X5 X4

Query P(x2|e) for sampling node X2 :Sample 1

Sample 2

Sample 3

),,|(},{

xxxxPxxc

X6 X5 X4

Query P(x3 |e) for non-sampled node X3 :

Gibbs: Error Bounds

Objectives: Estimate needed number of samples T Estimate error Methodology: 1 chain use lag-k autocovariance

Estimate T M chains standard sampling

variance Estimate Error

Gibbs: lag-k autocovariance

)(2)0(1

Lag-k autocovariance

Gibbs: lag-k autocovariance

)(2)0(1

)0(ˆPVar

Estimate Monte Carlo variance:

Here, is the smallest positive integer satisfying:

1)12()2( Effective chain size:

In absense of autocovariance: TT ˆ

Gibbs: Multiple Chains

Generate M chains of size K Each chain produces independent estimate Pm:

)|(1 i

t iim xxxPK

Treat Pm as independent random variables.

Estimate P(xi|e) as average of Pm (xi|e) :

Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:

Geman&Geman1984

Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41. Introduce Gibbs sampling; Place the idea of Gibbs sampling in a general

setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.

Tanner&Wong 1987

Tanner and Wong (1987) Data-augmentation Convergence Results

Pearl1988

Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann. In the case of Bayesian networks, the

neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.

Gelfand&Smith,1990

Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409. Show variance reduction in using

mixture estimator for posterior marginals.

Neal, 1992

R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118. Stochastic simulation in Noisy-Or networks.

CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3

0 1000 2000 3000 4000 5000

# samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0.0002

0.0004

0.0006

0.0008

0 5 10 15 20 25

Time(sec)

Cutset Gibbs

CPCS179 Test Results

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0 20 40 60 80

Time(sec)

Cutset Gibbs

CPCS360b Test Results

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

Random Networks

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

RANDOM, n=100, |C|=13, |E|=15-20

0.0005

0.0015

0.0025

0.0035

0 200 400 600 800 1000 1200

# samples

Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0.0002

0.0004

0.0006

0.0008

0 1 2 3 4 5 6 7 8 9 10 11

Time(sec)

Cutset Gibbs

Coding Networks

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

x1 x2 x3 x4

u1 u2 u3 u4

p1 p2 p3 p4

y4y3y2y1

Coding Networks, n=100, |C|=12-14

0 10 20 30 40 50 60

Time(sec)

IBP Gibbs Cutset

Non-Ergodic Hailfinder

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001

1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001

0 500 1000 1500

# samples

Cutset Gibbs

Non-Ergodic CPCS360b - MSE

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0.000005

0.00001

0.000015

0.00002

0.000025

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

|C|=26,fw=3

|C|=48,fw=2

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Non-Ergodic CPCS360b - MaxErr

cpcs360b, N=360, |E|=[20-34], MaxErr

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

|C|=26,fw=3

|C|=48,fw=2

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

Works well for likely evidence!

“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood

Likelihood Weighting

e e e e e

e e e e

Sample in topological order over X !

xi P(Xi|pai)

P(Xi|pai) is a look-up in CPT!

Likelihood Weighting Outline

EndFor

Do ForEach

paXPxX

paePww

exPexP

)( ),(

)(ˆ),(ˆ

Estimate Posterior Marginals:

1)|( since )|()(

)( )()(

)()( jj

tt paeQpaeP

Converges to exact posterior marginals

Generates Samples Fast Sampling distribution is close to

prior (especially if E Leaf Nodes) Increasing sampling varianceConvergence may be slowMany samples with P(x(t))=0 rejected

Likelihood Convergence(Chebychev’s Inequality)

Assume P(X=x|e) has mean and variance 2

Chebychev:

=P(x|e) is unknown => obtain it from samples!

Error Bound Derivation

)(:Numbers Large of Law theFrom

1, ),' with samples#(k )|'(

then , If:

TpqPPP

pqPVarxX

XPkCorollary

kXPsChebychev

K is a Bernoulli random variable

Likelihood Convergence 2

Assume P(X=x|e) has mean and variance 2

Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

Local Variance Bound (LVB)(Dagum&Luby, 1994)

Let be LVB of a binary valued network:

]1,1[))(|(

],[))(|()),(|(

],1,0[,,1

luxpaxP

ulxpaxPxpaxP

LVB Estimate(Pradhan,Dagum,1996)

Using the LVB, the Zero-One Estimator can be re-written:

Importance Sampling Idea

In general, it is hard to sample from target distribution P(X|E)

Generate samples from sampling (proposal) distribution Q(X)

Weigh each sample against P(X|E)

dxxfxQ

xPdxxffI t )(

)()()(

Importance Sampling Variants

Importance sampling: forward, non-adaptive Nodes sampled in topological order Sampling distribution (for non-instantiated nodes)

equal to the prior conditionals

Importance sampling: forward, adaptive Nodes sampled in topological order Sampling distribution adapted according to

average importance weights obtained in previous samples [Cheng,Druzdzel2000]

AIS-BN

The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks.

Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.

Importance vs. Gibbs

expexp

)|(:Importance

)|()|(~)|(~ :Gibbs

Sampling Bayesian Networks

Documents

Transcript of Sampling Bayesian Networks

Learning Bayesian Networks and Causal Discoverychirayukong.github.io/infsci2725/resources/08_Learning Bayesian... · Learning Bayesian Networks and Causal Discovery Bayesian networks

Bayesian Belief Networks Compound Bayesian Decision Theory

Overview on Bayesian networks applications for ... · PDF fileOverview on Bayesian networks applications for dependability, risk analysis ... Bayesian networks Applications for Dependability,

Bayesian Posterior Sampling via Stochastic Gradient ...

Importance sampling algorithms for Bayesian networks: Principles

An Importance Sampling Approach to Integrate Expert Knowledge When Learning Bayesian Networks From Data

Bayesian Learning and Learning Bayesian Networks · Bayesian Learning and Learning Bayesian Networks Chapter 20 some slides by Cristina Conati . Overview ! Full Bayesian Learning

Bayesian evolutionary analysis by sampling trees ...

Importance Sampling for Continuous Time Bayesian Networks

Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.

Bayesian Network Modelling · Bayesian Networks in Genetics & Systems Biology Bayesian networks areversatileand have several potential applications because: dynamic Bayesian networkscan

Bayesian Learning and Learning Bayesian Networks

Bayesian Optimization with Robust Bayesian Neural Networks

Probabilistic Reasoning Bayesian Belief Networks Constructing Bayesian Networks Representing Conditional Distributions Summary.

Bayesian Networks: Sampling Algorithms for Approximate Inference

BAYESIAN INFERENCE Sampling techniques Andreas Steingötter.

Sampling Bayesian Networks

Bayesian Learning and Learning Bayesian Networks.

Bayesian Networks

Bayesian Learning of Bayesian Networks with Informative Priors - …stoics.org.uk/~nicos/pbs/amai.pdf · 2019. 3. 4. · Bayesian Learning of Bayesian Networks Bayesian Learning of