Sampling Bayesian Networks

94
1 Sampling Bayesian Networks ICS 276 2007

description

Sampling Bayesian Networks. ICS 276 2007. Answering BN Queries. Probability of Evidence P(e) ? NP-hard Conditional Prob. P(x i |e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y  x ? NP PP -hard Approximating P(e) or P(x i |e) within : NP-hard. - PowerPoint PPT Presentation

Transcript of Sampling Bayesian Networks

Page 1: Sampling Bayesian Networks

1

Sampling Bayesian Networks

ICS 276

2007

Page 2: Sampling Bayesian Networks

2

Answering BN Queries

Probability of Evidence P(e) ? NP-hard

Conditional Prob. P(xi|e) ? NP-hard MPE x = arg max P(x|e) ? NP-hard MAP y = arg max P(y|e), y x ?

NPPP-hard Approximating P(e) or P(xi|e) within : NP-hard

Page 3: Sampling Bayesian Networks

3

Approximation Algorithms

Structural Approximations Eliminate some dependencies

Remove edges Mini-Bucket Approach

Search Approach for optimization tasks: MPE, MAP

SamplingGenerate random samples and compute values of interest from samples, not original network

Page 4: Sampling Bayesian Networks

4

Algorithm Tree

Page 5: Sampling Bayesian Networks

5

Sampling

Input: Bayesian network with set of nodes X Sample = a tuple with assigned values

s=(X1=x1,X2=x2,… ,Xk=xk)

Tuple may include all variables (except evidence) or a subset

Sampling schemas dictate how to generate samples (tuples)

Ideally, samples are distributed according to P(X|E)

Page 6: Sampling Bayesian Networks

6

Sampling Fundamentals

dxXxggE )()(

Given a set of variables X = {X1, X2, … Xn} that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :

Page 7: Sampling Bayesian Networks

7

Sampling From (X)

Given independent, identically distributed samples (iid) S1, S2, …ST from (X), it follows from Strong Law of Large Numbers:

T

t

tSgT

g1

)(1

},...,,{ 21tn

ttt xxxS A sample St is an instantiation:

Page 8: Sampling Bayesian Networks

8

Sampling Basics

Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Generate k samples: 0,1,1,1,0,1,1,0,1 Approximate P’(X):

}6.0,4.0{)('

6.010

6

#

)1(#)1('

4.010

4

#

)0(#)0('

XP

samples

XsamplesXP

samples

XsamplesXP

Page 9: Sampling Bayesian Networks

9

How to draw a sample ?

Given random variable X, D(X)={0, 1}

Given P(X) = {0.3, 0.7} Sample X P (X)

draw random number r [0, 1] If (r < 0.3) then set X=0 Else set X=1

Can generalize for any domain size

Page 10: Sampling Bayesian Networks

10

Sampling in BN

Same Idea: generate a set of samples T

Estimate P(Xi|E) from samples Challenge: X is a vector and P(X) is a

huge distribution represented by BN Need to know:

How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(Xi|e) ?

Page 11: Sampling Bayesian Networks

11

Sampling Algorithms

Forward Sampling Gibbs Sampling (MCMC)

Blocking Rao-Blackwellised

Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle

Filtering) in Dynamic Bayesian Networks

Page 12: Sampling Bayesian Networks

12

Forward Sampling

Forward Sampling Case with No evidence E={} Case with Evidence E=e # samples N and Error Bounds

Page 13: Sampling Bayesian Networks

13

Forward Sampling No Evidence(Henrion 1988)

Input: Bayesian networkX= {X1,…,XN}, N- #nodes, T - # samples

Output: T samples Process nodes in topological order – first process the

ancestors of a node, then the node itself:

1. For t = 0 to T2. For i = 0 to N3. Xi sample xi

t from P(xi | pai)

Page 14: Sampling Bayesian Networks

14

Sampling A Value

What does it mean to sample xit from P(Xi | pai) ?

Assume D(Xi)={0,1} Assume P(Xi | pai) = (0.3, 0.7)

Draw a random number r from [0,1]If r falls in [0,0.3], set Xi = 0

If r falls in [0.3,1], set Xi=1

0 10.3 r

Page 15: Sampling Bayesian Networks

15

Forward sampling (example)

X1

X4

X2X3

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from Sample .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

Evidence No

3,244

133

122

11

xxxPx

xxPx

xxPx

xPx

k

Page 16: Sampling Bayesian Networks

16

Forward Sampling-Answering Queries

Task: given T samples {S1,S2,…,Sn} estimate P(Xi = xi) :

T

xXsamplesxXP ii

ii

)(#)(

Basically, count the proportion of samples where Xi = xi

Page 17: Sampling Bayesian Networks

17

Forward Sampling w/ Evidence

Input: Bayesian networkX= {X1,…,XN}, N- #nodesE – evidence, T - # samples

Output: T samples consistent with E1. For t=1 to T2. For i=1 to N3. Xi sample xi

t from P(xi | pai)

4. If Xi in E and Xi xi, reject sample: 5. i = 1 and go to step 2

Page 18: Sampling Bayesian Networks

18

Forward sampling (example)

X1

X4

X2X3

)( 1xP

)|( 12 xxP

),|( 324 xxxP

)|( 13 xxP

)|( from Sample 5.

otherwise 1, fromstart and

samplereject 0, If .4

)|( from Sample .3

)|( from Sample .2

)( from Sample .1

sample generate//

0 :Evidence

3,244

3

133

122

11

3

xxxPx

x

xxPx

xxPx

xPx

k

X

Page 19: Sampling Bayesian Networks

19

Forward Sampling: Illustration

Let Y be a subset of evidence nodes s.t. Y=u

Page 20: Sampling Bayesian Networks

20

Forward Sampling –How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

1

)( 2

yP

cT

Derived from Chebychev’s Bound.

222])(,)([)( NeyPyPyPP

Page 21: Sampling Bayesian Networks

21

Forward Sampling - How many samples?

Theorem: Let s(y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have:

2

ln)(

42

yP

T

Derived from Hoeffding’s Bound (full proof is given in Koller).

222])(,)([)( NeyPyPyPP

Page 22: Sampling Bayesian Networks

22

Forward Sampling:Performance

Advantages: P(xi | pa(xi)) is readily available Samples are independent !Drawbacks: If evidence E is rare (P(e) is low), then

we will reject most of the samples! Since P(y) in estimate of T is unknown,

must estimate P(y) from samples themselves!

If P(e) is small, T will become very big!

Page 23: Sampling Bayesian Networks

23

Problem: Evidence

Forward Sampling High Rejection Rate

Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling

Page 24: Sampling Bayesian Networks

24

Forward Sampling Bibliography

{henrion88} M. Henrion, "Propagating uncertainty in Bayesian networks by probabilistic logic sampling”, Uncertainty in AI, pp. = 149-163,1988

Page 25: Sampling Bayesian Networks

25

Gibbs Sampling

Markov Chain Monte Carlo method(Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994)

Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|e) Guaranteed to converge when all P > 0 Methods to improve convergence:

Blocking Rao-Blackwellised

Error Bounds Lag-t autocovariance Multiple Chains, Chebyshev’s Inequality

Page 26: Sampling Bayesian Networks

26

Gibbs Sampling (Pearl, 1988)

A sample t[1,2,…],is an instantiation of all variables in the network:

Sampling process Fix values of observed variables e Instantiate node values in sample x0 at

random Generate samples x1,x2,…xT from P(x|e) Compute posteriors from samples

},...,,{ 2211tNN

ttt xXxXxXx

Page 27: Sampling Bayesian Networks

27

Ordered Gibbs Sampler

Generate sample xt+1 from xt :

In short, for i=1 to N:),\|(

),,...,,|(

...

),,...,,|(

),,...,,|(

1

11

12

11

1

31

121

22

3211

11

exxxPxX

exxxxPxX

exxxxPxX

exxxxPxX

it

itii

tN

ttN

tNN

tN

ttt

tN

ttt

from sampled

ProcessAllVariablesIn SomeOrder

Page 28: Sampling Bayesian Networks

28

Gibbs Sampling (cont’d)(Pearl, 1988)

ij chX

jjiiit

i paxPpaxPxxxP )|()|()\|(

:)\|( )\|( :Important it

iit

i xmarkovxPxxxP

iX )()( jj chX

jiii pachpaXM

Markov blanket:

nodesother all oft independen is parents), their andchildren, (parents,

Given

iX

blanketMarkov

Page 29: Sampling Bayesian Networks

29

Ordered Gibbs Sampling Algorithm

Input: X, EOutput: T samples {xt } Fix evidence E Generate samples from P(X | E)1. For t = 1 to T (compute samples)2. For i = 1 to N (loop through variables)3. Xi sample xi

t from P(Xi | markovt \ Xi)

Page 30: Sampling Bayesian Networks

Answering Queries

Query: P(xi |e) = ? Method 1: count #of samples where Xi=xi:

Method 2: average probability (mixture estimator):

n

t it

iiii XmarkovxXPT

xXP1

)\|(1

)(

T

xXsamplesxXP ii

ii

)(#)(

Page 31: Sampling Bayesian Networks

31

Gibbs Sampling Example - BN

X = {X1,X2,…,X9}

E = {X9}X1

X4

X8 X5 X2

X3

X9 X7

X6

Page 32: Sampling Bayesian Networks

32

Gibbs Sampling Example - BN

X1 = x10

X6 = x60

X2 = x20

X7 = x70

X3 = x30

X8 = x80

X4 = x40

X5 = x50

X1

X4

X8 X5 X2

X3

X9 X7

X6

Page 33: Sampling Bayesian Networks

33

Gibbs Sampling Example - BN

X1 P (X1 |X0

2,…,X0

8 ,X9}

E = {X9}X1

X4

X8 X5 X2

X3

X9 X7

X6

Page 34: Sampling Bayesian Networks

34

Gibbs Sampling Example - BN

X2 P(X2 |X1

1,…,X0

8 ,X9}

E = {X9}

X1

X4

X8 X5 X2

X3

X9 X7

X6

Page 35: Sampling Bayesian Networks

35

Gibbs Sampling: Illustration

Page 36: Sampling Bayesian Networks

40

Gibbs Sampling: Burn-In

We want to sample from P(X | E) But…starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample valkues

from approximate P(x|e) (for example, run IBP first)

Page 37: Sampling Bayesian Networks

41

Gibbs Sampling: Convergence

Converge to stationary distribution * :

* = * Pwhere P is a transition kernel

pij = P(Xi Xj) Guaranteed to converge iff chain is :

irreducible aperiodic ergodic ( i,j pij > 0)

Page 38: Sampling Bayesian Networks

42

Irreducible

A Markov chain (or its probability transition matrix) is said to be irreducible if it is possible to reach every state from every other state (not necessarily in one step).

In other words, i,j k : P(k)ij > 0 where

k is the number of steps taken to get to state j from state i.

Page 39: Sampling Bayesian Networks

43

Aperiodic

Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic.

Page 40: Sampling Bayesian Networks

44

Ergodicity

A recurrent state is a state to which the chain returns with probability 1:

n P(n)ij =

Recurrent, aperiodic states are ergodic.

Note: an extra condition for ergodicity is that expected recurrence time is finite. This holds for recurrent states in a finite state chain.

Page 41: Sampling Bayesian Networks

46

Gibbs Convergence

Gibbs convergence is generally guaranteed as long as all probabilities are positive!

Intuition for ergodicity requirement: if nodes X and Y are correlated s.t. X=0 Y=0, then: once we sample and assign X=0, then we are forced to

assign Y=0; once we sample and assign Y=0, then we are forced to

assign X=0;

we will never be able to change their values again!

Another problem: it can take a very long time to converge!

Page 42: Sampling Bayesian Networks

47

Gibbs Sampling: Performance

+Advantage: guaranteed to converge to P(X|E)-Disadvantage: convergence may be slow

Problems:

Samples are dependent ! Statistical variance is too big in high-

dimensional problems

Page 43: Sampling Bayesian Networks

48

Gibbs: Speeding Convergence

Objectives:1. Reduce dependence between

samples (autocorrelation) Skip samples Randomize Variable Sampling Order

2. Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation

Page 44: Sampling Bayesian Networks

49

Skipping Samples

Pick only every k-th sample (Gayer, 1992)

Can reduce dependence between samples !

Increases variance ! Waists samples !

Page 45: Sampling Bayesian Networks

50

Randomized Variable Order

Random Scan Gibbs SamplerPick each next variable Xi for update at

random with probability pi , i pi = 1.

(In the simplest case, pi are distributed uniformly.)

In some instances, reduces variance (MacEachern, Peruggia, 1999 “Subsampling the Gibbs Sampler: Variance

Reduction”)

Page 46: Sampling Bayesian Networks

51

Blocking

Sample several variables together, as a block Example: Given three variables X,Y,Z, with

domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample:

Xt+1 P(yt,zt)=P(wt)(yt+1,zt+1)=Wt+1 P(xt+1)

+ Can improve convergence greatly when two variables are strongly correlated!

- Domain of the block variable grows exponentially with the #variables in a block!

Page 47: Sampling Bayesian Networks

52

Blocking Gibbs Sampling

Jensen, Kong, Kjaerulff, 1993“Blocking Gibbs Sampling Very Large

Probabilistic Expert Systems” Select a set of subsets:

E1, E2, E3, …, Ek s.t. Ei X

Ui Ei = X

Ai = X \ Ei

Sample P(Ei | Ai)

Page 48: Sampling Bayesian Networks

53

Rao-Blackwellisation

Do not sample all variables! Sample a subset! Example: Given three variables

X,Y,Z, sample only X and Y, sum out Z. Given sample (xt,yt), compute next sample:

Xt+1 P(yt)yt+1 P(xt+1)

Page 49: Sampling Bayesian Networks

54

Rao-Blackwell Theorem

Bottom line: reducing number of variables in a sample reduce variance!

Page 50: Sampling Bayesian Networks

55

Blocking vs. Rao-Blackwellisation

Standard Gibbs:P(x|y,z),P(y|x,z),P(z|x,y) (1)

Blocking:P(x|y,z), P(y,z|x) (2)

Rao-Blackwellised:P(x|y), P(y|x) (3)

Var3 < Var2 < Var1 [Liu, Wong, Kong, 1994Covariance structure of the Gibbs

sampler…]

X Y

Z

Page 51: Sampling Bayesian Networks

56

Rao-Blackwellised Gibbs: Cutset Sampling

Select C X (possibly cycle-cutset), |C| = m

Fix evidence E Initialize nodes with random values:

For i=1 to m: ci to Ci = c 0i

For t=1 to n , generate samples:For i=1 to m:Ci=ci

t+1 P(ci|c1 t+1,…,ci-1

t+1,ci+1t,…,cm

t ,e)

Page 52: Sampling Bayesian Networks

57

Cutset Sampling

Select a subset C={C1,…,CK} X A sample t[1,2,…],is an instantiation of C:

Sampling process Fix values of observed variables e Generate sample c0 at random Generate samples c1,c2,…cT from P(c|e) Compute posteriors from samples

},...,,{ 2211tKK

ttt cCcCcCc

Page 53: Sampling Bayesian Networks

58

Cutset SamplingGenerating Samples

Generate sample ct+1 from ct :

In short, for i=1 to K:),\|(

),,...,,|(

...

),,...,,|(

),,...,,|(

1

11

12

11

1

31

121

22

3211

11

ecccPcC

eccccPcC

eccccPcC

eccccPcC

it

itii

tK

ttK

tKK

tK

ttt

tK

ttt

from sampled

Page 54: Sampling Bayesian Networks

59

Rao-Blackwellised Gibbs: Cutset Sampling

How to compute P(ci|c t\ci, e) ?

Compute joint P(ci, c t\ci, e) for each ci

D(Ci) Then normalize:

P(ci| c t\ci , e) = P(ci, c

t\ci , e) Computation efficiency depends

on choice of C

Page 55: Sampling Bayesian Networks

60

Rao-Blackwellised Gibbs: Cutset Sampling

How to choose C ? Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.

Pick C wisely so as to minimize w notion of w-cutset

Page 56: Sampling Bayesian Networks

61

w-cutset Sampling

C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w

Complexity of exact inference: bounded by w !

cycle-cutset is a special case

Page 57: Sampling Bayesian Networks

62

Cutset Sampling-Answering Queries

Query: ci C, P(ci |e)=? same as Gibbs: Special case of w-cutset

computed while generating sample t

compute after generating sample t

T

t it

ii ecccPT

|e)(cP1

),\|(1

Query: P(xi |e) = ?

T

t

tii ,ecxP

T|e)(xP

1)|(

1

Page 58: Sampling Bayesian Networks

63

Cutset Sampling Example

}{ 05

02

0 ,xxc

X1

X7

X5 X4

X2

X9 X8

X3

E=x9

X6

Page 59: Sampling Bayesian Networks

64

Cutset Sampling Example

),(

),(1)(

),(

),(

}{

905

''2

905

'2

9052

12

905

''2

905

'2

05

02

0

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,xx c

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X2 :

Page 60: Sampling Bayesian Networks

65

Cutset Sampling Example

},{

),(

),(1)(

),(

),(

)(

},{

15

12

1

9''

512

9'5

12

9125

15

9''

512

9'5

12

9052

12

05

02

0

xxc

,xxxBTE

,xxxBTE,x| xxP x

,xxxBTE

,xxxBTE

,x| xxP x

xxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Sample a new value for X5 :

Page 61: Sampling Bayesian Networks

66

Cutset Sampling Example

)(

)(

)(

3

1)|(

)(

)(

)(

9252

9152

9052

92

9252

32

9152

22

9052

12

,x| xxP

,x| xxP

,x| xxP

xxP

,x| xxP x

,x| xxP x

,x| xxP x

X1

X7

X6 X5 X4

X2

X9 X8

X3

Query P(x2|e) for sampling node X2 :Sample 1

Sample 2

Sample 3

Page 62: Sampling Bayesian Networks

67

Cutset Sampling Example

),,|(

),,|(

),,|(

3

1)|(

),,|(},{

),,|(},{

),,|(},{

935

323

925

223

915

123

93

935

323

35

32

3

925

223

25

22

2

915

123

15

12

1

xxxxP

xxxxP

xxxxP

xxP

xxxxPxxc

xxxxPxxc

xxxxPxxc

X1

X7

X6 X5 X4

X2

X9 X8

X3

Query P(x3 |e) for non-sampled node X3 :

Page 63: Sampling Bayesian Networks

68

Gibbs: Error Bounds

Objectives: Estimate needed number of samples T Estimate error Methodology: 1 chain use lag-k autocovariance

Estimate T M chains standard sampling

variance Estimate Error

Page 64: Sampling Bayesian Networks

69

Gibbs: lag-k autocovariance

12

1

1

1

)(2)0(1

)(

))((1

)(

)\|(1

)|(

)\|(

i

kN

t kii

itN

t ii

it

ii

iT

PVar

PPPPT

k

xxxPT

exPP

xxxPP

Lag-k autocovariance

Page 65: Sampling Bayesian Networks

70

Gibbs: lag-k autocovariance

12

1

)(2)0(1

)(

i

iT

PVar

)(

)0(ˆPVar

T

Estimate Monte Carlo variance:

Here, is the smallest positive integer satisfying:

1)12()2( Effective chain size:

In absense of autocovariance: TT ˆ

Page 66: Sampling Bayesian Networks

71

Gibbs: Multiple Chains

Generate M chains of size K Each chain produces independent estimate Pm:

M

i mPM

P1

1

)\|(1

)|(1 i

tK

t iim xxxPK

exPP

Treat Pm as independent random variables.

Estimate P(xi|e) as average of Pm (xi|e) :

Page 67: Sampling Bayesian Networks

72

Gibbs: Multiple Chains

{ Pm } are independent random variables

Therefore:

M

St

PMPM

PPM

SPVar

M

M

mm

M

m m

1,2/

1

22

2

1

2

1

1

1

1)(

Page 68: Sampling Bayesian Networks

73

Geman&Geman1984

Geman, S. & Geman D., 1984. Stocahstic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans.Pat.Anal.Mach.Intel. 6, 721-41. Introduce Gibbs sampling; Place the idea of Gibbs sampling in a general

setting in which the collection of variables is structured in a graphical model and each variable has a neighborhood corresponding to a local region of the graphical structure. Geman and Geman use the Gibbs distribution to define the joint distribution on this structured set of variables.

Page 69: Sampling Bayesian Networks

74

Tanner&Wong 1987

Tanner and Wong (1987) Data-augmentation Convergence Results

Page 70: Sampling Bayesian Networks

75

Pearl1988

Pearl,1988. Probabilistic Reasoning in Intelligent Systems, Morgan-Kaufmann. In the case of Bayesian networks, the

neighborhoods correspond to the Markov blanket of a variable and the joint distribution is defined by the factorization of the network.

Page 71: Sampling Bayesian Networks

76

Gelfand&Smith,1990

Gelfand, A.E. and Smith, A.F.M., 1990. Sampling-based approaches to calculating marginal densities. J. Am.Statist. Assoc. 85, 398-409. Show variance reduction in using

mixture estimator for posterior marginals.

Page 72: Sampling Bayesian Networks

77

Neal, 1992

R. M. Neal, 1992. Connectionist learning of belief networks, Artifical Intelligence, v. 56, pp. 71-118. Stochastic simulation in Noisy-Or networks.

Page 73: Sampling Bayesian Networks

78

CPCS54 Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 54, D(Xi) = 2, |C| = 15, |E| = 4

Exact Time = 30 sec using Cutset Conditioning

CPCS54, n=54, |C|=15, |E|=3

0

0.001

0.002

0.003

0.004

0 1000 2000 3000 4000 5000

# samples

Cutset Gibbs

CPCS54, n=54, |C|=15, |E|=3

0

0.0002

0.0004

0.0006

0.0008

0 5 10 15 20 25

Time(sec)

Cutset Gibbs

Page 74: Sampling Bayesian Networks

79

CPCS179 Test Results

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry)|X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35

Exact Time = 122 sec using Loop-Cutset Conditioning

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

100 500 1000 2000 3000 4000

# samples

Cutset Gibbs

CPCS179, n=179, |C|=8, |E|=35

0

0.002

0.004

0.006

0.008

0.01

0.012

0 20 40 60 80

Time(sec)

Cutset Gibbs

Page 75: Sampling Bayesian Networks

80

CPCS360b Test Results

MSE vs. #samples (left) and time (right)

Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36

Exact Time > 60 min using Cutset Conditioning

Exact Values obtained via Bucket Elimination

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

0 200 400 600 800 1000

# samples

Cutset Gibbs

CPCS360b, n=360, |C|=21, |E|=36

0

0.00004

0.00008

0.00012

0.00016

1 2 3 5 10 20 30 40 50 60

Time(sec)

Cutset Gibbs

Page 76: Sampling Bayesian Networks

81

Random Networks

MSE vs. #samples (left) and time (right)

|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20

Exact Time = 30 sec using Cutset Conditioning

RANDOM, n=100, |C|=13, |E|=15-20

0

0.0005

0.001

0.0015

0.002

0.0025

0.003

0.0035

0 200 400 600 800 1000 1200

# samples

Cutset Gibbs

RANDOM, n=100, |C|=13, |E|=15-20

0

0.0002

0.0004

0.0006

0.0008

0.001

0 1 2 3 4 5 6 7 8 9 10 11

Time(sec)

Cutset Gibbs

Page 77: Sampling Bayesian Networks

82

Coding Networks

MSE vs. time (right)

Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50

Sample Ergodic Subspace U={U1, U2,…Uk}

Exact Time = 50 sec using Cutset Conditioning

x1 x2 x3 x4

u1 u2 u3 u4

p1 p2 p3 p4

y4y3y2y1

Coding Networks, n=100, |C|=12-14

0.001

0.01

0.1

0 10 20 30 40 50 60

Time(sec)

IBP Gibbs Cutset

Page 78: Sampling Bayesian Networks

83

Non-Ergodic Hailfinder

MSE vs. #samples (left) and time (right)

Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0

Exact Time = 2 sec using Loop-Cutset Conditioning

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

1

1 2 3 4 5 6 7 8 9 10

Time(sec)

Cutset Gibbs

HailFinder, n=56, |C|=5, |E|=1

0.0001

0.001

0.01

0.1

0 500 1000 1500

# samples

Cutset Gibbs

Page 79: Sampling Bayesian Networks

84

Non-Ergodic CPCS360b - MSE

cpcs360b, N=360, |E|=[20-34], w*=20, MSE

0

0.000005

0.00001

0.000015

0.00002

0.000025

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

Gibbs

IBP

|C|=26,fw=3

|C|=48,fw=2

MSE vs. Time

Non-Ergodic, |X| = 360, |C| = 26, D(Xi)=2

Exact Time = 50 min using BTE

Page 80: Sampling Bayesian Networks

85

Non-Ergodic CPCS360b - MaxErr

cpcs360b, N=360, |E|=[20-34], MaxErr

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0 200 400 600 800 1000 1200 1400 1600

Time (sec)

Gibbs

IBP

|C|=26,fw=3

|C|=48,fw=2

Page 81: Sampling Bayesian Networks

86

Likelihood Weighting(Fung and Chang, 1990; Shachter and Peot, 1990)

Works well for likely evidence!

“Clamping” evidence+forward sampling+ weighing samples by evidence likelihood

Page 82: Sampling Bayesian Networks

87

Likelihood Weighting

e e e e e

e e e e

Sample in topological order over X !

xi P(Xi|pai)

P(Xi|pai) is a look-up in CPT!

Page 83: Sampling Bayesian Networks

88

Likelihood Weighting Outline

EndFor

)|(

)|(

)(

Do ForEach

1

)(

)()(

)(

iit

ii

iitt

ii

i

i

t

paXPxX

Else

paePww

eX

EXIf

XX

w

Page 84: Sampling Bayesian Networks

89

Likelihood Weighting

T

t

t

ti

T

t

t

ii

w

xxw

eP

exPexP

1

)(

)(

1

)( ),(

)(ˆ),(ˆ

)|(ˆ

Estimate Posterior Marginals:

1)|( since )|()(

)( )()(

)()( jj

j

tjjt

tt paeQpaeP

xQ

xPw

Page 85: Sampling Bayesian Networks

90

Likelihood Weighting

Converges to exact posterior marginals

Generates Samples Fast Sampling distribution is close to

prior (especially if E Leaf Nodes) Increasing sampling varianceConvergence may be slowMany samples with P(x(t))=0 rejected

Page 86: Sampling Bayesian Networks

93

Likelihood Convergence(Chebychev’s Inequality)

Assume P(X=x|e) has mean and variance 2

Chebychev:

1

ˆ

22

2

22

2

cN

N

cP

=P(x|e) is unknown => obtain it from samples!

Page 87: Sampling Bayesian Networks

94

Error Bound Derivation

2

2

2

2

2

22

2

2

2

2

/

)(:Numbers Large of Law theFrom

4

1)1(

/

1, ),' with samples#(k )|'(

then , If:

1,1

:'

T

TXP

TXVar

TT

PPPPP

TpqPPP

pqT

pqPVarxX

T

KexPP

XPkCorollary

kk

kXPsChebychev

K is a Bernoulli random variable

Page 88: Sampling Bayesian Networks

95

Likelihood Convergence 2

Assume P(X=x|e) has mean and variance 2

Zero-One Estimation Theory (Karp et al.,1989):

=P(x|e) is unknown => obtain it from samples!

2

ln4

2T

Page 89: Sampling Bayesian Networks

96

Local Variance Bound (LVB)(Dagum&Luby, 1994)

Let be LVB of a binary valued network:

]1,1[))(|(

],[))(|()),(|(

],1,0[,,1

1,max

luxpaxP

ulxpaxPxpaxP

ululu

l

l

u

OR

Page 90: Sampling Bayesian Networks

97

LVB Estimate(Pradhan,Dagum,1996)

Using the LVB, the Zero-One Estimator can be re-written:

2

ln4

2

k

T

Page 91: Sampling Bayesian Networks

98

Importance Sampling Idea

In general, it is hard to sample from target distribution P(X|E)

Generate samples from sampling (proposal) distribution Q(X)

Weigh each sample against P(X|E)

dxxfxQ

xPdxxffI t )(

)(

)()()(

Page 92: Sampling Bayesian Networks

99

Importance Sampling Variants

Importance sampling: forward, non-adaptive Nodes sampled in topological order Sampling distribution (for non-instantiated nodes)

equal to the prior conditionals

Importance sampling: forward, adaptive Nodes sampled in topological order Sampling distribution adapted according to

average importance weights obtained in previous samples [Cheng,Druzdzel2000]

Page 93: Sampling Bayesian Networks

100

AIS-BN

The most efficient variant of importance sampling to-date is AIS-BN – Adaptive Importance Sampling for Bayesian networks.

Jian Cheng and Marek J. Druzdzel. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks. Journal of Artificial Intelligence Research (JAIR), 13:155-188, 2000.

Page 94: Sampling Bayesian Networks

101

Importance vs. Gibbs

T

tt

tt

t

T

t

t

T

t

xq

xpxf

Tf

exqx

xfT

xf

expexp

expx

1

1

)(

)()(1

)|(:Importance

)(1

)(

)|()|(~)|(~ :Gibbs

wt