Learning Bayesian Networks

31
Learning Bayesian Networks

description

Learning Bayesian Networks. Dimensions of Learning. X 1 true false false true. X 2 1 5 3 2. X 3 0.7 -1.6 5.9 6.3. Learning Bayes nets from data. Bayes net(s). data. X 1. X 2. Bayes-net learner. X 3. X 4. X 5. X 6. X 7. + prior/expert information. X 8. X 9. Q. X 1. - PowerPoint PPT Presentation

Transcript of Learning Bayesian Networks

Page 1: Learning Bayesian Networks

Learning Bayesian Networks

Page 2: Learning Bayesian Networks

Dimensions of Learning

Model Bayes net Markov net

Data Complete Incomplete

Structure Known Unknown

Objective Generative Discriminative

Page 3: Learning Bayesian Networks

Bayes net(s)data

X1 truefalsefalsetrue

X2 1532

X3 0.7-1.65.96.3

...

.

.

....

Learning Bayes netsfrom data

X1

X4

X9

X3

X2

X5

X6

X7

X8

Bayes-netlearner

+prior/expert information

Page 4: Learning Bayesian Networks

From thumbtacks to Bayes nets

Thumbtack problem can be viewed as learningthe probability for a very simple BN:

X heads/tails

X1 X2 XN...

toss 1 toss 2 toss N

Page 5: Learning Bayesian Networks

The next simplest Bayes net

Xheads/tails Y heads/tails

tailsheads “heads” “tails”

Page 6: Learning Bayesian Networks

The next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

?

Page 7: Learning Bayesian Networks

The next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

Page 8: Learning Bayesian Networks

The next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

two separatethumbtack-likelearning problems

Page 9: Learning Bayesian Networks

A bit more difficult...

Xheads/tails Y heads/tails

Three probabilities to learn:X=heads

Y=heads|X=heads

Y=heads|X=tails

Page 10: Learning Bayesian Networks

A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

heads

tails

Page 11: Learning Bayesian Networks

A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

Page 12: Learning Bayesian Networks

A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

??

?

Page 13: Learning Bayesian Networks

A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

3 separate thumbtack-like problems

Page 14: Learning Bayesian Networks

In general …

Learning probabilities in a Bayes netis straightforward if

• Complete data

• Local distributions from the exponential family (binomial, Poisson, gamma, ...)

• Parameter independence

• Conjugate priors

Page 15: Learning Bayesian Networks

Incomplete data makes parameters dependent

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

Page 16: Learning Bayesian Networks

Solution: Use EM

• Initialize parameters ignoring missing data

• E step: Infer missing values usingcurrent parameters

• M step: Estimate parameters using completed data

• Can also use gradient descent

Page 17: Learning Bayesian Networks

Learning Bayes-net structure

Given data, which model is correct?

X Ymodel 1:

X Ymodel 2:

Page 18: Learning Bayesian Networks

Bayesian approach

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Page 19: Learning Bayesian Networks

Bayesian approach:Model averaging

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

averagepredictions

Page 20: Learning Bayesian Networks

Bayesian approach:Model selection

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Keep the best model:- Explanation- Understanding- Tractability

Page 21: Learning Bayesian Networks

To score a model,use Bayes’ theorem

Given data d:

)|()()|( mpmpmp dd

dmpmpmp )|(),|()|( dd

"marginallikelihood"

modelscore

likelihood

Page 22: Learning Bayesian Networks

Thumbtack example

)(

)#(

)(

)#(

)##(

)(

)1(

)|()1()|(

1#1#

##

t

t

h

h

th

th

th

th

th

th

d

dmpmp

th

d

conjugateprior

X heads/tails

Page 23: Learning Bayesian Networks

More complicated graphs

Xheads/tails Y heads/tails

3 separate thumbtack-like learning problems

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)()|(

t

t

h

h

th

th

t

t

h

h

th

th

t

t

h

h

th

th

th

th

th

th

th

thmp

d X

Y|X=heads

Y|X=tails

Page 24: Learning Bayesian Networks

Model score for adiscrete Bayes net

ii r

k ijk

ijkijkn

i

q

j ijij

ij N

Nmp

11 1 )(

)(

)(

)()|(

d

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Page 25: Learning Bayesian Networks

Computation ofmarginal likelihood

Efficient closed form if

• Local distributions from the exponential family (binomial, poisson, gamma, ...)

• Parameter independence

• Conjugate priors

• No missing data (including no hidden variables)

Page 26: Learning Bayesian Networks

Structure search• Finding the BN structure with the highest

score among those structures with at most k parents is NP hard for k>1 (Chickering, 1995)

• Heuristic methods

–Greedy–Greedy with restarts–MCMC methods score

all possiblesingle changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

Page 27: Learning Bayesian Networks

Structure priors

1. All possible structures equally likely

2. Partial ordering, required / prohibited arcs

3. Prior(m) Similarity(m, prior BN)

Page 28: Learning Bayesian Networks

Parameter priors

• All uniform: Beta(1,1)

• Use a prior Bayes net

Page 29: Learning Bayesian Networks

Parameter priors

Recall the intuition behind the Beta prior for the

thumbtack:

• The hyperparameters h and t can be thought

of as imaginary counts from our prior

experience, starting from "pure ignorance"

• Equivalent sample size = h + t

• The larger the equivalent sample size, the more

confident we are about the long-run fraction

Page 30: Learning Bayesian Networks

Parameter priors

x1

x4

x9

x3

x2

x5

x6

x7

x8

+equivalent

samplesize

imaginarycount

for anyvariable

configuration

parameter priors for any Bayes net structure for X1…Xn

parametermodularity

Page 31: Learning Bayesian Networks

x1

x4

x9

x3

x2

x5

x6

x7

x8

prior network+equivalent sample size

data

improved network(s)

x1 truefalsefalsetrue

x2 falsefalsefalsetrue

x3 truetruefalsefalse

...

.

.

....

Combining knowledge & data

x1

x4

x9

x3

x2

x5

x6

x7

x8