Learning Bayesian Networks (From David Heckerman’s tutorial)

Learning Bayesian NetworksLearning Bayesian Networks

(From David Heckerman’s tutorial)

Bayes net(s)data

X1 truefalsefalsetrue

X2 1532

X3 0.7-1.65.96.3

...

.

.

....

Learning Bayes Nets From DataLearning Bayes Nets From Data

X1

X4

X9

X3

X2

X5

X6

X7

X8

Bayes-netlearner

+prior/expert information

OverviewOverview

Introduction to Bayesian statistics:Introduction to Bayesian statistics:Learning a probabilityLearning a probability

Learning probabilities in a Bayes netLearning probabilities in a Bayes net

Learning Bayes-net structureLearning Bayes-net structure

Learning Probabilities: Classical ApproachLearning Probabilities: Classical Approach

Simple case: Flipping a thumbtack

tailsheads

True probability is unknown

Given iid data, estimate using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

Learning Probabilities: Bayesian ApproachLearning Probabilities: Bayesian Approach

tailsheads

True probability is unknown

Bayesian probability density for

p()

0 1

Bayesian Approach: use Bayes' rule to Bayesian Approach: use Bayes' rule to compute a new density for compute a new density for given data given data

dpp

ppp

)|heads()(

)|heads()()heads|(

prior likelihoodposterior

)|heads()( pp

The LikelihoodThe Likelihood

)|tails( p

)|heads( p

)1(

thttthhhthp ## )1()|...(

“binomial distribution”

Example: Application of Bayes rule to Example: Application of Bayes rule to the observation of a single "heads"the observation of a single "heads"

p(|heads)

0 1

p()

0 1

p(heads|)=

0 1

prior likelihood posterior

A Bayes net for learning probabilitiesA Bayes net for learning probabilities

X1 X2 XN...

toss 1 toss 2 toss N

thttthhhthp ## )1()|...(

Sufficient statisticsSufficient statistics

thp

hhhth...tttpphhhth...tttp## )1( )(

)|()()|(

(#h,#t) are sufficient statistics

The probability of heads on the next tossThe probability of heads on the next toss

)(

)|(

)|()|()| is th toss1(

)|(

1

d

d

dd

p

N

E

dp

dphXphnp

Prior Distributions for Prior Distributions for

Direct assessmentDirect assessment Parametric distributionsParametric distributions

– Conjugate distributions (for convenience)Conjugate distributions (for convenience)

– Mixtures of conjugate distributionsMixtures of conjugate distributions

Conjugate Family of DistributionsConjugate Family of Distributions

0,1

)1(1

),Beta()( thth

thp

1#)1(

1# tails),heads |(

th ththp

Beta distribution:

th

hE

+

)(

Properties:

IntuitionIntuition

The hyperparameters The hyperparameters hh and and tt can be can be

thought of as imaginary counts from our prior thought of as imaginary counts from our prior

experience, starting from "pure ignorance"experience, starting from "pure ignorance"

Equivalent sample size = Equivalent sample size = hh + + tt

The larger the equivalent sample size, the The larger the equivalent sample size, the

more confident we are about the true more confident we are about the true

probabilityprobability

Beta DistributionsBeta Distributions

Beta(3, 2 )Beta(1, 1 ) Beta(19, 39 )Beta(0.5, 0.5 )

Assessment of a Beta Distribution Assessment of a Beta Distribution

Method 1: Equivalent sample- assess h and t

- assess h+t and h/(h+t)

Method 2: Imagined future samples

4,15.0)heads 3|heads( and 2.0)heads( thpp

check: .2 =1

1+ 40 0 5

1 3

1 3 4, .

Generalization to Generalization to mm discrete outcomes discrete outcomes("multinomial distribution")("multinomial distribution")

1 ),,Dirichlet()(1

11

ii

m

imm,θ,θp

m

i

Nim

iiN,Np1

11 ),|(

Dirichlet distribution:

m

ii

iiE

1

)(

Properties:

011

i

m

i

More generalizationsMore generalizations(see, e.g., Bernardo + Smith, 1994)(see, e.g., Bernardo + Smith, 1994)

Likelihoods from the exponential familyLikelihoods from the exponential family BinomialBinomial MultinomialMultinomial PoissonPoisson GammaGamma NormalNormal

OverviewOverview

Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability



From thumbtacks to Bayes netsFrom thumbtacks to Bayes nets

Thumbtack problem can be viewed as learningthe probability for a very simple BN:

X heads/tails

X1 X2 XN...

toss 1 toss 2 toss N

The next simplest Bayes netThe next simplest Bayes net

Xheads/tails Y heads/tails

tailsheads “heads” “tails”



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

?



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"



X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

two separatethumbtack-likelearning problems

A bit more difficult...A bit more difficult...


Three probabilities to learn:Three probabilities to learn: XX=heads=heads

Y=Y=heads|heads|XX=heads=heads

Y=Y=heads|heads|XX=tails=tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

heads

tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

??

?



X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

3 separate thumbtack-like problems

In general…In general…

Learning probabilities in a BN is straightforward ifLearning probabilities in a BN is straightforward if Local distributions from the exponential family Local distributions from the exponential family

(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors Complete dataComplete data

Incomplete data makes parameters Incomplete data makes parameters dependentdependent


X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

OverviewOverview

Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability




Given data, which model is correct?

X Ymodel 1:

X Ymodel 2:

Bayesian approachBayesian approach

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Bayesian approach: Model AveragingBayesian approach: Model Averaging


X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

averagepredictions

Bayesian approach: Model SelectionBayesian approach: Model Selection


X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Keep the best model:- Explanation- Understanding- Tractability

To score a model, use Bayes ruleTo score a model, use Bayes rule

Given data d:

)|()()|( mpmpmp dd

dmpmpmp )|(),|()|( dd

"marginallikelihood"

modelscore

likelihood

Thumbtack exampleThumbtack example

)(

)#(

)(

)#(

)##(

)(

)1(

)|()1()|(

1#1#

##

t

t

h

h

th

th

th

th

th

th

d

dmpmp

th

d

conjugateprior

X heads/tails

More complicated graphsMore complicated graphs


3 separate thumbtack-like learning problems

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)()|(

t

t

h

h

th

th

t

t

h

h

th

th

t

t

h

h

th

th

th

th

th

th

th

thmp

d X

Y|X=heads

Y|X=tails

Model score for a discrete BNModel score for a discrete BN

ii r

k ijk

ijkijkn

i

q

j ijij

ij N

Nmp

11 1 )(

)(

)(

)()|(

d

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Computation of Marginal LikelihoodComputation of Marginal Likelihood

Efficient closed form ifEfficient closed form if Local distributions from the exponential family Local distributions from the exponential family

(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors No missing data (including no hidden variables)No missing data (including no hidden variables)

Practical considerationsPractical considerations

The number of possible BN structures for n The number of possible BN structures for n variables is super exponential in nvariables is super exponential in n

How do we find the best graph(s)?How do we find the best graph(s)?

How do we assign structure and parameter How do we assign structure and parameter priors to all possible graph?priors to all possible graph?

Model searchModel search

Finding the BN structure with the highest Finding the BN structure with the highest score among those structures with at most score among those structures with at most kk parents is NP hard for parents is NP hard for kk>1 (Chickering, 1995)>1 (Chickering, 1995)

Heuristic methodsHeuristic methods

– GreedyGreedy

– Greedy with restartsGreedy with restarts

– MCMC methodsMCMC methodsscore

all possiblesingle changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

Structure priorsStructure priors

1. All possible structures equally likely1. All possible structures equally likely

2. Partial ordering, required / prohibited arcs2. Partial ordering, required / prohibited arcs

3. p(m) 3. p(m) similarity(m, prior BN) similarity(m, prior BN)

Parameter priorsParameter priors

All uniform: Beta(1,1)All uniform: Beta(1,1)

Use a prior BN Use a prior BN


Recall the intuition behind the Beta prior for the Recall the intuition behind the Beta prior for the

thumbtack:thumbtack:

The hyperparameters The hyperparameters hh and and tt can be thought can be thought

of as imaginary counts from our prior of as imaginary counts from our prior

experience, starting from "pure ignorance"experience, starting from "pure ignorance"

Equivalent sample size = Equivalent sample size = hh + + tt

The larger the equivalent sample size, the more The larger the equivalent sample size, the more

confident we are about the long-run fraction confident we are about the long-run fraction


x1

x4

x9

x3

x2

x5

x6

x7

x8

+equivalent

samplesize

imaginarycount

for anyvariable

configuration

parameter priors for any BN structure for X1…Xn

parametermodularity

x1

x4

x9

x3

x2

x5

x6

x7

x8

prior network+equivalent sample size

data

improved network(s)

x1 truefalsefalsetrue

x2 falsefalsefalsetrue

x3 truetruefalsefalse

...

.

.

....

Combine user knowledge and dataCombine user knowledge and data

x1

x4

x9

x3

x2

x5

x6

x7

x8

Learning Bayesian Networks (From David Heckerman’s tutorial)

Documents

Transcript of Learning Bayesian Networks (From David Heckerman’s tutorial)