Learning Bayesian Networks (From David Heckerman’s tutorial)

48
Learning Bayesian Networks Learning Bayesian Networks (From David Heckerman’s tutorial)
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    6

Transcript of Learning Bayesian Networks (From David Heckerman’s tutorial)

Page 1: Learning Bayesian Networks (From David Heckerman’s tutorial)

Learning Bayesian NetworksLearning Bayesian Networks

(From David Heckerman’s tutorial)

Page 2: Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayes net(s)data

X1 truefalsefalsetrue

X2 1532

X3 0.7-1.65.96.3

...

.

.

....

Learning Bayes Nets From DataLearning Bayes Nets From Data

X1

X4

X9

X3

X2

X5

X6

X7

X8

Bayes-netlearner

+prior/expert information

Page 3: Learning Bayesian Networks (From David Heckerman’s tutorial)

OverviewOverview

Introduction to Bayesian statistics:Introduction to Bayesian statistics:Learning a probabilityLearning a probability

Learning probabilities in a Bayes netLearning probabilities in a Bayes net

Learning Bayes-net structureLearning Bayes-net structure

Page 4: Learning Bayesian Networks (From David Heckerman’s tutorial)

Learning Probabilities: Classical ApproachLearning Probabilities: Classical Approach

Simple case: Flipping a thumbtack

tailsheads

True probability is unknown

Given iid data, estimate using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)

Page 5: Learning Bayesian Networks (From David Heckerman’s tutorial)

Learning Probabilities: Bayesian ApproachLearning Probabilities: Bayesian Approach

tailsheads

True probability is unknown

Bayesian probability density for

p()

0 1

Page 6: Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayesian Approach: use Bayes' rule to Bayesian Approach: use Bayes' rule to compute a new density for compute a new density for given data given data

dpp

ppp

)|heads()(

)|heads()()heads|(

prior likelihoodposterior

)|heads()( pp

Page 7: Learning Bayesian Networks (From David Heckerman’s tutorial)

The LikelihoodThe Likelihood

)|tails( p

)|heads( p

)1(

thttthhhthp ## )1()|...(

“binomial distribution”

Page 8: Learning Bayesian Networks (From David Heckerman’s tutorial)

Example: Application of Bayes rule to Example: Application of Bayes rule to the observation of a single "heads"the observation of a single "heads"

p(|heads)

0 1

p()

0 1

p(heads|)=

0 1

prior likelihood posterior

Page 9: Learning Bayesian Networks (From David Heckerman’s tutorial)

A Bayes net for learning probabilitiesA Bayes net for learning probabilities

X1 X2 XN...

toss 1 toss 2 toss N

thttthhhthp ## )1()|...(

Page 10: Learning Bayesian Networks (From David Heckerman’s tutorial)

Sufficient statisticsSufficient statistics

thp

hhhth...tttpphhhth...tttp## )1( )(

)|()()|(

(#h,#t) are sufficient statistics

Page 11: Learning Bayesian Networks (From David Heckerman’s tutorial)

The probability of heads on the next tossThe probability of heads on the next toss

)(

)|(

)|()|()| is th toss1(

)|(

1

d

d

dd

p

N

E

dp

dphXphnp

Page 12: Learning Bayesian Networks (From David Heckerman’s tutorial)

Prior Distributions for Prior Distributions for

Direct assessmentDirect assessment Parametric distributionsParametric distributions

– Conjugate distributions (for convenience)Conjugate distributions (for convenience)

– Mixtures of conjugate distributionsMixtures of conjugate distributions

Page 13: Learning Bayesian Networks (From David Heckerman’s tutorial)

Conjugate Family of DistributionsConjugate Family of Distributions

0,1

)1(1

),Beta()( thth

thp

1#)1(

1# tails),heads |(

th ththp

Beta distribution:

th

hE

+

)(

Properties:

Page 14: Learning Bayesian Networks (From David Heckerman’s tutorial)

IntuitionIntuition

The hyperparameters The hyperparameters hh and and tt can be can be

thought of as imaginary counts from our prior thought of as imaginary counts from our prior

experience, starting from "pure ignorance"experience, starting from "pure ignorance"

Equivalent sample size = Equivalent sample size = hh + + tt

The larger the equivalent sample size, the The larger the equivalent sample size, the

more confident we are about the true more confident we are about the true

probabilityprobability

Page 15: Learning Bayesian Networks (From David Heckerman’s tutorial)

Beta DistributionsBeta Distributions

Beta(3, 2 )Beta(1, 1 ) Beta(19, 39 )Beta(0.5, 0.5 )

Page 16: Learning Bayesian Networks (From David Heckerman’s tutorial)

Assessment of a Beta Distribution Assessment of a Beta Distribution

Method 1: Equivalent sample- assess h and t

- assess h+t and h/(h+t)

Method 2: Imagined future samples

4,15.0)heads 3|heads( and 2.0)heads( thpp

check: .2 =1

1+ 40 0 5

1 3

1 3 4, .

Page 17: Learning Bayesian Networks (From David Heckerman’s tutorial)

Generalization to Generalization to mm discrete outcomes discrete outcomes("multinomial distribution")("multinomial distribution")

1 ),,Dirichlet()(1

11

ii

m

imm,θ,θp

m

i

Nim

iiN,Np1

11 ),|(

Dirichlet distribution:

m

ii

iiE

1

)(

Properties:

011

i

m

i

Page 18: Learning Bayesian Networks (From David Heckerman’s tutorial)

More generalizationsMore generalizations(see, e.g., Bernardo + Smith, 1994)(see, e.g., Bernardo + Smith, 1994)

Likelihoods from the exponential familyLikelihoods from the exponential family BinomialBinomial MultinomialMultinomial PoissonPoisson GammaGamma NormalNormal

Page 19: Learning Bayesian Networks (From David Heckerman’s tutorial)

OverviewOverview

Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability

Learning probabilities in a Bayes netLearning probabilities in a Bayes net

Learning Bayes-net structureLearning Bayes-net structure

Page 20: Learning Bayesian Networks (From David Heckerman’s tutorial)

From thumbtacks to Bayes netsFrom thumbtacks to Bayes nets

Thumbtack problem can be viewed as learningthe probability for a very simple BN:

X heads/tails

X1 X2 XN...

toss 1 toss 2 toss N

Page 21: Learning Bayesian Networks (From David Heckerman’s tutorial)

The next simplest Bayes netThe next simplest Bayes net

Xheads/tails Y heads/tails

tailsheads “heads” “tails”

Page 22: Learning Bayesian Networks (From David Heckerman’s tutorial)

The next simplest Bayes netThe next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

?

Page 23: Learning Bayesian Networks (From David Heckerman’s tutorial)

The next simplest Bayes netThe next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

Page 24: Learning Bayesian Networks (From David Heckerman’s tutorial)

The next simplest Bayes netThe next simplest Bayes net

Xheads/tails Y heads/tails

X

X1

X2

XN

Y

Y1

Y2

YN

case 1

case 2

case N

"parameterindependence"

two separatethumbtack-likelearning problems

Page 25: Learning Bayesian Networks (From David Heckerman’s tutorial)

A bit more difficult...A bit more difficult...

Xheads/tails Y heads/tails

Three probabilities to learn:Three probabilities to learn: XX=heads=heads

Y=Y=heads|heads|XX=heads=heads

Y=Y=heads|heads|XX=tails=tails

Page 26: Learning Bayesian Networks (From David Heckerman’s tutorial)

A bit more difficult...A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

heads

tails

Page 27: Learning Bayesian Networks (From David Heckerman’s tutorial)

A bit more difficult...A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

Page 28: Learning Bayesian Networks (From David Heckerman’s tutorial)

A bit more difficult...A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

??

?

Page 29: Learning Bayesian Networks (From David Heckerman’s tutorial)

A bit more difficult...A bit more difficult...

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

3 separate thumbtack-like problems

Page 30: Learning Bayesian Networks (From David Heckerman’s tutorial)

In general…In general…

Learning probabilities in a BN is straightforward ifLearning probabilities in a BN is straightforward if Local distributions from the exponential family Local distributions from the exponential family

(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors Complete dataComplete data

Page 31: Learning Bayesian Networks (From David Heckerman’s tutorial)

Incomplete data makes parameters Incomplete data makes parameters dependentdependent

Xheads/tails Y heads/tails

X

X1

X2

Y|X=heads

Y1

Y2

case 1

case 2

Y|X=tails

Page 32: Learning Bayesian Networks (From David Heckerman’s tutorial)

OverviewOverview

Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability

Learning probabilities in a Bayes netLearning probabilities in a Bayes net

Learning Bayes-net structureLearning Bayes-net structure

Page 33: Learning Bayesian Networks (From David Heckerman’s tutorial)

Learning Bayes-net structureLearning Bayes-net structure

Given data, which model is correct?

X Ymodel 1:

X Ymodel 2:

Page 34: Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayesian approachBayesian approach

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Page 35: Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayesian approach: Model AveragingBayesian approach: Model Averaging

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

averagepredictions

Page 36: Learning Bayesian Networks (From David Heckerman’s tutorial)

Bayesian approach: Model SelectionBayesian approach: Model Selection

Given data, which model is correct? more likely?

X Ymodel 1:

X Ymodel 2:

7.0)( 1 mp

3.0)( 2 mp

Data d

1.0)|( 1 dmp

9.0)|( 2 dmp

Keep the best model:- Explanation- Understanding- Tractability

Page 37: Learning Bayesian Networks (From David Heckerman’s tutorial)

To score a model, use Bayes ruleTo score a model, use Bayes rule

Given data d:

)|()()|( mpmpmp dd

dmpmpmp )|(),|()|( dd

"marginallikelihood"

modelscore

likelihood

Page 38: Learning Bayesian Networks (From David Heckerman’s tutorial)

Thumbtack exampleThumbtack example

)(

)#(

)(

)#(

)##(

)(

)1(

)|()1()|(

1#1#

##

t

t

h

h

th

th

th

th

th

th

d

dmpmp

th

d

conjugateprior

X heads/tails

Page 39: Learning Bayesian Networks (From David Heckerman’s tutorial)

More complicated graphsMore complicated graphs

Xheads/tails Y heads/tails

3 separate thumbtack-like learning problems

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)(

)(

)#(

)(

)#(

)##(

)()|(

t

t

h

h

th

th

t

t

h

h

th

th

t

t

h

h

th

th

th

th

th

th

th

thmp

d X

Y|X=heads

Y|X=tails

Page 40: Learning Bayesian Networks (From David Heckerman’s tutorial)

Model score for a discrete BNModel score for a discrete BN

ii r

k ijk

ijkijkn

i

q

j ijij

ij N

Nmp

11 1 )(

)(

)(

)()|(

d

N X x

r X

q X

N N

ijk i i ij

i i

i i

ij ijkk

r

ij ijkk

ri i

:

:

:

# cases where = and =

number of states of

number of instances of parents of

ik Pa pa

1 1

Page 41: Learning Bayesian Networks (From David Heckerman’s tutorial)

Computation of Marginal LikelihoodComputation of Marginal Likelihood

Efficient closed form ifEfficient closed form if Local distributions from the exponential family Local distributions from the exponential family

(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors No missing data (including no hidden variables)No missing data (including no hidden variables)

Page 42: Learning Bayesian Networks (From David Heckerman’s tutorial)

Practical considerationsPractical considerations

The number of possible BN structures for n The number of possible BN structures for n variables is super exponential in nvariables is super exponential in n

How do we find the best graph(s)?How do we find the best graph(s)?

How do we assign structure and parameter How do we assign structure and parameter priors to all possible graph?priors to all possible graph?

Page 43: Learning Bayesian Networks (From David Heckerman’s tutorial)

Model searchModel search

Finding the BN structure with the highest Finding the BN structure with the highest score among those structures with at most score among those structures with at most kk parents is NP hard for parents is NP hard for kk>1 (Chickering, 1995)>1 (Chickering, 1995)

Heuristic methodsHeuristic methods

– GreedyGreedy

– Greedy with restartsGreedy with restarts

– MCMC methodsMCMC methodsscore

all possiblesingle changes

anychangesbetter?

performbest

change

yes

no

returnsaved structure

initializestructure

Page 44: Learning Bayesian Networks (From David Heckerman’s tutorial)

Structure priorsStructure priors

1. All possible structures equally likely1. All possible structures equally likely

2. Partial ordering, required / prohibited arcs2. Partial ordering, required / prohibited arcs

3. p(m) 3. p(m) similarity(m, prior BN) similarity(m, prior BN)

Page 45: Learning Bayesian Networks (From David Heckerman’s tutorial)

Parameter priorsParameter priors

All uniform: Beta(1,1)All uniform: Beta(1,1)

Use a prior BN Use a prior BN

Page 46: Learning Bayesian Networks (From David Heckerman’s tutorial)

Parameter priorsParameter priors

Recall the intuition behind the Beta prior for the Recall the intuition behind the Beta prior for the

thumbtack:thumbtack:

The hyperparameters The hyperparameters hh and and tt can be thought can be thought

of as imaginary counts from our prior of as imaginary counts from our prior

experience, starting from "pure ignorance"experience, starting from "pure ignorance"

Equivalent sample size = Equivalent sample size = hh + + tt

The larger the equivalent sample size, the more The larger the equivalent sample size, the more

confident we are about the long-run fraction confident we are about the long-run fraction

Page 47: Learning Bayesian Networks (From David Heckerman’s tutorial)

Parameter priorsParameter priors

x1

x4

x9

x3

x2

x5

x6

x7

x8

+equivalent

samplesize

imaginarycount

for anyvariable

configuration

parameter priors for any BN structure for X1…Xn

parametermodularity

Page 48: Learning Bayesian Networks (From David Heckerman’s tutorial)

x1

x4

x9

x3

x2

x5

x6

x7

x8

prior network+equivalent sample size

data

improved network(s)

x1 truefalsefalsetrue

x2 falsefalsefalsetrue

x3 truetruefalsefalse

...

.

.

....

Combine user knowledge and dataCombine user knowledge and data

x1

x4

x9

x3

x2

x5

x6

x7

x8