Learning Bayesian Networks from Data

113
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford

description

Learning Bayesian Networks from Data. Nir Friedman Daphne Koller Hebrew U. Stanford. Overview. Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data. Qualitative part : - PowerPoint PPT Presentation

Transcript of Learning Bayesian Networks from Data

Page 1: Learning Bayesian Networks from Data

.

Learning Bayesian Networks from Data

Nir Friedman Daphne KollerHebrew U. Stanford

Page 2: Learning Bayesian Networks from Data

2

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Page 3: Learning Bayesian Networks from Data

3

Family of Alarm

Bayesian Networks

Qualitative part: Directed acyclic graph (DAG) Nodes - random variables Edges - direct influence

Quantitative part: Set of conditional probability distributions

0.9 0.1e

be

0.2 0.8

0.01 0.990.9 0.1

bebb

e

BE P(A | E,B)Earthquake

Radio

Burglary

Alarm

Call

Compact representation of probability distributions via conditional independence

Together:Define a unique distribution in a factored form

)|()|(),|()()(),,,,( ACPERPEBAPEPBPRCAEBP

Page 4: Learning Bayesian Networks from Data

4

Example: “ICU Alarm” networkDomain: Monitoring Intensive-Care Patients 37 variables 509 parameters …instead of 254

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Page 5: Learning Bayesian Networks from Data

5

Inference Posterior probabilities

Probability of any event given any evidence Most likely explanation

Scenario that explains evidence Rational decision making

Maximize expected utility Value of Information

Effect of intervention

Earthquake

Radio

Burglary

Alarm

Call

Radio

Call

Page 6: Learning Bayesian Networks from Data

6

Why learning?Knowledge acquisition bottleneck Knowledge acquisition is an expensive process Often we don’t have an expert

Data is cheap Amount of available information growing rapidly Learning allows us to construct models from raw

data

Page 7: Learning Bayesian Networks from Data

7

Why Learn Bayesian Networks? Conditional independencies & graphical language

capture structure of many real-world distributions

Graph structure provides much insight into domain Allows “knowledge discovery”

Learned model can be used for many tasks

Supports all the features of probabilistic learning Model selection criteria Dealing with missing data & hidden variables

Page 8: Learning Bayesian Networks from Data

8

Learning Bayesian networks

E

R

B

A

C

.9 .1e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

Data+

Prior Information

Learner

Page 9: Learning Bayesian Networks from Data

9

Known Structure, Complete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Inducer needs to estimate parameters

Data does not contain missing values

Learner

E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

Page 10: Learning Bayesian Networks from Data

10

Unknown Structure, Complete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is not specified Inducer needs to select arcs & estimate parameters

Data does not contain missing values

E, B, A<Y,N,N><Y,N,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

Learner

Page 11: Learning Bayesian Networks from Data

11

Known Structure, Incomplete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is specified Data contains missing values

Need to consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Learner

Page 12: Learning Bayesian Networks from Data

12

Unknown Structure, Incomplete Data

E B

A.9 .1

e

b

e

.7 .3

.99 .01.8 .2

be

b

b

e

BE P(A | E,B)

? ?e

b

e

? ?

? ?? ?

be

b

b

e

BE P(A | E,B) E B

A

Network structure is not specified Data contains missing values

Need to consider assignments to missing values

E, B, A<Y,N,N><Y,?,Y><N,N,Y><N,Y,?> . .<?,Y,Y>

Learner

Page 13: Learning Bayesian Networks from Data

13

Overview Introduction Parameter Estimation

Likelihood function Bayesian estimation

Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Page 14: Learning Bayesian Networks from Data

14

Learning ParametersE B

A

C

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

D

Training data has the form:

Page 15: Learning Bayesian Networks from Data

15

Likelihood Function E B

A

C

Assume i.i.d. samples Likelihood function is

m

mCmAmBmEPDL ):][],[],[],[():(

Page 16: Learning Bayesian Networks from Data

16

Likelihood FunctionE B

A

C

By definition of network, we get

m

m

mAmCPmEmBmAP

mBPmEP

mCmAmBmEPDL

):][|][():][],[|][(

):][():][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE

Page 17: Learning Bayesian Networks from Data

17

Likelihood FunctionE B

A

C

Rewriting terms, we get

m

m

m

m

m

mAmCP

mEmBmAP

mBP

mEP

mCmAmBmEPDL

):][|][(

):][],[|][(

):][(

):][(

):][],[],[],[():(

][][][][

]1[]1[]1[]1[

MCMAMBME

CABE=

Page 18: Learning Bayesian Networks from Data

18

General Bayesian Networks

Generalizing for any Bayesian network:

Decomposition Independent estimation problems

iii

i miii

mn

DL

mPamxP

mxmxPDL

):(

):][|][(

):][,],[():( 1

Page 19: Learning Bayesian Networks from Data

19

Likelihood Function: Multinomials

The likelihood for the sequence H,T, T, H, H is

m

mxPDPDL )|][()|():(

)1()1():( DL

0 0.2 0.4 0.6 0.8 1

L(

:D)

K

1k

Nk

kDL ):(Probability of kth outcome

Count of kth outcome in D

General case:

Page 20: Learning Bayesian Networks from Data

20

Bayesian Inference Represent uncertainty about parameters using a

probability distribution over parameters, data Learning using Bayes rule

])[],1[()()|][],1[(])[],1[|(

MxxPPMxxPMxxP

Posterior

Likelihood Prior

Probability of data

Page 21: Learning Bayesian Networks from Data

21

Bayesian Inference Represent Bayesian distribution as Bayes net

The values of X are independent given P(x[m] | ) = Bayesian prediction is inference in this network

X[1] X[2] X[m]

Observed data

Page 22: Learning Bayesian Networks from Data

22

Example: Binomial Data Prior: uniform for in [0,1] P( |D) the likelihood L( :D)

(NH,NT ) = (4,1)

MLE for P(X = H ) is 4/5 = 0.8 Bayesian prediction is

7142.075)|()|]1[( dDPDHMxP

0 0.2 0.4 0.6 0.8 1

)()|][],1[(])[],1[|( PMxxPMxxP

Page 23: Learning Bayesian Networks from Data

23

Dirichlet Priors Recall that the likelihood function is

Dirichlet prior with hyperparameters 1,…,K

the posterior has the same form, with

hyperparameters 1+N 1,…,K +N K

K

1k

Nk

kDL ):(

K

kk

kP1

1)(

K

k

Nk

K

k

Nk

K

kk

kkkkDPPDP1

1

11

1)|()()|(

Page 24: Learning Bayesian Networks from Data

24

Dirichlet Priors - Example

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 0.2 0.4 0.6 0.8 1

Dirichlet(1,1)

Dirichlet(2,2)

Dirichlet(5,5)Dirichlet(0.5,0.5)

Dirichlet(heads, tails)

heads

P( h

eads

)

Page 25: Learning Bayesian Networks from Data

25

Dirichlet Priors (cont.) If P() is Dirichlet with hyperparameters 1,…,K

Since the posterior is also Dirichlet, we get

kk dPkXP )()]1[(

)()|()|]1[(

NNdDPDkMXP kk

k

Page 26: Learning Bayesian Networks from Data

26

Priors for each parameter group are independent Data instances are independent given the

unknown parameters

X

X[1] X[2] X[M] X[M+1]

Observed dataPlate notation

Y[1] Y[2] Y[M] Y[M+1]

Y|X X

Y|XmX[m]

Y[m]

Query

Bayesian Nets & Bayesian Prediction

Page 27: Learning Bayesian Networks from Data

27

We can also “read” from the network: Complete data

posteriors on parameters are independent Can compute posterior over parameters separately!

Bayesian Nets & Bayesian PredictionX

X[1] X[2] X[M]

Observed data

Y[1] Y[2] Y[M]

Y|X

Page 28: Learning Bayesian Networks from Data

28

Learning Parameters: Summary Estimation relies on sufficient statistics

For multinomials: counts N(xi,pai) Parameter estimation

Both are asymptotically equivalent and consistent Both can be implemented in an on-line manner by

accumulating sufficient statistics

)()(),(),(~

|ii

iiiipax paNpa

paxNpaxii

)(

),(ˆ |i

iipax paN

paxNii

MLE Bayesian (Dirichlet)

Page 29: Learning Bayesian Networks from Data

29

Learning Parameters: Case Study

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erge

nce

to tr

ue d

istri

butio

n

# instances

MLE

Bayes; M'=50

Bayes; M'=20

Bayes; M'=5

Instances sampled from ICU Alarm networkM’ — strength of prior

Page 30: Learning Bayesian Networks from Data

30

Overview Introduction Parameter Learning Model Selection

Scoring function Structure search

Structure Discovery Incomplete Data Learning from Structured Data

Page 31: Learning Bayesian Networks from Data

31

Why Struggle for Accurate Structure?

Increases the number of parameters to be estimated

Wrong assumptions about domain structure

Cannot be compensated for by fitting parameters

Wrong assumptions about domain structure

Earthquake Alarm Set

Sound

Burglary Earthquake Alarm Set

Sound

Burglary

Earthquake Alarm Set

Sound

Burglary

Adding an arcMissing an arc

Page 32: Learning Bayesian Networks from Data

32

Score based Learning

E, B, A<Y,N,N><Y,Y,Y><N,N,Y><N,Y,Y> . .<N,Y,Y>

E B

A

E

B

AE

B A

Search for a structure that maximizes the score

Define scoring function that evaluates how well a structure matches the data

Page 33: Learning Bayesian Networks from Data

33

Likelihood Score for Structure

Larger dependence of Xi on Pai higher score

Adding arcs always helps I(X; Y) I(X; {Y,Z}) Max score attained by fully connected network Overfitting: A bad idea…

i

iGii H(X)Pa;I(XMD):L(GD):(G )log

Mutual information between Xi and its parents

Page 34: Learning Bayesian Networks from Data

34

Max likelihood params

Bayesian ScoreLikelihood score:

Bayesian approach: Deal with uncertainty by assigning probability to

all possibilities

)θG,|P(DD):L(G G

dGPGDPGDP )|(),|()|(

Likelihood Prior over parameters

P(D)G)P(G)P(DD)P(G ||

Marginal Likelihood

Page 35: Learning Bayesian Networks from Data

35

Marginal Likelihood: MultinomialsFortunately, in many cases integral has closed form

P() is Dirichlet with hyperparameters 1,…,K

D is a dataset with sufficient statistics N1,…,NK

Then

)()(

)()(

N

NDP

Page 36: Learning Bayesian Networks from Data

36

Marginal Likelihood: Bayesian Networks

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7

Network 1:Two Dirichlet marginal likelihoods

P( )

P( )

X Y

Integral over X

Integral over Y

H T T H T H HH T T H T H H

H T H H T T HH T H H T T H

Page 37: Learning Bayesian Networks from Data

37

Marginal Likelihood: Bayesian Networks

X

Y

Network structure determines form ofmarginal likelihood

1 2 3 4 5 6 7

Network 2:Three Dirichlet marginal likelihoods

P( )

P( )

P( )

X Y

Integral over X

Integral over Y|X=H

H T T H T H HH T T H T H H

H T H H T T HT H TH H T H

Integral over Y|X=T

Page 38: Learning Bayesian Networks from Data

38

Marginal Likelihood for NetworksThe marginal likelihood has the form:

i paG

i

GDP )|( Dirichlet marginal likelihoodfor multinomial P(Xi | pai)

i i

ii

ii

i

xG

i

Gi

Gi

GG

G

paxpaxNpax

paNpapa

)),(()),(),((

)()()(

N(..) are counts from the data(..) are hyperparameters for each family given G

Page 39: Learning Bayesian Networks from Data

39

As M (amount of data) grows, Increasing pressure to fit dependencies in distribution Complexity term avoids fitting noise

Asymptotic equivalence to MDL score Bayesian score is consistent

Observed data eventually overrides prior

O(1)dim(G)2M)H(X)Pa;I(XM

ii

Gii log

O(1)dim(G)2MD):(GG)|P(D

loglog

Fit dependencies inempirical distribution

Complexitypenalty

Bayesian Score: Asymptotic Behavior

Page 40: Learning Bayesian Networks from Data

40

Structure Search as OptimizationInput:

Training data Scoring function Set of possible structures

Output: A network that maximizes the score

Key Computational Property: Decomposability: score(G) = score ( family of X in G )

Page 41: Learning Bayesian Networks from Data

41

Tree-Structured Networks

Trees: At most one parent per variable

Why trees? Elegant math

we can solve the optimization problem

Sparse parameterization avoid overfitting

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

MINOVL

PVSAT

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Page 42: Learning Bayesian Networks from Data

42

Learning Trees Let p(i) denote parent of Xi

We can write the Bayesian score as

Score = sum of edge scores + constant

Score of “empty”network

Improvement over “empty” network

i

ii PaXScoreDGScore ):():(

i

ii

iipi XScoreXScoreXXScore )()():( )(

Page 43: Learning Bayesian Networks from Data

43

Learning Trees

Set w(ji) =Score( Xj Xi ) - Score(Xi) Find tree (or forest) with maximal weight

Standard max spanning tree algorithm — O(n2 log n)

Theorem: This procedure finds tree with max score

Page 44: Learning Bayesian Networks from Data

44

Beyond TreesWhen we consider more complex network, the

problem is not as easy Suppose we allow at most two parents per node A greedy algorithm is no longer guaranteed to find

the optimal network

In fact, no efficient algorithm exists Theorem: Finding maximal scoring structure with at

most k parents per node is NP-hard for k > 1

Page 45: Learning Bayesian Networks from Data

45

Heuristic Search Define a search space:

search states are possible structures operators make small changes to structure

Traverse space looking for high-scoring structures Search techniques:

Greedy hill-climbing Best first search Simulated Annealing ...

Page 46: Learning Bayesian Networks from Data

46

Local Search Start with a given network

empty network best tree a random network

At each iteration Evaluate all possible changes Apply change based on score

Stop when no modification improves score

Page 47: Learning Bayesian Networks from Data

47

Heuristic Search Typical operations:

S C

E

D Reverse C EDelete C

E

Add C

D

S C

E

D

S C

E

D

S C

E

D

score = S({C,E} D) - S({E} D)

To update score after local change, only re-score families that changed

Page 48: Learning Bayesian Networks from Data

48

Learning in Practice: Alarm domain

0

0.5

1

1.5

2

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

KL

Div

erge

nce

to

true

dist

ribut

ion

#samples

Structure known, fit params

Learn both structure & params

Page 49: Learning Bayesian Networks from Data

49

Local Search: Possible Pitfalls Local search can get stuck in:

Local Maxima:All one-edge changes reduce the score

Plateaux:Some one-edge changes leave the score unchanged

Standard heuristics can escape both Random restarts TABU search Simulated annealing

Page 50: Learning Bayesian Networks from Data

50

Improved Search: Weight Annealing Standard annealing process:

Take bad steps with probability exp(score/t) Probability increases with temperature

Weight annealing: Take uphill steps relative to perturbed score Perturbation increases with temperature

Scor

e(G

|D)

G

Page 51: Learning Bayesian Networks from Data

51

Perturbing the Score Perturb the score by reweighting instances Each weight sampled from distribution:

Mean = 1 Variance temperature

Instances sampled from “original” distribution … but perturbation changes emphasis

Benefit: Allows global moves in the search space

Page 52: Learning Bayesian Networks from Data

52

Weight Annealing: ICU Alarm networkCumulative performance of 100 runs of

annealed structure search

Greedyhill-climbing

True structureLearned params

Annealedsearch

Page 53: Learning Bayesian Networks from Data

53

Structure Search: Summary Discrete optimization problem In some cases, optimization problem is easy

Example: learning trees In general, NP-Hard

Need to resort to heuristic search In practice, search is relatively fast (~100 vars in

~2-5 min):DecomposabilitySufficient statistics

Adding randomness to search is critical

Page 54: Learning Bayesian Networks from Data

54

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Page 55: Learning Bayesian Networks from Data

55

Structure DiscoveryTask: Discover structural properties

Is there a direct connection between X & Y Does X separate between two “subsystems” Does X causally effect Y

Example: scientific data mining Disease properties and symptoms Interactions between the expression of genes

Page 56: Learning Bayesian Networks from Data

56

Discovering Structure

Current practice: model selection Pick a single high-scoring model Use that model to infer domain structure

E

R

B

A

C

P(G|D)

Page 57: Learning Bayesian Networks from Data

57

Discovering Structure

Problem Small sample size many high scoring models Answer based on one model often useless Want features common to many models

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

E

R

B

A

C

P(G|D)

Page 58: Learning Bayesian Networks from Data

58

Bayesian Approach Posterior distribution over structures Estimate probability of features

Edge XY Path X… Y …

G

DGPGfDfP )|()()|(Feature of G,

e.g., XYIndicator function

for feature f

Bayesian scorefor G

Page 59: Learning Bayesian Networks from Data

59

MCMC over Networks Cannot enumerate structures, so sample structures

MCMC Sampling Define Markov chain over BNs Run chain to get samples from posterior P(G | D)

Possible pitfalls: Huge (superexponential) number of networks Time for chain to converge to posterior is unknown Islands of high posterior, connected by low bridges

n

iiGf

nDGfP

1)(1)|)((

Page 60: Learning Bayesian Networks from Data

60

ICU Alarm BN: No Mixing 500 instances:

The runs clearly do not mixMCMC Iteration

Sco

re o

f cu

uren

t sam

ple

-9400

-9200

-9000

-8800

-8600

-8400

0 100000 200000 300000 400000 500000 600000

scor

e

iteration

emptygreedy

Page 61: Learning Bayesian Networks from Data

61

Effects of Non-Mixing Two MCMC runs over same 500 instances Probability estimates for edges for two runs

Probability estimates highly variable, nonrobustTrue BN

Ran

dom

sta

rt

True BN

True

BN

Page 62: Learning Bayesian Networks from Data

62

Fixed OrderingSuppose that We know the ordering of variables

say, X1 > X2 > X3 > X4 > … > Xn parents for Xi must be in X1,…,Xi-1

Limit number of parents per nodes to k

Intuition: Order decouples choice of parents Choice of Pa(X7) does not restrict choice of Pa(X12)

Upshot: Can compute efficiently in closed form Likelihood P(D | ) Feature probability P(f | D, )

2k•n•log n networks

Page 63: Learning Bayesian Networks from Data

63

Our Approach: Sample OrderingsWe can write

Sample orderings and approximate

MCMC Sampling Define Markov chain over orderings Run chain to get samples from posterior P( | D)

)|(),|()|( DPDfPDfP

),|()|(1

DfPDfP i

n

i

Page 64: Learning Bayesian Networks from Data

64

Mixing with MCMC-Orderings 4 runs on ICU-Alarm with 500 instances

fewer iterations than MCMC-Nets approximately same amount of computation

Process appears to be mixing!

-8450

-8445

-8440

-8435

-8430

-8425

-8420

-8415

-8410

-8405

-8400

0 10000 20000 30000 40000 50000 60000

scor

e

iteration

randomgreedy

MCMC Iteration

Sco

re o

f cu

uren

t sam

ple

Page 65: Learning Bayesian Networks from Data

65

Mixing of MCMC runs Two MCMC runs over same instances Probability estimates for edges

500 instances 1000 instances

Probability estimates very robust

Page 66: Learning Bayesian Networks from Data

66

Application: Gene expressionInput: Measurement of gene expression under

different conditions Thousands of genes Hundreds of experiments

Output: Models of gene interaction

Uncover pathways

Page 67: Learning Bayesian Networks from Data

67

Map of Feature Confidence

Yeast data [Hughes et al 2000]

600 genes 300 experiments

Page 68: Learning Bayesian Networks from Data

68

“Mating response” Substructure

Automatically constructed sub-network of high-confidence edges

Almost exact reconstruction of yeast mating pathway

KAR4

AGA1PRM1TEC1

SST2

STE6

KSS1NDJ1

FUS3AGA2

YEL059W

TOM6 FIG1YLR343W

YLR334C MFA1

FUS1

Page 69: Learning Bayesian Networks from Data

69

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data

Parameter estimation Structure search

Learning from Structured Data

Page 70: Learning Bayesian Networks from Data

70

Incomplete DataData is often incomplete Some variables of interest are not assigned values

This phenomenon happens when we have Missing values:

Some variables unobserved in some instances Hidden variables:

Some variables are never observed We might not even know they exist

Page 71: Learning Bayesian Networks from Data

71

Hidden (Latent) VariablesWhy should we care about unobserved variables?

X1 X2 X3

H

Y1 Y2 Y3

X1 X2 X3

Y1 Y2 Y3

17 parameters 59 parameters

Page 72: Learning Bayesian Networks from Data

72

Example P(X) assumed to be known Likelihood function of: Y|X=T, Y|X=H Contour plots of log likelihood for different number

of missing values of X (M = 8):

no missing values

Y|X

=H

Y|X=T2 missing value

Y|X=T3 missing values

Y|X=T

X Y

In general: likelihood function has multiple modes

Page 73: Learning Bayesian Networks from Data

73

Incomplete Data In the presence of incomplete data, the likelihood

can have multiple maxima

Example: We can rename the values of hidden variable H If H has two values, likelihood has two maxima

In practice, many local maxima

H Y

Page 74: Learning Bayesian Networks from Data

74

L(

|D)

EM: MLE from Incomplete Data

Use current point to construct “nice” alternative function Max of new function scores ≥ than current point

Page 75: Learning Bayesian Networks from Data

75

Expectation Maximization (EM) A general purpose method for learning from

incomplete data

Intuition: If we had true counts, we could estimate parameters But with missing values, counts are unknown We “complete” counts using probabilistic inference

based on current parameter assignment We use completed counts as if real to re-estimate

parameters

Page 76: Learning Bayesian Networks from Data

76

Expectation Maximization (EM)

1.30.41.71.6

N (X,Y )X Y #HTHT

HHTT

Expected Counts

X Z

HTHHT

Y

??HTT

T??TH

Data

P(Y=H|X=T,) = 0.4

P(Y=H|X=H,Z=T,) = 0.3

Current model

Page 77: Learning Bayesian Networks from Data

77

Expectation Maximization (EM)

TrainingData

X1 X2 X3

H

Y1 Y2 Y3

Initial network (G,0)

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

(E-Step)

X1 X2 X3

H

Y1 Y2 Y3

Updated network (G,1)

Reparameterize

(M-Step)

Reiterate

X1 X2 X3

H

Y1 Y2 Y3

Page 78: Learning Bayesian Networks from Data

78

Expectation Maximization (EM)Formal Guarantees: L(1:D) L(0:D)

Each iteration improves the likelihood If 1 = 0 , then 0 is a stationary point of L(:D)

Usually, this means a local maximum

Page 79: Learning Bayesian Networks from Data

79

Expectation Maximization (EM)Computational bottleneck: Computation of expected counts in E-Step

Need to compute posterior for each unobserved variable in each instance of training set

All posteriors for an instance can be derived from one pass of standard BN inference

Page 80: Learning Bayesian Networks from Data

80

Summary: Parameter Learningwith Incomplete Data

Incomplete data makes parameter estimation hard Likelihood function

Does not have closed form Is multimodal

Finding max likelihood parameters: EM Gradient ascent

Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

Page 81: Learning Bayesian Networks from Data

81

Incomplete Data: Structure ScoresRecall, Bayesian score:

With incomplete data: Cannot evaluate marginal likelihood in closed form We have to resort to approximations:

Evaluate score around MAP parameters Need to find MAP parameters (e.g., EM)

dGPGDPGP

GDPGPDGP

)|(),|()()|()()|(

Page 82: Learning Bayesian Networks from Data

82

Naive Approach Perform EM for each candidate graph

G1G3 G2

Parametric optimization

(EM)

Parameter space

Local Maximum

G4 Gn Computationally expensive:

Parameter optimization via EM — non-trivial Need to perform EM for all candidate structures Spend time even on poor candidates

In practice, considers only a few candidates

Page 83: Learning Bayesian Networks from Data

83

Structural EMRecall, in complete data we had

Decomposition efficient search

Idea: Instead of optimizing the real score… Find decomposable alternative score Such that maximizing new score

improvement in real score

Page 84: Learning Bayesian Networks from Data

84

Structural EMIdea: Use current model to help evaluate new structures

Outline: Perform search in (Structure, Parameters) space At each iteration, use current model for finding either:

Better scoring parameters: “parametric” EM stepor Better scoring structure: “structural” EM step

Page 85: Learning Bayesian Networks from Data

85

TrainingData

Expected CountsN(X1)N(X2)N(X3)N(H, X1, X1, X3)N(Y1, H)N(Y2, H)N(Y3, H)

Computation

X1 X2 X3

H

Y1 Y2 Y3

X1 X2 X3

H

Y1 Y2 Y3

Score &

Parameterize

X1 X2 X3

H

Y1 Y2 Y3

Reiterate

N(X2,X1)N(H, X1, X3)N(Y1, X2)N(Y2, Y1, H)

X1 X2 X3

H

Y1 Y2 Y3

Page 86: Learning Bayesian Networks from Data

86

Example: Phylogenetic ReconstructionInput: Biological sequences

Human CGTTGC…Chimp CCTAGG…Orang CGAACG…….

Output: a phylogeny

leaf

An “instance” of evolutionary process

Assumption: positions are independent

10 billion years

Page 87: Learning Bayesian Networks from Data

87

Phylogenetic Model

Topology: bifurcating Observed species – 1…N Ancestral species – N+1…2N-2

Lengths t = {ti,j} for each branch (i,j) Evolutionary model:

P(A changes to T| 10 billion yrs )

internalnode

1 2 3 4 5 6 7

812

10 11

9branch (8,9)

leaf

Page 88: Learning Bayesian Networks from Data

88

Phylogenetic Tree as a Bayes Net Variables: Letter at each position for each species

Current day species – observed Ancestral species - hidden

BN Structure: Tree topology BN Parameters: Branch lengths (time spans)

Main problem: Learn topology

If ancestral were observed easy learning problem (learning trees)

Page 89: Learning Bayesian Networks from Data

89

Algorithm Outline

Original Tree (T0,t0)

Weights: Branch scores

Compute expected pairwise stats

Page 90: Learning Bayesian Networks from Data

90

Pairwise weights

Algorithm Outline

Tji

jiT wT),(

,maxarg'Find:

Weights: Branch scores

Compute expected pairwise stats

O(N2) pairwise statistics suffice to evaluate all trees

Page 91: Learning Bayesian Networks from Data

91

Max. Spanning Tree

Algorithm Outline

Tji

jiT wT),(

,maxarg'Find:

Construct bifurcation T1

Weights: Branch scores

Compute expected pairwise stats

Page 92: Learning Bayesian Networks from Data

92

New TreeTheorem: L(T1,t1) L(T0,t0)

Algorithm Outline

Construct bifurcation T1

Tji

jiT wT),(

,maxarg'Find:

Repeat until convergence…

Weights: Branch scores

Compute expected pairwise stats

Page 93: Learning Bayesian Networks from Data

93

Real Life Data

-74,227.9-2,916.2Traditional approach

3,578122# pos

3443# sequences

Mitochondrial genomes

Lysozyme c

1.030.19Difference per position

-70,533.5-2,892.1Structural EMApproach

Log-likelihood

Each position twice as likely

Page 94: Learning Bayesian Networks from Data

94

Overview Introduction Parameter Estimation Model Selection Structure Discovery Incomplete Data Learning from Structured Data

Page 95: Learning Bayesian Networks from Data

95

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other

Intelligence Difficulty

Grade

Page 96: Learning Bayesian Networks from Data

96

Bayesian Networks: Problem Bayesian nets use propositional representation Real world has objects, related to each other

Intell_J.Doe Diffic_CS101

Grade_JDoe_CS101

Intell_FGump Diffic_Geo101

Grade_FGump_Geo101

Intell_FGump Diffic_CS101

Grade_FGump_CS101

These “instances” are not independent!

A C

Page 97: Learning Bayesian Networks from Data

97

St. Nordaf University

Teac

hes

Teac

hes

In-course

In-course

Registered

In-course

Forrest Gump

Jane Doe

Welcome to

CS101

Welcome to

Geo101

Difficulty

Difficulty Registered

RegisteredGrade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

Teaching-abilityTeaching-ability

Page 98: Learning Bayesian Networks from Data

98

Relational Schema Specifies types of objects in domain, attributes of each type

of object, & types of links between objects

Teach

StudentIntelligence

RegistrationGradeSatisfaction

CourseDifficulty

ProfessorTeaching-Ability

In

Take

ClassesClasses

LinksLinks

AttributesAttributes

Page 99: Learning Bayesian Networks from Data

99

Representing the Distribution Many possible worlds for a given university

All possible assignments of all attributes of all objects

Infinitely many potential universities Each associated with a very different set of worlds

Need to represent infinite set of complex distributions

Page 100: Learning Bayesian Networks from Data

100

Possible Worlds

Prof. SmithProf. Jones

Forrest Gump

Jane Doe

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

HighLow

Easy

Easy

A

B

C

Like

Hate

Like

Smart

Weak

World: assignment to all attributes of all objects in domainHighHigh

Hard

Easy

A

C

B

Hate

Hate

Hate

Smart

Weak

Page 101: Learning Bayesian Networks from Data

101

Probabilistic Relational Models

Universals: Probabilistic patterns hold for all objects in class Locality: Represent direct probabilistic dependencies

Links give us potential interactions!

StudentIntelligence

RegGradeSatisfaction

CourseDifficulty

ProfessorTeaching-Ability

Key ideas:

0% 20% 40% 60% 80% 100%

hard,smarthard,weak

easy,smarteasy,weak

A B C

Page 102: Learning Bayesian Networks from Data

102

Prof. SmithProf. Jones

Forrest Gump

Jane Doe

Welcome to

CS101

Welcome to

Geo101

PRM Semantics

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

Instantiated PRM BN variables: attributes of all objects dependencies: determined by links & PRM

Grade|Intell,Diffic

Page 103: Learning Bayesian Networks from Data

103

Welcome to

CS101

weak / smart

The Web of Influence

0% 50% 100%0% 50% 100%

Welcome to

Geo101 A

C

weak smart

0% 50% 100%

easy / hard

Objects are all correlated Need to perform inference over entire model For large databases, use approximate inference:

Loopy belief propagation

Page 104: Learning Bayesian Networks from Data

104

PRM Learning: Complete Data

Prof. SmithProf. Jones

Welcome to

CS101

Welcome to

Geo101

Teaching-abilityTeaching-ability

Difficulty

Difficulty

Grade

Grade

Grade

Satisfac

Satisfac

Satisfac

Intelligence

Intelligence

HighLow

Easy

Easy

A

B

C

Like

Hate

Like

Smart

Weak

Grade|Intell,Diffic

Entire database is single “instance” Parameters used many times in instance

Introduce prior over parameters Update prior with sufficient statistics:

Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)

Page 105: Learning Bayesian Networks from Data

105

PRM Learning: Incomplete Data

Welcome to

CS101

Welcome to

Geo101

??????

???

???

B

A

C

Hi

Low

Hi

???

??? Use expected sufficient statistics But, everything is correlated:

E-step uses (approx) inference over entire model

Page 106: Learning Bayesian Networks from Data

106

A Web of DataTom MitchellProfessor

WebKBProject

Sean SlatteryStudent

CMU CS Faculty

Contains

Advisee-of

Project-of

Works-on

[Craven et al.]

Page 107: Learning Bayesian Networks from Data

107

Professordepartment

extractinformationcomputersciencemachinelearning

Standard Approach

0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68

...

PageCategory

Word1 WordN

Page 108: Learning Bayesian Networks from Data

108

What’s in a Link

0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68

...

PageCategory

Word1 WordN

From-

...

PageCategory

Word1 WordN

LinkExists

To-

Page 109: Learning Bayesian Networks from Data

109

Discovering Hidden Concepts

Internet Movie Databasehttp://www.imdb.com

Page 110: Learning Bayesian Networks from Data

110

Discovering Hidden Concepts

Type Type

Type

Internet Movie Databasehttp://www.imdb.com

Gender

Actor

Director

MovieGenre Rating

Year #VotesCredit-OrderMPAA Rating

Appeared

Page 111: Learning Bayesian Networks from Data

111

Web of Influence, Yet Again Movies

Terminator 2BatmanBatman ForeverMission: Impossible GoldenEyeStarship TroopersHunt for Red October

Wizard of OzCinderellaSound of MusicThe Love BugPollyannaThe Parent TrapMary PoppinsSwiss Family Robinson

Actors

Anthony HopkinsRobert De NiroTommy Lee JonesHarvey KeitelMorgan FreemanGary Oldman

Sylvester StalloneBruce WillisHarrison FordSteven SeagalKurt RussellKevin CostnerJean-Claude Van DammeArnold Schwarzenegger

Directors

Steven SpielbergTim BurtonTony ScottJames CameronJohn McTiernanJoel Schumacher

Alfred HitchcockStanley KubrickDavid LeanMilos FormanTerry GilliamFrancis Coppola

Page 112: Learning Bayesian Networks from Data

112

Conclusion Many distributions have combinatorial dependency

structure Utilizing this structure is good Discovering this structure has implications:

To density estimation To knowledge discovery

Many applications Medicine Biology Web

Page 113: Learning Bayesian Networks from Data

113

The END

Gal Elidan Lise Getoor Moises Goldszmidt Matan Ninio

Dana Pe’er Eran Segal Ben Taskar

http://robotics.stanford.edu/~koller/

Thanks to

Slides will be available from:http://www.cs.huji.ac.il/~nir/