Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University...

20
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan

Transcript of Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University...

Page 1: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Non-Informative Dirichlet Score for learning Bayesian networks

Maomi Ueno and Masaki UtoUniversity of Electro-

Communications, Japan

Page 2: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

BDe(Bayesian Dirichlet equivalence) Heckerman, Geiger and Chickering (1995)

Page 3: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Non-informative prior : BDeu

As Buntine (1991) described is considered to be a special case of the BDe metric. Heckerman et al. (1995) called this special case “BDeu”.

)/( iiijkg qrh

Page 4: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Problems of BDeu

BDeu is the most popular Bayesian network score.

However, recent reports have described that learning Bayesian network is highly sensitive to the chosen equivalent sample size (ESS) in the Bayesian Dirichlet equivalence uniform (BDeu) and this often engenders some unstable or undesirable results.

(Steck and Jaakkola, 2002, Silander, Kontkanen and Myllymaki, 2007).

Page 5: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Theorem 1(Ueno 2010) When  α  is sufficiently large, the log-ML is approximated asymptotically as

Log-posterior The number of Parameters

).1(1log1

2

1),,(),()|(log

1 1 1

On

r

rXgHgHgXp

ijk

ijkN

i

q

j

r

k i

ii i

The difference between prior and data

As the two structures become equivalent, the penalty term is minimized with the fixed ESS. Conversely, as the two structures become different, the term increases. Moreover, α determines the magnitude of the user‘s prior belief for a hypothetical structure. Thus, the mechanism of BDe makes sense to us when we have approximate prior knowledge.

Page 6: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

BDeu is not non-informative!!

Ueno (2011) showed that the prior of BDeu does not represent ignorance of prior knowledge, but rather a user’s prior belief in the uniformity of the conditional distribution. The results further imply that the optimal ESS becomes large/small when the empirical conditional distribution becomes uniform/skewed. This main factor underpins the sensitivity of BDeu to ESS.

Page 7: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Marginalization of ESS α

• Silander, Kontkanen and Myllymaki (2007) proposed a learning method to marginalize the ESS of BDeu. They averaged BDeu values with increasing ESS from 1 to 100 by 1.0 .

• Cano et al. (2011) proposed averaging a wide range of different chosen ESSs: ESS > 1:0, ESS >> 1:0, ESS < 1:0, ESS << 1:0.

Page 8: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Why do we use Bayesian approach ?The previous works eliminate out the ESS. However, One of most important features of Bayesian

approach is using prior knowledge to augment the data.

→When the score has a non-informative prior, the ESS is expected to work effectively as actual pseudo-samples to augment the data.

Page 9: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Main idea

To obtain a non-informative prior, we assume all possible hypothetical structures

as a prior belief of BDe because a problem of BDeu is that it assumes only a uniform distribution as a prior belief.

Page 10: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

NIP-BDe

Definition 1. NIP-BDe (Non Informative Prior Bayesian Dirichlet equivalence) is defined as

),,|()(),|( h

Gg

ggXpgpgXph

Page 11: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Calculation method

Our method has the following steps:1. Estimate the conditional probability parameters set given gh from data.2. Estimate the joint probability First, we estimate the conditional probability parameters

set given gh as

Next, we calculate the estimated joint probability as shown below.

)1,,1,,,1,,,1(},{ iigijk

g rkqjNihh

),|,( g h

i

gi gjkxp

,1

1

h

h

h

h

h

gijg

i

gijkg

iigijk

nq

nqr

),|,,,,(

),|,(

1

g

h

g

iil

h

i

g

xx

hNi

ghi

gxxxp

gjkxp

Page 12: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Expected log-BDe

In practice, however, the Expected log-Bde is difficult to calculate because the product of multiple probabilities suffers serious computational problems. To avoid this, we propose an alternative method, Expected log-BDe, as described below.

),,|(log)(),,|(log h

Gg

h

GgggXpgpggXpE

h

h

Page 13: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Advantage

The optimal ESS of BDeu becomes large/small when the empirical conditional distribution becomes uniform/skewed because its hypothetical structure assumes a uniform conditional probabilities distribution and the ESS adjusts the magnitude of the user’s belief for a hypothetical structure.

However, the ESS of full non-informative prior is expected to work effectively as actual pseudo-samples to augment the data, especially when the sample size is small, regardless of the uniformity of empirical distribution.

Page 14: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Learning algorithm using dynamic programming

We employ the learning algorithm of (Silander and Myllymaki, 2006) using dynamic programming. Our algorithm comprises four steps:

1. Compute the local Expected log-Bde scores for all possible

pairs.2. For each variable , find the best parent set in parent

candidate set for all , 3. Find the best sink for all 2N variables and the best ordering.4. Find the optimal Bayesian networkOnly Step 1 is different from the procedure described by Silander et al.,

2006. Algorithm 1 gives a pseudo-code for computing the local Expected log-BDe scores.

),(2 1 gi

N

ixN

xxi }{ g

i i

g xxi

/

Page 15: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Algorithm

Page 16: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Computational costs

• The time complexity of the Algorithm 1 is O(N222(N-1)exp(w)) given an elimination order of width w. The traditional methods (BDeu, AIC, BIC, and so on) run in O(N22(N-1)).

• However, the required memory for the proposed computation is equal to that of the traditional scores.

Page 17: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Experiments

Page 18: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.
Page 19: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.

Conclusions

This paper presented a proposal for a noninformative Dirichlet score by marginalizing the possible hypothetical structures as the user’s prior belief.

The results suggest that the proposed method is

effective especially for a small sample size.

Page 20: Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro- Communications, Japan.