Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University...

Non-Informative Dirichlet Score for learning Bayesian networks

Maomi Ueno and Masaki UtoUniversity of Electro-

Communications, Japan

BDe(Bayesian Dirichlet equivalence) Heckerman, Geiger and Chickering (1995)

Non-informative prior : BDeu

As Buntine (1991) described is considered to be a special case of the BDe metric. Heckerman et al. (1995) called this special case “BDeu”.

)/( iiijkg qrh

Problems of BDeu

BDeu is the most popular Bayesian network score.

However, recent reports have described that learning Bayesian network is highly sensitive to the chosen equivalent sample size (ESS) in the Bayesian Dirichlet equivalence uniform (BDeu) and this often engenders some unstable or undesirable results.

(Steck and Jaakkola, 2002, Silander, Kontkanen and Myllymaki, 2007).

Theorem 1(Ueno 2010) When　 α　 is sufficiently large, the log-ML is approximated asymptotically as

Log-posterior The number of Parameters

).1(1log1

2

1),,(),()|(log

1 1 1

On

r

rXgHgHgXp

ijk

ijkN

i

q

j

r

k i

ii i

The difference between prior and data

As the two structures become equivalent, the penalty term is minimized with the fixed ESS. Conversely, as the two structures become different, the term increases. Moreover, α determines the magnitude of the user‘s prior belief for a hypothetical structure. Thus, the mechanism of BDe makes sense to us when we have approximate prior knowledge.

BDeu is not non-informative!!

Ueno (2011) showed that the prior of BDeu does not represent ignorance of prior knowledge, but rather a user’s prior belief in the uniformity of the conditional distribution. The results further imply that the optimal ESS becomes large/small when the empirical conditional distribution becomes uniform/skewed. This main factor underpins the sensitivity of BDeu to ESS.

Marginalization of ESS α

• Silander, Kontkanen and Myllymaki (2007) proposed a learning method to marginalize the ESS of BDeu. They averaged BDeu values with increasing ESS from 1 to 100 by 1.0 .

• Cano et al. (2011) proposed averaging a wide range of different chosen ESSs: ESS > 1:0, ESS >> 1:0, ESS < 1:0, ESS << 1:0.

Why do we use Bayesian approach ?The previous works eliminate out the ESS. However, One of most important features of Bayesian

approach is using prior knowledge to augment the data.

→When the score has a non-informative prior, ｔhe ESS is expected to work effectively as actual pseudo-samples to augment the data.

Main idea

To obtain a non-informative prior, we assume all possible hypothetical structures

as a prior belief of BDe because a problem of BDeu is that it assumes only a uniform distribution as a prior belief.

NIP-BDe

Definition 1. NIP-BDe (Non Informative Prior Bayesian Dirichlet equivalence) is defined as

),,|()(),|( h

Gg

ggXpgpgXph

Calculation method

Our method has the following steps:1. Estimate the conditional probability parameters set given gh from data.2. Estimate the joint probability First, we estimate the conditional probability parameters

set given gh as

Next, we calculate the estimated joint probability as shown below.

)1,,1,,,1,,,1(},{ iigijk

g rkqjNihh

),|,( g h

i

gi gjkxp

,1

1

h

h

h

h

h

gijg

i

gijkg

iigijk

nq

nqr

),|,,,,(

),|,(

1

g

h

g

iil

h

i

g

xx

hNi

ghi

gxxxp

gjkxp

Expected log-BDe

In practice, however, the Expected log-Bde is difficult to calculate because the product of multiple probabilities suffers serious computational problems. To avoid this, we propose an alternative method, Expected log-BDe, as described below.

),,|(log)(),,|(log h

Gg

h

GgggXpgpggXpE

h

h

Advantage

The optimal ESS of BDeu becomes large/small when the empirical conditional distribution becomes uniform/skewed because its hypothetical structure assumes a uniform conditional probabilities distribution and the ESS adjusts the magnitude of the user’s belief for a hypothetical structure.

However, the ESS of full non-informative prior is expected to work effectively as actual pseudo-samples to augment the data, especially when the sample size is small, regardless of the uniformity of empirical distribution.

Learning algorithm using dynamic programming

We employ the learning algorithm of (Silander and Myllymaki, 2006) using dynamic programming. Our algorithm comprises four steps:

1. Compute the local Expected log-Bde scores for all possible

pairs.2. For each variable , find the best parent set in parent

candidate set for all , 3. Find the best sink for all 2N variables and the best ordering.4. Find the optimal Bayesian networkOnly Step 1 is different from the procedure described by Silander et al.,

2006. Algorithm 1 gives a pseudo-code for computing the local Expected log-BDe scores.

),(2 1 gi

N

ixN

xxi }{ g

i i

g xxi

/

Ａｌｇｏｒｉｔｈｍ

Computational costs

• The time complexity of the Algorithm 1 is O(N222(N-1)exp(w)) given an elimination order of width w. The traditional methods (BDeu, AIC, BIC, and so on) run in O(N22(N-1)).

• However, the required memory for the proposed computation is equal to that of the traditional scores.

Experiments

Conclusions

This paper presented a proposal for a noninformative Dirichlet score by marginalizing the possible hypothetical structures as the user’s prior belief.

The results suggest that the proposed method is

effective especially for a small sample size.

Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University...

Documents

Transcript of Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University...