Learning Bayesian Networks (From David Heckerman’s tutorial)
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
6
Transcript of Learning Bayesian Networks (From David Heckerman’s tutorial)
Learning Bayesian NetworksLearning Bayesian Networks
(From David Heckerman’s tutorial)
Bayes net(s)data
X1 truefalsefalsetrue
X2 1532
X3 0.7-1.65.96.3
...
.
.
....
Learning Bayes Nets From DataLearning Bayes Nets From Data
X1
X4
X9
X3
X2
X5
X6
X7
X8
Bayes-netlearner
+prior/expert information
OverviewOverview
Introduction to Bayesian statistics:Introduction to Bayesian statistics:Learning a probabilityLearning a probability
Learning probabilities in a Bayes netLearning probabilities in a Bayes net
Learning Bayes-net structureLearning Bayes-net structure
Learning Probabilities: Classical ApproachLearning Probabilities: Classical Approach
Simple case: Flipping a thumbtack
tailsheads
True probability is unknown
Given iid data, estimate using an estimator with good properties: low bias, low variance, consistent (e.g., ML estimate)
Learning Probabilities: Bayesian ApproachLearning Probabilities: Bayesian Approach
tailsheads
True probability is unknown
Bayesian probability density for
p()
0 1
Bayesian Approach: use Bayes' rule to Bayesian Approach: use Bayes' rule to compute a new density for compute a new density for given data given data
dpp
ppp
)|heads()(
)|heads()()heads|(
prior likelihoodposterior
)|heads()( pp
The LikelihoodThe Likelihood
)|tails( p
)|heads( p
)1(
thttthhhthp ## )1()|...(
“binomial distribution”
Example: Application of Bayes rule to Example: Application of Bayes rule to the observation of a single "heads"the observation of a single "heads"
p(|heads)
0 1
p()
0 1
p(heads|)=
0 1
prior likelihood posterior
A Bayes net for learning probabilitiesA Bayes net for learning probabilities
X1 X2 XN...
toss 1 toss 2 toss N
thttthhhthp ## )1()|...(
Sufficient statisticsSufficient statistics
thp
hhhth...tttpphhhth...tttp## )1( )(
)|()()|(
(#h,#t) are sufficient statistics
The probability of heads on the next tossThe probability of heads on the next toss
)(
)|(
)|()|()| is th toss1(
)|(
1
d
d
dd
p
N
E
dp
dphXphnp
Prior Distributions for Prior Distributions for
Direct assessmentDirect assessment Parametric distributionsParametric distributions
– Conjugate distributions (for convenience)Conjugate distributions (for convenience)
– Mixtures of conjugate distributionsMixtures of conjugate distributions
Conjugate Family of DistributionsConjugate Family of Distributions
0,1
)1(1
),Beta()( thth
thp
1#)1(
1# tails),heads |(
th ththp
Beta distribution:
th
hE
+
)(
Properties:
IntuitionIntuition
The hyperparameters The hyperparameters hh and and tt can be can be
thought of as imaginary counts from our prior thought of as imaginary counts from our prior
experience, starting from "pure ignorance"experience, starting from "pure ignorance"
Equivalent sample size = Equivalent sample size = hh + + tt
The larger the equivalent sample size, the The larger the equivalent sample size, the
more confident we are about the true more confident we are about the true
probabilityprobability
Beta DistributionsBeta Distributions
Beta(3, 2 )Beta(1, 1 ) Beta(19, 39 )Beta(0.5, 0.5 )
Assessment of a Beta Distribution Assessment of a Beta Distribution
Method 1: Equivalent sample- assess h and t
- assess h+t and h/(h+t)
Method 2: Imagined future samples
4,15.0)heads 3|heads( and 2.0)heads( thpp
check: .2 =1
1+ 40 0 5
1 3
1 3 4, .
Generalization to Generalization to mm discrete outcomes discrete outcomes("multinomial distribution")("multinomial distribution")
1 ),,Dirichlet()(1
11
ii
m
imm,θ,θp
m
i
Nim
iiN,Np1
11 ),|(
Dirichlet distribution:
m
ii
iiE
1
)(
Properties:
011
i
m
i
More generalizationsMore generalizations(see, e.g., Bernardo + Smith, 1994)(see, e.g., Bernardo + Smith, 1994)
Likelihoods from the exponential familyLikelihoods from the exponential family BinomialBinomial MultinomialMultinomial PoissonPoisson GammaGamma NormalNormal
OverviewOverview
Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability
Learning probabilities in a Bayes netLearning probabilities in a Bayes net
Learning Bayes-net structureLearning Bayes-net structure
From thumbtacks to Bayes netsFrom thumbtacks to Bayes nets
Thumbtack problem can be viewed as learningthe probability for a very simple BN:
X heads/tails
X1 X2 XN...
toss 1 toss 2 toss N
The next simplest Bayes netThe next simplest Bayes net
Xheads/tails Y heads/tails
tailsheads “heads” “tails”
The next simplest Bayes netThe next simplest Bayes net
Xheads/tails Y heads/tails
X
X1
X2
XN
Y
Y1
Y2
YN
case 1
case 2
case N
?
The next simplest Bayes netThe next simplest Bayes net
Xheads/tails Y heads/tails
X
X1
X2
XN
Y
Y1
Y2
YN
case 1
case 2
case N
"parameterindependence"
The next simplest Bayes netThe next simplest Bayes net
Xheads/tails Y heads/tails
X
X1
X2
XN
Y
Y1
Y2
YN
case 1
case 2
case N
"parameterindependence"
two separatethumbtack-likelearning problems
A bit more difficult...A bit more difficult...
Xheads/tails Y heads/tails
Three probabilities to learn:Three probabilities to learn: XX=heads=heads
Y=Y=heads|heads|XX=heads=heads
Y=Y=heads|heads|XX=tails=tails
A bit more difficult...A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
heads
tails
A bit more difficult...A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
A bit more difficult...A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
??
?
A bit more difficult...A bit more difficult...
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
3 separate thumbtack-like problems
In general…In general…
Learning probabilities in a BN is straightforward ifLearning probabilities in a BN is straightforward if Local distributions from the exponential family Local distributions from the exponential family
(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors Complete dataComplete data
Incomplete data makes parameters Incomplete data makes parameters dependentdependent
Xheads/tails Y heads/tails
X
X1
X2
Y|X=heads
Y1
Y2
case 1
case 2
Y|X=tails
OverviewOverview
Intro to Bayesian statistics:Intro to Bayesian statistics:Learning a probabilityLearning a probability
Learning probabilities in a Bayes netLearning probabilities in a Bayes net
Learning Bayes-net structureLearning Bayes-net structure
Learning Bayes-net structureLearning Bayes-net structure
Given data, which model is correct?
X Ymodel 1:
X Ymodel 2:
Bayesian approachBayesian approach
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
Bayesian approach: Model AveragingBayesian approach: Model Averaging
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
averagepredictions
Bayesian approach: Model SelectionBayesian approach: Model Selection
Given data, which model is correct? more likely?
X Ymodel 1:
X Ymodel 2:
7.0)( 1 mp
3.0)( 2 mp
Data d
1.0)|( 1 dmp
9.0)|( 2 dmp
Keep the best model:- Explanation- Understanding- Tractability
To score a model, use Bayes ruleTo score a model, use Bayes rule
Given data d:
)|()()|( mpmpmp dd
dmpmpmp )|(),|()|( dd
"marginallikelihood"
modelscore
likelihood
Thumbtack exampleThumbtack example
)(
)#(
)(
)#(
)##(
)(
)1(
)|()1()|(
1#1#
##
t
t
h
h
th
th
th
th
th
th
d
dmpmp
th
d
conjugateprior
X heads/tails
More complicated graphsMore complicated graphs
Xheads/tails Y heads/tails
3 separate thumbtack-like learning problems
)(
)#(
)(
)#(
)##(
)(
)(
)#(
)(
)#(
)##(
)(
)(
)#(
)(
)#(
)##(
)()|(
t
t
h
h
th
th
t
t
h
h
th
th
t
t
h
h
th
th
th
th
th
th
th
thmp
d X
Y|X=heads
Y|X=tails
Model score for a discrete BNModel score for a discrete BN
ii r
k ijk
ijkijkn
i
q
j ijij
ij N
Nmp
11 1 )(
)(
)(
)()|(
d
N X x
r X
q X
N N
ijk i i ij
i i
i i
ij ijkk
r
ij ijkk
ri i
:
:
:
# cases where = and =
number of states of
number of instances of parents of
ik Pa pa
1 1
Computation of Marginal LikelihoodComputation of Marginal Likelihood
Efficient closed form ifEfficient closed form if Local distributions from the exponential family Local distributions from the exponential family
(binomial, poisson, gamma, ...)(binomial, poisson, gamma, ...) Parameter independenceParameter independence Conjugate priorsConjugate priors No missing data (including no hidden variables)No missing data (including no hidden variables)
Practical considerationsPractical considerations
The number of possible BN structures for n The number of possible BN structures for n variables is super exponential in nvariables is super exponential in n
How do we find the best graph(s)?How do we find the best graph(s)?
How do we assign structure and parameter How do we assign structure and parameter priors to all possible graph?priors to all possible graph?
Model searchModel search
Finding the BN structure with the highest Finding the BN structure with the highest score among those structures with at most score among those structures with at most kk parents is NP hard for parents is NP hard for kk>1 (Chickering, 1995)>1 (Chickering, 1995)
Heuristic methodsHeuristic methods
– GreedyGreedy
– Greedy with restartsGreedy with restarts
– MCMC methodsMCMC methodsscore
all possiblesingle changes
anychangesbetter?
performbest
change
yes
no
returnsaved structure
initializestructure
Structure priorsStructure priors
1. All possible structures equally likely1. All possible structures equally likely
2. Partial ordering, required / prohibited arcs2. Partial ordering, required / prohibited arcs
3. p(m) 3. p(m) similarity(m, prior BN) similarity(m, prior BN)
Parameter priorsParameter priors
All uniform: Beta(1,1)All uniform: Beta(1,1)
Use a prior BN Use a prior BN
Parameter priorsParameter priors
Recall the intuition behind the Beta prior for the Recall the intuition behind the Beta prior for the
thumbtack:thumbtack:
The hyperparameters The hyperparameters hh and and tt can be thought can be thought
of as imaginary counts from our prior of as imaginary counts from our prior
experience, starting from "pure ignorance"experience, starting from "pure ignorance"
Equivalent sample size = Equivalent sample size = hh + + tt
The larger the equivalent sample size, the more The larger the equivalent sample size, the more
confident we are about the long-run fraction confident we are about the long-run fraction
Parameter priorsParameter priors
x1
x4
x9
x3
x2
x5
x6
x7
x8
+equivalent
samplesize
imaginarycount
for anyvariable
configuration
parameter priors for any BN structure for X1…Xn
parametermodularity
x1
x4
x9
x3
x2
x5
x6
x7
x8
prior network+equivalent sample size
data
improved network(s)
x1 truefalsefalsetrue
x2 falsefalsefalsetrue
x3 truetruefalsefalse
...
.
.
....
Combine user knowledge and dataCombine user knowledge and data
x1
x4
x9
x3
x2
x5
x6
x7
x8