Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall...
Transcript of Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall...
![Page 1: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/1.jpg)
Bayesian networks: Modeling
CS194-10 Fall 2011 Lecture 21
CS194-10 Fall 2011 Lecture 21 1
![Page 2: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/2.jpg)
Outline
♦ Overview of Bayes nets
♦ Syntax and semantics
♦ Examples
♦ Compact conditional distributions
CS194-10 Fall 2011 Lecture 21 2
![Page 3: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/3.jpg)
Learning with complex probability models
Learning cannot succeed without imposing some prior structure on thehypothesis space (by constraint or by preference)
Generative models P (X | θ) support MLE, MAP, and Bayesian learning fordomains that (approximately) reflect the assumptions underlying the model:♦ Naive Bayes—conditional independence of attributes given class value♦ Mixture models—domain has a flat, discrete category structure♦ All i.i.d. models—model doesn’t change over time♦ Etc.
Would like to express arbitrarily complex and flexible prior knowledge:♦ Some attributes depend on others♦ Categories have hierarchical structure;
objects may be mixtures of several categories♦ Observations at time t may depend on earlier observations♦ Etc.
CS194-10 Fall 2011 Lecture 21 3
![Page 4: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/4.jpg)
Bayesian networks
A simple, graphical notation for conditional independence assertionsamong a predefined set of random variables Xj, j = 1, . . . , Dand hence for compact specification of arbitrary joint distributions
Syntax:a set of nodes, one per variablea directed, acyclic graph (link ≈ “directly influences”)a set of parameters for each node given its parents:
θ(Xj|Parents(Xj))
In the simplest case, parameters consist ofa conditional probability table (CPT) giving thedistribution over Xj for each combination of parent values
CS194-10 Fall 2011 Lecture 21 4
![Page 5: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/5.jpg)
Example
Topology of network encodes conditional independence assertions:
Weather Cavity
Toothache Catch
Weather is independent of the other variables
Toothache and Catch are conditionally independent given Cavity
CS194-10 Fall 2011 Lecture 21 5
![Page 6: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/6.jpg)
Example
I’m at work, neighbor John calls to say my alarm is ringing, but neighborMary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there aburglar?
Variables: Burglar, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects “causal” knowledge:
– A burglar can set the alarm off– An earthquake can set the alarm off– The alarm can cause Mary to call– The alarm can cause John to call
CS194-10 Fall 2011 Lecture 21 6
![Page 7: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/7.jpg)
Example contd.
.001P(B)
.002P(E)
Alarm
Earthquake
MaryCallsJohnCalls
Burglary
BTTFF
ETFTF
.95
.29
.001
.94
P(A|B,E)
ATF
.90
.05
P(J|A) ATF
.70
.01
P(M|A)
CS194-10 Fall 2011 Lecture 21 7
![Page 8: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/8.jpg)
Compactness
A CPT for Boolean Xj with L Boolean parents has B E
JA
M
2L rows for the combinations of parent values
Each row requires one parameter p for Xj = true(the parameter for Xj = false is just 1− p)
If each variable has no more than L parents,the complete network requires O(D · 2L) parameters
I.e., grows linearly with D, vs. O(2D) for the full joint distribution
For burglary net, 1 + 1 + 4 + 2 + 2 = 10 parameters (vs. 25 − 1 = 31)
CS194-10 Fall 2011 Lecture 21 8
![Page 9: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/9.jpg)
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
=
CS194-10 Fall 2011 Lecture 21 9
![Page 10: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/10.jpg)
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063
CS194-10 Fall 2011 Lecture 21 10
![Page 11: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/11.jpg)
Global semantics
Global semantics defines the full joint distribution B E
JA
M
as the product of the local conditional distributions:
P (x1, . . . , xD) = ΠDj = 1θ(xj|parents(Xj)) .
e.g., P (j ∧m ∧ a ∧ ¬b ∧ ¬e)
= θ(j|a)θ(m|a)θ(a|¬b,¬e)θ(¬b)θ(¬e)
= 0.9× 0.7× 0.001× 0.999× 0.998
≈ 0.00063
Theorem: θ(Xj|Parents(Xj)) = P(Xj|Parents(Xj))
CS194-10 Fall 2011 Lecture 21 11
![Page 12: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/12.jpg)
Local semantics
Local semantics: each node is conditionally independentof its nondescendants given its parents
. . .
. . .U1
X
Um
Yn
Znj
Y1
Z1j
Theorem: Local semantics ⇔ global semantics
CS194-10 Fall 2011 Lecture 21 12
![Page 13: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/13.jpg)
Markov blanket
Each node is conditionally independent of all others given itsMarkov blanket: parents + children + children’s parents
. . .
. . .U1
X
Um
Yn
Znj
Y1
Z1j
CS194-10 Fall 2011 Lecture 21 13
![Page 14: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/14.jpg)
Constructing Bayesian networks
Need a method such that a series of locally testable assertions ofconditional independence guarantees the required global semantics
1. Choose an ordering of variables X1, . . . , XD
2. For j = 1 to Dadd Xj to the networkselect parents from X1, . . . , Xj−1 such that
P(Xj|Parents(Xj)) = P(Xj|X1, . . . , Xj−1)i.e., Xj is conditionally independent of other variables given parents
This choice of parents guarantees the global semantics:
P(X1, . . . , XD) = ΠDj = 1P(Xj|X1, . . . , Xj−1) (chain rule)
= ΠDj = 1P(Xj|Parents(Xj)) (by construction)
CS194-10 Fall 2011 Lecture 21 14
![Page 15: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/15.jpg)
Example: Car diagnosis
Initial evidence: car won’t startTestable variables (green), “broken, so fix it” variables (orange)Hidden variables (gray) ensure sparse structure, reduce parameters
lights
no oil no gas starterbroken
battery age alternator broken
fanbeltbroken
battery dead no charging
battery flat
gas gauge
fuel lineblocked
oil light
battery meter
car won’t start dipstick
CS194-10 Fall 2011 Lecture 21 15
![Page 16: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/16.jpg)
Example: Car insurance
SocioEconAge
GoodStudentExtraCar
MileageVehicleYear
RiskAversion
SeniorTrain
DrivingSkill MakeModelDrivingHist
DrivQualityAntilock
Airbag CarValue HomeBase AntiTheftt
TheftOwnDamage
PropertyCostLiabilityCostMedicalCost
Cushioning
Ruggedness Accident
OtherCost OwnCost
CS194-10 Fall 2011 Lecture 21 16
![Page 17: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/17.jpg)
Compact conditional distributions
CPT grows exponentially with number of parentsCPT becomes infinite with continuous-valued parent or child
Solution: canonical distributions that are defined compactly
Deterministic nodes are the simplest case:X = f (Parents(X)) for some function f
E.g., Boolean functionsNorthAmerican ⇔ Canadian ∨ US ∨Mexican
E.g., numerical relationships among continuous variables
∂LakeLevel
∂t= inflow + precipitation - outflow - evaporation
CS194-10 Fall 2011 Lecture 21 17
![Page 18: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/18.jpg)
Compact conditional distributions contd.
Noisy-OR distributions model multiple noninteracting causes1) Parents U1 . . . UL include all causes (can add leak node)2) Independent failure probability q` for each cause alone
⇒ P (X|U1 . . . UM ,¬UM+1 . . .¬UL) = 1−ΠM` = 1q`
Cold F lu Malaria P (Fever) P (¬Fever)F F F 0.0 1.0F F T 0.9 0.1F T F 0.8 0.2F T T 0.98 0.02 = 0.2× 0.1T F F 0.4 0.6T F T 0.94 0.06 = 0.6× 0.1T T F 0.88 0.12 = 0.6× 0.2T T T 0.988 0.012 = 0.6× 0.2× 0.1
Number of parameters linear in number of parents
CS194-10 Fall 2011 Lecture 21 18
![Page 19: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/19.jpg)
Hybrid (discrete+continuous) networks
Discrete (Subsidy? and Buys?); continuous (Harvest and Cost)
Buys?
HarvestSubsidy?
Cost
Option 1: discretization—possibly large errors, large CPTsOption 2: finitely parameterized canonical families
1) Continuous variable, discrete+continuous parents (e.g., Cost)2) Discrete variable, continuous parents (e.g., Buys?)
CS194-10 Fall 2011 Lecture 21 19
![Page 20: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/20.jpg)
Continuous child variables
Need one conditional density function for child variable given continuousparents, for each possible assignment to discrete parents
Most common is the linear Gaussian model, e.g.,:
P (Cost = c|Harvest = h, Subsidy? = true)
= N(ath + bt, σt)(c)
=1
σt
√2π
exp
−1
2
c− (ath + bt)
σt
2
Mean Cost varies linearly with Harvest, variance is fixed
Linear variation is unreasonable over the full rangebut works OK if the likely range of Harvest is narrow
CS194-10 Fall 2011 Lecture 21 20
![Page 21: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/21.jpg)
Continuous child variables
0 2 4 6 8 10Cost c02 46 81012
Harvest h
00.10.20.30.4
P(c | h, subsidy)
All-continuous network with LG distributions⇒ full joint distribution is a multivariate Gaussian
Discrete+continuous LG network is a conditional Gaussian network i.e., amultivariate Gaussian over all continuous variables for each combination ofdiscrete variable values
CS194-10 Fall 2011 Lecture 21 21
![Page 22: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/22.jpg)
Discrete variable w/ continuous parents
Probability of Buys? given Cost should be a “soft” threshold:
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
P(c)
Cost c
Probit distribution uses integral of Gaussian:Φ(x) = ∫x
−∞N(0, 1)(x)dxP (Buys? = true | Cost = c) = Φ((−c + µ)/σ)
CS194-10 Fall 2011 Lecture 21 22
![Page 23: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/23.jpg)
Why the probit?
1. It’s sort of the right shape
2. Can view as hard threshold whose location is subject to noise
Buys?
Cost Cost Noise
CS194-10 Fall 2011 Lecture 21 23
![Page 24: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/24.jpg)
Discrete variable contd.
Sigmoid (or logit) distribution also used in neural networks:
P (Buys? = true | Cost = c) =1
1 + exp(−2−c+µσ )
Sigmoid has similar shape to probit but much longer tails:
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12
P(buys
| c)
Cost c
LogitProbit
CS194-10 Fall 2011 Lecture 21 24
![Page 25: Bayesian networks: Modeling - Peoplerussell/classes... · Bayesian networks: Modeling CS194-10 Fall 2011 Lecture 21 CS194-10 Fall 2011 Lecture 21 1. Outline ♦ Overview of Bayes](https://reader033.fdocuments.in/reader033/viewer/2022060313/5f0b5b8d7e708231d4301ec1/html5/thumbnails/25.jpg)
Summary (representation)
Bayes nets provide a natural representation for (causally induced)conditional independence
Topology + CPTs = compact representation of joint distribution⇒ fast learning from few examples
Generally easy for (non)experts to construct
Canonical distributions (e.g., noisy-OR, linear Gaussian)⇒ compact representation of CPTs⇒ faster learning from fewer examples
CS194-10 Fall 2011 Lecture 21 25