Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson...

119
Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Slide 1 Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~brigham [email protected] Probabili lty = relative area

Transcript of Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson...

Page 1: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

Aug 25th, 2001

Copyright © 2001, Andrew W. Moore

Slide 1

ProbabilisticModels

Brigham S. Anderson

School of Computer Science

Carnegie Mellon University

www.cs.cmu.edu/~brigham

[email protected]

Probabililty = relative area

Page 2: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

2

OUTLINE

Probability• Axioms Rules

Probabilistic Models• Probability Tables Bayes Nets

Machine Learning• Inference• Anomaly detection• Classification

Page 3: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

3

Overview

The point of this section:1. Introduce probability models.

2. Introduce Bayes Nets as powerful probability models.

Page 4: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

4

Representing P(A,B,C)• How can we represent the function P(A,B,C)?

• It ideally should contain all the information in this Venn diagram

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

Page 5: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

5

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Example: P(A, B, C)

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

The Joint Probability Table

Page 6: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

6

Using the Prob. Table

One you have the PT you can ask for the probability of any logical expression involving your attribute

E

PEP matching rows

)row()(

Page 7: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

7

P(Poor, Male) = 0.4654 E

PEP matching rows

)row()(

Using the Prob. Table

Page 8: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

8

P(Poor) = 0.7604 E

PEP matching rows

)row()(

Using the Prob. Table

Page 9: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

9

Inference with the

Prob. Table

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

),()|(

E

EE

P

P

EP

EEPEEP

Page 10: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

10

Inference with the

Prob. Table

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

),()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Page 11: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

11

Inference is a big deal• I’ve got this evidence. What’s the chance

that this conclusion is true?• I’ve got a sore neck: how likely am I to have meningitis?• I see my lights are out and it’s 9pm. What’s the chance my spouse

is already asleep?• I see that there’s a long queue by my gate and the clouds are dark

and I’m traveling by US Airways. What’s the chance I’ll be home tonight? (Answer: 0.00000001)

• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

Page 12: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

12

The Full Probability Table

Problem:We usually don’t know P(A,B,C)!

Solutions:1. Construct it analytically

2. Learn it from data : P(A,B,C)

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

P(A,B,C)

Page 13: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

13

Learning a Probability Table

Build a PT for your attributes in which the probabilities are unspecified

The fill in each row with

records ofnumber total

row matching records)row(ˆ P

A B C Prob

0 0 0 ?

0 0 1 ?

0 1 0 ?

0 1 1 ?

1 0 0 ?

1 0 1 ?

1 1 0 ?

1 1 1 ?

A B C Prob

0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10Fraction of all records in whichA and B are True but C is False

Page 14: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

14

Example of Learning a Prob. Table

• This PTable was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

Page 15: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

15

Where are we?

• We have covered the fundamentals of probability

• We have become content with PTables

• And we even know how to learn PTables from data.

Page 16: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

16

Probability Tableas Anomaly Detector

• Our PTable is our first example of something called an Anomaly Detector.

• An anomaly detector tells you how likely a datapoint is given a model.

AnomalyDetector P(x)Data point x

Page 17: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

17

• Given a record x, a model M can tell you how likely the record is:

• Given a dataset with R records, a model can tell you how likely the dataset is:(Under the assumption that all records were independently generated

from the model’s probability function)

Evaluating a Model

R

kkR |MP|MP|MP

121 )(ˆ),,,(ˆ)dataset(ˆ xxxx

)(ˆ |MP x

Page 18: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

18

A small dataset: Miles Per Gallon

From the UCI repository (thanks to Ross Quinlan)

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

Page 19: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

19

A small dataset: Miles Per Gallon

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

Page 20: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

20

A small dataset: Miles Per Gallon

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

203-1

21

10 3.4 case) (in this

)(ˆ),,,(ˆ)dataset(ˆ

R

kkR |MP|MP|MP xxxx

Page 21: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

21

Log Probabilities

Since probabilities of datasets get so small we usually use log probabilities

R

kk

R

kk |MP|MP|MP

11

)(ˆlog)(ˆlog)dataset(ˆlog xx

Page 22: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

22

A small dataset: Miles Per Gallon

192 Training Set Records

mpg modelyear maker

good 75to78 asiabad 70to74 americabad 75to78 europebad 70to74 americabad 70to74 americabad 70to74 asiabad 70to74 asiabad 75to78 america: : :: : :: : :bad 70to74 americagood 79to83 americabad 75to78 americagood 79to83 americabad 75to78 americagood 79to83 americagood 79to83 americabad 70to74 americagood 75to78 europebad 75to78 europe

466.19 case) (in this

)(ˆlog)(ˆlog)dataset(ˆlog11

R

kk

R

kk |MP|MP|MP xx

Page 23: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

23

Summary: The Good News

Prob. Tables allow us to learn P(X) from data.

• Can do anomaly detection spot suspicious / incorrect records

• Can do inference: P(E1|E2)Automatic Doctor, Help Desk, etc

• Can do Bayesian classificationComing soon…

Page 24: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

24

Summary: The Bad News

Learning a Prob. Table for P(X) is for Cowboy data miners

• Those probability tables get big fast! • With so many parameters relative to our data, the

resulting model could be pretty wild.

Page 25: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

25

Using a test set

An independent test set with 196 cars has a worse log likelihood

(actually it’s a billion quintillion quintillion quintillion quintillion times less likely)

….Density estimators can overfit. And the full joint density estimator is the overfittiest of them all!

Page 26: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

26

Overfitting Prob. Tables

If this ever happens, it means there are certain combinations that we learn are impossible

0)(ˆ any for if

)(ˆlog)(ˆlog)testset(ˆlog11

|MPk

|MP|MP|MP

k

R

kk

R

kk

x

xx

Page 27: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

27

Using a test set

The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020

We need P() estimators that are less prone to overfitting

Page 28: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

28

Naïve Density Estimation

The problem with the Probability Table is that it just mirrors the training data.

In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from our Table.

We need something which generalizes more usefully.

The naïve model generalizes strongly: Assume that each attribute is distributed independently of any of the other attributes.

Page 29: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

29

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

Page 30: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

30

Independently Distributed Data

• Let x[i] denote the i’th field of record x.• The independently distributed assumption says

that for any i,v, u1 u2… ui-1 ui+1… uM

)][(

)][,]1[,]1[,]2[,]1[|][( 1121

vixP

uMxuixuixuxuxvixP Mii

• Or in other words, x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}

• This is often written as ]}[],1[],1[],2[],1[{][ Mxixixxxix

Page 31: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

31

A note about independence

• Assume A and B are Boolean Random Variables. Then

“A and B are independent”

if and only if

P(A|B) = P(A)

• “A and B are independent” is often notated as

BA

Page 32: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

32

Independence Theorems

• Assume P(A|B) = P(A)• Then P(A, B) =

= P(A) P(B)

• Assume P(A|B) = P(A)• Then P(B|A) =

= P(B)

Page 33: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

33

Independence Theorems

• Assume P(A|B) = P(A)• Then P(~A|B) =

= P(~A)

• Assume P(A|B) = P(A)• Then P(A|~B) =

= P(A)

Page 34: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

34

Multivalued Independence

For Random Variables A and B,

BAif and only if

)()|(:, uAPvBuAPvu from which you can then prove things like…

)()(),(:, vBPuAPvBuAPvu )()|(:, vBPvAvBPvu

Page 35: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

35

Back to Naïve Density Estimation

• Let x[i] denote the i’th field of record x:• Naïve DE assumes x[i] is independent of {x[1],x[2],..x[i-1], x[i+1],…x[M]}• Example:

• Suppose that each record is generated by randomly shaking a green dice and a red dice

• Dataset 1: A = red value, B = green value

• Dataset 2: A = red value, B = sum of values

• Dataset 3: A = sum of values, B = difference of values

• Which of these datasets violates the naïve assumption?

Page 36: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

36

Using the Naïve Distribution

• Once you have a Naïve Distribution you can easily compute any row of the joint distribution.

• Suppose A, B, C and D are independently distributed.

What is P(A^~B^C^~D)?

= P(A) P(~B) P(C) P(~D)

Page 37: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

37

Naïve Distribution General Case

• Suppose x[1], x[2], … x[M] are independently distributed.

M

kkM ukxPuMxuxuxP

121 )][()][,]2[,]1[(

• So if we have a Naïve Distribution we can construct any row of the implied Joint Distribution on demand.

• So we can do any inference* • But how do we learn a Naïve Density

Estimator?

Page 38: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

38

Learning a Naïve Density Estimator

records ofnumber total

][in which records#)][(ˆ uixuixP

Another trivial learning algorithm!

Page 39: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

39

Contrast

Joint DE Naïve DE

Can model anything Can model only very boring distributions

No problem to model “C is a noisy copy of A”

Outside Naïve’s scope

Given 100 records and more than 6 boolean attributes will screw up badly

Given 100 records and 10,000 multivalued attributes will be fine

Page 40: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

40

Empirical Results: “Hopeless”

The “hopeless” dataset consists of 40,000 records and 21 boolean attributes called a,b,c, … u. Each attribute in each record is generated 50-50 randomly as 0 or 1.

Despite the vast amount of data, “Joint” overfits hopelessly and does much worse

Average test set log probability during 10 folds of k-fold cross-validation*Described in a future Andrew lecture

Page 41: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

41

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

Page 42: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

42

Empirical Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped

The DE learned by

“Joint”

The DE learned by

“Naive”

Page 43: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

43

A tiny part of the DE

learned by “Joint”

Empirical Results: “MPG”The “MPG” dataset consists of 392 records and 8 attributes

The DE learned by

“Naive”

Page 44: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

44

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

Page 45: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

45

The DE learned by

“Joint”

Empirical Results: “Weight vs MPG”Suppose we train only from the “Weight” and “MPG” attributes

The DE learned by

“Naive”

Page 46: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

46

The DE learned by

“Joint”

“Weight vs MPG”: The best that Naïve can do

The DE learned by

“Naive”

Page 47: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

47

Reminder: The Good News

• We have two ways to learn a Density Estimator from data.

• There are vastly more impressive* Density Estimators * (Mixture Models, Bayesian Networks, Density Trees, Kernel

Densities and many more)

• Density estimators can do many good things…• Anomaly detection• Can do inference: P(E1|E2) Automatic Doctor / Help Desk etc

• Ingredient for Bayes Classifiers

Page 48: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

48

Probability Models

Full Prob.Table Naïve Density

• No assumptions• Overfitting-prone• Scales horribly

• Strong assumptions• Overfitting-resistant• Scales incredibly well

Bayes Nets

• Carefully chosen assumptions• Overfitting and scaling

properties depend on assumptions

Page 49: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

49

Bayes Nets

Page 50: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

50

Bayes Nets• What are they?

• Bayesian nets are a network-based framework for representing and analyzing models involving uncertainty

• What are they used for?• Intelligent decision aids, data fusion, 3-E feature recognition, intelligent

diagnostic aids, automated free text understanding, data mining• Where did they come from?

• Cross fertilization of ideas between the artificial intelligence, decision analysis, and statistic communities

• Why the sudden interest?• Development of propagation algorithms followed by availability of easy

to use commercial software • Growing number of creative applications

• How are they different from other knowledge representation and probabilistic analysis tools?• Different from other knowledge-based systems tools because

uncertainty is handled in mathematically rigorous yet efficient and simple way

Page 51: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

51

Bayes Net Concepts

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

Page 52: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

52

A Simple Bayes Net

• Let’s assume that we already have P(Mpg,Horse)

How would you rewrite this using the Chain rule?

0.480.12bad

0.040.36good

highlowP(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) =

Page 53: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

53

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Mpg)

P(Horse|Mpg)

*

Page 54: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

54

Review: Chain Rule

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48

P(Mpg, Horse) P(good) = 0.4P( bad) = 0.6

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Mpg)

P(Horse|Mpg)

*

= P(good) * P(low|good) = 0.4 * 0.89

= P(good) * P(high|good) = 0.4 * 0.11

= P(bad) * P(low|bad) = 0.6 * 0.21

= P(bad) * P(high|bad) = 0.6 * 0.79

Page 55: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

55

How to Make a Bayes Net

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

Mpg

Horse

Page 56: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

56

How to Make a Bayes Net

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

Mpg

Horse

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

Page 57: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

57

How to Make a Bayes Net

Mpg

Horse

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

• Each node is a probability function

• Each arc denotes conditional dependence

Page 58: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

58

How to Make a Bayes Net

So, what have we accomplished thus far?

Nothing; we’ve just “Bayes Net-ified”

P(Mpg, Horse) using the Chain rule.

…the real excitement starts when we wield conditional independence

Mpg

Horse

P(Mpg)

P(Horse|Mpg)

Page 59: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

59

How to Make a Bayes Net

Before we apply conditional independence, we need a worthier opponent than puny P(Mpg, Horse)…

We’ll use P(Mpg, Horse, Accel)

Page 60: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

60

How to Make a Bayes Net

Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

Note:Obviously, we could have written this 3!=6 different ways…

P(M, H, A) = P(M) * P(H|M) * P(A|M,H) = P(M) * P(A|M) * P(H|M,A) = P(H) * P(M|H) * P(A|H,M) = P(H) * P(A|H) * P(M|H,A) = … = …

Page 61: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

61

How to Make a Bayes Net

Mpg

Horse

Accel

Rewrite joint using the Chain rule.

P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

Page 62: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

62

How to Make a Bayes Net

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95

P(Accel|Mpg,Horse)

* Note: I made these up…

Page 63: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

63

How to Make a Bayes Net

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P( low|good, low) = 0.97P( low|good,high) = 0.15P( low| bad, low) = 0.90P( low| bad,high) = 0.05P(high|good, low) = 0.03P(high|good,high) = 0.85P(high| bad, low) = 0.10P(high| bad,high) = 0.95

P(Accel|Mpg,Horse)

A Miracle Occurs!

You are told by God (or another domain expert)that Accel is independent of Mpg given Horse!

P(Accel | Mpg, Horse) = P(Accel | Horse)

Page 64: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

64

How to Make a Bayes Net

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

Page 65: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

65

How to Make a Bayes Net

Mpg

Horse

Accel

P(good) = 0.4P( bad) = 0.6

P(Mpg)

P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79

P(Horse|Mpg)

P( low| low) = 0.95P( low|high) = 0.11P(high| low) = 0.05P(high|high) = 0.89

P(Accel|Horse)

Thank you, domain expert!

Now I only need to learn 5 parameters

instead of 7 from my data!

My parameter estimateswill be more accurate as

a result!

Page 66: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

66

Independence“The Acceleration does not depend on the Mpg once I know the Horsepower.”

This can be specified very simply:

P(Accel Mpg, Horse) = P(Accel | Horse)

This is a powerful statement!

It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Page 67: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

67

Bayes Net-Building Tools

1.Chain RuleP(A,B) = P(A) P(B|A)

2.Conditional IndependenceP(B|A) = P(B)

This is Ridiculously Useful in general

Page 68: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

68

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

Page 69: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

69

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 70: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

70

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Page 71: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

71

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Page 72: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

72

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

4 joint computes

Page 73: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

73

IndependenceWe’ve stated:

P(M) = 0.6P(S) = 0.3P(S M) = P(S)

M S Prob

T T

T F

F T

F F

And since we now have the joint pdf, we can make any queries we like.

From these statements, we can derive the full joint pdf.

Page 74: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

74

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

Page 75: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

75

The good news

We can do inference. We can compute any conditional probability:

P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.

How much work is the above computation?

Page 76: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

76

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

Page 77: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

77

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

Page 78: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

78

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

But perhaps there are faster ways of querying Bayes nets?• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find

there are often many tricks to save you time.• So we’ve just got to program our computer to do those tricks too, right?

Sadder and worse news:General querying of Bayes nets is NP-complete.

Page 79: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

79

Case Study I

Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).

• Diagnostic system for lymph-node diseases.

• 60 diseases and 100 symptoms and test-results.

• 14,000 probabilities

• Expert consulted to make net.

• 8 hours to determine variables.

• 35 hours for net topology.

• 40 hours for probability table values.

• Apparently, the experts found it quite easy to invent the causal links and probabilities.

• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

Page 80: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

80

What you should know

• The meanings and importance of independence and conditional independence.

• The definition of a Bayes net.• Computing probabilities of assignments of variables (i.e.

members of the joint p.d.f.) with a Bayes net.• The slow (exponential) method for computing arbitrary,

conditional probabilities.• The stochastic simulation method and likelihood weighting.

Page 81: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

81

Page 82: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

82

ChainRule

0.6

0.4good

bad 0.790.11high

0.210.89 low

badgood

&

P(Mpg) P(Horse | Mpg)

0.920.28bad

0.080.72 good

highlow

0.52

0.48 low

high&

P(Horse) P(Mpg | Horse)

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

Page 83: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

83

ChainRule

&

P(Mpg)

P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)

P(Mpg, Horse) = P(Horse) * P(Mpg | Horse)

0.48high

0.12lowbad

0.04high

0.36lowgood

HorseMpg

good 0.4

bad 0.6

0.79high

0.21lowbad

0.11high

0.89lowgood

HorseMpgMpg

P(Horse | Mpg)

&

P(Horse)

low 0.48

high 0.52

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorseHorse

P(Mpg | Horse)

Page 84: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

84

Making a Bayes Net

P(Horse) P(Mpg | Horse)

P(Horse) P(Mpg | Horse)

1.Start with a Joint

2.Rewrite joint with Chain Rule

3.Each component becomes a node

(with arcs)

P(Horse, Mpg)

Page 85: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

85

P(Mpg, Horsepower)

Mpg

Horsepower

P(Mpg)

P(Horsepower | Mpg)

0.790.11high

0.210.89 low

badgood

0.6

0.4good

bad

We can rewrite P(Mpg, Horsepower) as a Bayes net!

1. Decompose with the Chain rule.2. Make each element of decomposition into a node

Page 86: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

86

P(Mpg, Horsepower)

Horsepower

Mpg

P(Horse)

P(Mpg | Horse)

0.92bad

0.08goodhigh

0.28bad

0.72goodlow

MpgHorse

0.52high

0.48low

Horse

Page 87: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

87

hf

h ~h

f 0.01 0.01

~f 0.09 0.89

0.01 0.01

0.89

0.09

The Joint Distribution

Page 88: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

88

hf

The Joint Distribution

a

~hh

~f

f

a~a

Page 89: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

89

Joint Distribution

0.480.12bad

0.040.36good

highlow

P(Mpg, Horse)

P(Mpg, Horse)

Venn Diagrams• Visual• Good for tutorials

Probability Tables• Efficient• Good for computation

good0.36

high0.48

0.12

0.04

Up to this point, we have represented a probability model (or joint distribution) in two ways:

Page 90: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

90

Where do Joint Distributions come from?

• Idea One: Expert Humans with lots of free time• Idea Two: Simpler probabilistic facts and some algebra Example: Suppose you knew

P(A) = 0.7

P(B|A) = 0.2P(B|~A) = 0.1

P(C|A,B) = 0.1P(C|A,~B) = 0.8P(C|~A,B) = 0.3P(C|~A,~B) = 0.1

Then you can automatically compute the PT using the chain rule

P(x, y, z) = P(x) P(y|x) P(z|x,y)Later: Bayes Nets will do this systematically

Page 91: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

91

Where do Joint Distributions come from?

• Idea Three: Learn them from data!

Prepare to be impressed…

Page 92: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

92

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

Page 93: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

93

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

We already know the Joint of S and M, so all we need now isP(L S=u, M=v)

in the 4 cases of u/v = True/False.

Page 94: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

94

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven*

*Savings are larger for larger numbers of variables.

Page 95: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

95

A more interesting case

• M : Manuela teaches the class• S : It is sunny• L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Question: ExpressP(L=x ^ M=y ^ S=z)

in terms that only need the above expressions, where x,y and z may each be True or False.

Page 96: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

96

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

S M

L

P(s)=0.3P(M)=0.6

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 97: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

97

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

S M

L

P(s)=0.3P(M)=0.6

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Read the absence of an arrow between S and M to mean “it

would not help me predict M if I knew the value of S”

Read the two arrows into L to mean that if I want to know the

value of L it may help me to know M and to know S.

Th

is kind o

f stuff w

ill be

tho

rou

ghly form

alize

d la

ter

Page 98: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

98

An even cuter trick

Suppose we have these three events:• M : Lecture taught by Manuela• L : Lecturer arrives late• R : Lecture concerns robotsSuppose:• Andrew has a higher chance of being late than Manuela.• Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?

How about:• P(L M) = P(L) ?• P(R M) = P(R) ?• P(L R) = P(L) ?

Page 99: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

99

Conditional independence

Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots.

P(R M,L) = P(R M) and

P(R ~M,L) = P(R ~M)

We express this in the following way:

“R and L are conditionally independent given M”

M

L R

Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc.

..which is also notated by the following diagram.

Page 100: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

100

Conditional Independence formalized

R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

Page 101: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

101

Example:

R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

“Shoe-size is conditionally independent of Glove-size given height weight and age”

means

forall s,g,h,w,a

P(ShoeSize=s|Height=h,Weight=w,Age=a)

=

P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

Page 102: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

102

Example:

R and L are conditionally independent given M if

for all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:

Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

“Shoe-size is conditionally independent of Glove-size given height weight and age”

does not mean

forall s,g,h

P(ShoeSize=s|Height=h)

=

P(ShoeSize=s|Height=h, GloveSize=g)

Page 103: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

103

Conditional independence

M

L R

We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(LM) and P(L~M) and know we’ve fully specified L’s behavior. Ditto for R.

P(M) = 0.6 P(L M) = 0.085 P(L ~M) = 0.17 P(R M) = 0.3 P(R ~M) = 0.6

‘R and L conditionally independent given M’

Page 104: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

104

Conditional independenceM

L R

P(M) = 0.6

P(L M) = 0.085

P(L ~M) = 0.17

P(R M) = 0.3

P(R ~M) = 0.6

Conditional Independence:

P(RM,L) = P(RM),

P(R~M,L) = P(R~M)

Again, we can obtain any member of the Joint prob dist that we desire:

P(L=x ^ R=y ^ M=z) =

Page 105: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

105

Assume five variables

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

• T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)

• L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)

• R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)

• M and S are independent

Page 106: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

106

Making a Bayes net

S M

R

L

T

Step One: add variables.• Just choose the variables you’d like to be included in the

net.

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 107: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

107

Making a Bayes net

S M

R

L

T

Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising

that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 108: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

108

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each

possible combination of parent values

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 109: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

109

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-

descendants in the tree, given its parents.• You can deduce many other conditional independence

relations from a Bayes net. See the next lecture.

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 110: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

110

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

• V is a set of vertices.• E is a set of directed edges joining vertices. No loops of

any length are allowed.

Each vertex in V contains the following information:• The name of a random variable• A probability distribution table indicating how the

probability of this variable’s values depends on all possible combinations of parental values.

Page 111: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

111

Building a Bayes Net

1. Choose a set of relevant variables.2. Choose an ordering for them

3. Assume they’re called X1 .. Xm (where X1 is the first in the ordering, X1 is the second, etc)

4. For i = 1 to m:

1. Add the Xi node to the network

2. Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )

3. Define the probability table of

P(Xi =k Assignments of Parents(Xi ) ).

Page 112: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

112

Example Bayes Net Building

Suppose we’re building a nuclear power station.There are the following random variables:

GRL : Gauge Reads Low.CTL : Core temperature is low.FG : Gauge is faulty.FA : Alarm is faultyAS : Alarm sounds

• If alarm working properly, the alarm is meant to sound if the gauge stops reading a low temp.

• If gauge working properly, the gauge is meant to read the temp of the core.

Page 113: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

113

Computing a Joint EntryHow to compute an entry in a joint distribution?

E.G: What is P(S ^ ~M ^ L ~R ^ T)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 114: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

114

Computing with Bayes Net

P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 115: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

115

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *

P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =

:

:

=

n

iiii

n

iiiii

XxXP

xXxXxXP

1

11111

Parents of sAssignment

So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Page 116: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

116

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 117: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

117

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Page 118: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

118

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Page 119: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Models Brigham S. Anderson School of Computer Science Carnegie Mellon University.

119

Where are we now?

• We have a methodology for building Bayes nets.

• We don’t require exponential storage to hold our probability table. Only exponential in the maximum number of parents of any node.

• We can compute probabilities of any given assignment of truth values to the variables. And we can do it in time linear with the number of nodes.

• So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(~R ^ T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(R ^ T ^ ~S)+ P(~R ^ T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match ~R ^ T ^ ~S

Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

4 joint computes