Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it...

31
' & $ % Outline Introduction Bayesian Probability Theory Bayes rule Bayes rule applied to learning Bayesian learning and the MDL principle Sequence Prediction and Data Compression Bayesian Networks Copyright 2015 Kee Siong Ng. All rights reserved. 2

Transcript of Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it...

Page 1: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Outline

• Introduction

• Bayesian Probability Theory

– Bayes rule

– Bayes rule applied to learning

– Bayesian learning and the MDL principle

• Sequence Prediction and Data Compression

• Bayesian Networks

Copyright 2015 Kee Siong Ng. All rights reserved. 2

Page 2: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Medical Diagnosis

Suppose in a population only 8 in 1000 people have a certain cancer. Wehave a lab test for the cancer that returns

• a positive result in 98% of the cases in which the cancer is actuallypresent; and

• a negative result in 97% of the cases in which the cancer is notpresent.

Suppose we now observe a new patient for whom the lab test returns apositive result. Should we diagnose the patient as having cancer or not?

Copyright 2015 Kee Siong Ng. All rights reserved. 3

Page 3: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Monty Hall Problem

Suppose you’re on a game show, and you’re given the choice of threedoors: Behind one door is a car; behind the others, goats. You pick a door,say No. 1, and the host, who knows what’s behind the doors, opensanother door, say No. 3, which has a goat. He then says to you, ”Do youwant to pick door No. 2?” Is it to your advantage to switch your choice?

Copyright 2015 Kee Siong Ng. All rights reserved. 4

Page 4: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayesian Probability Theory

• Probability denotes degree of belief.

• Prescribes a rational approach to changing one’s belief upon seeingnew evidence.

• A central concept is the Bayes Rule, named after Reverend ThomasBayes (1702-1761).

Copyright 2015 Kee Siong Ng. All rights reserved. 5

Page 5: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Rule

Let A and B be events/propositions, we have

Pr(A|B) = Pr(B|A)Pr(A)Pr(B)

.

Derivation is simple:

• By definition of conditional probability,

Pr(A|B) = Pr(A^B)Pr(B)

and Pr(B|A) = Pr(A^B)Pr(A)

.

• Thus Pr(A|B)Pr(B) = Pr(A^B) = Pr(B|A)Pr(A)

• Dividing by Pr(B) yields

Pr(A|B) = Pr(A^B)Pr(B)

=Pr(B|A)Pr(A)

Pr(B).

Copyright 2015 Kee Siong Ng. All rights reserved. 6

Page 6: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Medical Diagnosis

Prior: Pr(cancer) = 0.008 Pr(¬cancer) = 0.992Lab test:

Pr(posTest|cancer) = 0.98 Pr(negTest|cancer) = 0.02

Pr(posTest|¬cancer) = 0.03 Pr(negTest|¬cancer) = 0.97

Posterior probabilities:

Pr(cancer|posTest) = Pr(posTest|cancer)Pr(cancer)/Pr(posTest)

= 0.98 ·0.008/Pr(posTest) = 0.0078 ·K

Copyright 2015 Kee Siong Ng. All rights reserved. 7

Page 7: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Medical Diagnosis

Prior: Pr(cancer) = 0.008 Pr(¬cancer) = 0.992Lab test:

Pr(posTest|cancer) = 0.98 Pr(negTest|cancer) = 0.02

Pr(posTest|¬cancer) = 0.03 Pr(negTest|¬cancer) = 0.97

Posterior probabilities:

Pr(cancer|posTest) = Pr(posTest|cancer)Pr(cancer)/Pr(posTest)

= 0.98 ·0.008/Pr(posTest) = 0.0078 ·K

Pr(¬cancer|posTest) = Pr(posTest|¬cancer)Pr(¬cancer)/Pr(posTest)

= 0.03 ·0.992/Pr(posTest) = 0.0298 ·K

Copyright 2015 Kee Siong Ng. All rights reserved. 7

Page 8: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

A Second Test

New prior after first test: Pr(cancer) = 0.207 Pr(¬cancer) = 0.793Second lab test:

Pr(posTest2|cancer) = 0.8 Pr(negTest2|cancer) = 0.2

Pr(posTest2|¬cancer) = 0.1 Pr(negTest2|¬cancer) = 0.9

Posterior probabilities after second positive test:

Pr(cancer|posTest2) = Pr(posTest2|cancer)Pr(cancer)/Pr(posTest2)

= 0.8 ·0.207/Pr(posTest2) = 0.1656 ·K2

Pr(¬cancer|posTest2) = Pr(posTest2|¬cancer)Pr(¬cancer)/Pr(posTest2)

= 0.1 ·0.793/Pr(posTest2) = 0.079 ·K2

Copyright 2015 Kee Siong Ng. All rights reserved. 8

Page 9: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Some Lessons

• That’s why doctors order multiple tests.

• Probabilistic reasoning can appear counter-intuitive (in other words,we are not very good with probabilistic reasoning).

Copyright 2015 Kee Siong Ng. All rights reserved. 9

Page 10: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Monty Hall Problem

Suppose you’re on a game show, and you’re given the choice of threedoors: Behind one door is a car; behind the others, goats. You pick a door,say No. 1, and the host, who knows what’s behind the doors, opensanother door, say No. 3, which has a goat. He then says to you, ”Do youwant to pick door No. 2?” Is it to your advantage to switch your choice?

Copyright 2015 Kee Siong Ng. All rights reserved. 10

Page 11: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Monty Hall Problem

Let Ci denotes “car is behind door i”.Prior probability: Pr(C1) = Pr(C2) = Pr(C3) =

13

Suppose we pick door 1 and the host then reveals a goat behind door 2(event denoted by H2).

Pr(C1|H2) =Pr(H2|C1)Pr(C1)

Pr(H2)=

1/2 ·1/3Pr(H2)

=1

Pr(H2)

16

Copyright 2015 Kee Siong Ng. All rights reserved. 11

Page 12: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Monty Hall Problem

Let Ci denotes “car is behind door i”.Prior probability: Pr(C1) = Pr(C2) = Pr(C3) =

13

Suppose we pick door 1 and the host then reveals a goat behind door 2(event denoted by H2).

Pr(C1|H2) =Pr(H2|C1)Pr(C1)

Pr(H2)=

1/2 ·1/3Pr(H2)

=1

Pr(H2)

16

Pr(C3|H2) =Pr(H2|C3)Pr(C3)

Pr(H2)=

1 ·1/3Pr(H2)

=1

Pr(H2)

13

Copyright 2015 Kee Siong Ng. All rights reserved. 11

Page 13: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Application to Monty Hall Problem

Let Ci denotes “car is behind door i”.Prior probability: Pr(C1) = Pr(C2) = Pr(C3) =

13

Suppose we pick door 1 and the host then reveals a goat behind door 2(event denoted by H2).

Pr(C1|H2) =Pr(H2|C1)Pr(C1)

Pr(H2)=

1/2 ·1/3Pr(H2)

=1

Pr(H2)

16

Pr(C3|H2) =Pr(H2|C3)Pr(C3)

Pr(H2)=

1 ·1/3Pr(H2)

=1

Pr(H2)

13

Since Pr(C3|H2)> Pr(C1|H2), we should switch!

Copyright 2015 Kee Siong Ng. All rights reserved. 11

Page 14: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Rule Applied to Learning

• Let H be a class of functions.

• Suppose we have seen data D.

• Given h 2 H, what is Pr(h|D), i.e. the posterior probability of h giventhat we have seen the data D?

Copyright 2015 Kee Siong Ng. All rights reserved. 12

Page 15: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Rule Applied to Learning

• Let H be a class of functions.

• Suppose we have seen data D.

• Given h 2 H, what is Pr(h|D), i.e. the posterior probability of h giventhat we have seen the data D?

By Bayes rule,

Pr(h|D) =Pr(D|h)Pr(h)

Pr(D)

Copyright 2015 Kee Siong Ng. All rights reserved. 12

Page 16: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Rule Applied to Learning

• Let H be a class of functions.

• Suppose we have seen data D.

• Given h 2 H, what is Pr(h|D), i.e. the posterior probability of h giventhat we have seen the data D?

By Bayes rule,

Pr(h|D) =Pr(D|h)Pr(h)

Pr(D)

Pr(h) is the prior probability of hPr(D|h) is the likelihood of h

Copyright 2015 Kee Siong Ng. All rights reserved. 12

Page 17: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Rule Applied to Learning

By Bayes rule,

Pr(h|D) =Pr(D|h)Pr(h)

Pr(D)

Bayes rule says posterior probability of h is proportional to priorprobability of h times the likelihood of h.

Pr(h) incorporates our background knowledgePr(D|h) can be simplified by making assumptions on the data-generationprocess.

For example, if D = [(x1,y1),(x2,y2), . . . ,(xn,yn)] are generatedindependently at random, then

Pr(D|h) =n

’i=1

Pr((xi,yi)|h) =n

’i=1

Pr(h(xi) = yi).

Copyright 2015 Kee Siong Ng. All rights reserved. 13

Page 18: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

MAP Estimator

A learning algorithm:Input: H, D, and prior probability Pr(h)

1. For each function h 2 H, calculate the posterior

Pr(h|D) =Pr(D|h)Pr(h)

Pr(D).

2. Output the function hMAP with the highest posterior

hMAP = argmaxh2H

Pr(h|D).

MAP = Maximum a posteriori

Copyright 2015 Kee Siong Ng. All rights reserved. 14

Page 19: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Relationship to MDL Principle

• MDL = Minimum Description Length

• A formal version of Occam’s razor: choose the shortest explanationfor the observed data.

• We have

hMAP = argmaxh2H

Pr(h)Pr(D|h)

= argmaxh2H

log2 Pr(h)+ log2 Pr(D|h)

= argminh2H

� log2 Pr(h)� log2 Pr(D|h)

• The last line can be interpreted as saying hMAP is the shortesthypothesis describing the data well.

Copyright 2015 Kee Siong Ng. All rights reserved. 15

Page 20: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

A Short Detour on Information Theory

• Consider the problem of designing a code to transmit messagesdrawn at random, where the probability of drawing message i is Pr(i).

• We want a code that minimises the expected number of bits we needto transmit.

• This implies assigning shorter codes to messages that are moreprobable.

• Shannon and Weaver (1949) showed that the optimal code assigns� log2 Pr(i) bits to encode message i.

Copyright 2015 Kee Siong Ng. All rights reserved. 16

Page 21: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Relationship to MDL Principle

We have

hMAP = argminh2H

� log2 Pr(h)� log2 Pr(D|h)

Information-theoretic interpretation:

• � log2 Pr(h) is the code length of h under the optimal code (wrt theprior) for hypothesis space H;

• � log2 Pr(D|h) is the code length of the observed data D under theoptimal code wrt conditional probability of data given hypothesis h.

• Instead of transmitting D, we transmit a function h plus informationon how to reconstruct D from h.

Copyright 2015 Kee Siong Ng. All rights reserved. 17

Page 22: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Optimal Estimator

• The hypothesis hMAP is not the best estimator!

Copyright 2015 Kee Siong Ng. All rights reserved. 18

Page 23: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Optimal Estimator

• The hypothesis hMAP is not the best estimator!

• Consider H = {h1,h2,h3 }.

• Suppose

Pr(h1|D) = 0.4 Pr(h2|D) = 0.3 Pr(h3|D) = 0.3.

• Given x, suppose further that

h1(x) = pos h2(x) = neg h3(x) = neg.

• We have hMAP(x) = h1(x) = pos.

• But taking h1,h2,h3 into account, x is positive with probability 0.4and negative with probability 0.6.

Copyright 2015 Kee Siong Ng. All rights reserved. 18

Page 24: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes Optimal Estimator

The best estimate of the label of x given data D and function class H is

argmaxy2Y Â

hi2HPr(hi|D)I(hi(x) = y) (3)

• Any system that predicts using (3) is called a Bayes optimalestimator.

• On average, no other prediction method that uses the same functionclass H and same prior can outperform a Bayes optimal estimator.

• A Bayes optimal estimator may not be in H itself.

Copyright 2015 Kee Siong Ng. All rights reserved. 19

Page 25: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Bayes, Occam, Epicurus

The best estimate of the label of x given data D and function class H is

argmaxy2Y Â

hi2HPr(hi|D)I(hi(x) = y) (4)

• Epicurus’ (342 BC – 270 BC) principle of multiple explanations:

keep all hypotheses that are consistent with the data.

• Occam’s razor (1285 – 1349):

entities should not be multiplied beyond necessity.

Widely understood to mean: Among all hypotheses consistent withthe observations, choose the simplest.

• The Bayes optimal estimator strikes a balance between the twoprinciples.

Copyright 2015 Kee Siong Ng. All rights reserved. 20

Page 26: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Problems Solved with Bayesian Analysis

• A good prior allows us to cope with small sample sizes.

– If the sample size is large, then the prior is overwhelmed by thelikelihood, which is determined entirely by the data.

• A Bayes optimal estimator weighs and combines the prediction ofmultiple plausible hypotheses.

Copyright 2015 Kee Siong Ng. All rights reserved. 21

Page 27: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Problems with Bayesian Analysis

• Prescribed solution is often computationally intractable.

• Choosing a good prior Pr(h) is not always straightforward.

Copyright 2015 Kee Siong Ng. All rights reserved. 22

Page 28: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Test Your Knowledge: Exercise I

You have been called to jury duty in a town where there are two taxicompanies, Green Cabs Ltd. and Blue Taxi Inc. Blue Taxi uses carspainted blue; Green Cabs uses green cars. Green Cabs dominates themarket, with 85% of the taxies on the road. On a misty winter night a taxisideswiped another car and drove off. A witness says it was a blue cab.

The witness is tested under conditions like those on the night of theaccident, and 80% of the time she correctly reports the colour of the cabthat is seen.

What is the probability that the taxi that caused the accident is blue?

Copyright 2015 Kee Siong Ng. All rights reserved. 23

Page 29: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Test Your Knowledge: Exercise II

You are a physician. You think it is quite likely (say 60% probability) thatone of your patients has strep throat, but you aren’t sure. You take someswabs from the throat and send them to a lab for testing. The test is (likenearly all lab tests) not perfect. If the patient has strep throat, then 70% ofthe time the lab says YES. But 30% of the time it says NO. If, however,the patient does not have strep throat, then 90% of the time the lab saysNO. But 10% of the time it says YES. You send five successive swabs tothe lab, from the same patient. You get back these results, in order: YES,NO, YES, NO, YES.

From that, which of the following can you conclude and why?

1. The results are worthless.

2. It is likely that the patient does not have strep throat.

3. It is more likely than not that the patient does have strep throat.

Copyright 2015 Kee Siong Ng. All rights reserved. 24

Page 30: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Session Review

• Bayes rule is a useful tool in probability-type problems.

• The maximum a posteriori (MAP) estimator can be interpreted as theshortest description of the data.

• A weighted average of the models (Bayesian optimal estimator) isbetter than the MAP estimator.

Copyright 2015 Kee Siong Ng. All rights reserved. 25

Page 31: Outline - WordPress.com€¦ · Test Your Knowledge: Exercise II You are a physician. You think it is quite likely (say 60% probability) that ' & $ %

'

&

$

%

Review of Session 2

• Bayesian optimal estimators are what we want ideally but they areusually very hard to compute.

• We next look at a (rare) case where the Bayesian optimal estimatorcan be computed cheaply.

Copyright 2015 Kee Siong Ng. All rights reserved. 1