CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan,...
-
Upload
angela-small -
Category
Documents
-
view
213 -
download
0
Transcript of CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan,...
![Page 1: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/1.jpg)
1
CISC 4631Data Mining
Lecture 06:• Bayes Theorem
Theses slides are based on the slides by • Tan, Steinbach and Kumar (textbook authors)• Eamonn Koegh (UC Riverside)• Andrew Moore (CMU/Google)
![Page 2: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/2.jpg)
2
Naïve Bayes Classifier
We will start off with a visual intuition, before looking at the math…
Thomas Bayes1702 - 1761
![Page 3: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/3.jpg)
3
Ante
nna
Len
gth
10
1 2 3 4 5 6 7 8 9 10
123456789
Grasshoppers Katydids
Abdomen Length
Remember this example? Let’s get lots more data…
Remember this example? Let’s get lots more data…
![Page 4: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/4.jpg)
4
Ante
nna
Len
gth
10
1 2 3 4 5 6 7 8 9 10
123456789
KatydidsGrasshoppers
With a lot of data, we can build a histogram. Let us just build one for “Antenna Length” for now…
![Page 5: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/5.jpg)
5
We can leave the histograms as they are, or we can summarize them with two normal distributions.
Let us us two normal distributions for ease of visualization in the following slides…
![Page 6: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/6.jpg)
6
p(cj | d) = probability of class cj, given that we have observed dp(cj | d) = probability of class cj, given that we have observed d
3
Antennae length is 3
• We want to classify an insect we have found. Its antennae are 3 units long. How can we classify it?
• We can just ask ourselves, give the distributions of antennae lengths we have seen, is it more probable that our insect is a Grasshopper or a Katydid.• There is a formal way to discuss the most probable classification…
![Page 7: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/7.jpg)
7
Bayes Classifier• A probabilistic framework for classification problems• Often appropriate because the world is noisy and also some
relationships are probabilistic in nature– Is predicting who will win a baseball game probabilistic in
nature?• Before getting the heart of the matter, we will go over some
basic probability.• We will review the concept of reasoning with uncertainty also
known as probability– This is a fundamental building block for understanding how Bayesian
classifiers work– It’s really going to be worth it – You may find a few of these basic probability questions on your exam– Stop me if you have questions!!!!
![Page 8: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/8.jpg)
8
Discrete Random Variables
• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
• Examples– A = The next patient you examine is suffering from inhalational
anthrax– A = The next patient you examine has a cough– A = There is an active terrorist cell in your city
![Page 9: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/9.jpg)
9
Probabilities
• We write P(A) as “the fraction of possible worlds in which A is true”
• We could at this point spend 2 hours on the philosophy of this.
• But we won’t.
![Page 10: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/10.jpg)
10
Visualizing A
Event space of all possible worlds
Its area is 1Worlds in which A is False
Worlds in which A is true
P(A) = Area ofreddish oval
![Page 11: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/11.jpg)
11
The Axioms Of Probability• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any smaller than 0
And a zero area would mean no world could ever have A true
![Page 12: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/12.jpg)
12
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
The area of A can’t get any bigger than 1
And an area of 1 would mean all worlds will have A true
![Page 13: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/13.jpg)
13
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
A
B
![Page 14: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/14.jpg)
14
A
B
Interpreting the axioms• 0 <= P(A) <= 1• P(True) = 1• P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
P(A or B)
BP(A and B)
Simple addition and subtraction
![Page 15: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/15.jpg)
15
Another important theorem
• 0 <= P(A) <= 1, P(True) = 1, P(False) = 0• P(A or B) = P(A) + P(B) - P(A and B)
From these we can prove:P(A) = P(A and B) + P(A and not B)
A B
![Page 16: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/16.jpg)
16
Conditional Probability
• P(A|B) = Fraction of worlds in which B is true that also have A true
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
“Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”
![Page 17: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/17.jpg)
17
Conditional Probability
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(H|F) = Fraction of flu-inflicted worlds in which you have a headache
= #worlds with flu and headache ------------------------------------ #worlds with flu
= Area of “H and F” region ------------------------------ Area of “F” region
= P(H and F) --------------- P(F)
![Page 18: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/18.jpg)
18
Definition of Conditional Probability P(A and B) P(A|B) = ----------- P(B)
Corollary: The Chain Rule
P(A and B) = P(A|B) P(B)
![Page 19: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/19.jpg)
19
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
Is this reasoning good?
![Page 20: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/20.jpg)
20
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
P(F and H) = …
P(F|H) = …
![Page 21: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/21.jpg)
21
Probabilistic Inference
F
H
H = “Have a headache”F = “Coming down with Flu”
P(H) = 1/10P(F) = 1/40P(H|F) = 1/2
8
1
10180
1
)(
) and ()|(
HP
HFPHFP
80
1
40
1
2
1)()|() and ( FPFHPHFP
![Page 22: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/22.jpg)
22
What we just did…
P(A & B) P(A|B) P(B)P(B|A) = ----------- = --------------- P(A) P(A)
This is Bayes Rule
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
![Page 23: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/23.jpg)
23
Some more terminology
• The Prior Probability is the probability assuming no specific information. – Thus we would refer to P(A) as the prior probability of
even A occurring– We would not say that P(A|C) is the prior probability of A
occurring
• The Posterior probability is the probability given that we know something– We would say that P(A|C) is the posterior probability of A
(given that C occurs)
![Page 24: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/24.jpg)
24
Example of Bayes Theorem• Given:
– A doctor knows that meningitis causes stiff neck 50% of the time– Prior probability of any patient having meningitis is 1/50,000– Prior probability of any patient having stiff neck is 1/20
• If a patient has stiff neck, what’s the probability he/she has meningitis?
0002.020/150000/15.0
)()()|(
)|( SP
MPMSPSMP
![Page 25: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/25.jpg)
25
Menu
Bad Hygiene Good HygieneMenuMenu
Menu
MenuMenu
Menu
• You are a health official, deciding whether to investigate a restaurant
• You lose a dollar if you get it wrong.
• You win a dollar if you get it right
• Half of all restaurants have bad hygiene
• In a bad restaurant, ¾ of the menus are smudged
• In a good restaurant, 1/3 of the menus are smudged
• You are allowed to see a randomly chosen menu
Another Example of BT
![Page 26: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/26.jpg)
26
)|( SBP)(
) and (
SP
SBP
)(
) and (
SP
BSP
)not and () and (
) and (
BSPBSP
BSP
)not and () and (
) () | (
BSPBSP
BPBSP
)not ()not | () () | (
) () | (
BPBSPBPBSP
BPBSP
13
9
21
31
21
43
21
43
![Page 27: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/27.jpg)
27
Menu
Menu
Menu
Menu
Menu
Menu
Menu
Menu
Menu Menu Menu Menu
Menu Menu Menu Menu
![Page 28: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/28.jpg)
28
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
![Page 29: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/29.jpg)
29
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
![Page 30: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/30.jpg)
30
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
Evidence Some symptom, or other thing you can observe
Smudge
![Page 31: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/31.jpg)
31
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
Evidence Some symptom, or other thing you can observe
Conditional Probability of seeing evidence if you did know the true state
P(Smudge|Bad) 3/4
P(Smudge|not Bad) 1/3
![Page 32: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/32.jpg)
32
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
Evidence Some symptom, or other thing you can observe
Conditional Probability of seeing evidence if you did know the true state
P(Smudge|Bad) 3/4
P(Smudge|not Bad) 1/3
Posterior The Prob(true state = x | some evidence)
P(Bad|Smudge) 9/13
![Page 33: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/33.jpg)
33
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
Evidence Some symptom, or other thing you can observe
Conditional Probability of seeing evidence if you did know the true state
P(Smudge|Bad) 3/4
P(Smudge|not Bad) 1/3
Posterior The Prob(true state = x | some evidence)
P(Bad|Smudge) 9/13
Inference, Diagnosis, Bayesian Reasoning
Getting the posterior from the prior and the evidence
![Page 34: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/34.jpg)
34
Bayesian DiagnosisBuzzword Meaning In our
example
Our example’s value
True State The true state of the world, which you would like to know
Is the restaurant bad?
Prior Prob(true state = x) P(Bad) 1/2
Evidence Some symptom, or other thing you can observe
Conditional Probability of seeing evidence if you did know the true state
P(Smudge|Bad) 3/4
P(Smudge|not Bad) 1/3
Posterior The Prob(true state = x | some evidence)
P(Bad|Smudge) 9/13
Inference, Diagnosis, Bayesian Reasoning
Getting the posterior from the prior and the evidence
Decision theory
Combining the posterior with known costs in order to decide what to do
![Page 35: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/35.jpg)
35
Why Bayes Theorem at all?
• Why modeling P(C|A) via P(A|C)• Why not model P(C|A) directly?
• P(A|C)P(C) decomposition allows us to be “sloppy”
– P(C) and P(A|C) can be trained independently
)()()|(
)|(APCPCAP
ACP
![Page 36: CISC 4631 Data Mining Lecture 06: Bayes Theorem Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) Eamonn Koegh (UC Riverside)](https://reader036.fdocuments.in/reader036/viewer/2022062800/56649e045503460f94af0a32/html5/thumbnails/36.jpg)
36
Crime Scene Analogy
• A is a crime scene. C is a person who may have committed the crime– P(C|A) - look at the scene - who did it?
– P(C) - who had a motive? (Profiler)
– P(A|C) - could they have done it? (CSI - transportation, access to weapons, alibi)