Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....
-
date post
19-Dec-2015 -
Category
Documents
-
view
226 -
download
2
Transcript of Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S....
![Page 1: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/1.jpg)
Slide 1
Probabilistic Machine Learning
Brigham S. Anderson
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~brigham
![Page 2: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/2.jpg)
2
ML: Some Successful Applications
• Learning to recognize spoken words (speech recognition);
• Text categorization (SPAM, newsgroups);
• Learning to play world-class chess, backgammon and checkers;
• Handwriting recognition;
• Learning to classify new astronomical data;
• Learning to detect cancerous tissues (e.g. colon polyp detection).
![Page 3: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/3.jpg)
3
Machine Learning Application Areas
• Science• astronomy, bioinformatics, drug discovery, …
• Business• advertising, CRM (Customer Relationship management),
investments, manufacturing, sports/entertainment, telecom, e-Commerce, targeted marketing, health care, …
• Web: • search engines, bots, …
• Government• law enforcement, profiling tax cheaters, anti-terror(?)
![Page 4: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/4.jpg)
4
Classification Application: Assessing Credit Risk
• Situation: Person applies for a loan• Task: Should a bank approve the loan?• Banks develop credit models using variety of
machine learning methods. • Mortgage and credit card proliferation are the
results of being able to successfully predict if a person is likely to default on a loan
• Widely deployed in many countries
![Page 5: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/5.jpg)
5
Prob. TableAnomaly Detector
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to detect anomalous cars.
• If the next example we see is <good,high>, how anomalous is it?
![Page 6: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/6.jpg)
6
Prob. TableAnomaly Detector
04.0
),(),(
highgoodPhighgoodlikelihood
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) = How likely is
<good,high>?
![Page 7: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/7.jpg)
7How likely is a
<good,high,fast> example?
)|()|()(),,( highfastPgoodhighPgoodPfasthighgoodP
Mpg Horse Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg) P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
039.0
)89.0)(11.0)(4.0(
Bayes NetAnomaly Detector
![Page 8: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/8.jpg)
8
Probability Model Uses
ClassifierData point x
AnomalyDetector
Data point x P(x)
P(C | x)
Inference
Engine
Evidence e1P(E2 | e1) Missing Variables E2
![Page 9: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/9.jpg)
9
Bayes Classifiers
• A formidable and sworn enemy of decision trees
DT BC
ClassifierData point x P(C | x)
![Page 10: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/10.jpg)
10
Dead-SimpleBayes Classifier Example
• Suppose we have the following model:
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we classify it?
![Page 11: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/11.jpg)
11
Dead-SimpleBayes Classifier Example
)(
),()|(
lowP
lowgoodPlowgoodP
P(good, low) = 0.36P(good,high) = 0.04P( bad, low) = 0.12P( bad,high) = 0.48
P(Mpg, Horse) =
),(),(
),(
lowbadPlowgoodP
lowgoodP
739.012.036.0
36.0
How do we classify
<Horse=low>?
The P(good | low) = 0.75,so we classify the example
as “good”
![Page 12: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/12.jpg)
12
Bayes Classifiers
• That was just inference!
• In fact, virtually all machine learning tasks are a form of inference
• Anomaly detection: P(x)• Classification: P(Class | x)• Regression: P(Y | x)• Model Learning: P(Model | dataset)• Feature Selection: P(Model | dataset)
![Page 13: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/13.jpg)
13Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
),(
)|()|()(
fastlowP
lowfastPgoodlowPgoodP
),(
0178.0
),(
)05.0)(89.0)(4.0(
fastlowPfastlowP
),,(),,(
0178.0
fastlowbadPfastlowgoodP
75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
![Page 14: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/14.jpg)
14Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
),(
)|()|()(
fastlowP
lowfastPgoodlowPgoodP
),(
0178.0
),(
)05.0)(89.0)(4.0(
fastlowPfastlowP
),,(),,(
0178.0
fastlowbadPfastlowgoodP
75.0 Note: this is not exactly 0.75 because I rounded some of the CPT numbers earlier…
The P(good | low, fast) = 0.75,so we classify the example
as “good”.
…but that seems somehow familiar…
Wasn’t that the same answer asP(good | low)?
![Page 15: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/15.jpg)
15Suppose we get a
<Horse=low, Accel=fast> example?
),(
),,(),|(
fastlowP
fastlowgoodPfastlowgoodP
Mpg
Horse
Accel
P(good) = 0.4P( bad) = 0.6
P(Mpg)
P( low|good) = 0.89P( low| bad) = 0.21P(high|good) = 0.11P(high| bad) = 0.79
P(Horse|Mpg)
P(slow| low) = 0.95P(slow|high) = 0.11P(fast| low) = 0.05P(fast|high) = 0.89
P(Accel|Horse)
![Page 16: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/16.jpg)
16
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
![Page 17: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/17.jpg)
17
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2, …
vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
![Page 18: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/18.jpg)
18
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
![Page 19: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/19.jpg)
19
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(X1, X2, … Xm | Y=vi ) most likely
)|(argmax 11predict vYuXuXPY mm
v
Is this a good idea?
This is a Maximum Likelihood classifier.
It can get silly if some Ys are very unlikely
![Page 20: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/20.jpg)
20
How to build a Bayes Classifier• Assume you want to predict output Y which has arity nY and values v1, v2,
… vny.
• Assume there are m input attributes called X1, X2, … Xm
• Break dataset into nY smaller datasets called DS1, DS2, … DSny.
• Define DSi = Records in which Y=vi
• For each DSi , learn Density Estimator Mi to model the input distribution among the Y=vi records.
• Mi estimates P(X1, X2, … Xm | Y=vi )
• Idea: When a new set of input values (X1 = u1, X2 = u2, …. Xm = um) come along to be evaluated predict the value of Y that makes P(Y=vi | X1, X2, … Xm) most likely
)|(argmax 11predict
mmv
uXuXvYPY
Is this a good idea?
Much Better Idea
![Page 21: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/21.jpg)
21
Terminology
• MLE (Maximum Likelihood Estimator):
• MAP (Maximum A-Postiori Estimator):
)|(argmax 11predict
mmv
uXuXvYPY
)|(argmax 11predict vYuXuXPY mm
v
![Page 22: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/22.jpg)
22
Getting what we need
)|(argmax 11predict
mmv
uXuXvYPY
![Page 23: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/23.jpg)
23
Getting a posterior probability
Yn
jjjmm
mm
mm
mm
mm
vYPvYuXuXP
vYPvYuXuXP
uXuXP
vYPvYuXuXP
uXuXvYP
111
11
11
11
11
)()|(
)()|(
)(
)()|(
)|(
![Page 24: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/24.jpg)
24
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction:
![Page 25: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/25.jpg)
25
Bayes Classifiers in a nutshell
)()|(argmax
)|(argmax
11
11predict
vYPvYuXuXP
uXuXvYPY
mmv
mmv
1. Learn the distribution over inputs for each value Y.
2. This gives P(X1, X2, … Xm | Y=vi ).
3. Estimate P(Y=vi ). as fraction of records with Y=vi .
4. For a new prediction: We can use our favorite Density Estimator here.
Right now we have three options:
• Probability Table• Naïve Density• Bayes Net
![Page 26: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/26.jpg)
26
Joint Density Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the joint Bayes Classifier this degenerates to a very simple rule:
Ypredict = the most common value of Y among records in which X1 = u1, X2 = u2, …. Xm = um.
Note that if no records have the exact set of inputs X1 = u1, X2 = u2, …. Xm = um, then P(X1, X2, … Xm | Y=vi ) = 0 for all values of Y.
In that case we just have to guess Y’s value
![Page 27: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/27.jpg)
27
Joint BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Joint BC”
![Page 28: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/28.jpg)
28
Joint BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25
![Page 29: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/29.jpg)
29
![Page 30: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/30.jpg)
30
BC Results: “MPG”: 392
records
The Classifier
learned by “Naive BC”
![Page 31: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/31.jpg)
31
Joint Distribution
Horsepower
Mpg Acceleration
Maker
![Page 32: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/32.jpg)
32
Joint Distribution
P(Mpg, Horse) = P(Mpg) * P(Horse|Mpg)
Recall:A joint distribution can be decomposed via the chain rule…
Note that this takes the same amount of information to create.
We “gain” nothing from this decomposition
![Page 33: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/33.jpg)
33
Naive Distribution
Mpg
Cylinders
P(Mpg)
P(Cylinders|Mpg)
Horsepower
P(Horsepower|Mpg)
Weight
P(Weight|Mpg)
MakerModelyear Acceleration
P(Modelyear|Mpg) P(Maker|Mpg)
P(Acceleration|Mpg)
![Page 34: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/34.jpg)
34
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
![Page 35: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/35.jpg)
35
Naïve Bayes Classifier
)()|(argmax 11predict vYPvYuXuXPY mm
v
In the case of the naive Bayes Classifier this can be simplified:
Yn
jjj
vvYuXPvYPY
1
predict )|()(argmax
Technical Hint:If you have 10,000 input attributes that product will underflow in floating point math. You should use logs:
Yn
jjj
vvYuXPvYPY
1
predict )|(log)(logargmax
![Page 36: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/36.jpg)
36
BC Results: “XOR”The “XOR” dataset consists of 40,000 records and 2 boolean inputs called a and b, generated 50-50 randomly as 0 or 1. c (output) = a XOR b
The Classifier
learned by “Naive BC”
The Classifier
learned by “Joint BC”
![Page 37: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/37.jpg)
37
Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Naive BC”
![Page 38: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/38.jpg)
38
Naive BC Results: “Logical”The “logical” dataset consists of 40,000 records and 4 boolean attributes called a,b,c,d where a,b,c are generated 50-50 randomly as 0 or 1. D = A^~C, except that in 10% of records it is flipped
The Classifier
learned by “Joint BC”
This result surprised Andrew until he had thought about it a little
![Page 39: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/39.jpg)
39
Naïve BC Results: “All Irrelevant”The “all irrelevant” dataset consists of 40,000 records and 15 boolean attributes called a,b,c,d..o where a,b,c are generated 50-50 randomly as 0 or 1. v (output) = 1 with probability 0.75, 0 with prob 0.25The Classifier
learned by “Naive BC”
![Page 40: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/40.jpg)
40
BC Results: “MPG”: 392
records
The Classifier
learned by “Naive BC”
![Page 41: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/41.jpg)
41
BC Results: “MPG”: 40
records
![Page 42: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/42.jpg)
42
More Facts About Bayes Classifiers
• Many other density estimators can be slotted in*.• Density estimation can be performed with real-valued inputs*• Bayes Classifiers can be built with real-valued inputs*• Rather Technical Complaint: Bayes Classifiers don’t try to be
maximally discriminative---they merely try to honestly model what’s going on*
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help*.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!
*See future Andrew Lectures
![Page 43: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/43.jpg)
43
What you should know
• Probability• Fundamentals of Probability and Bayes Rule• What’s a Joint Distribution• How to do inference (i.e. P(E1|E2)) once you have a JD
• Density Estimation• What is DE and what is it good for• How to learn a Joint DE• How to learn a naïve DE
![Page 44: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/44.jpg)
44
What you should know
• Bayes Classifiers• How to build one• How to predict with a BC• Contrast between naïve and joint BCs
![Page 45: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/45.jpg)
45
Interesting Questions
• Suppose you were evaluating NaiveBC, JointBC, and Decision Trees
• Invent a problem where only NaiveBC would do well• Invent a problem where only Dtree would do well• Invent a problem where only JointBC would do well• Invent a problem where only NaiveBC would do poorly• Invent a problem where only Dtree would do poorly• Invent a problem where only JointBC would do poorly
![Page 46: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/46.jpg)
46
Venn Diagram
![Page 47: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/47.jpg)
47
For more information
• Two nice books• L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone.
Classification and Regression Trees. Wadsworth, Belmont, CA, 1984.
• C4.5 : Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning) by J. Ross Quinlan
• Dozens of nice papers, including• Learning Classification Trees, Wray Buntine, Statistics and
Computation (1992), Vol 2, pages 63-73
• Kearns and Mansour, On the Boosting Ability of Top-Down Decision Tree Learning Algorithms, STOC: ACM Symposium on Theory of Computing, 1996“
• Dozens of software implementations available on the web for free and commercially for prices ranging between $50 - $300,000
![Page 48: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/48.jpg)
48
Probability Model Uses
ClassifierInput
Attributes
AnomalyDetector
Data point x P(x | M)
P(C | E)
Inference
Engine
Subset Evidence E1P(E2 | e1)
ClustererData setclustersof points
Variables E2
![Page 49: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/49.jpg)
49
How to Build a Bayes Classifier
Data Set P(I,A,R,C)
This function simulates a four-dimensional lookup table
of the probability of each possible Industry/Analyte/Result/Class
Each record has aclass of either “normal”or “outbreak”
![Page 50: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/50.jpg)
50
How to Build a Bayes Classifier
Data Set
Data SetOutbreaks
Data SetNormals
P(I,A,R,O) P(I,A,R | normal)
![Page 51: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/51.jpg)
51
How to Build a Bayes Classifier
Suppose that a new test result arrives…
<meat, salmonella, negative>
P(meat, salmonella, negative, normal) = 0.19
P(meat, salmonella, negative, outbreak) = 0.005
0.19-------- = 38.00.005
Class = “normal”!
![Page 52: Slide 1 Aug 25th, 2001 Copyright © 2001, Andrew W. Moore Probabilistic Machine Learning Brigham S. Anderson School of Computer Science Carnegie Mellon.](https://reader036.fdocuments.in/reader036/viewer/2022062516/56649d3f5503460f94a18177/html5/thumbnails/52.jpg)
52
How to Build a Bayes Classifier
Next test:
<Seafood, Vibrio, Positive>
P(seafood, vibrio, positive, normal) = 0.02
P(seafood, vibrio, positive, outbreak) = 0.07
0.02------ = 0.290.07
Class = “outbreak”!