Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
-
Upload
kathleen-baldwin -
Category
Documents
-
view
213 -
download
0
Transcript of Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining – Algorithms: Naïve Bayes
Chapter 4, Section 4.2
More Simplicity
• Direct contrast to OneR– Use all attributes
• Assume that all attributes are equally important
• Assume that all (non predicted) attributes are independent of each other
• This is clearly naïve!
• But it works pretty well
Again, Let’s take this a little more realistic than book does
• Divide into training and test data
• Let’s save the last record as a test
• (using my weather, nominal …
Determine Distribution of Attributes
Outlook \ Play Yes No
Sunny 4 1
Overcast 2 2
Rainy 0 4
Temperature \ Play
Yes No
Hot 1 3
Mild 3 2
Cool 2 2
Determine Distribution of Attributes
Humidity \ Play Yes No
High 3 3
Normal 3 4
Windy \ Play Yes No
False 2 6
True 4 1
Also, the attribute to be predicted …
Yes No
Play 6 7
Inferring Probabilities from Observed
Outlook \ Play Yes No
Sunny 4/6 1/7
Overcast 2/6 2/7
Rainy 0/6 4/7
Temperature \ Play
Yes No
Hot 1/6 3/7
Mild 3/6 2/7
Cool 2/6 2/7
Inferring Probabilities from Observed
Humidity \ Play Yes No
High 3/6 3/7
Normal 3/6 4/7
Windy \ Play Yes No
False 2/6 6/7
True 4/6 1/7
Also, the attribute to be predicted …Proportion of days that were yes and no
Yes No
Play 6/13 7/13
Now, suppose we must predict the test instance
• Rainy, mild, high, true• Probability of Yes =
Probability of Yes given that it is rainy* Probability of Yes given that it is mild* Probability of Yes given that humidity is high* Probability of Yes given that it is windy* Probability of Yes (in general)
= 0/6 * 3/6 * 3/6 * 4/6 * 6/13 = 0 / 16848 = 0.0
• Probability of No = Probability of No given that it is rainy* Probability of No given that it is mild* Probability of No given that humidity is high* Probability of No given that it is windy* Probability of No (in general)
= 4/7 * 2/7 * 3/7 * 1/7 * 7/13 = 168 / 31213 = 0.005
The Foundation:• Bayes Rule of Conditional Probabilities• P[H|E] = Pr [E|H] P[H]
Pr [ E ]• The probability of a hypothesis (e.g. play=yes) given
evidence E (the new test instance) is equal to:– the Probability of Evidence given the hypothesis
– times the Probability of the Hypothesis,
– all divided by the probability of Evidence
• We did the numerator of this (the denominator doesn’t matter since it is the same for both Yes and No)
The Probability of Evidence given the Hypothesis:
• Since this is naïve bayes, we assume that the evidence in terms of the different attributes is independent (given the class), so the probabilities of the 4 attributes having the values that they do are multiplied together
The Probability of the Hypothesis:
• This is just the probability of Yes (or No)
• This is called the “prior probability” of the hypothesis – it would be your guess prior to seeing and evidence
• We multiplied this by the previous slide’s value as called for in the formula
A complication
• Our probability of “yes” came out zero since no rainy day had had play=yes
• This may be a little extreme – this one attribute has ruled all, no matter what the other evidence says
• Common adjustment – start all counts off at 1 instead of at 0 (“Laplace estimator”) …
With Laplace Estimator … Determine Distribution of Attributes
Outlook \ Play Yes No
Sunny 5 2
Overcast 3 3
Rainy 1 5
Temperature \ Play
Yes No
Hot 2 4
Mild 4 3
Cool 3 3
With Laplace Estimator … Determine Distribution of Attributes
Humidity \ Play Yes No
High 4 4
Normal 4 5
Windy \ Play Yes No
False 3 7
True 5 2
With Laplace Estimator …, the attribute to be predicted …
Yes No
Play 7 8
With Laplace Estimator … Inferring Probabilities from Observed
Outlook \ Play Yes No
Sunny 5/9 2/10
Overcast 3/9 3/10
Rainy 1/9 5/10
Temperature \ Play
Yes No
Hot 2/9 4/10
Mild 4/9 3/10
Cool 3/9 3/10
With Laplace Estimator … Inferring Probabilities from Observed
Humidity \ Play Yes No
High 4/8 4/9
Normal 4/8 5/9
Windy \ Play Yes No
False 3/8 7/9
True 5/8 2/9
With Laplace Estimator …, the attribute to be predicted …
Proportion of days that were yes and no
Yes No
Play 7/15 8/15
Now, predict the test instance
• Rainy, mild, high, true
• Probability of Yes = = 1/9 * 4/9 * 4/8 * 5/8 * 7/15 = 560 / 77760 = 0.007
• Probability of No = = 5/10 * 3/10 * 4/9 * 2/9 * 8/15 = 960 / 121500 = 0.008
In a 14-fold cross validation, this would continue 13 more times
• Let’s run WEKA on this … NaiveBayesSimple …
WEKA results – first look near the bottom
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 9 64.2857 %
Incorrectly Classified Instances 5 35.7143 %
============================================• On the cross validation – it got 9 out of 14 tests correct
•Same as OneR
More Detailed Results=== Confusion Matrix ===
a b <-- classified as
3 3 | a = yes
2 6 | b = no
====================================•Here we see –the program 5 times predicted play=yes, on 3 of those it was correct – it is predicting yes less often than OneR did
•The program 9 times predicted play = no, on 6 of those it was correct
•There were 6 instances whose actual value was play=yes, the program correctly predicted that on 3 of them
•There were 8 instances whose actual value was play=no, the program correctly predicted that on 9 of them
Again, part of our purpose is to have a take-home message for humans
• Not 14 take home messages!
• So instead of reporting each of the things learned on each of the 14 training sets …
• … The program runs again on all of the data and builds a pattern for that – a take home message
• … However, for naïve bayes, then take home message is less easily interpreted …
WEKA - Take-HomeNaive Bayes (simple)Class yes: P(C) = 0.4375 Attribute outlooksunny overcast rainy0.55555556 0.33333333 0.11111111
Attribute temperaturehot mild cool0.22222222 0.44444444 0.33333333
Attribute humidityhigh normal0.5 0.5
Attribute windyTRUE FALSE0.625 0.375
WEKA - Take-Home continuedClass no: P(C) = 0.5625
Attribute outlooksunny overcast rainy0.18181818 0.27272727 0.54545455
Attribute temperaturehot mild cool0.36363636 0.36363636 0.27272727
Attribute humidityhigh normal0.5 0.5
Attribute windyTRUE FALSE0.3 0.7
Let’s Try WEKA Naïve Bayes on njcrimenominal
• Try 10-fold=== Confusion Matrix === a b <-- classified as 6 1 | a = bad 7 18 | b = ok• This represents a slight improvement over OneR (probably not
significant)• We note that OneR chose unemployment as the attribute to use, with
the probabilities, note for bad crime:Attribute unemployhi med low0.3 0.6 0.1
• … while for ok crime:Attribute unemployhi med low0.03571429 0.28571429 0.67857143
Naïve Bayes – Missing Values
• Training data – simply not included in frequency counts; probability ratios are based on pct of actually occurring rather than total # instances
• Test data – calculations omit the missing attribute – E.g. Prob(yes| sunny,?,high,false) = 5/9 X 4/8 X 3/8 X
7/15 (skipping temperature)– since omitted for each class, not a problem
Naïve Bayes – Numeric Values
• Assume values fit a “normal” curve• Calculate mean and standard deviation for each
class• Known properties of normal curves allow us to
use a formula for the “probability density function” to calculate the probability based on a value, the mean, and standard deviation.
• Book has equation, p87 – don’t memorize. Look up if you are writing the program
Naïve Bayes – Discussion
• Naïve Bayes frequently does as well or better than sophisticated classification algorithms or real datasets – despite its assumptions being violated
• Clearly redundant attributes hurt performance, because they have the effect of counting an attribute more than once (e.g. at a school with very high pct of “traditional students”, age and year in school are redundant
• Many correlated or redundant attributes make Naïve Bayes a poor choice for a dataset– (unless preprocessing removes them)
• Numeric data known to not be in a normal distribution can be handled using the other distribution (e.g. Poisson) or if unknown, a generic “kernal density estimation”
Class Exercise
Class Exercise
• Let’s run WEKA NaiveBayesSimple on japanbank
End Section 4.2