Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Data Mining – Algorithms: Naïve Bayes

Chapter 4, Section 4.2

More Simplicity

• Direct contrast to OneR– Use all attributes

• Assume that all attributes are equally important

• Assume that all (non predicted) attributes are independent of each other

• This is clearly naïve!

• But it works pretty well

Again, Let’s take this a little more realistic than book does

• Divide into training and test data

• Let’s save the last record as a test

• (using my weather, nominal …

Determine Distribution of Attributes

Outlook \ Play Yes No

Sunny 4 1

Overcast 2 2

Rainy 0 4

Temperature \ Play

Yes No

Hot 1 3

Mild 3 2

Cool 2 2

Determine Distribution of Attributes

Humidity \ Play Yes No

High 3 3

Normal 3 4

Windy \ Play Yes No

False 2 6

True 4 1

Also, the attribute to be predicted …

Yes No

Play 6 7

Inferring Probabilities from Observed


Sunny 4/6 1/7

Overcast 2/6 2/7

Rainy 0/6 4/7

Temperature \ Play

Yes No

Hot 1/6 3/7

Mild 3/6 2/7

Cool 2/6 2/7

Inferring Probabilities from Observed


High 3/6 3/7

Normal 3/6 4/7

Windy \ Play Yes No

False 2/6 6/7

True 4/6 1/7

Also, the attribute to be predicted …Proportion of days that were yes and no

Yes No

Play 6/13 7/13

Now, suppose we must predict the test instance

• Rainy, mild, high, true• Probability of Yes =

Probability of Yes given that it is rainy* Probability of Yes given that it is mild* Probability of Yes given that humidity is high* Probability of Yes given that it is windy* Probability of Yes (in general)

= 0/6 * 3/6 * 3/6 * 4/6 * 6/13 = 0 / 16848 = 0.0

• Probability of No = Probability of No given that it is rainy* Probability of No given that it is mild* Probability of No given that humidity is high* Probability of No given that it is windy* Probability of No (in general)

= 4/7 * 2/7 * 3/7 * 1/7 * 7/13 = 168 / 31213 = 0.005

The Foundation:• Bayes Rule of Conditional Probabilities• P[H|E] = Pr [E|H] P[H]

Pr [ E ]• The probability of a hypothesis (e.g. play=yes) given

evidence E (the new test instance) is equal to:– the Probability of Evidence given the hypothesis

– times the Probability of the Hypothesis,

– all divided by the probability of Evidence

• We did the numerator of this (the denominator doesn’t matter since it is the same for both Yes and No)

The Probability of Evidence given the Hypothesis:

• Since this is naïve bayes, we assume that the evidence in terms of the different attributes is independent (given the class), so the probabilities of the 4 attributes having the values that they do are multiplied together

The Probability of the Hypothesis:

• This is just the probability of Yes (or No)

• This is called the “prior probability” of the hypothesis – it would be your guess prior to seeing and evidence

• We multiplied this by the previous slide’s value as called for in the formula

A complication

• Our probability of “yes” came out zero since no rainy day had had play=yes

• This may be a little extreme – this one attribute has ruled all, no matter what the other evidence says

• Common adjustment – start all counts off at 1 instead of at 0 (“Laplace estimator”) …

With Laplace Estimator … Determine Distribution of Attributes


Sunny 5 2

Overcast 3 3

Rainy 1 5

Temperature \ Play

Yes No

Hot 2 4

Mild 4 3

Cool 3 3

With Laplace Estimator … Determine Distribution of Attributes


High 4 4

Normal 4 5

Windy \ Play Yes No

False 3 7

True 5 2

With Laplace Estimator …, the attribute to be predicted …

Yes No

Play 7 8

With Laplace Estimator … Inferring Probabilities from Observed


Sunny 5/9 2/10

Overcast 3/9 3/10

Rainy 1/9 5/10

Temperature \ Play

Yes No

Hot 2/9 4/10

Mild 4/9 3/10

Cool 3/9 3/10

With Laplace Estimator … Inferring Probabilities from Observed


High 4/8 4/9

Normal 4/8 5/9

Windy \ Play Yes No

False 3/8 7/9

True 5/8 2/9

With Laplace Estimator …, the attribute to be predicted …

Proportion of days that were yes and no

Yes No

Play 7/15 8/15

Now, predict the test instance

• Rainy, mild, high, true

• Probability of Yes = = 1/9 * 4/9 * 4/8 * 5/8 * 7/15 = 560 / 77760 = 0.007

• Probability of No = = 5/10 * 3/10 * 4/9 * 2/9 * 8/15 = 960 / 121500 = 0.008

In a 14-fold cross validation, this would continue 13 more times

• Let’s run WEKA on this … NaiveBayesSimple …

WEKA results – first look near the bottom

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 9 64.2857 %

Incorrectly Classified Instances 5 35.7143 %

============================================• On the cross validation – it got 9 out of 14 tests correct

•Same as OneR

More Detailed Results=== Confusion Matrix ===

a b <-- classified as

3 3 | a = yes

2 6 | b = no

====================================•Here we see –the program 5 times predicted play=yes, on 3 of those it was correct – it is predicting yes less often than OneR did

•The program 9 times predicted play = no, on 6 of those it was correct

•There were 6 instances whose actual value was play=yes, the program correctly predicted that on 3 of them

•There were 8 instances whose actual value was play=no, the program correctly predicted that on 9 of them

Again, part of our purpose is to have a take-home message for humans

• Not 14 take home messages!

• So instead of reporting each of the things learned on each of the 14 training sets …

• … The program runs again on all of the data and builds a pattern for that – a take home message

• … However, for naïve bayes, then take home message is less easily interpreted …

WEKA - Take-HomeNaive Bayes (simple)Class yes: P(C) = 0.4375 Attribute outlooksunny overcast rainy0.55555556 0.33333333 0.11111111

Attribute temperaturehot mild cool0.22222222 0.44444444 0.33333333

Attribute humidityhigh normal0.5 0.5

Attribute windyTRUE FALSE0.625 0.375

WEKA - Take-Home continuedClass no: P(C) = 0.5625

Attribute outlooksunny overcast rainy0.18181818 0.27272727 0.54545455

Attribute temperaturehot mild cool0.36363636 0.36363636 0.27272727

Attribute humidityhigh normal0.5 0.5

Attribute windyTRUE FALSE0.3 0.7

Let’s Try WEKA Naïve Bayes on njcrimenominal

• Try 10-fold=== Confusion Matrix === a b <-- classified as 6 1 | a = bad 7 18 | b = ok• This represents a slight improvement over OneR (probably not

significant)• We note that OneR chose unemployment as the attribute to use, with

the probabilities, note for bad crime:Attribute unemployhi med low0.3 0.6 0.1

• … while for ok crime:Attribute unemployhi med low0.03571429 0.28571429 0.67857143

Naïve Bayes – Missing Values

• Training data – simply not included in frequency counts; probability ratios are based on pct of actually occurring rather than total # instances

• Test data – calculations omit the missing attribute – E.g. Prob(yes| sunny,?,high,false) = 5/9 X 4/8 X 3/8 X

7/15 (skipping temperature)– since omitted for each class, not a problem

Naïve Bayes – Numeric Values

• Assume values fit a “normal” curve• Calculate mean and standard deviation for each

class• Known properties of normal curves allow us to

use a formula for the “probability density function” to calculate the probability based on a value, the mean, and standard deviation.

• Book has equation, p87 – don’t memorize. Look up if you are writing the program

Naïve Bayes – Discussion

• Naïve Bayes frequently does as well or better than sophisticated classification algorithms or real datasets – despite its assumptions being violated

• Clearly redundant attributes hurt performance, because they have the effect of counting an attribute more than once (e.g. at a school with very high pct of “traditional students”, age and year in school are redundant

• Many correlated or redundant attributes make Naïve Bayes a poor choice for a dataset– (unless preprocessing removes them)

• Numeric data known to not be in a normal distribution can be handled using the other distribution (e.g. Poisson) or if unknown, a generic “kernal density estimation”

Class Exercise

Class Exercise

• Let’s run WEKA NaiveBayesSimple on japanbank

End Section 4.2

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Documents

Transcript of Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.