Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

34
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2

Transcript of Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Page 1: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Data Mining – Algorithms: Naïve Bayes

Chapter 4, Section 4.2

Page 2: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

More Simplicity

• Direct contrast to OneR– Use all attributes

• Assume that all attributes are equally important

• Assume that all (non predicted) attributes are independent of each other

• This is clearly naïve!

• But it works pretty well

Page 3: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Again, Let’s take this a little more realistic than book does

• Divide into training and test data

• Let’s save the last record as a test

• (using my weather, nominal …

Page 4: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Determine Distribution of Attributes

Outlook \ Play Yes No

Sunny 4 1

Overcast 2 2

Rainy 0 4

Temperature \ Play

Yes No

Hot 1 3

Mild 3 2

Cool 2 2

Page 5: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Determine Distribution of Attributes

Humidity \ Play Yes No

High 3 3

Normal 3 4

Windy \ Play Yes No

False 2 6

True 4 1

Page 6: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Also, the attribute to be predicted …

Yes No

Play 6 7

Page 7: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Inferring Probabilities from Observed

Outlook \ Play Yes No

Sunny 4/6 1/7

Overcast 2/6 2/7

Rainy 0/6 4/7

Temperature \ Play

Yes No

Hot 1/6 3/7

Mild 3/6 2/7

Cool 2/6 2/7

Page 8: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Inferring Probabilities from Observed

Humidity \ Play Yes No

High 3/6 3/7

Normal 3/6 4/7

Windy \ Play Yes No

False 2/6 6/7

True 4/6 1/7

Page 9: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Also, the attribute to be predicted …Proportion of days that were yes and no

Yes No

Play 6/13 7/13

Page 10: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Now, suppose we must predict the test instance

• Rainy, mild, high, true• Probability of Yes =

Probability of Yes given that it is rainy* Probability of Yes given that it is mild* Probability of Yes given that humidity is high* Probability of Yes given that it is windy* Probability of Yes (in general)

= 0/6 * 3/6 * 3/6 * 4/6 * 6/13 = 0 / 16848 = 0.0

• Probability of No = Probability of No given that it is rainy* Probability of No given that it is mild* Probability of No given that humidity is high* Probability of No given that it is windy* Probability of No (in general)

= 4/7 * 2/7 * 3/7 * 1/7 * 7/13 = 168 / 31213 = 0.005

Page 11: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

The Foundation:• Bayes Rule of Conditional Probabilities• P[H|E] = Pr [E|H] P[H]

Pr [ E ]• The probability of a hypothesis (e.g. play=yes) given

evidence E (the new test instance) is equal to:– the Probability of Evidence given the hypothesis

– times the Probability of the Hypothesis,

– all divided by the probability of Evidence

• We did the numerator of this (the denominator doesn’t matter since it is the same for both Yes and No)

Page 12: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

The Probability of Evidence given the Hypothesis:

• Since this is naïve bayes, we assume that the evidence in terms of the different attributes is independent (given the class), so the probabilities of the 4 attributes having the values that they do are multiplied together

Page 13: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

The Probability of the Hypothesis:

• This is just the probability of Yes (or No)

• This is called the “prior probability” of the hypothesis – it would be your guess prior to seeing and evidence

• We multiplied this by the previous slide’s value as called for in the formula

Page 14: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

A complication

• Our probability of “yes” came out zero since no rainy day had had play=yes

• This may be a little extreme – this one attribute has ruled all, no matter what the other evidence says

• Common adjustment – start all counts off at 1 instead of at 0 (“Laplace estimator”) …

Page 15: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator … Determine Distribution of Attributes

Outlook \ Play Yes No

Sunny 5 2

Overcast 3 3

Rainy 1 5

Temperature \ Play

Yes No

Hot 2 4

Mild 4 3

Cool 3 3

Page 16: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator … Determine Distribution of Attributes

Humidity \ Play Yes No

High 4 4

Normal 4 5

Windy \ Play Yes No

False 3 7

True 5 2

Page 17: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator …, the attribute to be predicted …

Yes No

Play 7 8

Page 18: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator … Inferring Probabilities from Observed

Outlook \ Play Yes No

Sunny 5/9 2/10

Overcast 3/9 3/10

Rainy 1/9 5/10

Temperature \ Play

Yes No

Hot 2/9 4/10

Mild 4/9 3/10

Cool 3/9 3/10

Page 19: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator … Inferring Probabilities from Observed

Humidity \ Play Yes No

High 4/8 4/9

Normal 4/8 5/9

Windy \ Play Yes No

False 3/8 7/9

True 5/8 2/9

Page 20: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

With Laplace Estimator …, the attribute to be predicted …

Proportion of days that were yes and no

Yes No

Play 7/15 8/15

Page 21: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Now, predict the test instance

• Rainy, mild, high, true

• Probability of Yes = = 1/9 * 4/9 * 4/8 * 5/8 * 7/15 = 560 / 77760 = 0.007

• Probability of No = = 5/10 * 3/10 * 4/9 * 2/9 * 8/15 = 960 / 121500 = 0.008

Page 22: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

In a 14-fold cross validation, this would continue 13 more times

• Let’s run WEKA on this … NaiveBayesSimple …

Page 23: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

WEKA results – first look near the bottom

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 9 64.2857 %

Incorrectly Classified Instances 5 35.7143 %

============================================• On the cross validation – it got 9 out of 14 tests correct

•Same as OneR

Page 24: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

More Detailed Results=== Confusion Matrix ===

a b <-- classified as

3 3 | a = yes

2 6 | b = no

====================================•Here we see –the program 5 times predicted play=yes, on 3 of those it was correct – it is predicting yes less often than OneR did

•The program 9 times predicted play = no, on 6 of those it was correct

•There were 6 instances whose actual value was play=yes, the program correctly predicted that on 3 of them

•There were 8 instances whose actual value was play=no, the program correctly predicted that on 9 of them

Page 25: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Again, part of our purpose is to have a take-home message for humans

• Not 14 take home messages!

• So instead of reporting each of the things learned on each of the 14 training sets …

• … The program runs again on all of the data and builds a pattern for that – a take home message

• … However, for naïve bayes, then take home message is less easily interpreted …

Page 26: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

WEKA - Take-HomeNaive Bayes (simple)Class yes: P(C) = 0.4375 Attribute outlooksunny overcast rainy0.55555556 0.33333333 0.11111111

Attribute temperaturehot mild cool0.22222222 0.44444444 0.33333333

Attribute humidityhigh normal0.5 0.5

Attribute windyTRUE FALSE0.625 0.375

Page 27: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

WEKA - Take-Home continuedClass no: P(C) = 0.5625

Attribute outlooksunny overcast rainy0.18181818 0.27272727 0.54545455

Attribute temperaturehot mild cool0.36363636 0.36363636 0.27272727

Attribute humidityhigh normal0.5 0.5

Attribute windyTRUE FALSE0.3 0.7

Page 28: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Let’s Try WEKA Naïve Bayes on njcrimenominal

• Try 10-fold=== Confusion Matrix === a b <-- classified as 6 1 | a = bad 7 18 | b = ok• This represents a slight improvement over OneR (probably not

significant)• We note that OneR chose unemployment as the attribute to use, with

the probabilities, note for bad crime:Attribute unemployhi med low0.3 0.6 0.1

• … while for ok crime:Attribute unemployhi med low0.03571429 0.28571429 0.67857143

Page 29: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Naïve Bayes – Missing Values

• Training data – simply not included in frequency counts; probability ratios are based on pct of actually occurring rather than total # instances

• Test data – calculations omit the missing attribute – E.g. Prob(yes| sunny,?,high,false) = 5/9 X 4/8 X 3/8 X

7/15 (skipping temperature)– since omitted for each class, not a problem

Page 30: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Naïve Bayes – Numeric Values

• Assume values fit a “normal” curve• Calculate mean and standard deviation for each

class• Known properties of normal curves allow us to

use a formula for the “probability density function” to calculate the probability based on a value, the mean, and standard deviation.

• Book has equation, p87 – don’t memorize. Look up if you are writing the program

Page 31: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Naïve Bayes – Discussion

• Naïve Bayes frequently does as well or better than sophisticated classification algorithms or real datasets – despite its assumptions being violated

• Clearly redundant attributes hurt performance, because they have the effect of counting an attribute more than once (e.g. at a school with very high pct of “traditional students”, age and year in school are redundant

• Many correlated or redundant attributes make Naïve Bayes a poor choice for a dataset– (unless preprocessing removes them)

• Numeric data known to not be in a normal distribution can be handled using the other distribution (e.g. Poisson) or if unknown, a generic “kernal density estimation”

Page 32: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Class Exercise

Page 33: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Class Exercise

• Let’s run WEKA NaiveBayesSimple on japanbank

Page 34: Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

End Section 4.2