7.Simple Classification

45
Classification methods

description

K723 Data Mining

Transcript of 7.Simple Classification

Page 1: 7.Simple Classification

Classification methods

Page 2: 7.Simple Classification

Three Simple Classification Methods

Page 3: 7.Simple Classification

Methods & Characteristics

The three methods: Naïve rule Naïve Bayes K-nearest-neighbor

Common characteristics: Data-driven, not model-driven Make no assumptions about the data

Page 4: 7.Simple Classification

Naïve Rule Classify all records as the majority

class Not a “real” method Introduced so it will serve as a

benchmark against which to measure other results

Y

N

S L

Charge

Size

Fraud40%

Truthful60%

Error rate 40%

Page 5: 7.Simple Classification

Naïve Bayes

Page 6: 7.Simple Classification

Idea of Naïve Bayes: Financial Fraud

Target variable: fraud truthfulPredictors:

Prior pending legal charges (yes/no) Size of firm (small/large)

Y

N

S L

Charge

Size

Classify basedon the majorityin each cell(Conditional Probability)

Error rate 20%

Page 7: 7.Simple Classification

Naïve Bayes: The Basic Idea

For a given new record to be classified, find other records like it (i.e., same values for the predictors)

What is the prevalent class among those records?

Assign that class to your new record

Page 8: 7.Simple Classification

Usage Requires categorical variables Numerical variable must be binned

and converted to categorical Can be used with very large data sets Example: Spell check – computer

attempts to assign your misspelled word to an established “class” (i.e., correctly spelled word)

Page 9: 7.Simple Classification

Exact Bayes Classifier Relies on finding other records that

share same predictor values as record-to-be-classified.

Want to find “probability of belonging to class C, given specified values of predictors.”

Conditional probability P (Y= C| X = (x1, …xp))

Page 10: 7.Simple Classification

Example: Financial Fraud

Target variable: fraud truthfulPredictors:

Prior pending legal charges (yes/no) Size of firm (small/large)

Y

N

S L

Charge

Size

Classify basedon the majorityin each cell

Error rate 20%

Page 11: 7.Simple Classification

Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud

(T,F) Small Large

Charges Yes

(1,1) (0,2)

Charges No (3, 0) (2,1)

P(F|C,S) Small Large

Y 0.5 1

N 0 0.33

Rule Small Large

Y ? Fraud

N Truthful Truthful

Exact Bayes Calculations

Page 12: 7.Simple Classification

Exact Bayes Calculations Goal: classify (as “fraudulent” or as

“truthful”) a small firm with charges filed There are 2 firms like that, one

fraudulent and the other truthful P(fraud|charges=y, size=small) = ½ =

0.50 Note: calculation is limited to the two

firms matching those characteristics

Page 13: 7.Simple Classification

Problem Even with large data sets, may be hard

to find other records that exactly match your record, in terms of predictor values.

Page 14: 7.Simple Classification

Solution – Naïve Bayes Assume independence of predictor

variables (within each class) Use multiplication rule Find same probability that record

belongs to class C, given predictor values, without limiting calculation to records that share all those same values

Page 15: 7.Simple Classification

15

Refining the “primitive” idea: Naïve Bayes

Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately

How can this be done? A probability trick! Based on Bayes’ rule Then make some simplifying assumption And get a powerful classifier!

Page 16: 7.Simple Classification

16

Conditional Probability

A = the event “ X = A” B = the event “Y = B” denotes the probability of A given B

(the conditional probability that A occurs given that B occurred)

)(

)()|(

BP

BAPBAP

)|( BAP

If P(B)>0

A

BA∩B

P(Fraud | Charge) = P(Charge and Fraud) / P(Charge)

Page 17: 7.Simple Classification

17

Bayes’ Rule(Reverse conditioning)

What if I only know the opposite direction?Bayes’ rule gives a neat way to reverse time!

)(

)()|()|(

AP

BPBAPABP

)( BAP

P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)

P(A∩B) = P(B | A) P(A)= P(A | B) P(B)

A

BA∩B

P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud)

Page 18: 7.Simple Classification

18

Using Baye’s rule Flipping the condition:

),...,(

)1()1|,...,(),...,|1(

1

11

p

pp XXP

YPYXXPXXYP

)0()0|,...,()1()1|,...,(

),...,(

11

1

YPYXXPYPYXXP

XXP

pp

p

Page 19: 7.Simple Classification

19

How is this used to solve our problem?

We want to estimate P(Y=1 | X1,…,Xp) But we don’t have enough examples of each

possible profile X1…, Xp in the training set If we had instead P(X1,…,Xp | Y=1), we could

separate it to P(X1|Y=1) ּP(X2|Y=1) ּּּP(Xp|Y=1)

True if we can assume independence between X1,…,Xp

within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably

well

Page 20: 7.Simple Classification

Independence Assumption With Independence Assumption: P(A∩B) = P(A)*P(B) We can thus calculate

P(X1,…,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* ּּּP(Xp|Y=1)

P(X1,…,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* ּּּP(Xp|Y=0)

P(X1,…,Xp ) = P(X1,…,Xp | Y=1)+ P(X1,…,Xp | Y=0)

A

B

A∩B

Page 21: 7.Simple Classification

21

Putting it all together: How it works

1. All predictors must be categorical.2. From the training set create all pivot tables of Y on each

separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0)3. For a to-be-predicted observation with predictors X1,X2,…

Xp, software computes the probability of belonging to Y=1 using the formula

Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1’s in training set

4. Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)

),...,(

)1()1|()1|()1|(),...,|1(

1

211

p

pp XXP

YPYXPYXPYXPXXYP

Page 22: 7.Simple Classification

Naïve Bayes, cont. Note that probability estimate does not

differ greatly from exact All records are used in calculations, not

just those matching predictor values This makes calculations practical in most

circumstances Relies on assumption of independence

between predictor variables within each class

Page 23: 7.Simple Classification

Independence Assumption

Not strictly justified (variables often correlated with one another)

Often “good enough”

Page 24: 7.Simple Classification

Example: Financial Fraud

Target variable: Fraud TruthfulPredictors:

Prior pending legal charges (yes/no) Size of firm (small/large)

Y

N

S L

Charge

Size

Classify basedon estimated Conditional Probability

Page 25: 7.Simple Classification

Y

N

S L

Charge

Y

N

S L

Charge

Y

N

S L

1 3

3

1

4 2

5

1

P(S,YIT)P(T) = P(S|T)*P(Y|T)P(T) = (4/6)*(1/6)*(6/10) = 0.067P(S,YIF)P(F) = P(S|F)*P(Y|F)P(F)= (1/4)*(3/4)*(4/10) = 0.075

P(F|S,Y) =P(S,Y|F)P(F)/P(S,Y)=P(S,Y|F)P(F)/(P(S,Y|F)P(F)+(P(S,Y,IT)P(T)) = 0.075/(0.075+0.067) = 0.528

Page 26: 7.Simple Classification

Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud

(T,F) Small Large sum

Y (1,1) (0,2) (1,3)

N (3, 0) (2,1) (5,1)

sum (4,1) (2,3) (6,4)

P(C,S|F)P(F) Small Large P(C|F)

Y 0.075 0.225 0.75

N 0.025 0.075 0.25

P(S|F) 0.25 0.75 0.40

P(F|C,S) Small Large

Y 0.528 0.869

N 0.070 0.316

Naïve Bayes Calculations

P(C,S|T)P(T) Small Large P(C|T)

Y 0.067 0.034 0.17

N 0.334 0.164 0.83

P(S|T) 0.67 0.33 0.60

P(F|C,S) = P(C,S|F)P(F)/P(C,S) = P(C|F)P(S|F)P(F)/P(C,S)P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)

P(F|C,S) Small Large

Y 0.5 1

N 0 0.33

exact est.

0.075/(0.075+0.067) = 0.528

0.25*0.75*0.40 = 0.075

Page 27: 7.Simple Classification

Example: Financial Fraud

Target variable: Fraud TruthfulPredictors:

Prior pending legal charges (yes/no) Size of firm (small/large)

Y

N

S L

Charge

Size

P(F|C,S) Small Large

Y 0.528 0.869

N 0.070 0.316

Estimated conditional probability

Page 28: 7.Simple Classification

28

Advantages and Disadvantages

The good Simple Can handle large amount of predictors High performance accuracy, when the goal is

ranking Pretty robust to independence assumption!

The bad Need to categorize continuous predictors Predictors with “rare” categories -> zero prob

(if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each

predictor

Page 29: 7.Simple Classification

29

Naïve Bayes in XLMiner Classification> Naïve Bayes

Prior class probabilities

Prob.

0.095333333

0.904666667

Value Prob Value Prob

0 0.374125874 0 0.401621223

1 0.625874126 1 0.598378777

0 0.699300699 0 0.711864407

1 0.300699301 1 0.288135593

Online

CreditCard

Classes-->

1 0Input Variables

1

0

<-- Success Class

Conditional probabilities

According to relative occurrences in training data

Class

P(CC=1| accept=1) = 0.301

Sheet: NNB-Output1

P(accept=1) = 0.095

Page 30: 7.Simple Classification

30

Naïve Bayes in XLMiner Scoring the validation data

XLMiner : Naive Bayes - Classification of Validation Data

0.5

Row Id.Predicted

ClassActual Class

Prob. for 1 (success)

Online CreditCard

2 0 0 0.08795125 0 0

3 0 0 0.08795125 0 0

7 0 0 0.097697987 1 0

8 0 0 0.092925663 0 1

11 0 0 0.08795125 0 0

13 0 0 0.08795125 0 0

14 0 0 0.097697987 1 0

15 0 0 0.08795125 0 0

16 0 0 0.10316131 1 1

Cut off Prob.Val. for Success (Updatable) ( Updating the value here will NOT update value in summary report )

Data range['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018

Sheet: NNB-ValidScore1

Page 31: 7.Simple Classification

K-Nearest Neighbors

Page 32: 7.Simple Classification

Basic Idea For a given record to be classified,

identify nearby records “Near” means records with similar

predictor values X1, X2, … Xp

Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)

Page 33: 7.Simple Classification

How to Measure “nearby”?

The most popular distance measure is Euclidean distance

Page 34: 7.Simple Classification

Choosing k K is the number of nearby neighbors to

be used to classify the new record k=1 means use the single nearest record k=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in validation data

Page 35: 7.Simple Classification

K=3

X1

X2

Page 36: 7.Simple Classification

Low k vs. High k Low values of k (1, 3 …) capture local

structure in data (but also noise) High values of k provide more

smoothing, less noise, but may miss local structure

Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)

Page 37: 7.Simple Classification

Example: Riding Mowers

Data: 24 households classified as owning or not owning riding mowers

Predictors = Income, Lot Size

Page 38: 7.Simple Classification

Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner

Page 39: 7.Simple Classification

XLMiner Output For each record in validation data (6

records) XLMiner finds neighbors amongst training data (18 records).

The record is scored for k=1, k=2, … k=18.

Best k seems to be k=8. K = 9, k = 10, k=14 also share low

error rate, but best to choose lowest k.

Page 40: 7.Simple Classification

Value of k% Error

Training% Error

Validation

1 0.00 33.33

2 16.67 33.33

3 11.11 33.33

4 22.22 33.33

5 11.11 33.33

6 27.78 33.33

7 22.22 33.33

8 22.22 16.67 <--- Best k

9 22.22 16.67

10 22.22 16.67

11 16.67 33.33

12 16.67 16.67

13 11.11 33.33

14 11.11 16.67

15 5.56 33.33

16 16.67 33.33

17 11.11 33.33

18 50.00 50.00

Page 41: 7.Simple Classification

Using K-NN for Prediction (for Numerical Outcome)

Instead of “majority vote determines class” use average of response values

May be a weighted average, weight decreasing with distance

Page 42: 7.Simple Classification

Advantages Simple No assumptions required about

Normal distribution, etc. Effective at capturing complex

interactions among variables without having to define a statistical model

Page 43: 7.Simple Classification

Shortcomings Required size of training set increases

exponentially with # of predictors, p This is because expected distance to nearest

neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)

These constitute “curse of dimensionality”

Page 44: 7.Simple Classification

Dealing with the Curse

Reduce dimension of predictors (e.g., with PCA)

Computational shortcuts that settle for “almost nearest neighbors”

Page 45: 7.Simple Classification

Summary Naïve rule: benchmark Naïve Bayes and K-NN are two

variations on the same theme: “Classify new record according to the class of similar records”

No statistical models involved These methods pay attention to

complex interactions and local structure Computational challenges remain