7.Simple Classification

Classification methods

Three Simple Classification Methods

Methods & Characteristics

The three methods: Naïve rule Naïve Bayes K-nearest-neighbor

Common characteristics: Data-driven, not model-driven Make no assumptions about the data

Naïve Rule Classify all records as the majority

class Not a “real” method Introduced so it will serve as a

benchmark against which to measure other results

Y

N

S L

Charge

Size

Fraud40%

Truthful60%

Error rate 40%

Naïve Bayes

Idea of Naïve Bayes: Financial Fraud

Target variable: fraud truthfulPredictors:

Prior pending legal charges (yes/no) Size of firm (small/large)

Y

N

S L

Charge

Size

Classify basedon the majorityin each cell(Conditional Probability)

Error rate 20%

Naïve Bayes: The Basic Idea

For a given new record to be classified, find other records like it (i.e., same values for the predictors)

What is the prevalent class among those records?

Assign that class to your new record

Usage Requires categorical variables Numerical variable must be binned

and converted to categorical Can be used with very large data sets Example: Spell check – computer

attempts to assign your misspelled word to an established “class” (i.e., correctly spelled word)

Exact Bayes Classifier Relies on finding other records that

share same predictor values as record-to-be-classified.

Want to find “probability of belonging to class C, given specified values of predictors.”

Conditional probability P (Y= C| X = (x1, …xp))

Example: Financial Fraud

Target variable: fraud truthfulPredictors:


Y

N

S L

Charge

Size

Classify basedon the majorityin each cell

Error rate 20%

Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud

(T,F) Small Large

Charges Yes

(1,1) (0,2)

Charges No (3, 0) (2,1)

P(F|C,S) Small Large

Y 0.5 1

N 0 0.33

Rule Small Large

Y ? Fraud

N Truthful Truthful

Exact Bayes Calculations

Exact Bayes Calculations Goal: classify (as “fraudulent” or as

“truthful”) a small firm with charges filed There are 2 firms like that, one

fraudulent and the other truthful P(fraud|charges=y, size=small) = ½ =

0.50 Note: calculation is limited to the two

firms matching those characteristics

Problem Even with large data sets, may be hard

to find other records that exactly match your record, in terms of predictor values.

Solution – Naïve Bayes Assume independence of predictor

variables (within each class) Use multiplication rule Find same probability that record

belongs to class C, given predictor values, without limiting calculation to records that share all those same values

15

Refining the “primitive” idea: Naïve Bayes

Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately

How can this be done? A probability trick! Based on Bayes’ rule Then make some simplifying assumption And get a powerful classifier!

16

Conditional Probability

A = the event “ X = A” B = the event “Y = B” denotes the probability of A given B

(the conditional probability that A occurs given that B occurred)

)(

)()|(

BP

BAPBAP

)|( BAP

If P(B)>0

A

BA∩B

P(Fraud | Charge) = P(Charge and Fraud) / P(Charge)

18

Using Baye’s rule Flipping the condition:

),...,(

)1()1|,...,(),...,|1(

1

11

p

pp XXP

YPYXXPXXYP

)0()0|,...,()1()1|,...,(

),...,(

11

1

YPYXXPYPYXXP

XXP

pp

p

19

How is this used to solve our problem?

We want to estimate P(Y=1 | X1,…,Xp) But we don’t have enough examples of each

possible profile X1…, Xp in the training set If we had instead P(X1,…,Xp | Y=1), we could

separate it to P(X1|Y=1) ּP(X2|Y=1) ּּּP(Xp|Y=1)

True if we can assume independence between X1,…,Xp

within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably

well

21

Putting it all together: How it works

1. All predictors must be categorical.2. From the training set create all pivot tables of Y on each

separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0)3. For a to-be-predicted observation with predictors X1,X2,…

Xp, software computes the probability of belonging to Y=1 using the formula

Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1’s in training set

4. Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)

),...,(

)1()1|()1|()1|(),...,|1(

1

211

p

pp XXP

YPYXPYXPYXPXXYP

Naïve Bayes, cont. Note that probability estimate does not

differ greatly from exact All records are used in calculations, not

just those matching predictor values This makes calculations practical in most

circumstances Relies on assumption of independence

between predictor variables within each class

Independence Assumption

Not strictly justified (variables often correlated with one another)

Often “good enough”


Target variable: Fraud TruthfulPredictors:


Y

N

S L

Charge

Size

Classify basedon estimated Conditional Probability

Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud

(T,F) Small Large sum

Y (1,1) (0,2) (1,3)

N (3, 0) (2,1) (5,1)

sum (4,1) (2,3) (6,4)

P(C,S|F)P(F) Small Large P(C|F)

Y 0.075 0.225 0.75

N 0.025 0.075 0.25

P(S|F) 0.25 0.75 0.40


Y 0.528 0.869

N 0.070 0.316

Naïve Bayes Calculations

P(C,S|T)P(T) Small Large P(C|T)

Y 0.067 0.034 0.17

N 0.334 0.164 0.83

P(S|T) 0.67 0.33 0.60

P(F|C,S) = P(C,S|F)P(F)/P(C,S) = P(C|F)P(S|F)P(F)/P(C,S)P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)


Y 0.5 1

N 0 0.33

exact est.

0.075/(0.075+0.067) = 0.528

0.25*0.75*0.40 = 0.075


Target variable: Fraud TruthfulPredictors:


Y

N

S L

Charge

Size


Y 0.528 0.869

N 0.070 0.316

Estimated conditional probability

28

Advantages and Disadvantages

The good Simple Can handle large amount of predictors High performance accuracy, when the goal is

ranking Pretty robust to independence assumption!

The bad Need to categorize continuous predictors Predictors with “rare” categories -> zero prob

(if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each

predictor

29

Naïve Bayes in XLMiner Classification> Naïve Bayes

Prior class probabilities

Prob.

0.095333333

0.904666667

Value Prob Value Prob

0 0.374125874 0 0.401621223

1 0.625874126 1 0.598378777

0 0.699300699 0 0.711864407

1 0.300699301 1 0.288135593

Online

CreditCard

Classes-->

1 0Input Variables

1

0

<-- Success Class

Conditional probabilities

According to relative occurrences in training data

Class

P(CC=1| accept=1) = 0.301

Sheet: NNB-Output1

P(accept=1) = 0.095

30

Naïve Bayes in XLMiner Scoring the validation data

XLMiner : Naive Bayes - Classification of Validation Data

0.5

Row Id.Predicted

ClassActual Class

Prob. for 1 (success)

Online CreditCard

2 0 0 0.08795125 0 0

3 0 0 0.08795125 0 0

7 0 0 0.097697987 1 0

8 0 0 0.092925663 0 1

11 0 0 0.08795125 0 0

13 0 0 0.08795125 0 0

14 0 0 0.097697987 1 0

15 0 0 0.08795125 0 0

16 0 0 0.10316131 1 1

Cut off Prob.Val. for Success (Updatable) ( Updating the value here will NOT update value in summary report )

Data range['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018

Sheet: NNB-ValidScore1

K-Nearest Neighbors

Basic Idea For a given record to be classified,

identify nearby records “Near” means records with similar

predictor values X1, X2, … Xp

Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)

How to Measure “nearby”?

The most popular distance measure is Euclidean distance

Choosing k K is the number of nearby neighbors to

be used to classify the new record k=1 means use the single nearest record k=5 means use the 5 nearest records

Typically choose that value of k which has lowest error rate in validation data

K=3

●

●

●

X1

X2

Low k vs. High k Low values of k (1, 3 …) capture local

structure in data (but also noise) High values of k provide more

smoothing, less noise, but may miss local structure

Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)

Example: Riding Mowers

Data: 24 households classified as owning or not owning riding mowers

Predictors = Income, Lot Size

Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner

XLMiner Output For each record in validation data (6

records) XLMiner finds neighbors amongst training data (18 records).

The record is scored for k=1, k=2, … k=18.

Best k seems to be k=8. K = 9, k = 10, k=14 also share low

error rate, but best to choose lowest k.

Value of k% Error

Training% Error

Validation

1 0.00 33.33

2 16.67 33.33

3 11.11 33.33

4 22.22 33.33

5 11.11 33.33

6 27.78 33.33

7 22.22 33.33

8 22.22 16.67 <--- Best k

9 22.22 16.67

10 22.22 16.67

11 16.67 33.33

12 16.67 16.67

13 11.11 33.33

14 11.11 16.67

15 5.56 33.33

16 16.67 33.33

17 11.11 33.33

18 50.00 50.00

Using K-NN for Prediction (for Numerical Outcome)

Instead of “majority vote determines class” use average of response values

May be a weighted average, weight decreasing with distance

Advantages Simple No assumptions required about

Normal distribution, etc. Effective at capturing complex

interactions among variables without having to define a statistical model

Shortcomings Required size of training set increases

exponentially with # of predictors, p This is because expected distance to nearest

neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)

In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)

These constitute “curse of dimensionality”

Dealing with the Curse

Reduce dimension of predictors (e.g., with PCA)

Computational shortcuts that settle for “almost nearest neighbors”

Summary Naïve rule: benchmark Naïve Bayes and K-NN are two

variations on the same theme: “Classify new record according to the class of similar records”

No statistical models involved These methods pay attention to

complex interactions and local structure Computational challenges remain

7.Simple Classification

Documents

Transcript of 7.Simple Classification