Neural Networks Simple Neural Nets For Pattern Classification CHAPTER 2.
7.Simple Classification
-
Upload
ironchefff -
Category
Documents
-
view
111 -
download
1
description
Transcript of 7.Simple Classification
Classification methods
Three Simple Classification Methods
Methods & Characteristics
The three methods: Naïve rule Naïve Bayes K-nearest-neighbor
Common characteristics: Data-driven, not model-driven Make no assumptions about the data
Naïve Rule Classify all records as the majority
class Not a “real” method Introduced so it will serve as a
benchmark against which to measure other results
Y
N
S L
Charge
Size
Fraud40%
Truthful60%
Error rate 40%
Naïve Bayes
Idea of Naïve Bayes: Financial Fraud
Target variable: fraud truthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon the majorityin each cell(Conditional Probability)
Error rate 20%
Naïve Bayes: The Basic Idea
For a given new record to be classified, find other records like it (i.e., same values for the predictors)
What is the prevalent class among those records?
Assign that class to your new record
Usage Requires categorical variables Numerical variable must be binned
and converted to categorical Can be used with very large data sets Example: Spell check – computer
attempts to assign your misspelled word to an established “class” (i.e., correctly spelled word)
Exact Bayes Classifier Relies on finding other records that
share same predictor values as record-to-be-classified.
Want to find “probability of belonging to class C, given specified values of predictors.”
Conditional probability P (Y= C| X = (x1, …xp))
Example: Financial Fraud
Target variable: fraud truthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon the majorityin each cell
Error rate 20%
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
(T,F) Small Large
Charges Yes
(1,1) (0,2)
Charges No (3, 0) (2,1)
P(F|C,S) Small Large
Y 0.5 1
N 0 0.33
Rule Small Large
Y ? Fraud
N Truthful Truthful
Exact Bayes Calculations
Exact Bayes Calculations Goal: classify (as “fraudulent” or as
“truthful”) a small firm with charges filed There are 2 firms like that, one
fraudulent and the other truthful P(fraud|charges=y, size=small) = ½ =
0.50 Note: calculation is limited to the two
firms matching those characteristics
Problem Even with large data sets, may be hard
to find other records that exactly match your record, in terms of predictor values.
Solution – Naïve Bayes Assume independence of predictor
variables (within each class) Use multiplication rule Find same probability that record
belongs to class C, given predictor values, without limiting calculation to records that share all those same values
15
Refining the “primitive” idea: Naïve Bayes
Main idea: Instead of looking at combinations of predictors (crossed pivot table), look at each predictor separately
How can this be done? A probability trick! Based on Bayes’ rule Then make some simplifying assumption And get a powerful classifier!
16
Conditional Probability
A = the event “ X = A” B = the event “Y = B” denotes the probability of A given B
(the conditional probability that A occurs given that B occurred)
)(
)()|(
BP
BAPBAP
)|( BAP
If P(B)>0
A
BA∩B
P(Fraud | Charge) = P(Charge and Fraud) / P(Charge)
17
Bayes’ Rule(Reverse conditioning)
What if I only know the opposite direction?Bayes’ rule gives a neat way to reverse time!
)(
)()|()|(
AP
BPBAPABP
)( BAP
P(Fraud | Charge) = P(Charge | Fraud) P(Fraud) / P(Charge)
P(A∩B) = P(B | A) P(A)= P(A | B) P(B)
A
BA∩B
P(Fraud | Charge) P(Charge)= P(Charge | Fraud) P(Fraud)
18
Using Baye’s rule Flipping the condition:
),...,(
)1()1|,...,(),...,|1(
1
11
p
pp XXP
YPYXXPXXYP
)0()0|,...,()1()1|,...,(
),...,(
11
1
YPYXXPYPYXXP
XXP
pp
p
19
How is this used to solve our problem?
We want to estimate P(Y=1 | X1,…,Xp) But we don’t have enough examples of each
possible profile X1…, Xp in the training set If we had instead P(X1,…,Xp | Y=1), we could
separate it to P(X1|Y=1) ּP(X2|Y=1) ּּּP(Xp|Y=1)
True if we can assume independence between X1,…,Xp
within each class That means we could use single pivot tables! If the dependence is not extreme, it will work reasonably
well
Independence Assumption With Independence Assumption: P(A∩B) = P(A)*P(B) We can thus calculate
P(X1,…,Xp | Y=1) = P(X1|Y=1)*P(X2|Y=1)* ּּּP(Xp|Y=1)
P(X1,…,Xp | Y=0) = P(X1|Y=0)*P(X2|Y=0)* ּּּP(Xp|Y=0)
P(X1,…,Xp ) = P(X1,…,Xp | Y=1)+ P(X1,…,Xp | Y=0)
A
B
A∩B
21
Putting it all together: How it works
1. All predictors must be categorical.2. From the training set create all pivot tables of Y on each
separate X. We can thus obtain P(X), P(X|Y=1),P(X|Y=0)3. For a to-be-predicted observation with predictors X1,X2,…
Xp, software computes the probability of belonging to Y=1 using the formula
Each of the probabilities in the formula is estimated from a pivot table, and estimated P(Y=1) is the proportion of 1’s in training set
4. Use the cutoff to determine classification of this observation. Default: cutoff = 0.5 (classify to group that is most likely)
),...,(
)1()1|()1|()1|(),...,|1(
1
211
p
pp XXP
YPYXPYXPYXPXXYP
Naïve Bayes, cont. Note that probability estimate does not
differ greatly from exact All records are used in calculations, not
just those matching predictor values This makes calculations practical in most
circumstances Relies on assumption of independence
between predictor variables within each class
Independence Assumption
Not strictly justified (variables often correlated with one another)
Often “good enough”
Example: Financial Fraud
Target variable: Fraud TruthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
Classify basedon estimated Conditional Probability
Y
N
S L
Charge
Y
N
S L
Charge
Y
N
S L
1 3
3
1
4 2
5
1
P(S,YIT)P(T) = P(S|T)*P(Y|T)P(T) = (4/6)*(1/6)*(6/10) = 0.067P(S,YIF)P(F) = P(S|F)*P(Y|F)P(F)= (1/4)*(3/4)*(4/10) = 0.075
P(F|S,Y) =P(S,Y|F)P(F)/P(S,Y)=P(S,Y|F)P(F)/(P(S,Y|F)P(F)+(P(S,Y,IT)P(T)) = 0.075/(0.075+0.067) = 0.528
Charges? Size Outcomey small truthfuln small truthfuln large truthfuln large truthfuln small truthfuln small truthfuly small fraudy large fraudn large fraudy large fraud
(T,F) Small Large sum
Y (1,1) (0,2) (1,3)
N (3, 0) (2,1) (5,1)
sum (4,1) (2,3) (6,4)
P(C,S|F)P(F) Small Large P(C|F)
Y 0.075 0.225 0.75
N 0.025 0.075 0.25
P(S|F) 0.25 0.75 0.40
P(F|C,S) Small Large
Y 0.528 0.869
N 0.070 0.316
Naïve Bayes Calculations
P(C,S|T)P(T) Small Large P(C|T)
Y 0.067 0.034 0.17
N 0.334 0.164 0.83
P(S|T) 0.67 0.33 0.60
P(F|C,S) = P(C,S|F)P(F)/P(C,S) = P(C|F)P(S|F)P(F)/P(C,S)P(C,S) = P(C,S|F)P(F)+P(C,S|T)P(T)
P(F|C,S) Small Large
Y 0.5 1
N 0 0.33
exact est.
0.075/(0.075+0.067) = 0.528
0.25*0.75*0.40 = 0.075
Example: Financial Fraud
Target variable: Fraud TruthfulPredictors:
Prior pending legal charges (yes/no) Size of firm (small/large)
Y
N
S L
Charge
Size
P(F|C,S) Small Large
Y 0.528 0.869
N 0.070 0.316
Estimated conditional probability
28
Advantages and Disadvantages
The good Simple Can handle large amount of predictors High performance accuracy, when the goal is
ranking Pretty robust to independence assumption!
The bad Need to categorize continuous predictors Predictors with “rare” categories -> zero prob
(if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each
predictor
29
Naïve Bayes in XLMiner Classification> Naïve Bayes
Prior class probabilities
Prob.
0.095333333
0.904666667
Value Prob Value Prob
0 0.374125874 0 0.401621223
1 0.625874126 1 0.598378777
0 0.699300699 0 0.711864407
1 0.300699301 1 0.288135593
Online
CreditCard
Classes-->
1 0Input Variables
1
0
<-- Success Class
Conditional probabilities
According to relative occurrences in training data
Class
P(CC=1| accept=1) = 0.301
Sheet: NNB-Output1
P(accept=1) = 0.095
30
Naïve Bayes in XLMiner Scoring the validation data
XLMiner : Naive Bayes - Classification of Validation Data
0.5
Row Id.Predicted
ClassActual Class
Prob. for 1 (success)
Online CreditCard
2 0 0 0.08795125 0 0
3 0 0 0.08795125 0 0
7 0 0 0.097697987 1 0
8 0 0 0.092925663 0 1
11 0 0 0.08795125 0 0
13 0 0 0.08795125 0 0
14 0 0 0.097697987 1 0
15 0 0 0.08795125 0 0
16 0 0 0.10316131 1 1
Cut off Prob.Val. for Success (Updatable) ( Updating the value here will NOT update value in summary report )
Data range['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018
Sheet: NNB-ValidScore1
K-Nearest Neighbors
Basic Idea For a given record to be classified,
identify nearby records “Near” means records with similar
predictor values X1, X2, … Xp
Classify the record as whatever the predominant class is among the nearby records (the “neighbors”)
How to Measure “nearby”?
The most popular distance measure is Euclidean distance
Choosing k K is the number of nearby neighbors to
be used to classify the new record k=1 means use the single nearest record k=5 means use the 5 nearest records
Typically choose that value of k which has lowest error rate in validation data
K=3
●
●
●
X1
X2
Low k vs. High k Low values of k (1, 3 …) capture local
structure in data (but also noise) High values of k provide more
smoothing, less noise, but may miss local structure
Note: the extreme case of k = n (i.e. the entire data set) is the same thing as “naïve rule” (classify all records according to majority class)
Example: Riding Mowers
Data: 24 households classified as owning or not owning riding mowers
Predictors = Income, Lot Size
Income Lot_Size Ownership60.0 18.4 owner85.5 16.8 owner64.8 21.6 owner61.5 20.8 owner87.0 23.6 owner110.1 19.2 owner108.0 17.6 owner82.8 22.4 owner69.0 20.0 owner93.0 20.8 owner51.0 22.0 owner81.0 20.0 owner75.0 19.6 non-owner52.8 20.8 non-owner64.8 17.2 non-owner43.2 20.4 non-owner84.0 17.6 non-owner49.2 17.6 non-owner59.4 16.0 non-owner66.0 18.4 non-owner47.4 16.4 non-owner33.0 18.8 non-owner51.0 14.0 non-owner63.0 14.8 non-owner
XLMiner Output For each record in validation data (6
records) XLMiner finds neighbors amongst training data (18 records).
The record is scored for k=1, k=2, … k=18.
Best k seems to be k=8. K = 9, k = 10, k=14 also share low
error rate, but best to choose lowest k.
Value of k% Error
Training% Error
Validation
1 0.00 33.33
2 16.67 33.33
3 11.11 33.33
4 22.22 33.33
5 11.11 33.33
6 27.78 33.33
7 22.22 33.33
8 22.22 16.67 <--- Best k
9 22.22 16.67
10 22.22 16.67
11 16.67 33.33
12 16.67 16.67
13 11.11 33.33
14 11.11 16.67
15 5.56 33.33
16 16.67 33.33
17 11.11 33.33
18 50.00 50.00
Using K-NN for Prediction (for Numerical Outcome)
Instead of “majority vote determines class” use average of response values
May be a weighted average, weight decreasing with distance
Advantages Simple No assumptions required about
Normal distribution, etc. Effective at capturing complex
interactions among variables without having to define a statistical model
Shortcomings Required size of training set increases
exponentially with # of predictors, p This is because expected distance to nearest
neighbor increases with p (with large vector of predictors, all records end up “far away” from each other)
In a large training set, it takes a long time to find distances to all the neighbors and then identify the nearest one(s)
These constitute “curse of dimensionality”
Dealing with the Curse
Reduce dimension of predictors (e.g., with PCA)
Computational shortcuts that settle for “almost nearest neighbors”
Summary Naïve rule: benchmark Naïve Bayes and K-NN are two
variations on the same theme: “Classify new record according to the class of similar records”
No statistical models involved These methods pay attention to
complex interactions and local structure Computational challenges remain