A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of...

26
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015

description

Classification Examples  Spam filtering  Fraud detection  Self-piloting automobile

Transcript of A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of...

Page 1: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

A Brief Introduction and Issues on the Classification Problem

Jin MaoPostdoc, School of Information, University of Arizona

Sept 18, 2015

Page 2: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Outline

Page 3: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Classification Examples

Spam Email filtering

Fraud detection

Self-piloting automobile

Page 4: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

The Classification Problem

Page 5: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

The Classification Problem

Page 6: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

The Classification Problem

Page 7: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Classic Classifiers

Naïve Bayes

Decision Tree : J48(C4.5)

KNN

RandomForest

SVM : SMO, LibSVM

Neural Network

Page 8: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

How to Choose the Classifier?

Observe your data: amount, features

Your application: precision/recall, explainable, incremental,

complexity

Decision Tree is easy to understand, but can't predict numerical values and is

slow.

Naïve Bayes is robust for somehow, easy to increment.

Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no.

!Never Mind: You can try all of them.

Model Selection with Cross Validation

Page 9: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

How to Choose the Classifier?

Page 11: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Train Your Classifier

Page 12: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Obtain Training Set

Instances should be labeled.From running systems in practiceAnnotate by multi-experts (Inter-rater agreement)Crowdsourcing (Google’s Captcha)…

Page 13: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Obtain Training Set

Large EnoughMore data can reduce the noisesThe benefit of enough data even can dominate

that of the classification algorithmsRedundant data will do little help.Selection Strategies: nearest neighbors, ordered removals,

random sampling, particle swarms or evolutionary methods

Page 14: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Obtain Training Set

Unbalanced Training Instances for Different ClassesEvaluation: For simple measures, precision/recall,only the

instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.)

No enough information for the features to find the class boundaries.

Page 15: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Obtain Training Set

Strategies:divide into L distinct clusters, train L predictors, and

average them as the final one. Generate synthetic data for rare class. SMOTEReduce the imbalance level. Cut down the majority class…

Page 16: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Obtain Training Set

More materials https://www.quora.com/In-classification-how-do-you-handle-an-

unbalanced-training-set http://stats.stackexchange.com/questions/57259/highly-

unbalanced-test-data-set-and-balanced-training-data-in-classification

He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009): 1263-1284.

Page 17: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Feature Selection

WhyUnrelated Features noise, heavy computationInterdependent Features redundant featuresBetter Model

http://machinelearningmastery.com/an-introduction-to-feature-selection/Guyon and Elisseeff in “An Introduction to Variable and Feature Selection”

(PDF)

Page 18: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Feature Selection

Feature Selection MethodFilter methods: apply a statistical measure to assign a scoring

to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores.

Wrapper methods: consider the selection of a set of features as a search problem.

Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.

Page 19: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Evaluation Method

Basic Evaluation Method Precision

Confusion matrix

Per-class accuracy

AUC(Area Under the Curve) The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives

to the rate of false positives

Page 20: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Evaluation Method

Cross Validation Random Subsampling

K-fold Cross Validation

Leave-one-out Cross Validation

Page 21: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Cross Validation

Random Subsampling

Page 22: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Cross Validation

K-fold Cross Validation

Page 23: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Cross Validation

Leave-one-out Cross Validation

Page 24: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Cross Validation

Three-way data splits

Page 25: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Apply the Classifier

Save the Model

Make the Model dynamic

Page 26: A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.

Thank you!