An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General...

29
An Exercise in An Exercise in Machine Learning Machine Learning http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/ Cornelia Caragea

Transcript of An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General...

Page 1: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

An Exercise in An Exercise in Machine Learning Machine Learning

http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/

Cornelia Caragea

Page 2: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Outline

• Machine Learning Software

• Preparing Data

• Building Classifiers

• Interpreting Results

Page 3: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Specific Classification: C4.5, SVMlight Association Rule Mining Bayesian Net …

Commercial vs. Free

Machine Learning Software

Page 4: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

What does WEKA do? Implementation of the state-of-the-art

learning algorithm Main strengths in the classification Regression, Association Rules and

clustering algorithms Extensible to try new learning schemes Large variety of handy tools (transforming

datasets, filters, visualization etc…)

Page 5: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

WEKA resources API Documentation, Tutorials, Source code. WEKA mailing list Data Mining: Practical Machine Learning Tools

and Techniques with Java Implementations Weka-related Projects:

Weka-Parallel - parallel processing for Weka RWeka - linking R and Weka YALE - Yet Another Learning Environment Many others…

Page 6: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Outline

• Machine Learning Software

• Preparing Data

• Building Classifiers

• Interpreting Results

Page 7: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Preparing Data

ARFF Data Format Header – describing the

attribute types Data – (instances,

examples) comma-separated list

Page 8: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Launching WEKA java -jar weka.jar

Page 9: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Load Dataset into WEKA

Page 10: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Data Filters Useful support for data preprocessing Removing or adding attributes, resampling the

dataset, removing examples, etc. Creates stratified cross-validation folds of the

given dataset, and class distributions areapproximately retained within each fold.

Typically split data as 2/3 in training and 1/3 intesting

Page 11: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Data Filters

Page 12: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Outline

• Machine Learning Software

• Preparing Data

• Building Classifiers

• Interpreting Results

Page 13: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Building Classifiers

A classifier model - mapping from datasetattributes to the class (target) attribute.Creation and form differs.

Decision Tree and Naïve Bayes Classifiers Which one is the best?

No Free Lunch!

Page 14: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Building Classifiers

Page 15: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

(1) weka.classifiers.rules.ZeroR

Class for building and using a 0-R classifier Majority class classifier Predicts the mean (for a numeric class) or the

mode (for a nominal class)

Page 16: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Exercise 1

http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex1.html

Page 17: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

(2)weka.classifiers.bayes.NaiveBayes Class for building a Naive Bayes classifier

Page 18: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

(3) weka.classifiers.trees.J48 Class for generating a pruned or

unpruned C4.5 decision tree

Page 19: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Test Options Percentage Split (2/3 Training; 1/3

Testing) Cross-validation

estimating the generalization error based onresampling when limited data; averaged errorestimate.

stratified 10-fold leave-one-out (Loo)

Page 20: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Outline

• Machine Learning Software

• Preparing Data

• Building Classifiers

• Interpreting Results

Page 21: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Understanding Output

Page 22: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Decision Tree Output (1)

Page 23: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Decision Tree Output (2)

Page 24: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex2.html

Exercise 2

Page 25: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Performance Measures Accuracy & Error rate Confusion matrix – contingency table True Positive rate & False Positive rate (Area

under Receiver Operating Characteristic) Precision,Recall & F-Measure Sensitivity & Specificity For more information on these, see

uisp09-Evaluation.ppt

Page 26: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Decision Tree Pruning

Overcome Over-fitting Pre-pruning and Post-pruning Reduced error pruning Subtree raising with different confidence Comparing tree size and accuracy

Page 27: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Subtree replacement Bottom-up: tree is considered for

replacement once all its subtrees havebeen considered

Page 28: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Subtree Raising Deletes node and redistributes instances Slower than subtree replacement

Page 29: An Exercise in Machine Learningweb.cs.iastate.edu/~cs573x/BBSIlab/2006/BBSI.pdf · Suites (General Purpose) WEKA (Source: Java) MLC++ (Source: C++) SAS List from KDNuggets (Various)

Exercise 3

http://www.cs.iastate.edu/~cs573x/BBSIlab/2006/exercises/ex3.html