Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...

20
Weka Project assignment 3 by Jason Chang

Transcript of Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...

Page 1: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Weka Projectassignment 3

by Jason Chang

Page 2: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Overview

Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development

Page 3: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Assignment requirement

Write a program to implement Quinlan's basic two-category ID3 program.

Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. T

est it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)

Page 4: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

ID3 revisit

J. Ross Quinlan originally developed ID3 at the University of Sydney

Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A

Page 5: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

ID3 revisit cont.

quick summarize of algorithm

1. For each attribute, compute its entropy with respect to the conclusion class.

2. Compute and Select the attribute (say A) with highest Gain.3. Divide the data into separate sets so that within a set, A has a fixed

value.4. Build a tree with each branch represent an attribute value.5. For each subtree, repeat this process from step 1.6. At each iteration, one attribute gets removed from consideration. Th

e process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.

Page 6: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

ID3 implementation

I cheated!! Modify Weka’s ID3 code Understand Weka’s implementation

Page 7: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Weka data objects

Eatable skin color size

yes rough Brown large

no rough Green Large

yes smooth Red large

Instances - tableInstance - rowAttribute Attribute valueClass

This is essential if you are to implement algorithm that make use of Weka data processing

Page 8: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Weka’s ID3

A nested Id3 object with array

id3

id3 id3

Class valueClass attributeClass distribution

Split attribute

Index value represent attribute value

Page 9: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Modification to Weka’s ID3

Build a two-category ID3 ? Add to buildClassifier methodif(data.classAttribute().numClasses() != 2)

{

throw new UnsupportedClassTypeException("Id3: class with two category only, please.");

}

Page 10: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Modification to Weka’s ID3 cont.

Add missing value handling feature? A naive approach!!1.Find row with most similar attribute value

match2.Copy and replace the missing value3.If no row is found or match ratio is low,

Delete the row with missing value

Page 11: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Modification to Weka’s ID3 cont.

My algorithm reduce dataset to row with similar attributes onlyIt loop through all attributes and rows

Page 12: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Modification to Weka’s ID3 cont. Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances

Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one

int attrnum = therow.numAttributes();

boolean noMatch = false;

int count; // count how many attribute looked that produce similiar value

Enumeration enum = tempdata.enumerateInstances();

count = 0;

tempdata.delete(instindex); // delete row with missing attribute

// loop through all attribute

for(int i=0; i< attrnum; i++)

{

Attribute cur_attr = therow.attribute(i);

enum = tempdata.enumerateInstances();

// loop through all rows and ignore it if the attribute i am looking at is the missing one

while(enum.hasMoreElements() && !cur_attr.equals(attr))

{

Instance cur_inst = (Instance)enum.nextElement();

// current row has same attribute value as row with missing value

if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i))

{

tempInstance.add(cur_inst); // add to temp table

}

}

if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it

{

tempdata = tempInstance;

break;

}

Page 13: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Result on breast-cancer data

Bad! Training data

Correctly Classified Instances 279 97.5524 %

Incorrectly Classified Instances 7 2.4476 % Cross-validation

Correctly Classified Instances 162 56.6434 %

Incorrectly Classified Instances 92 32.1678 % Overfitting!!

Page 14: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Result on breast-cancer data

How about simply ignore missing value!? 1 line of code Training data

Correctly Classified Instances 275 96.1538 %

Incorrectly Classified Instances 10 3.4965 % Cross-validation

Correctly Classified Instances 166 58.042 %

Incorrectly Classified Instances 87 30.4196 % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)

Page 15: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Result on Soybean data

Fairly decent performance Training data

Correctly Classified Instances 679 99.4143 %

Incorrectly Classified Instances 4 0.5857 % Cross-validation

Correctly Classified Instances 606 88.7262 %

Incorrectly Classified Instances 64 9.3704 %

Page 16: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Lesson learned

Simpler approach might work better!! Ignoring missing value in dataset with larg

e volume of missing values is not appropriate

Familiar with Weka implementation Understand ID3 more clearly

Page 17: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Problems!!

It won’t compile! Solution:

Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! Solution:

run in simple CLI with (soybean in this sample)java weka.classifiers.trees.Id3m -t data/soybean.arff

Page 18: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Conclusion

Breast-cancer dataset seems to work universally bad on tree classifier

High information gain is not always the way to go

ID3 handle multi-class data well My algorithm bias similar rows show up

early in the record

Page 19: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

Future development

Further modify ID3 to satisfy assignment requirement

Try something else to improve result A innovative way to compute unique value per

rows. This will increase speed and eliminate bias problem

Page 20: Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development.

References

Rule induction: Ross Quinlan's ID3 algorithm http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html

Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.nz/~ml/weka/index.html

Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques

http://web.archive.org/web/20011112215049/www.mkp.com/books_catalog/weka/teaching_material/Assignment3.html