Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...

Weka Projectassignment 3

by Jason Chang

Overview

Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development

Assignment requirement

Write a program to implement Quinlan's basic two-category ID3 program.

Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. T

est it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)

ID3 revisit

J. Ross Quinlan originally developed ID3 at the University of Sydney

Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A

ID3 revisit cont.

quick summarize of algorithm

1. For each attribute, compute its entropy with respect to the conclusion class.

2. Compute and Select the attribute (say A) with highest Gain.3. Divide the data into separate sets so that within a set, A has a fixed

value.4. Build a tree with each branch represent an attribute value.5. For each subtree, repeat this process from step 1.6. At each iteration, one attribute gets removed from consideration. Th

e process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.

ID3 implementation

I cheated!! Modify Weka’s ID3 code Understand Weka’s implementation

Weka data objects

Eatable skin color size

yes rough Brown large

no rough Green Large

yes smooth Red large

Instances - tableInstance - rowAttribute Attribute valueClass

This is essential if you are to implement algorithm that make use of Weka data processing

Weka’s ID3

A nested Id3 object with array

id3

id3 id3

Class valueClass attributeClass distribution

Split attribute

Index value represent attribute value

Modification to Weka’s ID3

Build a two-category ID3 ? Add to buildClassifier methodif(data.classAttribute().numClasses() != 2)

{

throw new UnsupportedClassTypeException("Id3: class with two category only, please.");

}

Modification to Weka’s ID3 cont.

Add missing value handling feature? A naive approach!!1.Find row with most similar attribute value

match2.Copy and replace the missing value3.If no row is found or match ratio is low,

Delete the row with missing value

Modification to Weka’s ID3 cont.

My algorithm reduce dataset to row with similar attributes onlyIt loop through all attributes and rows

Modification to Weka’s ID3 cont. Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances

Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one

int attrnum = therow.numAttributes();

boolean noMatch = false;

int count; // count how many attribute looked that produce similiar value

Enumeration enum = tempdata.enumerateInstances();

count = 0;

tempdata.delete(instindex); // delete row with missing attribute

// loop through all attribute

for(int i=0; i< attrnum; i++)

{

Attribute cur_attr = therow.attribute(i);

enum = tempdata.enumerateInstances();

// loop through all rows and ignore it if the attribute i am looking at is the missing one

while(enum.hasMoreElements() && !cur_attr.equals(attr))

{

Instance cur_inst = (Instance)enum.nextElement();

// current row has same attribute value as row with missing value

if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i))

{

tempInstance.add(cur_inst); // add to temp table

}

}

if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it

{

tempdata = tempInstance;

break;

}

Result on breast-cancer data

Bad! Training data

Correctly Classified Instances 279 97.5524 %

Incorrectly Classified Instances 7 2.4476 % Cross-validation


Incorrectly Classified Instances 92 32.1678 % Overfitting!!

Result on breast-cancer data

How about simply ignore missing value!? 1 line of code Training data




Incorrectly Classified Instances 87 30.4196 % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)

Result on Soybean data

Fairly decent performance Training data




Incorrectly Classified Instances 64 9.3704 %

Lesson learned

Simpler approach might work better!! Ignoring missing value in dataset with larg

e volume of missing values is not appropriate

Familiar with Weka implementation Understand ID3 more clearly

Problems!!

It won’t compile! Solution:

Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! Solution:

run in simple CLI with (soybean in this sample)java weka.classifiers.trees.Id3m -t data/soybean.arff

Conclusion

Breast-cancer dataset seems to work universally bad on tree classifier

High information gain is not always the way to go

ID3 handle multi-class data well My algorithm bias similar rows show up

early in the record

Future development

Further modify ID3 to satisfy assignment requirement

Try something else to improve result A innovative way to compute unique value per

rows. This will increase speed and eliminate bias problem

References

Rule induction: Ross Quinlan's ID3 algorithm http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html

Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.nz/~ml/weka/index.html

Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques

http://web.archive.org/web/20011112215049/www.mkp.com/books_catalog/weka/teaching_material/Assignment3.html

Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...

Documents

Transcript of Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...