Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...
-
Upload
noel-fleming -
Category
Documents
-
view
218 -
download
6
Transcript of Weka Project assignment 3 by Jason Chang. Overview Assignment ID3 revisit Weka Development process...
Weka Projectassignment 3
by Jason Chang
Overview
Assignment ID3 revisit Weka Development process Result Problems Conclusion Future development
Assignment requirement
Write a program to implement Quinlan's basic two-category ID3 program.
Test it on the weather data ( weather.nominal.arff.) implement a way to deal with missing value in the program. T
est it on the breast-cancer data (in breast-cancer.arff.) Extend your program to work with multiple categories Run a series of tests on the soybean data (in soybean.arff)
ID3 revisit
J. Ross Quinlan originally developed ID3 at the University of Sydney
Decision tree classifier Make use of entropy – “degree of doubt” Information gain of Class from an attribute A
ID3 revisit cont.
quick summarize of algorithm
1. For each attribute, compute its entropy with respect to the conclusion class.
2. Compute and Select the attribute (say A) with highest Gain.3. Divide the data into separate sets so that within a set, A has a fixed
value.4. Build a tree with each branch represent an attribute value.5. For each subtree, repeat this process from step 1.6. At each iteration, one attribute gets removed from consideration. Th
e process stops when there are no attributes left to consider, or when all the data being considered in a subtree have the same value for the conclusion class.
ID3 implementation
I cheated!! Modify Weka’s ID3 code Understand Weka’s implementation
Weka data objects
Eatable skin color size
yes rough Brown large
no rough Green Large
yes smooth Red large
Instances - tableInstance - rowAttribute Attribute valueClass
This is essential if you are to implement algorithm that make use of Weka data processing
Weka’s ID3
A nested Id3 object with array
id3
id3 id3
Class valueClass attributeClass distribution
Split attribute
Index value represent attribute value
Modification to Weka’s ID3
Build a two-category ID3 ? Add to buildClassifier methodif(data.classAttribute().numClasses() != 2)
{
throw new UnsupportedClassTypeException("Id3: class with two category only, please.");
}
Modification to Weka’s ID3 cont.
Add missing value handling feature? A naive approach!!1.Find row with most similar attribute value
match2.Copy and replace the missing value3.If no row is found or match ratio is low,
Delete the row with missing value
Modification to Weka’s ID3 cont.
My algorithm reduce dataset to row with similar attributes onlyIt loop through all attributes and rows
Modification to Weka’s ID3 cont. Part of actual code Instances tempdata = new Instances(data); // create a copy of original instances
Instances tempInstance = new Instances(data, data.numInstances()); // create a empty one
int attrnum = therow.numAttributes();
boolean noMatch = false;
int count; // count how many attribute looked that produce similiar value
Enumeration enum = tempdata.enumerateInstances();
count = 0;
tempdata.delete(instindex); // delete row with missing attribute
// loop through all attribute
for(int i=0; i< attrnum; i++)
{
Attribute cur_attr = therow.attribute(i);
enum = tempdata.enumerateInstances();
// loop through all rows and ignore it if the attribute i am looking at is the missing one
while(enum.hasMoreElements() && !cur_attr.equals(attr))
{
Instance cur_inst = (Instance)enum.nextElement();
// current row has same attribute value as row with missing value
if(!cur_inst.isMissing(cur_attr) && !cur_inst.isMissing(attr) && cur_inst.value(cur_attr) == therow.value(i))
{
tempInstance.add(cur_inst); // add to temp table
}
}
if(tempInstance.numInstances() == 1 && count >= attrnum/3) // only 1 left! must be it
{
tempdata = tempInstance;
break;
}
Result on breast-cancer data
Bad! Training data
Correctly Classified Instances 279 97.5524 %
Incorrectly Classified Instances 7 2.4476 % Cross-validation
Correctly Classified Instances 162 56.6434 %
Incorrectly Classified Instances 92 32.1678 % Overfitting!!
Result on breast-cancer data
How about simply ignore missing value!? 1 line of code Training data
Correctly Classified Instances 275 96.1538 %
Incorrectly Classified Instances 10 3.4965 % Cross-validation
Correctly Classified Instances 166 58.042 %
Incorrectly Classified Instances 87 30.4196 % A little better but still bad! 1 line of code (58%) vs 30 lines of code (56%)
Result on Soybean data
Fairly decent performance Training data
Correctly Classified Instances 679 99.4143 %
Incorrectly Classified Instances 4 0.5857 % Cross-validation
Correctly Classified Instances 606 88.7262 %
Incorrectly Classified Instances 64 9.3704 %
Lesson learned
Simpler approach might work better!! Ignoring missing value in dataset with larg
e volume of missing values is not appropriate
Familiar with Weka implementation Understand ID3 more clearly
Problems!!
It won’t compile! Solution:
Javac Id3m.java –classpath weka.jar It won’t run with Weka explorer! Solution:
run in simple CLI with (soybean in this sample)java weka.classifiers.trees.Id3m -t data/soybean.arff
Conclusion
Breast-cancer dataset seems to work universally bad on tree classifier
High information gain is not always the way to go
ID3 handle multi-class data well My algorithm bias similar rows show up
early in the record
Future development
Further modify ID3 to satisfy assignment requirement
Try something else to improve result A innovative way to compute unique value per
rows. This will increase speed and eliminate bias problem
References
Rule induction: Ross Quinlan's ID3 algorithm http://www.dcs.napier.ac.uk/~peter/vldb/dm/node11.html
Weka 3: Data Mining Software in Java http://www.cs.waikato.ac.nz/~ml/weka/index.html
Weka javadoc Data Mining: Practical Machine Learning Tools and Techniques
http://web.archive.org/web/20011112215049/www.mkp.com/books_catalog/weka/teaching_material/Assignment3.html