Weka j48 Pruning
-
Upload
nixon-patel -
Category
Documents
-
view
215 -
download
0
Transcript of Weka j48 Pruning
-
7/23/2019 Weka j48 Pruning
1/7
CSC 578 Neural Networks and Machine Learning
Homework #1
Due: January 20 (Wed)
Do all questions below.
1. Textbook Exercise 3.1 (p. 77).
In case you don't have the textbook yet, the question is: "Give decision trees to represent the followingboolean functions:
a. A and notB
b. A or [B and C]
c. A xor B
d. [A and B] or [C and D]
2. The following data gives the conditions under which an optician might want to prescribe soft contact
lenses, hard contact lenses, or no contact lenses for a patient. Show the decision tree that would belearned by ID3. The target attribute is 'Contact-lenses'.
Show all your workincluding the calculations of IG (REQUIRED). Do NOT use any decision-tree
induction tools such as Weka. You may use tools/software for numeric calculation (including Excel), but
NOT those that produce a decision tree.
Age Spec tacl e-pre scri p Astigm at ism Tea r-pro d-r ate Contact-l en ses
young myope no normal soft
young myope yes reduced none
young myope yes normal hard
young hypermetrope no reduced none
young hypermetrope no normal soft
young hypermetrope yes reduced nonepre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
pre-presbyopic myope yes normal hard
pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
pre-presbyopic hypermetrope yes reduced none
pre-presbyopic hypermetrope yes normal none
presbyopic myope no normal none
presbyopic myope yes reduced none
presbyopic myope yes normal hard
presbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal soft
presbyopic hypermetrope yes reduced none
presbyopic hypermetrope yes normal none
3. Why is a decision tree that fits the data really well not necessarily better than another that doesn't fit it so
well? [Assume the whole data can fit in the computer memory.]
Write at least 3 sentences.
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 1
-
7/23/2019 Weka j48 Pruning
2/7
4. Download WEKAand install on your system. Then conduct the following experiment.
Setup:
If your system already has Java 1.5 or newer, the second choice "a self-extracting executable
without the Java VM (weka-3-6-1.exe)" will do.
In case you encountered problems with the Weka site, here is a local, ZIP file of the self-extracting
executable (weka-3-6-1.zip, 18MB).
J48 in Weka:
For this and the next questions (4 & 5), you experiment the effect of pruningin decision trees. Weka's
'weka.classifiers.trees.J48' lets you generate pruned as well as unpruned trees. The J48 classifier
provides two methods for pruning a decision tree:
a. By using a "pess imistic estimate" function(described in Mitchell's textbook p. 71, the 9th line
from the bottom "Another method, used by C4.5,.."); and
b. By using a validation set to test if the pruning will improve accuracy -- by 'reducedErrorPruning'
scheme.
[FYI, J48 does NOT convert trees to rules. Both pruning schemes alter a tree after it is fully grown (thus
post-pruning).]
For question 4, we experiment with the former scheme.
The pessimistic estimate function has a parameter: Confidence Level. By setting this parameter to various
values, we can experiment the degree of pruning -- minimal to aggressive -- and its effect on classification
accuracy. In J48, this confidence level can be set by the 'confidenceFactor' parameter, in the pop-up
window which appears after clicking in the text box to the right of the "Choose" button. It is set to 0.25
by default. If you change it to a smaller value, you can do more aggressive pruning, while a
larger value will let you do minimal pruning.
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 2
-
7/23/2019 Weka j48 Pruning
3/7
Specifics:
The purpose of the experiment is to derive the (sub-)optimal confidence factor value. To do so, we try
various confidence factor values, and with several datasets. The datasets are as follows. Here is also a
zip file which includes all.
vote.arff(40 kb)16 attributes (nominal),
2 classes, 435 instances
This dataset contains the party affiliation of the 435
members of the 1984 US House of Representatives,
as well as their voting records on 16 different bills.
tic-tac-toe.arff(31 kb)9 attributes (nominal),
2 classes, 958 instances
This database encodes the complete set of possible
board configurations at the end of tic-tac-toe games.
splice2.arff(393 kb) 62 attributes (nominal),3 classes, 3190 instances
Primate splice-junction gene sequences (DNA).
Given a sequence of DNA, recognize the boundaries
between exons (the parts of the DNA sequence
retained after splicing) and introns (the parts of the
DNA sequence that are spliced out).
breast-cancer.arff(30
kb)
9 attributes (nominal),
2 classes, 286 instances
Breast cancer data; classify into recurrent/non-
recurrent events
(*)halloffame.arff(140
kb)
17 attributes
(nominal/numeric mixed),
3 classes, 1338 instances
Records of baseball players inducted to the Baseball
Hall of Fame
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 3
-
7/23/2019 Weka j48 Pruning
4/7
NOTE: (*)When you run the Hall of Fame data, remove the 'Player' attribute(the first
attribute). To do so, after you open the file (in the "Preprocess" step), select the attribute
and hit "Remove".
For each dataset,
You run J48 with the confidence factor 0.50, 0.45, 0.40, ... down to 0.05(i.e., a decrement by
0.05; So you'll do a total of 10 runs). Also be sure to set the 'minNumObj' to 1, and make sure
the 'reducedErrorPruning' is Falseand 'unpruned' False, along with all other parameters as
indicated in the previous figure.
For each run, do the evaluation by 10-fold cross-validation. In the "Weka Explorer" window,
under "Test Options", select "Cross-validation" and set the number of folds to 10(which is the
default in Weka).
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 4
-
7/23/2019 Weka j48 Pruning
5/7
After each run, record the confidence factor, the size of the treeand the classification
accuracy.
Do the same procedure for all datasets.
To Answer:
Answer the following questions. In addition to running the experiments, I strongly recommend you read
the description of each dataset (written at the top of each file) in order to learn its domain.
a. Show a table which tabulates the values obtained for all runs for each dataset (confidence factor,
size of the tree, classification accuracy).
b. Your result probably indicated that pruning helped improve the accuracy greatly for some datasets
but only marginally if at all for others. Or in some cases, pruning might have hindered the accuracy.
Based on the results, discuss in detail what factor or factors you think influenced the effect by
pruning. Write at least 3 sentences.
c. Weka uses 0.25 as the default confidence value. Do you think it is a good value to use? Explain
why or why not.
5. For this question, you experiment the other pruning scheme (reduced error pruning), using the five
datasets from the previous question.
Specifics:
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 5
-
7/23/2019 Weka j48 Pruning
6/7
The reduced error pruning in J48 has a parameter: 'numFolds'. It specifies the number of subsets into
which the training data is divided -- and one fold is reserved as a validation set (used only for testing the
effect of pruning a particular subtree) and the remaining folds are used for training/building a tree. By
changing the the number of folds, you essentially control the portion of the data used for training -- a
small number of folds makes the validation set larger, thus leaving the training set smaller,
while a large number of folds makes the validation set smaller, thus leaving the training set
larger (although it is still a subset of the original training data).
For each dataset,
You run J48 with the 'numFolds' = 2, 5and 10(so you'll do a total of 3 runs). Also, set the
'reducedErrorPruning'to Trueand the 'minNumObj' to 1. Note that you can leave the 'seed'
as 1. You can ignore other parameters (because J48 does too).
As for the overall evaluation, as with the previous question, do the evaluation by 10-fold cross-
validation.
After each run, record the number of folds, the size of the treeand the classification accuracy.
To Answer:
Answer the following questions.
a. Show a table which tabulates the values obtained for all runs for each dataset (number of folds, size
of the tree, classification accuracy).
b. Describe your observation on the effect of the size of the training set (i.e., the number of folds).
Write at least 3 sentences.
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 6
-
7/23/2019 Weka j48 Pruning
7/7
c. How did this pruning scheme compare with the pessimistic estimate function? Was there large
differences in the accuracy or tree size between the two schemes? Which pruning scheme "works
better" or "is preferred" in your opinion?
Submission
Type all your answers in an electronic file (in doc, txt, or pdf), and submit the file on COL (under 'Submit
Homework' and 'HW#1' bin) before 11:59 pm on the due date.
If it's difficult for you to draw figures (trees in this homework) by using a software, you can alternatively
hand-draw them on paper, scan the paper and insert/paste/concatenate the scanned image in the file. No
matter how you create figures, make ONE file which contains ALL answersand submit the file.
Be sure to WRITE YOUR NAME in the beginning of the file. As I have on the syllabus, "Assignments
with NO NAME may be penalized by some points."
4/3/2010 Homework #1
condor.depaul.edu//hw1.html 7