Weka j48 Pruning

7/23/2019 Weka j48 Pruning

1/7

CSC 578 Neural Networks and Machine Learning

Homework #1

Due: January 20 (Wed)

Do all questions below.

1. Textbook Exercise 3.1 (p. 77).

In case you don't have the textbook yet, the question is: "Give decision trees to represent the followingboolean functions:

a. A and notB

b. A or [B and C]

c. A xor B

d. [A and B] or [C and D]

2. The following data gives the conditions under which an optician might want to prescribe soft contact

lenses, hard contact lenses, or no contact lenses for a patient. Show the decision tree that would belearned by ID3. The target attribute is 'Contact-lenses'.

Show all your workincluding the calculations of IG (REQUIRED). Do NOT use any decision-tree

induction tools such as Weka. You may use tools/software for numeric calculation (including Excel), but

NOT those that produce a decision tree.

Age Spec tacl e-pre scri p Astigm at ism Tea r-pro d-r ate Contact-l en ses

young myope no normal soft

young myope yes reduced none

young myope yes normal hard

young hypermetrope no reduced none

young hypermetrope no normal soft

young hypermetrope yes reduced nonepre-presbyopic myope no reduced none

pre-presbyopic myope no normal soft

pre-presbyopic myope yes normal hard

pre-presbyopic hypermetrope no reduced none

pre-presbyopic hypermetrope no normal soft

pre-presbyopic hypermetrope yes reduced none

pre-presbyopic hypermetrope yes normal none

presbyopic myope no normal none

presbyopic myope yes reduced none

presbyopic myope yes normal hard

presbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal soft

presbyopic hypermetrope yes reduced none

presbyopic hypermetrope yes normal none

3. Why is a decision tree that fits the data really well not necessarily better than another that doesn't fit it so

well? [Assume the whole data can fit in the computer memory.]

Write at least 3 sentences.

4/3/2010 Homework #1

condor.depaul.edu//hw1.html 1


2/7

4. Download WEKAand install on your system. Then conduct the following experiment.

Setup:

If your system already has Java 1.5 or newer, the second choice "a self-extracting executable

without the Java VM (weka-3-6-1.exe)" will do.

In case you encountered problems with the Weka site, here is a local, ZIP file of the self-extracting

executable (weka-3-6-1.zip, 18MB).

J48 in Weka:

For this and the next questions (4 & 5), you experiment the effect of pruningin decision trees. Weka's

'weka.classifiers.trees.J48' lets you generate pruned as well as unpruned trees. The J48 classifier

provides two methods for pruning a decision tree:

a. By using a "pess imistic estimate" function(described in Mitchell's textbook p. 71, the 9th line

from the bottom "Another method, used by C4.5,.."); and

b. By using a validation set to test if the pruning will improve accuracy -- by 'reducedErrorPruning'

scheme.

[FYI, J48 does NOT convert trees to rules. Both pruning schemes alter a tree after it is fully grown (thus

post-pruning).]

For question 4, we experiment with the former scheme.

The pessimistic estimate function has a parameter: Confidence Level. By setting this parameter to various

values, we can experiment the degree of pruning -- minimal to aggressive -- and its effect on classification

accuracy. In J48, this confidence level can be set by the 'confidenceFactor' parameter, in the pop-up

window which appears after clicking in the text box to the right of the "Choose" button. It is set to 0.25

by default. If you change it to a smaller value, you can do more aggressive pruning, while a

larger value will let you do minimal pruning.




3/7

Specifics:

The purpose of the experiment is to derive the (sub-)optimal confidence factor value. To do so, we try

various confidence factor values, and with several datasets. The datasets are as follows. Here is also a

zip file which includes all.

vote.arff(40 kb)16 attributes (nominal),

2 classes, 435 instances

This dataset contains the party affiliation of the 435

members of the 1984 US House of Representatives,

as well as their voting records on 16 different bills.

tic-tac-toe.arff(31 kb)9 attributes (nominal),


This database encodes the complete set of possible

board configurations at the end of tic-tac-toe games.

splice2.arff(393 kb) 62 attributes (nominal),3 classes, 3190 instances

Primate splice-junction gene sequences (DNA).

Given a sequence of DNA, recognize the boundaries

between exons (the parts of the DNA sequence

retained after splicing) and introns (the parts of the

DNA sequence that are spliced out).

breast-cancer.arff(30

kb)

9 attributes (nominal),


Breast cancer data; classify into recurrent/non-

recurrent events

(*)halloffame.arff(140

kb)

17 attributes

(nominal/numeric mixed),


Records of baseball players inducted to the Baseball

Hall of Fame




4/7

NOTE: (*)When you run the Hall of Fame data, remove the 'Player' attribute(the first

attribute). To do so, after you open the file (in the "Preprocess" step), select the attribute

and hit "Remove".

For each dataset,

You run J48 with the confidence factor 0.50, 0.45, 0.40, ... down to 0.05(i.e., a decrement by

0.05; So you'll do a total of 10 runs). Also be sure to set the 'minNumObj' to 1, and make sure

the 'reducedErrorPruning' is Falseand 'unpruned' False, along with all other parameters as

indicated in the previous figure.

For each run, do the evaluation by 10-fold cross-validation. In the "Weka Explorer" window,

under "Test Options", select "Cross-validation" and set the number of folds to 10(which is the

default in Weka).




5/7

After each run, record the confidence factor, the size of the treeand the classification

accuracy.

Do the same procedure for all datasets.

To Answer:

Answer the following questions. In addition to running the experiments, I strongly recommend you read

the description of each dataset (written at the top of each file) in order to learn its domain.

a. Show a table which tabulates the values obtained for all runs for each dataset (confidence factor,

size of the tree, classification accuracy).

b. Your result probably indicated that pruning helped improve the accuracy greatly for some datasets

but only marginally if at all for others. Or in some cases, pruning might have hindered the accuracy.

Based on the results, discuss in detail what factor or factors you think influenced the effect by

pruning. Write at least 3 sentences.

c. Weka uses 0.25 as the default confidence value. Do you think it is a good value to use? Explain

why or why not.

5. For this question, you experiment the other pruning scheme (reduced error pruning), using the five

datasets from the previous question.

Specifics:




6/7

The reduced error pruning in J48 has a parameter: 'numFolds'. It specifies the number of subsets into

which the training data is divided -- and one fold is reserved as a validation set (used only for testing the

effect of pruning a particular subtree) and the remaining folds are used for training/building a tree. By

changing the the number of folds, you essentially control the portion of the data used for training -- a

small number of folds makes the validation set larger, thus leaving the training set smaller,

while a large number of folds makes the validation set smaller, thus leaving the training set

larger (although it is still a subset of the original training data).

For each dataset,

You run J48 with the 'numFolds' = 2, 5and 10(so you'll do a total of 3 runs). Also, set the

'reducedErrorPruning'to Trueand the 'minNumObj' to 1. Note that you can leave the 'seed'

as 1. You can ignore other parameters (because J48 does too).

As for the overall evaluation, as with the previous question, do the evaluation by 10-fold cross-

validation.

After each run, record the number of folds, the size of the treeand the classification accuracy.

To Answer:

Answer the following questions.

a. Show a table which tabulates the values obtained for all runs for each dataset (number of folds, size

of the tree, classification accuracy).

b. Describe your observation on the effect of the size of the training set (i.e., the number of folds).

Write at least 3 sentences.




7/7

c. How did this pruning scheme compare with the pessimistic estimate function? Was there large

differences in the accuracy or tree size between the two schemes? Which pruning scheme "works

better" or "is preferred" in your opinion?

Submission

Type all your answers in an electronic file (in doc, txt, or pdf), and submit the file on COL (under 'Submit

Homework' and 'HW#1' bin) before 11:59 pm on the due date.

If it's difficult for you to draw figures (trees in this homework) by using a software, you can alternatively

hand-draw them on paper, scan the paper and insert/paste/concatenate the scanned image in the file. No

matter how you create figures, make ONE file which contains ALL answersand submit the file.

Be sure to WRITE YOUR NAME in the beginning of the file. As I have on the syllabus, "Assignments

with NO NAME may be penalized by some points."



Weka j48 Pruning

Documents

Transcript of Weka j48 Pruning