Weka j48 Pruning

download Weka j48 Pruning

of 7

Transcript of Weka j48 Pruning

  • 7/23/2019 Weka j48 Pruning

    1/7

    CSC 578 Neural Networks and Machine Learning

    Homework #1

    Due: January 20 (Wed)

    Do all questions below.

    1. Textbook Exercise 3.1 (p. 77).

    In case you don't have the textbook yet, the question is: "Give decision trees to represent the followingboolean functions:

    a. A and notB

    b. A or [B and C]

    c. A xor B

    d. [A and B] or [C and D]

    2. The following data gives the conditions under which an optician might want to prescribe soft contact

    lenses, hard contact lenses, or no contact lenses for a patient. Show the decision tree that would belearned by ID3. The target attribute is 'Contact-lenses'.

    Show all your workincluding the calculations of IG (REQUIRED). Do NOT use any decision-tree

    induction tools such as Weka. You may use tools/software for numeric calculation (including Excel), but

    NOT those that produce a decision tree.

    Age Spec tacl e-pre scri p Astigm at ism Tea r-pro d-r ate Contact-l en ses

    young myope no normal soft

    young myope yes reduced none

    young myope yes normal hard

    young hypermetrope no reduced none

    young hypermetrope no normal soft

    young hypermetrope yes reduced nonepre-presbyopic myope no reduced none

    pre-presbyopic myope no normal soft

    pre-presbyopic myope yes normal hard

    pre-presbyopic hypermetrope no reduced none

    pre-presbyopic hypermetrope no normal soft

    pre-presbyopic hypermetrope yes reduced none

    pre-presbyopic hypermetrope yes normal none

    presbyopic myope no normal none

    presbyopic myope yes reduced none

    presbyopic myope yes normal hard

    presbyopic hypermetrope no reduced nonepresbyopic hypermetrope no normal soft

    presbyopic hypermetrope yes reduced none

    presbyopic hypermetrope yes normal none

    3. Why is a decision tree that fits the data really well not necessarily better than another that doesn't fit it so

    well? [Assume the whole data can fit in the computer memory.]

    Write at least 3 sentences.

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 1

  • 7/23/2019 Weka j48 Pruning

    2/7

    4. Download WEKAand install on your system. Then conduct the following experiment.

    Setup:

    If your system already has Java 1.5 or newer, the second choice "a self-extracting executable

    without the Java VM (weka-3-6-1.exe)" will do.

    In case you encountered problems with the Weka site, here is a local, ZIP file of the self-extracting

    executable (weka-3-6-1.zip, 18MB).

    J48 in Weka:

    For this and the next questions (4 & 5), you experiment the effect of pruningin decision trees. Weka's

    'weka.classifiers.trees.J48' lets you generate pruned as well as unpruned trees. The J48 classifier

    provides two methods for pruning a decision tree:

    a. By using a "pess imistic estimate" function(described in Mitchell's textbook p. 71, the 9th line

    from the bottom "Another method, used by C4.5,.."); and

    b. By using a validation set to test if the pruning will improve accuracy -- by 'reducedErrorPruning'

    scheme.

    [FYI, J48 does NOT convert trees to rules. Both pruning schemes alter a tree after it is fully grown (thus

    post-pruning).]

    For question 4, we experiment with the former scheme.

    The pessimistic estimate function has a parameter: Confidence Level. By setting this parameter to various

    values, we can experiment the degree of pruning -- minimal to aggressive -- and its effect on classification

    accuracy. In J48, this confidence level can be set by the 'confidenceFactor' parameter, in the pop-up

    window which appears after clicking in the text box to the right of the "Choose" button. It is set to 0.25

    by default. If you change it to a smaller value, you can do more aggressive pruning, while a

    larger value will let you do minimal pruning.

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 2

  • 7/23/2019 Weka j48 Pruning

    3/7

    Specifics:

    The purpose of the experiment is to derive the (sub-)optimal confidence factor value. To do so, we try

    various confidence factor values, and with several datasets. The datasets are as follows. Here is also a

    zip file which includes all.

    vote.arff(40 kb)16 attributes (nominal),

    2 classes, 435 instances

    This dataset contains the party affiliation of the 435

    members of the 1984 US House of Representatives,

    as well as their voting records on 16 different bills.

    tic-tac-toe.arff(31 kb)9 attributes (nominal),

    2 classes, 958 instances

    This database encodes the complete set of possible

    board configurations at the end of tic-tac-toe games.

    splice2.arff(393 kb) 62 attributes (nominal),3 classes, 3190 instances

    Primate splice-junction gene sequences (DNA).

    Given a sequence of DNA, recognize the boundaries

    between exons (the parts of the DNA sequence

    retained after splicing) and introns (the parts of the

    DNA sequence that are spliced out).

    breast-cancer.arff(30

    kb)

    9 attributes (nominal),

    2 classes, 286 instances

    Breast cancer data; classify into recurrent/non-

    recurrent events

    (*)halloffame.arff(140

    kb)

    17 attributes

    (nominal/numeric mixed),

    3 classes, 1338 instances

    Records of baseball players inducted to the Baseball

    Hall of Fame

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 3

  • 7/23/2019 Weka j48 Pruning

    4/7

    NOTE: (*)When you run the Hall of Fame data, remove the 'Player' attribute(the first

    attribute). To do so, after you open the file (in the "Preprocess" step), select the attribute

    and hit "Remove".

    For each dataset,

    You run J48 with the confidence factor 0.50, 0.45, 0.40, ... down to 0.05(i.e., a decrement by

    0.05; So you'll do a total of 10 runs). Also be sure to set the 'minNumObj' to 1, and make sure

    the 'reducedErrorPruning' is Falseand 'unpruned' False, along with all other parameters as

    indicated in the previous figure.

    For each run, do the evaluation by 10-fold cross-validation. In the "Weka Explorer" window,

    under "Test Options", select "Cross-validation" and set the number of folds to 10(which is the

    default in Weka).

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 4

  • 7/23/2019 Weka j48 Pruning

    5/7

    After each run, record the confidence factor, the size of the treeand the classification

    accuracy.

    Do the same procedure for all datasets.

    To Answer:

    Answer the following questions. In addition to running the experiments, I strongly recommend you read

    the description of each dataset (written at the top of each file) in order to learn its domain.

    a. Show a table which tabulates the values obtained for all runs for each dataset (confidence factor,

    size of the tree, classification accuracy).

    b. Your result probably indicated that pruning helped improve the accuracy greatly for some datasets

    but only marginally if at all for others. Or in some cases, pruning might have hindered the accuracy.

    Based on the results, discuss in detail what factor or factors you think influenced the effect by

    pruning. Write at least 3 sentences.

    c. Weka uses 0.25 as the default confidence value. Do you think it is a good value to use? Explain

    why or why not.

    5. For this question, you experiment the other pruning scheme (reduced error pruning), using the five

    datasets from the previous question.

    Specifics:

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 5

  • 7/23/2019 Weka j48 Pruning

    6/7

    The reduced error pruning in J48 has a parameter: 'numFolds'. It specifies the number of subsets into

    which the training data is divided -- and one fold is reserved as a validation set (used only for testing the

    effect of pruning a particular subtree) and the remaining folds are used for training/building a tree. By

    changing the the number of folds, you essentially control the portion of the data used for training -- a

    small number of folds makes the validation set larger, thus leaving the training set smaller,

    while a large number of folds makes the validation set smaller, thus leaving the training set

    larger (although it is still a subset of the original training data).

    For each dataset,

    You run J48 with the 'numFolds' = 2, 5and 10(so you'll do a total of 3 runs). Also, set the

    'reducedErrorPruning'to Trueand the 'minNumObj' to 1. Note that you can leave the 'seed'

    as 1. You can ignore other parameters (because J48 does too).

    As for the overall evaluation, as with the previous question, do the evaluation by 10-fold cross-

    validation.

    After each run, record the number of folds, the size of the treeand the classification accuracy.

    To Answer:

    Answer the following questions.

    a. Show a table which tabulates the values obtained for all runs for each dataset (number of folds, size

    of the tree, classification accuracy).

    b. Describe your observation on the effect of the size of the training set (i.e., the number of folds).

    Write at least 3 sentences.

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 6

  • 7/23/2019 Weka j48 Pruning

    7/7

    c. How did this pruning scheme compare with the pessimistic estimate function? Was there large

    differences in the accuracy or tree size between the two schemes? Which pruning scheme "works

    better" or "is preferred" in your opinion?

    Submission

    Type all your answers in an electronic file (in doc, txt, or pdf), and submit the file on COL (under 'Submit

    Homework' and 'HW#1' bin) before 11:59 pm on the due date.

    If it's difficult for you to draw figures (trees in this homework) by using a software, you can alternatively

    hand-draw them on paper, scan the paper and insert/paste/concatenate the scanned image in the file. No

    matter how you create figures, make ONE file which contains ALL answersand submit the file.

    Be sure to WRITE YOUR NAME in the beginning of the file. As I have on the syllabus, "Assignments

    with NO NAME may be penalized by some points."

    4/3/2010 Homework #1

    condor.depaul.edu//hw1.html 7