CSC 4510 – Machine Learning

CSC 4510 – Machine LearningDr. Mary-Angela PapalaskariDepartment of Computing SciencesVillanova University

Course website:www.csc.villanova.edu/~map/4510/

Lecture 4: Decision Trees, continued

1CSC 4510 - M.A. Papalaskari - Villanova University

2

Tiny Example of Decision Tree Learning

• Instance attributes: <size, color, shape>• C = {positive, negative}

• D:

Example Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negative

4 large blue circle negativeCSC 4510 - M.A. Papalaskari - Villanova University

color

red blue green

shape

circle square triangleneg pos

pos neg neg

3

Decision Trees• Tree-based classifiers for instances represented as feature-vectors. Nodes

test features, there is one branch for each value of the feature, and leaves specify the category.

• Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.

• Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF).– red circle → pos– red circle → A blue → B; red square → B green → C; red triangle → C

color

red blue green

shape

circle square triangleneg pos

pos neg neg

color

red blue green

shape

circle square triangle B C

A B C

CSC 4510 - M.A. Papalaskari - Villanova University

4

Top-Down Decision Tree Creation

<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:

color

red blue green

<big, red, circle>: + <small, red, circle>: +<small, red, square>:


5

shape

circle square triangle

Top-Down Decision Tree Creation



color

red blue green

<big, red, circle>: + <small, red, circle>: +

pos<small, red, square>:

neg pos

<big, blue, circle>: neg neg


6

First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

colorred

blue

green

Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue

Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>

Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0 <small, red, square>


7


colorred

blue

green



Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0

Entropy=0.92

Entropy=1

Entropy=0


8


colorred

blue

green



Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0

Entropy=0.92

Entropy=1

Entropy=0

Gain(Color) = Entropy(start) – (Entropy(red)*(3/4) – Entropy(blue)*(1/4) = 1 – 0.69 – 0 = 0.31

Entropy=0


9

First step in detail- Split on Size?StartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small

Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>

Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5


10

First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small



Entropy=1

Entropy=1

Entropy=1


11

First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small



Entropy=1

Entropy=1

Gain(Size) = Entropy(start) – (Entropy(BIG)*(2/4) – Entropy(small)*(2/4) = 1 – 1/2 – 1/2 = 0

Entropy=1


12

shape

circle square triangle

Top-Down Decision Tree Induction

• Build a tree top-down by divide and conquer.



color

red blue green

<big, red, circle>: + <small, red, circle>: +

pos<small, red, square>:

neg pos<big, blue, circle>:

neg neg


Another view (Aispace)

CSC 4510 - M.A. Papalaskari - Villanova University 13

14

Picking a Good Split Feature• Goal is to have the resulting tree be as small as possible, per

Occam’s razor.• Finding a minimal decision tree (nodes, leaves, or depth) is an

NP-hard optimization problem.• Top-down divide-and-conquer method does a greedy search

for a simple tree but does not guarantee to find the smallest.– General lesson in ML: “Greed is good.”

• Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes.

• There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979).


15

Entropy• Entropy (disorder, impurity) of a set of examples, S, relative to a binary

classification is:

where p1 is the fraction of positive examples in S and p0 is the fraction of negatives.

• If all examples are in one category, entropy is zero (we define 0log(0)=0)

• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.

• Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases.

• For multi-class problems with c categories, entropy generalizes to:

)(log)(log)( 020121 ppppSEntropy

c

iii ppSEntropy

12 )(log)(


16

Entropy Plot for Binary Classification


17

Information Gain• The information gain of a feature F is the expected reduction in

entropy resulting from splitting on this feature.

where Sv is the subset of S having value v for feature F.

• Entropy of each resulting subset weighted by its relative size.• Example summary

– <big, red, circle>: + <small, red, circle>: +– <small, red, square>: <big, blue, circle>:

)()(),()(

vFValuesv

v SEntropyS

SSEntropyFSGain

2+, 2 : E=1 size

big small1+,1 1+,1E=1 E=1

Gain=1(0.51 + 0.51) = 0

2+, 2 : E=1 color

red blue2+,1 0+,1E=0.918 E=0

Gain=1(0.750.918 + 0.250) = 0.311

2+, 2 : E=1 shape

circle square2+,1 0+,1E=0.918 E=0

Gain=1(0.750.918 + 0.250) = 0.311


18

Bias in Decision-Tree Induction

• Information-gain gives a bias for trees with minimal depth.

Recall: • Bias

– Any criteria other than consistency with the training data that is used to select a hypothesis.


19

History of Decision-Tree Research• Hunt and colleagues use exhaustive search decision-tree

methods (CLS) to model human concept learning in the 1960’s.• In the late 70’s, Quinlan developed ID3 with the information

gain heuristic to learn expert systems from examples.• Simulataneously, Breiman and Friedman and colleagues develop

CART (Classification and Regression Trees), similar to ID3.• In the 1980’s a variety of improvements are introduced to

handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results.

• Quinlan’s updated decision-tree package (C4.5) released in 1993.

• Weka includes Java version of C4.5 called J48.


20

Additional Decision Tree Issues

• Better splitting criteria– Information gain prefers features with many values.

• Continuous features• Predicting a real-valued function (regression trees)• Missing feature values• Features with costs• Misclassification costs• Incremental learning

– ID4– ID5

• Mining large databases that do not fit in main memoryCSC 4510 - M.A. Papalaskari - Villanova University

Class Exercise• Practice using decision tree learning on some of the sample

datasets available in AISpace


Some of the slides in this presentation are adapted from:• Prof. Frank Klassner’s ML class at Villanova• the University of Manchester ML course http://www.cs.manchester.ac.uk/ugt/COMP24111/• The Stanford online ML course http://www.ml-class.org/

Resources: Datasets

• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html

• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html

• Statlib: http://lib.stat.cmu.edu/

• Delve: http://www.cs.utoronto.ca/~delve/


CSC 4510 – Machine Learning

Documents

Transcript of CSC 4510 – Machine Learning