CSC 4510 – Machine Learning

22
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/ Lecture 4: Decision Trees, continued 1 CSC 4510 - M.A. Papalaskari - Villanova University

description

Lecture 4: Decision Trees, continued. CSC 4510 – Machine Learning. Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/. color. green. red. blue. shape. pos. neg. circle. triangle. square. neg. neg. - PowerPoint PPT Presentation

Transcript of CSC 4510 – Machine Learning

Page 1: CSC 4510 – Machine Learning

CSC 4510 – Machine LearningDr. Mary-Angela PapalaskariDepartment of Computing SciencesVillanova University

Course website:www.csc.villanova.edu/~map/4510/

Lecture 4: Decision Trees, continued

1CSC 4510 - M.A. Papalaskari - Villanova University

Page 2: CSC 4510 – Machine Learning

2

Tiny Example of Decision Tree Learning

• Instance attributes: <size, color, shape>• C = {positive, negative}

• D:

Example Size Color Shape Category

1 small red circle positive

2 large red circle positive

3 small red triangle negative

4 large blue circle negativeCSC 4510 - M.A. Papalaskari - Villanova University

color

red blue green

shape

circle square triangleneg pos

pos neg neg

Page 3: CSC 4510 – Machine Learning

3

Decision Trees• Tree-based classifiers for instances represented as feature-vectors. Nodes

test features, there is one branch for each value of the feature, and leaves specify the category.

• Can represent arbitrary conjunction and disjunction. Can represent any classification function over discrete feature vectors.

• Can be rewritten as a set of rules, i.e. disjunctive normal form (DNF).– red circle → pos– red circle → A blue → B; red square → B green → C; red triangle → C

color

red blue green

shape

circle square triangleneg pos

pos neg neg

color

red blue green

shape

circle square triangle B C

A B C

CSC 4510 - M.A. Papalaskari - Villanova University

Page 4: CSC 4510 – Machine Learning

4

Top-Down Decision Tree Creation

<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:

color

red blue green

<big, red, circle>: + <small, red, circle>: +<small, red, square>:

CSC 4510 - M.A. Papalaskari - Villanova University

Page 5: CSC 4510 – Machine Learning

5

shape

circle square triangle

Top-Down Decision Tree Creation

<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:

<big, red, circle>: + <small, red, circle>: +<small, red, square>:

color

red blue green

<big, red, circle>: + <small, red, circle>: +

pos<small, red, square>:

neg pos

<big, blue, circle>: neg neg

CSC 4510 - M.A. Papalaskari - Villanova University

Page 6: CSC 4510 – Machine Learning

6

First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

colorred

blue

green

Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue

Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>

Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0 <small, red, square>

CSC 4510 - M.A. Papalaskari - Villanova University

Page 7: CSC 4510 – Machine Learning

7

First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

colorred

blue

green

Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue

Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>

Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0

Entropy=0.92

Entropy=1

Entropy=0

CSC 4510 - M.A. Papalaskari - Villanova University

Page 8: CSC 4510 – Machine Learning

8

First step in detailStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

colorred

blue

green

Color=redValue/Examples Count Probabilitypositive: 2 .667 <big, red, circle> <small, red, circle>negative: 1 .333 <small, red, square> Color=blue

Value/Examples Count Probabilitypositive: none 0 0negative: 1 1 <small, red, square>

Color=greenValue/Examples Count Probabilitypositive: none 0 0negative: none 0 0

Entropy=0.92

Entropy=1

Entropy=0

Gain(Color) = Entropy(start) – (Entropy(red)*(3/4) – Entropy(blue)*(1/4) = 1 – 0.69 – 0 = 0.31

Entropy=0

CSC 4510 - M.A. Papalaskari - Villanova University

Page 9: CSC 4510 – Machine Learning

9

First step in detail- Split on Size?StartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small

Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>

Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5

CSC 4510 - M.A. Papalaskari - Villanova University

Page 10: CSC 4510 – Machine Learning

10

First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small

Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>

Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5

Entropy=1

Entropy=1

Entropy=1

CSC 4510 - M.A. Papalaskari - Villanova University

Page 11: CSC 4510 – Machine Learning

11

First step in detail- Split on SizeStartValue Examples Count Probabilitypositive: <big, red, circle> 2 .5 <small, red, circle>negative: <small, red, square> 2 .5 <big, blue, circle>

sizeBIG

small

Size=BIGValue/Examples Count Probabilitypositive: 1 .5 <big, red, circle>negative: 1 .5 <big, blue, circle>

Size = smallValue/Examples Count Probabilitypositive: 1 .5 <small, red, circle>negative: <small, red, square> 1 .5

Entropy=1

Entropy=1

Gain(Size) = Entropy(start) – (Entropy(BIG)*(2/4) – Entropy(small)*(2/4) = 1 – 1/2 – 1/2 = 0

Entropy=1

CSC 4510 - M.A. Papalaskari - Villanova University

Page 12: CSC 4510 – Machine Learning

12

shape

circle square triangle

Top-Down Decision Tree Induction

• Build a tree top-down by divide and conquer.

<big, red, circle>: + <small, red, circle>: +<small, red, square>: <big, blue, circle>:

<big, red, circle>: + <small, red, circle>: +<small, red, square>:

color

red blue green

<big, red, circle>: + <small, red, circle>: +

pos<small, red, square>:

neg pos<big, blue, circle>:

neg neg

CSC 4510 - M.A. Papalaskari - Villanova University

Page 13: CSC 4510 – Machine Learning

Another view (Aispace)

CSC 4510 - M.A. Papalaskari - Villanova University 13

Page 14: CSC 4510 – Machine Learning

14

Picking a Good Split Feature• Goal is to have the resulting tree be as small as possible, per

Occam’s razor.• Finding a minimal decision tree (nodes, leaves, or depth) is an

NP-hard optimization problem.• Top-down divide-and-conquer method does a greedy search

for a simple tree but does not guarantee to find the smallest.– General lesson in ML: “Greed is good.”

• Want to pick a feature that creates subsets of examples that are relatively “pure” in a single class so they are “closer” to being leaf nodes.

• There are a variety of heuristics for picking a good test, a popular one is based on information gain that originated with the ID3 system of Quinlan (1979).

CSC 4510 - M.A. Papalaskari - Villanova University

Page 15: CSC 4510 – Machine Learning

15

Entropy• Entropy (disorder, impurity) of a set of examples, S, relative to a binary

classification is:

where p1 is the fraction of positive examples in S and p0 is the fraction of negatives.

• If all examples are in one category, entropy is zero (we define 0log(0)=0)

• If examples are equally mixed (p1=p0=0.5), entropy is a maximum of 1.

• Entropy can be viewed as the number of bits required on average to encode the class of an example in S where data compression (e.g. Huffman coding) is used to give shorter codes to more likely cases.

• For multi-class problems with c categories, entropy generalizes to:

)(log)(log)( 020121 ppppSEntropy

c

iii ppSEntropy

12 )(log)(

CSC 4510 - M.A. Papalaskari - Villanova University

Page 16: CSC 4510 – Machine Learning

16

Entropy Plot for Binary Classification

CSC 4510 - M.A. Papalaskari - Villanova University

Page 17: CSC 4510 – Machine Learning

17

Information Gain• The information gain of a feature F is the expected reduction in

entropy resulting from splitting on this feature.

where Sv is the subset of S having value v for feature F.

• Entropy of each resulting subset weighted by its relative size.• Example summary

– <big, red, circle>: + <small, red, circle>: +– <small, red, square>: <big, blue, circle>:

)()(),()(

vFValuesv

v SEntropyS

SSEntropyFSGain

2+, 2 : E=1 size

big small1+,1 1+,1E=1 E=1

Gain=1(0.51 + 0.51) = 0

2+, 2 : E=1 color

red blue2+,1 0+,1E=0.918 E=0

Gain=1(0.750.918 + 0.250) = 0.311

2+, 2 : E=1 shape

circle square2+,1 0+,1E=0.918 E=0

Gain=1(0.750.918 + 0.250) = 0.311

CSC 4510 - M.A. Papalaskari - Villanova University

Page 18: CSC 4510 – Machine Learning

18

Bias in Decision-Tree Induction

• Information-gain gives a bias for trees with minimal depth.

Recall: • Bias

– Any criteria other than consistency with the training data that is used to select a hypothesis.

CSC 4510 - M.A. Papalaskari - Villanova University

Page 19: CSC 4510 – Machine Learning

19

History of Decision-Tree Research• Hunt and colleagues use exhaustive search decision-tree

methods (CLS) to model human concept learning in the 1960’s.• In the late 70’s, Quinlan developed ID3 with the information

gain heuristic to learn expert systems from examples.• Simulataneously, Breiman and Friedman and colleagues develop

CART (Classification and Regression Trees), similar to ID3.• In the 1980’s a variety of improvements are introduced to

handle noise, continuous features, missing features, and improved splitting criteria. Various expert-system development tools results.

• Quinlan’s updated decision-tree package (C4.5) released in 1993.

• Weka includes Java version of C4.5 called J48.

CSC 4510 - M.A. Papalaskari - Villanova University

Page 20: CSC 4510 – Machine Learning

20

Additional Decision Tree Issues

• Better splitting criteria– Information gain prefers features with many values.

• Continuous features• Predicting a real-valued function (regression trees)• Missing feature values• Features with costs• Misclassification costs• Incremental learning

– ID4– ID5

• Mining large databases that do not fit in main memoryCSC 4510 - M.A. Papalaskari - Villanova University

Page 21: CSC 4510 – Machine Learning

Class Exercise• Practice using decision tree learning on some of the sample

datasets available in AISpace

21CSC 4510 - M.A. Papalaskari - Villanova University

Some of the slides in this presentation are adapted from:• Prof. Frank Klassner’s ML class at Villanova• the University of Manchester ML course http://www.cs.manchester.ac.uk/ugt/COMP24111/• The Stanford online ML course http://www.ml-class.org/

Page 22: CSC 4510 – Machine Learning

Resources: Datasets

• UCI Repository: http://www.ics.uci.edu/~mlearn/MLRepository.html

• UCI KDD Archive: http://kdd.ics.uci.edu/summary.data.application.html

• Statlib: http://lib.stat.cmu.edu/

• Delve: http://www.cs.utoronto.ca/~delve/

22CSC 4510 - M.A. Papalaskari - Villanova University