Karar Ağaçları, ID3, C4.5, CARTID3 Algorithm It works with only categorical data. In the first...

1

Umut ORHAN, PhD.1

Machine Learning3. week

Entropy

Decision Trees

ID3

C4.5

Classification and Regression Trees (CART)

Umut ORHAN, PhD.2

What is Decision Tree

As a short description, decision tree is a data classification procedure which is made by asking questions in right order.

2

Umut ORHAN, PhD.3

Wheather

Humidity

Yes

Wind

Sunny

Cloudy

Rainy

YesYesNo

No

NormalHigh StrongWeak

Decision Tree Sample

Umut ORHAN, PhD.4

It start with entropy based classification tree methods such as

ID3

C4.5.

But first, we should learn something about entropy, uncertainty and information.

Decision Trees

3

Umut ORHAN, PhD.5

Entropy and Uncertainty

In probability theory, entropy is a measure ofthe uncertainty about a random variable.

In information theory, entropy (Shannon) is theexpected value (average) of information ineach situation of the random variable.

According to Shannon, the proper choice for"function to measure information" must belogarithmic.

Umut ORHAN, PhD.6

Uncertainty and Information

The entropy and uncertainty are maximized ifall situations of the random variable have thesame probability.

The difference between prior and posteriorprobability distributions shows the amount ofinformation gained or lost (Kullback–Leiblerdivergence).

Accordingly, posteriors of situations withmaximum uncertainty causes likely maximuminformation gain.

4

Umut ORHAN, PhD.7

Information

In fact, both information and uncertainty are likeantonyms to each other. But since maximumuncertainty can cause maximum information,information gain and uncertainty focus on thesame concept. Self information is represented by

)(log)(

1log)( xP

xPxI

Umut ORHAN, PhD.8

Entropy

Shannon Entropy (H) term is the expectedvalue of all ai situations with Pi probabilities.

n

i

ii

n

i i

i

ni

ii

PPxP

xP

xIxPXIEXH

1

2

1

2

1

log)(

1log)(

)().())(()(

5

Umut ORHAN, PhD.9

Entropy

Let X be random process of a coin toss. Since its two situations have the same probability, the entropy of X is found 1 as below.

We can comment it as that after the process, we will win 1 bit of information.

1)5.0log5.05.0log5.0(

log)(

22

2

1

2

i

ii ppXH

Umut ORHAN, PhD.10

Entropy in Decision Tree

A decision tree divides multidimensional datainto several sets with a condition on specifiedfeature.

In each case, to decide working on whichfeature of data and based on which condition isneeded solving a big combinational problem.

6

Umut ORHAN, PhD.11


Even only in a data with 20 samples and 5features, we can construct more than 106decision trees.

Therefore, each branching should be performedmethodologically.

Umut ORHAN, PhD.12


According to Quinlan, we should branch thetree by gaining maximum information.

For this reason, it must be found the featurewhich causes the minimum uncertainty.

In the ID3, all branches are examinedindividually, and feature causing the minimumuncertainty is preferred to make the branchingon the tree.

7

Umut ORHAN, PhD.13

ID3 Algorithm

It works with only categorical data.

In the first step of each iteration, entropy of thedata set is computed for all features.

Then entropy of each feature depending on theclass is computed, and it is subtracted from theentropy calculated in the first step.

The feature is chosen, which supportsmaximum information gain.

Umut ORHAN, PhD.14

ID3 Sample

V1 V2 S

A C E

B C F

B D E

B D F

We have a dataset with 2 features (V1 and V2), a class vector (S), and 4 samples.

Find the first branching feature?

H(S) - H(V1,S)=? H(S) - H(V2,S)=?

8

Umut ORHAN, PhD.15

ID3 Sample

V1 V2 S

A C E

B C F

B D E

B D F

At first, general entropy:

12

1log

2

1

2

1log

2

1)( 22

SH

Entropy of V1

0,68870,91834

30

3

2log

3

2

3

1log

3

1

4

30

4

1

)(4

3)(

4

1)1(

22

BHAHVH

Entropy of V2

12

1

2

1)(

2

1)(

2

1)2( DHCHVH We choose V1.

Umut ORHAN, PhD.16

C4.5 Algorithm

It is a form of ID3 algorithm that can be applied tonumerical data.

It transforms numerical properties into categorical formby a thresholding method.

Different threshold values are tested.

The threshold value that supports the best informationgain is selected.

After categorizing the feature vector with the selectedthreshold, ID3 can be directly applied.

9

Umut ORHAN, PhD.17

C4.5 Algorithm

To determine the best threshold,

1. All numerical data are sorted (a1, a2,…, an)2. An average value of each sorted pair is

computed by (ai + ai+1)/23. We have n-1 average values as thresholds4. Each average is tested to branch data5. The best average should support the

maximum information gain.

Umut ORHAN, PhD.18

Lost Data

If some samples are lost, there are two ways to be followed:

1. Samples with lost features are completely removed from the dataset.

2. The algorithm is arranged to operate with lost data.

If the number of lost samples is too much, then the second option should be applied.

10

Umut ORHAN, PhD.19

Lost Data

When calculating the information gain for thefeature vector with lost data, information gain iscalculated by excluding the lost samples, andthen it is multiplied by F coefficient. Fcoefficient is the ratio of lost data in dataset.

)),()(.()( XVHXHFXIG

Umut ORHAN, PhD.20

Lost Data

Writing the most frequent value in featurevector with lost data into the lost cells isanother one of the proposed methods.

11

Umut ORHAN, PhD.21

Overfitting

As in other machine learning methods, weshould avoid overfitting also in the decisiontree.

If precautions are not taken, all decision treesmake overfitting.

Therefore the trees should be pruned during orafter creation.

Umut ORHAN, PhD.22

Pruning

Pruning is process of removing needless partsto the classification from decision tree.

In this way, the decision tree becomes bothsimple and understandable.

There are two types of pruning methods;

pre-pruning

post-pruning

12

Umut ORHAN, PhD.23

Pre-Pruning

Pre-pruning is done during creation of trees.

If divided features values is less than athreshold value (fault tolerance), thepartitioning process is stopped.

At that moment a leaf is created, which hasthe label of the dominant class of the availabledata set.

Umut ORHAN, PhD.24

Post-Pruning

Post-pruning is done after creation of tree by

creating a leaf instead of a sub-tree,

rising a sub-tree,

and removing some branches.

13

Umut ORHAN, PhD.25

Pruning Sample

If fault tolerance is 33%, “Yes” ratio at sub-tree of “Humidity” node is computed as 30%. Therefore instead of “Humidity” node, a “No” leaf is created.

Wheather

Humidity

Yes

15

Wind

Sunny

Cloudy

Rainy

Yes

20

Yes

15

No

35No

25

NormalHigh StrongWeak

Umut ORHAN, PhD.26

Pruning Sample

Wheather

Yes

Wind

Sunny

Cloudy

Rainy

Yes

No

No

StrongWeak

14

Umut ORHAN, PhD.27

Classification and Regression Trees (CART)

Briefly, CART trees divides each node into 2 branches. The most known methods are:

Twoing algorithm

Gini algorithm

Umut ORHAN, PhD.28

MATLAB Application

>edit C4_5_ornek.m

Student should study with this sample by using given datasets.

15

Umut ORHAN, PhD.29

Presentation Task

You can choose one of two CART algorithms listed below.

Twoing

Gini

Karar Ağaçları, ID3, C4.5, CARTID3 Algorithm It works with only categorical data. In the first...

Documents

Transcript of Karar Ağaçları, ID3, C4.5, CARTID3 Algorithm It works with only categorical data. In the first...