Karar Ağaçları, ID3, C4.5, CARTID3 Algorithm It works with only categorical data. In the first...
Transcript of Karar Ağaçları, ID3, C4.5, CARTID3 Algorithm It works with only categorical data. In the first...
1
Umut ORHAN, PhD.1
Machine Learning3. week
Entropy
Decision Trees
ID3
C4.5
Classification and Regression Trees (CART)
Umut ORHAN, PhD.2
What is Decision Tree
As a short description, decision tree is a data classification procedure which is made by asking questions in right order.
2
Umut ORHAN, PhD.3
Wheather
Humidity
Yes
Wind
Sunny
Cloudy
Rainy
YesYesNo
No
NormalHigh StrongWeak
Decision Tree Sample
Umut ORHAN, PhD.4
It start with entropy based classification tree methods such as
ID3
C4.5.
But first, we should learn something about entropy, uncertainty and information.
Decision Trees
3
Umut ORHAN, PhD.5
Entropy and Uncertainty
In probability theory, entropy is a measure ofthe uncertainty about a random variable.
In information theory, entropy (Shannon) is theexpected value (average) of information ineach situation of the random variable.
According to Shannon, the proper choice for"function to measure information" must belogarithmic.
Umut ORHAN, PhD.6
Uncertainty and Information
The entropy and uncertainty are maximized ifall situations of the random variable have thesame probability.
The difference between prior and posteriorprobability distributions shows the amount ofinformation gained or lost (Kullback–Leiblerdivergence).
Accordingly, posteriors of situations withmaximum uncertainty causes likely maximuminformation gain.
4
Umut ORHAN, PhD.7
Information
In fact, both information and uncertainty are likeantonyms to each other. But since maximumuncertainty can cause maximum information,information gain and uncertainty focus on thesame concept. Self information is represented by
)(log)(
1log)( xP
xPxI
Umut ORHAN, PhD.8
Entropy
Shannon Entropy (H) term is the expectedvalue of all ai situations with Pi probabilities.
n
i
ii
n
i i
i
ni
ii
PPxP
xP
xIxPXIEXH
1
2
1
2
1
log)(
1log)(
)().())(()(
5
Umut ORHAN, PhD.9
Entropy
Let X be random process of a coin toss. Since its two situations have the same probability, the entropy of X is found 1 as below.
We can comment it as that after the process, we will win 1 bit of information.
1)5.0log5.05.0log5.0(
log)(
22
2
1
2
i
ii ppXH
Umut ORHAN, PhD.10
Entropy in Decision Tree
A decision tree divides multidimensional datainto several sets with a condition on specifiedfeature.
In each case, to decide working on whichfeature of data and based on which condition isneeded solving a big combinational problem.
6
Umut ORHAN, PhD.11
Entropy in Decision Tree
Even only in a data with 20 samples and 5features, we can construct more than 106decision trees.
Therefore, each branching should be performedmethodologically.
Umut ORHAN, PhD.12
Entropy in Decision Tree
According to Quinlan, we should branch thetree by gaining maximum information.
For this reason, it must be found the featurewhich causes the minimum uncertainty.
In the ID3, all branches are examinedindividually, and feature causing the minimumuncertainty is preferred to make the branchingon the tree.
7
Umut ORHAN, PhD.13
ID3 Algorithm
It works with only categorical data.
In the first step of each iteration, entropy of thedata set is computed for all features.
Then entropy of each feature depending on theclass is computed, and it is subtracted from theentropy calculated in the first step.
The feature is chosen, which supportsmaximum information gain.
Umut ORHAN, PhD.14
ID3 Sample
V1 V2 S
A C E
B C F
B D E
B D F
We have a dataset with 2 features (V1 and V2), a class vector (S), and 4 samples.
Find the first branching feature?
H(S) - H(V1,S)=? H(S) - H(V2,S)=?
8
Umut ORHAN, PhD.15
ID3 Sample
V1 V2 S
A C E
B C F
B D E
B D F
At first, general entropy:
12
1log
2
1
2
1log
2
1)( 22
SH
Entropy of V1
0,68870,91834
30
3
2log
3
2
3
1log
3
1
4
30
4
1
)(4
3)(
4
1)1(
22
BHAHVH
Entropy of V2
12
1
2
1)(
2
1)(
2
1)2( DHCHVH We choose V1.
Umut ORHAN, PhD.16
C4.5 Algorithm
It is a form of ID3 algorithm that can be applied tonumerical data.
It transforms numerical properties into categorical formby a thresholding method.
Different threshold values are tested.
The threshold value that supports the best informationgain is selected.
After categorizing the feature vector with the selectedthreshold, ID3 can be directly applied.
9
Umut ORHAN, PhD.17
C4.5 Algorithm
To determine the best threshold,
1. All numerical data are sorted (a1, a2,…, an)2. An average value of each sorted pair is
computed by (ai + ai+1)/23. We have n-1 average values as thresholds4. Each average is tested to branch data5. The best average should support the
maximum information gain.
Umut ORHAN, PhD.18
Lost Data
If some samples are lost, there are two ways to be followed:
1. Samples with lost features are completely removed from the dataset.
2. The algorithm is arranged to operate with lost data.
If the number of lost samples is too much, then the second option should be applied.
10
Umut ORHAN, PhD.19
Lost Data
When calculating the information gain for thefeature vector with lost data, information gain iscalculated by excluding the lost samples, andthen it is multiplied by F coefficient. Fcoefficient is the ratio of lost data in dataset.
)),()(.()( XVHXHFXIG
Umut ORHAN, PhD.20
Lost Data
Writing the most frequent value in featurevector with lost data into the lost cells isanother one of the proposed methods.
11
Umut ORHAN, PhD.21
Overfitting
As in other machine learning methods, weshould avoid overfitting also in the decisiontree.
If precautions are not taken, all decision treesmake overfitting.
Therefore the trees should be pruned during orafter creation.
Umut ORHAN, PhD.22
Pruning
Pruning is process of removing needless partsto the classification from decision tree.
In this way, the decision tree becomes bothsimple and understandable.
There are two types of pruning methods;
pre-pruning
post-pruning
12
Umut ORHAN, PhD.23
Pre-Pruning
Pre-pruning is done during creation of trees.
If divided features values is less than athreshold value (fault tolerance), thepartitioning process is stopped.
At that moment a leaf is created, which hasthe label of the dominant class of the availabledata set.
Umut ORHAN, PhD.24
Post-Pruning
Post-pruning is done after creation of tree by
creating a leaf instead of a sub-tree,
rising a sub-tree,
and removing some branches.
13
Umut ORHAN, PhD.25
Pruning Sample
If fault tolerance is 33%, “Yes” ratio at sub-tree of “Humidity” node is computed as 30%. Therefore instead of “Humidity” node, a “No” leaf is created.
Wheather
Humidity
Yes
15
Wind
Sunny
Cloudy
Rainy
Yes
20
Yes
15
No
35No
25
NormalHigh StrongWeak
Umut ORHAN, PhD.26
Pruning Sample
Wheather
Yes
Wind
Sunny
Cloudy
Rainy
Yes
No
No
StrongWeak
14
Umut ORHAN, PhD.27
Classification and Regression Trees (CART)
Briefly, CART trees divides each node into 2 branches. The most known methods are:
Twoing algorithm
Gini algorithm
Umut ORHAN, PhD.28
MATLAB Application
>edit C4_5_ornek.m
Student should study with this sample by using given datasets.
15
Umut ORHAN, PhD.29
Presentation Task
You can choose one of two CART algorithms listed below.
Twoing
Gini