8/8/2019 1__dt
1/36
Decision Trees
SLIQ fast scalable classifierGroup 12-Vaibhav Chopda-Tarun Bahadur
8/8/2019 1__dt
2/36
8/8/2019 1__dt
3/36
8/8/2019 1__dt
4/36
AgendaWhat is classification
Why decision trees ?The ID3 algorithmLimitations of ID3 algorithm
SLIQ fast scalable classifier forDataMiningSP RINT the successor of SLIQ
8/8/2019 1__dt
5/36
Classification Process : Model Constructi
TrainingData
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 no
Anne Associate Prof 3 no
ClassificationAlgorithms
IF rank = professorOR years > 6THEN tenured = yes
Classifier (Mode l)
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
6/36
T esting and Prediction (by a classifier)
Classifier
TestingData
NA M E RA N K YEAR S TENU R EDT om Assistant P rof 2 noM erlisa Associate Prof 7 noG eorge Professor 5 yes
Joseph Assistant P rof 7 yes
U nseen Data
(Jeff, Professor, 4 )
Tenured?
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
7/36
Classification by Decision T ree Induction
Decision tree (T uples flow along the tree structure) Internal node denotes an attribute Branch represents the values of the
node attribute
Leaf nodes represent class labels or class distribution
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
8/36
Classification by Decision T ree Induction
Decision tree generation consists of two phases Tree construction
At start we choose one attribute as the root and put all its values as
branchesWe choose recursively internal nodes (attributes) with their proper values as branches.We Stop when
all the samples (records) are of the same class, then the node becomesthe leaf labeled with that class
or there is no more samples left
or there is no more new attributes to be put as the nodes. In this casewe apply MAJORI TY VOT ING to classify the node.
Tree pruningIdentify and remove branches that reflect noise or outliers
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
9/36
Classification by Decision T ree Induction
Wh eres t h e c h allenge ? Good c h oice of root attribute
Good c h oice of t h e internal nodes attributesis a crucial point.
Decision T ree Induction Algorithms differ on methods of evaluating and choosingthe root and internal nodes attributes.
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
10/36
Basic Idea of ID3/C4.5 Algorithm- greedy algorithm- constructs decision trees in a top-down
recursive divide-and-conquer manner.
T ree S T AR T S as a single node (root) representing alltraining dataset (samples)IFthe samples are ALL in the same class, THE N thenode becomes a L E AF and is labeled with that classOTHE RWIS E , the algorithm uses an entropy-basedmeasure known as information gain as a heuristicfor selecting the A TT RIBU TE that will B EST separatethe samples into individual classes. T his attributebecomes the node-name (test, or tree split decisionattribute)
CSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
11/36
Basic Idea of ID3/C4.5 Algorithm (2)A branch is created for each value of the node-attribute (and is
labeled by this value -this is syntax) and the samples arepartitioned accordingly (this is semantics; see example whichfollows)T he algorithm uses the same process recursively to form adecision tree at each partition. Once an attribute has occurred at anode, it need not be considered in any other of the nodesdescendentsT he recursive partitioning STOPS only when any one of thefollowing conditions is true.All records (samples) for the given node belong to the same classor
T here are no remaining attributes on which theRecords (samples) may be further partitioning. In this case weconvert the given node into a L E AF and label it with the class inmajority among samples ( majority voting )T here is no records (samples) left a leaf is created with majority
vote for training sampleCSE634 course notes P rof. Anita Wasilewska
8/8/2019 1__dt
12/36
E xample from
professor Anitas slidea e in ome st dent redit_ratin
0 lo es fair > 0 lo es ex ellent
1 0 lo es ex ellent
8/8/2019 1__dt
13/36
Shortcommings of ID3Scalability ?requires lot of computation at everystage of construction of decision treeScalability ?needs all the training data to be in thememoryIt does not suggest any standardsplitting index for range attributes
8/8/2019 1__dt
14/36
SLIQ - a decision tree classifier
Features of SLIQ
Applies to both numerical and categorical attributesBuilds compact and accurate treesUses a pre-sorting technique in the tree growingphase and an inexpensive pruning algorithm
Suitable for classification of large disk-resident datasets, independently of the number of classes,attributes and records
8/8/2019 1__dt
15/36
SLIQ Methodology:
Generateattribute listfor eachattribute
Sort attributelists forNUMERICAttributes
Createdecision treebypartitioningrecords
Start End
8/8/2019 1__dt
16/36
Example:
Drivers Age CarType Class23 Family HIGH17 Sports HIGH
43 Sports HIGH68 Family LOW32 Truck LOW20 Family HIGH
8/8/2019 1__dt
17/36
Attribute listing phase :
Age Class RecId
23 HIGH 0
17 HIGH 1
43 HIGH 268 LOW 3
32 LOW 4
20 HIGH 5
Rec Id Age CarType Class
0 23 Family HIGH
1 17 Sports HIGH
2 43 Sports HIGH
3 68 Family LOW
4 32 Truck LOW
5 20 Family HIGH
Rec Id CarType Rec Id
Family HIGH 0
Sports HIGH 1
Sports HIGH 2Family LOW 3
Truck LOW 4
Family HIGH 5
Age NUMERIC attribute CarType CATEGORICAL attribute
8/8/2019 1__dt
18/36
P resorting P hase:
Age Class RecId17 HIGH 0
20 HIGH 5
23 HIGH 0
32 LOW 4
43 LOW 268 HIGH 3
CarType Class Rec IdFamily HIGH 0
Sports HIGH 1
Sports HIGH 2
Family LOW 3
Truck LOW 4
Family HIGH 5
Only NUMERIC attributes sorted CATEGORICAL attribute need not be sorted
8/8/2019 1__dt
19/36
Constructing the decision tree
8/8/2019 1__dt
20/36
Constructing the decision tree(block 20) for each leaf node being examined, the methoddetermines a split test to best separate the records at theexamined node using the attribute lists in block 21.
(block 22) the records at the examined leaf node are partitionedaccording to the best split test at that node to form new leaf nodes, which are also child nodes of the examined node.
The records at each new leaf node are checked at block 23 to seeif they are of the same class. If this condition has not been
achieved, the splitting process is repeated starting with block 24for each newly formed leaf node until each leaf node containsrecords from one class.
In finding the best split test (or split point) at a leaf node, asplitting index corresponding to a criterion used for splitting therecords may be used to help evaluate possible splits. This
splitting index indicates how well the criterion separates therecord classes. The splitting index is preferably a gini index.
8/8/2019 1__dt
21/36
Gini IndexThe gini index is used to evaluate the goodness of the alternativesplits for an attributeIf a data set T contains examples from n classes, gini(T) is defined as
Where pj is the relative ferquency of class j in the data set T.
After splitting T into two subset T1 and T2 the gini index of the split data is defined as
8/8/2019 1__dt
22/36
Gini Index : The preferred
splitting index Advantage of the gini index:Its calculation requires only the distribution of the class valuesin each record partition.
To find the best split point for a node, the node's attribute listsare scanned to evaluate the splits for the attributes. Theattribute containing the split point with the lowest value forthe gini index is used for splitting the node's records.
The following is the splitting test (next slide) The flow chart will fit in the block 21 of decision treeconstruction
8/8/2019 1__dt
23/36
8/8/2019 1__dt
24/36
Numeric Attributes splitting
index
8/8/2019 1__dt
25/36
Splitting for catergorical
attributes
8/8/2019 1__dt
26/36
Determining subset of highest index
The logic of finding best subset substitute for block 39
G reedyalgorithm maybe used here
8/8/2019 1__dt
27/36
The decision tree getting
constructed (level 0)
8/8/2019 1__dt
28/36
Decision tree (level 1)
8/8/2019 1__dt
29/36
The classification at level 1
8/8/2019 1__dt
30/36
P erformance:
8/8/2019 1__dt
31/36
P erformance: Classification
Accuracy
8/8/2019 1__dt
32/36
P erformance: Decision Tree
Size
8/8/2019 1__dt
33/36
P erformance: Execution time
8/8/2019 1__dt
34/36
P erformance: Scalability
8/8/2019 1__dt
35/36
Conclusion:
8/8/2019 1__dt
36/36
THANK YOU !!!