Download - 1__dt

8/8/2019 1__dt

1/36

Decision Trees

SLIQ fast scalable classifierGroup 12-Vaibhav Chopda-Tarun Bahadur

8/8/2019 1__dt

2/36

8/8/2019 1__dt

3/36

8/8/2019 1__dt

4/36

AgendaWhat is classification

Why decision trees ?The ID3 algorithmLimitations of ID3 algorithm

SLIQ fast scalable classifier forDataMiningSP RINT the successor of SLIQ

8/8/2019 1__dt

5/36

Classification Process : Model Constructi

TrainingData

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 no

Anne Associate Prof 3 no

ClassificationAlgorithms

IF rank = professorOR years > 6THEN tenured = yes

Classifier (Mode l)

CSE634 course notes P rof. Anita Wasilewska

8/8/2019 1__dt

6/36

T esting and Prediction (by a classifier)

Classifier

TestingData

NA M E RA N K YEAR S TENU R EDT om Assistant P rof 2 noM erlisa Associate Prof 7 noG eorge Professor 5 yes

Joseph Assistant P rof 7 yes

U nseen Data

(Jeff, Professor, 4 )

Tenured?


8/8/2019 1__dt

7/36

Classification by Decision T ree Induction

Decision tree (T uples flow along the tree structure) Internal node denotes an attribute Branch represents the values of the

node attribute

Leaf nodes represent class labels or class distribution


8/8/2019 1__dt

8/36


Decision tree generation consists of two phases Tree construction

At start we choose one attribute as the root and put all its values as

branchesWe choose recursively internal nodes (attributes) with their proper values as branches.We Stop when

all the samples (records) are of the same class, then the node becomesthe leaf labeled with that class

or there is no more samples left

or there is no more new attributes to be put as the nodes. In this casewe apply MAJORI TY VOT ING to classify the node.

Tree pruningIdentify and remove branches that reflect noise or outliers


8/8/2019 1__dt

9/36


Wh eres t h e c h allenge ? Good c h oice of root attribute

Good c h oice of t h e internal nodes attributesis a crucial point.

Decision T ree Induction Algorithms differ on methods of evaluating and choosingthe root and internal nodes attributes.


8/8/2019 1__dt

10/36

Basic Idea of ID3/C4.5 Algorithm- greedy algorithm- constructs decision trees in a top-down

recursive divide-and-conquer manner.

T ree S T AR T S as a single node (root) representing alltraining dataset (samples)IFthe samples are ALL in the same class, THE N thenode becomes a L E AF and is labeled with that classOTHE RWIS E , the algorithm uses an entropy-basedmeasure known as information gain as a heuristicfor selecting the A TT RIBU TE that will B EST separatethe samples into individual classes. T his attributebecomes the node-name (test, or tree split decisionattribute)


8/8/2019 1__dt

11/36

Basic Idea of ID3/C4.5 Algorithm (2)A branch is created for each value of the node-attribute (and is

labeled by this value -this is syntax) and the samples arepartitioned accordingly (this is semantics; see example whichfollows)T he algorithm uses the same process recursively to form adecision tree at each partition. Once an attribute has occurred at anode, it need not be considered in any other of the nodesdescendentsT he recursive partitioning STOPS only when any one of thefollowing conditions is true.All records (samples) for the given node belong to the same classor

T here are no remaining attributes on which theRecords (samples) may be further partitioning. In this case weconvert the given node into a L E AF and label it with the class inmajority among samples ( majority voting )T here is no records (samples) left a leaf is created with majority

vote for training sampleCSE634 course notes P rof. Anita Wasilewska

8/8/2019 1__dt

12/36

E xample from

professor Anitas slidea e in ome st dent redit_ratin

0 lo es fair > 0 lo es ex ellent

1 0 lo es ex ellent

8/8/2019 1__dt

13/36

Shortcommings of ID3Scalability ?requires lot of computation at everystage of construction of decision treeScalability ?needs all the training data to be in thememoryIt does not suggest any standardsplitting index for range attributes

8/8/2019 1__dt

14/36

SLIQ - a decision tree classifier

Features of SLIQ

Applies to both numerical and categorical attributesBuilds compact and accurate treesUses a pre-sorting technique in the tree growingphase and an inexpensive pruning algorithm

Suitable for classification of large disk-resident datasets, independently of the number of classes,attributes and records

8/8/2019 1__dt

15/36

SLIQ Methodology:

Generateattribute listfor eachattribute

Sort attributelists forNUMERICAttributes

Createdecision treebypartitioningrecords

Start End

8/8/2019 1__dt

16/36

Example:

Drivers Age CarType Class23 Family HIGH17 Sports HIGH

43 Sports HIGH68 Family LOW32 Truck LOW20 Family HIGH

8/8/2019 1__dt

17/36

Attribute listing phase :

Age Class RecId

23 HIGH 0

17 HIGH 1

43 HIGH 268 LOW 3

32 LOW 4

20 HIGH 5

Rec Id Age CarType Class

0 23 Family HIGH

1 17 Sports HIGH

2 43 Sports HIGH

3 68 Family LOW

4 32 Truck LOW

5 20 Family HIGH

Rec Id CarType Rec Id

Family HIGH 0

Sports HIGH 1

Sports HIGH 2Family LOW 3

Truck LOW 4

Family HIGH 5

Age NUMERIC attribute CarType CATEGORICAL attribute

8/8/2019 1__dt

18/36

P resorting P hase:

Age Class RecId17 HIGH 0

20 HIGH 5

23 HIGH 0

32 LOW 4

43 LOW 268 HIGH 3

CarType Class Rec IdFamily HIGH 0

Sports HIGH 1

Sports HIGH 2

Family LOW 3

Truck LOW 4

Family HIGH 5

Only NUMERIC attributes sorted CATEGORICAL attribute need not be sorted

8/8/2019 1__dt

19/36

Constructing the decision tree

8/8/2019 1__dt

20/36

Constructing the decision tree(block 20) for each leaf node being examined, the methoddetermines a split test to best separate the records at theexamined node using the attribute lists in block 21.

(block 22) the records at the examined leaf node are partitionedaccording to the best split test at that node to form new leaf nodes, which are also child nodes of the examined node.

The records at each new leaf node are checked at block 23 to seeif they are of the same class. If this condition has not been

achieved, the splitting process is repeated starting with block 24for each newly formed leaf node until each leaf node containsrecords from one class.

In finding the best split test (or split point) at a leaf node, asplitting index corresponding to a criterion used for splitting therecords may be used to help evaluate possible splits. This

splitting index indicates how well the criterion separates therecord classes. The splitting index is preferably a gini index.

8/8/2019 1__dt

21/36

Gini IndexThe gini index is used to evaluate the goodness of the alternativesplits for an attributeIf a data set T contains examples from n classes, gini(T) is defined as

Where pj is the relative ferquency of class j in the data set T.

After splitting T into two subset T1 and T2 the gini index of the split data is defined as

8/8/2019 1__dt

22/36

Gini Index : The preferred

splitting index Advantage of the gini index:Its calculation requires only the distribution of the class valuesin each record partition.

To find the best split point for a node, the node's attribute listsare scanned to evaluate the splits for the attributes. Theattribute containing the split point with the lowest value forthe gini index is used for splitting the node's records.

The following is the splitting test (next slide) The flow chart will fit in the block 21 of decision treeconstruction

8/8/2019 1__dt

23/36

8/8/2019 1__dt

24/36

Numeric Attributes splitting

index

8/8/2019 1__dt

25/36

Splitting for catergorical

attributes

8/8/2019 1__dt

26/36

Determining subset of highest index

The logic of finding best subset substitute for block 39

G reedyalgorithm maybe used here

8/8/2019 1__dt

27/36

The decision tree getting

constructed (level 0)

8/8/2019 1__dt

28/36

Decision tree (level 1)

8/8/2019 1__dt

29/36

The classification at level 1

8/8/2019 1__dt

30/36

P erformance:

8/8/2019 1__dt

31/36

P erformance: Classification

Accuracy

8/8/2019 1__dt

32/36

P erformance: Decision Tree

Size

8/8/2019 1__dt

33/36

P erformance: Execution time

8/8/2019 1__dt

34/36

P erformance: Scalability

8/8/2019 1__dt

35/36

Conclusion:

8/8/2019 1__dt

36/36

THANK YOU !!!