[IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India...

5
FHSM: Fuzzy Heterogeneous Split Measure Algorithm for Decision Trees Shalini Bhaskar Bajaj Akshaya Kubba Department of Computer Science Department of Information Technology G D Goenka University Banasthali Vidyapeeth University Sohna, Gurgaon Rajasthan [email protected] [email protected] AbstractClassification is the best way to partition a given data set. Decision tree is one of the common methods for extracting knowledge from the data set. Traditional decision tree faces the problem of crisp boundary hence fuzzy boundary conditions are proposed in this research. The paper proposes Fuzzy Heterogeneous Split Measure (FHSM) algorithm for decision tree construction that uses trapezoidal membership function to assign fuzzy membership value to the attributes. Size of the decision tree is one of the main concern as larger size leads to incomprehensible rules. The proposed algorithm tries to reduce the size of the decision tree generated by fixing the value of the control variable in this approach without compromising the classification accuracy. Keywords—Classification, fuzzy decision tree, HSM, fuzzy membership function. I. INTRODUCTION Decision making is one of our major problems to be dealt with. It works on the principles of If-then-else rules. In order to make efficient decisions, decision problems can be solved using appropriate classification techniques. Classification is an important concept as it allows to search for the desired record in a very fast and inexpensive way from the given database. A number of classification algorithms were proposed in the past. In 1986, Quinlan proposed ID3 (Iterative Dichotomiser 3) algorithm [7] that uses the entropy measure to find the splitting attribute. Entropy is defined as the randomness of the system or the number of ways in which arrangement can be done. In 1990, Fayyad and Irani introduced C4.5 algorithm which uses the gain ratio to classify the data set [8]. In 1996, Mehta, Agarwal and Riassnen developed SLIQ algorithm [10]. It uses gini index [5] as a measure to classify the data set. Gini index defines the diversity of a given data set. In 2009, Paul and Chandra developed a linear node split measure [2] based on three distinct classes. In 2010 Chandra, Kothari and Paul developed a nonlinear new node split measure [1]. Split measure [6] is the main decision criteria in classification algorithms. Various split measures have been proposed in the past. These include gini index [5], information gain [7], gain ratio [8], etc. In 2011, B. Chandra and Venkatanaresh babu developed a split measure known as heterogeneous split measure (HSM) [3], based on distinct classes. It works on quasi-linear mean (non-linear differential equation) of exponential function [3]. HSM proved to have better accuracy then the previously described algorithms. Decision making in crisp boundary is an inefficient way of classification. In this paper, fuzzy decision trees have been proposed to overcome this problem using the fuzzy concepts proposed by Zadeh [9] in 1965. Fuzzy set introduces the concept of membership functions that defines how much membership an attribute have in a particular class. A number of algorithms [4][5][11][12] have been proposed to fuzzify the well-known ID3, C4.5 and SLIQ algorithm. In this paper a new approach FHSM (Fuzzy Heterogeneous Split Measure) has been proposed to fuzzify HSM algorithm using trapezoidal membership function. The proposed algorithm FHSM deals with a 3-class problem and uses trapezoidal membership function to generate fuzzy decision tree. Fuzzy decision trees are used to model uncertainty around the split values to make soft rather than hard split. The paper is organized into the following sections. Section II describes HSM algorithm; Section III proposes FHSM (Fuzzy Heterogeneous Split Measure) algorithm; Section IV gives illustrations of the proposed algorithm; Section V discusses the results obtained and finally Section VI gives concluding remarks. II. HSM (HETEROGENEOUS SPLIT MEASURE) ALGORITHM HSM [3] works both on global and local information of dataset and uses quasi-linear mean of exponential function. The node splitting value in HSM is computed as the weighted sum of partial information gain of sub-partitions. For calculating the partial information gain, ratio of weighted quasi-linear mean (class value of sub partition i.e. local) of information gain of class values of the sub partitions with respect to entire dataset (global) is considered. Finally, the decision tree is constructed and partitioning is continued till the records at the partitioned node belongs to the same class. Consider a data set D having R records divided into k classes . A split point partitions data set into two 574 978-1-4799-2572-8/14/$31.00 c 2014 IEEE

Transcript of [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India...

Page 1: [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - FHSM: Fuzzy

FHSM: Fuzzy Heterogeneous Split Measure Algorithm for Decision Trees

Shalini Bhaskar Bajaj Akshaya Kubba Department of Computer Science Department of Information Technology

G D Goenka University Banasthali Vidyapeeth University Sohna, Gurgaon Rajasthan

[email protected] [email protected]

Abstract— Classification is the best way to partition a given data set. Decision tree is one of the common methods for extracting knowledge from the data set. Traditional decision tree faces the problem of crisp boundary hence fuzzy boundary conditions are proposed in this research. The paper proposes Fuzzy Heterogeneous Split Measure (FHSM) algorithm for decision tree construction that uses trapezoidal membership function to assign fuzzy membership value to the attributes. Size of the decision tree is one of the main concern as larger size leads to incomprehensible rules. The proposed algorithm tries to reduce the size of the decision tree generated by fixing the value of the control variable in this approach without compromising the classification accuracy.

Keywords—Classification, fuzzy decision tree, HSM, fuzzy membership function.

I. INTRODUCTION

Decision making is one of our major problems to be dealt with. It works on the principles of If-then-else rules. In order to make efficient decisions, decision problems can be solved using appropriate classification techniques. Classification is an important concept as it allows to search for the desired record in a very fast and inexpensive way from the given database. A number of classification algorithms were proposed in the past. In 1986, Quinlan proposed ID3 (Iterative Dichotomiser 3) algorithm [7] that uses the entropy measure to find the splitting attribute. Entropy is defined as the randomness of the system or the number of ways in which arrangement can be done. In 1990, Fayyad and Irani introduced C4.5 algorithm which uses the gain ratio to classify the data set [8]. In 1996, Mehta, Agarwal and Riassnen developed SLIQ algorithm [10]. It uses gini index [5] as a measure to classify the data set. Gini index defines the diversity of a given data set. In 2009, Paul and Chandra developed a linear node split measure [2] based on three distinct classes. In 2010 Chandra, Kothari and Paul developed a nonlinear new node split measure [1].

Split measure [6] is the main decision criteria in classification algorithms. Various split measures have been proposed in the past. These include gini index [5], information gain [7], gain ratio [8], etc. In 2011, B. Chandra and Venkatanaresh babu developed a split measure known as

heterogeneous split measure (HSM) [3], based on distinct classes. It works on quasi-linear mean (non-linear differential equation) of exponential function [3]. HSM proved to have better accuracy then the previously described algorithms.

Decision making in crisp boundary is an inefficient way of classification. In this paper, fuzzy decision trees have been proposed to overcome this problem using the fuzzy concepts proposed by Zadeh [9] in 1965. Fuzzy set introduces the concept of membership functions that defines how much membership an attribute have in a particular class. A number of algorithms [4][5][11][12] have been proposed to fuzzify the well-known ID3, C4.5 and SLIQ algorithm.

In this paper a new approach FHSM (Fuzzy Heterogeneous Split Measure) has been proposed to fuzzify HSM algorithm using trapezoidal membership function. The proposed algorithm FHSM deals with a 3-class problem and uses trapezoidal membership function to generate fuzzy decision tree. Fuzzy decision trees are used to model uncertainty around the split values to make soft rather than hard split.

The paper is organized into the following sections. Section II describes HSM algorithm; Section III proposes FHSM (Fuzzy Heterogeneous Split Measure) algorithm; Section IV gives illustrations of the proposed algorithm; Section V discusses the results obtained and finally Section VI gives concluding remarks.

II. HSM (HETEROGENEOUS SPLIT MEASURE) ALGORITHM

HSM [3] works both on global and local information of

dataset and uses quasi-linear mean of exponential function. The node splitting value in HSM is computed as the weighted sum of partial information gain of sub-partitions. For calculating the partial information gain, ratio of weighted quasi-linear mean (class value of sub partition i.e. local) of information gain of class values of the sub partitions with respect to entire dataset (global) is considered. Finally, the decision tree is constructed and partitioning is continued till the records at the partitioned node belongs to the same class.

Consider a data set D having R records divided into k classes . A split point partitions data set into two

574978-1-4799-2572-8/14/$31.00 c©2014 IEEE

Page 2: [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - FHSM: Fuzzy

portions, lower partition set L and upper partition set U. For lower partition set belonging to class and for upper partition set belonging to class where i=1, 2 ….. k, node splitting value HSM for a given attribute can be calculated as in (1).

HSM = (|L|/R)*log ( )

+ (|U|/R)*log( ) (1)

Partial info gain describes the mean of information gain of probability of occurrence of events and is given in (2).

Partial info gain = (2)

The split point at which HSM is maximum is selected as the best splitting point and the corresponding attribute is the splitting attribute. Further partitioning of the decision tree continues until all the records in a given partition belongs to a single class. Calculation of HSM for three class problem

Let A, B, C be the three classes. L and U are the Lower and Upper partitions at the splitting point T. L1 and U1 is the sum of number of records of a particular attribute belonging to class A, L2 and U2 is the sum of number of records of a particular attribute belonging to class B, L3 and U3 is the sum of number of records of a particular attribute belonging to class C and N is the total number of records (refer Table I). The HSM value is calculated as shown in (3).

TABLE I. SAMPLE CLASS HISTOGRAM

(3)

III. PROPOSED METHOD

A. FHSM (Fuzzy Heterogeneous Split Measure) algorithm

In order to overcome the problems associated with the crisp boundaries and to handle the data in a more flexible manner concept of fuzzy logic has been used in the proposed approach. This section discusses the proposed method FHSM (Fuzzy heterogeneous split measure). The proposed method uses trapezoidal membership function to fuzzify HSM. Split point is chosen as the mid-point of the attribute values where the class information changes. Fuzzy membership value is

calculated for each attribute using the trapezoidal membership function.

In FHSM, let L1 and U1 be the sum of fuzzy value of records of an attribute belonging to class A, L2 and U2 be the sum of fuzzy value of records of an attribute belonging to class B, L3 and U3 be the sum of fuzzy value of records of an attribute belonging to class C.

For a given record x, the fuzzy membership value can be calculated using the trapezoidal membership function, where a, b, c and d are the points of trapezoid as given in (4). Fig. 1 shows the trapezoidal membership function.

(4)

Figure 1. Trapezoidal membership function

For a 3-class problem the trapezoidal membership function

will look like the one given in Fig. 2 where c1 and d1 belongs to class 1, a2, b2, c2 and d2 belongs to class 2 and a3 and b3 belongs to class 3.

Figure 2. Membership function used in this paper

For each attribute, the membership value is calculated, followed by the calculation of the split points HSM value using trapezoidal membership function. Initially, all attributes ought to have equal membership i.e. 1/ no of classes i.e. 0.3. The attribute with maximum fuzzy HSM value is selected as the root node and is further partitioned. The process is repeated until either of the following conditions is met. Thus, the stopping criteria for the decision tree construction using Fuzzy HSM (FHSM) algorithm could be one of the following: 1. All the records in a particular node belong to the same class. 2. The dataset chosen is empty and no record is left for further

partitioning 3. Depth has reached a given threshold value

Attribute value <T A B C

L L1 L2 L3

U U1 U2 U3

a b c d

1 0

c1 a2 b2 d1 a3 c2 d2 b3

Class1 Class2 1

0

Class 3

2014 IEEE International Advance Computing Conference (IACC) 575

Page 3: [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - FHSM: Fuzzy

Proposed Algorithm FHSM:

If (all records belong to the same class) Then

There is only one leaf node labelled with that class name Else { Calculate the fuzzy membership value for each record in the

data set as 1/ number of classes. Fuzzy membership value of each record is updated by multiplying it with their fuzzy membership value calculated using the trapezoidal membership function. For each attribute, {

Sort all the records on the basis of its attribute value in increasing order along-with their corresponding class value and fuzzy membership value. Find the split point (where the class information changes). For each split point,

Evaluate HSM value }

} Find the split point that has the highest HSM value The attribute with the highest HSM value is the root node

and the records of the dataset are partitioned to the left and right subtrees of the root node on the basis of the split value which is nothing but the attribute value. The same procedure is repeated till the stopping criterion mentioned above is satisfied.

IV. ILLUSTRATION

The FHSM algorithm described above is explained with

the help of a small dataset taken from the UCI Machine learning repository [13]. The name of the dataset is Ecolitst (refer Table II) [13]. It has 7 attributes and 3 classes in total. Initially the membership of each record is considered equal i.e. 1/ total number of class.

TABLE II. EXAMPLE DATASET

A1 A2 A3 A4 A5 A6 A7 Class

48 42 48 5 45 25 35 1

49 42 48 5 53 79 81 2

53 42 48 5 16 29 39 1

62 42 48 5 58 79 81 3

63 42 48 5 48 77 8 3

72 42 48 5 65 77 79 3

88 42 48 5 52 73 75 3

12 43 48 5 63 7 74 2

16 43 48 5 54 27 37 1

22 43 48 5 48 16 28 1

24 43 48 5 54 52 59 1

24 43 48 5 37 28 38 1

41 43 48 5 45 31 41 1

In the proposed algorithm FHSM, initially all the attributes have same membership value. The fuzzy values are calculated using the trapezoidal membership function and their new fuzzy value is updated and the attribute values are sorted. For attribute A1, A2, A3, A4, A5, A6, A7 the split points and there corresponding HSM value is calculated. Table III shows the sorted list of split points along-with their HSM value (on the basis of increasing HSM value). Split point is the attribute value where class value change.

TABLE III. SORTED LIST OF ATTRIBUTES ON THE BASIS OF HSM

VALUE Attribute Split Point HSM

A3 48.0 0.2766

A5 52.5 0.2906

A1 48.5 0.2949

A7 66.5 0.3001

A2 43.0 0.3010

A4 5.0 0.3010

A6 78.0 0.3010

From Table III, it can be observed that the attribute A6 has

the highest HSM value at 78. Hence A6 is selected as root node at split point 78. The dataset is divided into two halves on the basis of the root node selected and the same procedure is applied on each partitioned dataset till the stopping criteria is satisfied. The left and right partitions of the example dataset given in Table II are given below in Table IV and Table V respectively.

TABLE IV. LEFT SUBTREE AT LEVEL 1

From Table IV, it can be seen that membership of Class 1, Class 2 and Class 3 in the resultant dataset in left subtree at level 1 is 0.637, 0.090 and 0.273 respectively.

A1 A2 A3 A4 A5 A6 A7 Class

12 43 48 5 63 7 74 2

22 43 48 5 48 16 28 1

48 42 48 5 45 25 35 1

16 43 48 5 54 27 37 1

24 43 48 5 37 28 38 1

53 42 48 5 16 29 38 1

41 43 48 5 45 31 41 1

24 43 48 5 54 52 59 1

88 42 48 5 52 73 75 3

63 42 48 5 48 77 8 3

72 72 48 5 65 77 79 3

576 2014 IEEE International Advance Computing Conference (IACC)

Page 4: [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - FHSM: Fuzzy

TABLE V. RIGHT SUBTREE AT LEVEL 1

A1 A2 A3 A4 A5 A6 A7 Class

49 42 48 5 53 79 81 2

62 42 48 5 58 79 81 3

From Table V, it can be observed that membership of Class 2 and Class 3 in the resultant dataset in right subtree at level 1 is 0.5 i.e. Class 2: 0.5, Class 3: 0.5. As there is no record of Class 1 in the resultant dataset at right subtree hence Class 1 has 0 membership value.

The records of the left subtree (refer Table IV) at level 1 are further partitioned using the same procedure and the results obtained are given in Table VI. Table VI displays the sorted list of attributes in accordance to the HSM value. Split point is the attribute value where class value change. TABLE VI. SORTED LIST OF ATTRIBUTES ON THE BASIS OF HSM

VALUE FOR THE LEFT SUBTREE AT LEVEL 1

Attribute Split point

HSM

A4 0 0

A6 0 0

A7 66.5 0.0469

A5 58.5 0.0557

A3 48.0 0.0781

A2 43.0 0.1252

A1 58.0 0.3010

From Table VI, it can be observed that attribute A1 acts as the splitting attribute at split point 58 as it has the highest HSM value. Hence, further partitioning is done as shown in Table VII and Table VIII.

TABLE VII. LEFT SUBTREE OBTAINED BY SPLITTING THE

DATASET OF LEFT SUBTREE AT LEVEL 1

A1 A2 A3 A4 A5 A6 A7 Class

12 43 48 5 63 7 74 2

16 43 48 5 54 27 37 1

22 43 48 5 48 16 28 1

24 43 48 5 37 28 38 1

24 43 48 5 54 52 59 1

41 43 48 5 45 31 41 4

48 42 48 5 45 25 35 1

53 42 48 5 16 29 39 1

From Table VII, it can be observed that membership of

Class 1, Class 2 and Class 3 in the resultant dataset in left subtree of the left subtree at level 1 are 0.875, 0.125 and 0

respectively. As there is no record of Class 1 in the resultant dataset at left subtree of the left subtree at level 1 hence Class 1 has 0 membership value.

TABLE VIII. RIGHT SUBTREE OBTAINED BY SPLITTING THE DATASET OF LEFT SUBTREE AT LEVEL 1

A1 A2 A3 A4 A5 A6 A7 Class

63 42 48 5 48 77 8 3

72 42 48 5 65 77 79 3

88 42 48 5 52 73 75 3

From Table VIII, it can be observed that membership of

Class 3 in the resultant dataset in right subtree of the left subtree at level 1 is 1. As there is no record of Class 2 and Class 3 in the resultant dataset at right subtree of the left subtree at level 1 hence Class 1 and Class 2 has 0 membership value.

Figure 3. Decision tree using fuzzy HSM (FHSM) The decision tree formed by the above algorithm as

obtained from dataset (Table VII) is shown in Fig 3. For the same dataset in Table II, the decision tree obtained by HSM algorithm [3] is shown in Fig. 4.

Figure 4. Decision Tree using HSM

V. RESULTS

The proposed algorithm FHSM has been tested on a

number of datasets taken from the UCI Machine Learning Repository [13]. Table IX gives details about the number of

A6

A7 A1

Class1 Class2 Class2 Class3

<=62.5 >62.5

<=66.5 >66.5

<=55.5 >55.5

A6

A7

Class2

<=78 >78

>81

Class1:0.875 Class2:0.125

A1

Class3

<=58 >58

Class3

<=81

2014 IEEE International Advance Computing Conference (IACC) 577

Page 5: [IEEE 2014 IEEE International Advance Computing Conference (IACC) - Gurgaon, India (2014.02.21-2014.02.22)] 2014 IEEE International Advance Computing Conference (IACC) - FHSM: Fuzzy

attributes and the number of classes of variofor classification. Classification accuracy algorithm FHSM has been compared with HSM

TABLE IX. DATASETS

Table X gives classification accuracy resu

HSM algorithm on datasets mentioned in Taresults obtained, it can be observed that the prgives better classification accuracy on a numb

TABLE X. CLASSIFCATION ACCURACY ON DHSM AND FHSM

From the results listed in Table X and gra

Fig. 5, it can be observed that the averageproposed algorithm FHSM is better than theFHSM algorithm gives better accuracy on aBrest Cancer, Wine and Zoo datasets.

Sr. No. Dataset Name No. o

Attribu

1. Haberman 3

2. Liver 6

3. Balanced Scale 4

4. Breast Cancer 9

5. Credit 14

6. Wine 13

7. Ionosphere 34

8. Lymphograph 18

9. Pima Indian Diabetes 8

10. Zoo 16

Sr. No. Dataset Name

Classification Accuracy

obtained using HSM algorithm

(in %)

1. Haberman 66.77

2. Liver 63.33

3. Balanced Scale 75.73

4. Breast Cancer 94.92

5. Credit 83.75

6. Wine 93.33

7. Ionosphere 86.57

8. Lymphograph 78.82

9. Pima Indian Diabetes 68.96

10. Zoo 97

ous datasets taken of the proposed M algorithm.

ults of FHSM and able IX. From the roposed algorithm

ber of datasets.

DATA SETS USING

aphically shown in e accuracy of the e HSM algorithm. all datasets except

Figure 5. Classification Accuracy on diff

VI. CON

Classification algorithm HS

have crisp boundaries. Decisionhave hard split point values. In reimportant role. Hence FHSM trapezoidal membership functionsofter while generating the accuracy obtained using fuzzy cothe one obtained using crisp proved to have better classificamost of the datasets which experimental analysis.

REFERE

[1] B. Chandra, R. Kothari, P. Padecision tree construction”pp.2725-2731, Aug. 2010.

[2] B. Chandra, P. Paul Varghestree construction”, InformatioMar., 2009.

[3] B. Chandra and V. Babu, “ HDecision Tree Construction” ,

[4] B. Chandra and P. Paul Vaalgorithm” ,IEEE Transaction

[5] E. Hullermeier, S. VanderloGood Rankers”, to appear in I

[6] L. Breiman, “Some properLearning, Vol. 24, p.41–47, 1

[7] J. R. Quinlan, “Induction op.81–106, 1986.

[8] J. R. Quinlan, “C4.5: ProgramCA: Morgan Kaufmann, 1993

[9] J. Ross, “Fuzzy logic and En2010.

[10] M. Mehta, R. Aggarwal, Jclassifier for data miningAvignon, France, 1996.

[11] S. Ruggieri, “Efficient C4.5”14, no. 2, pp. 438–444, Mar./A

[12] Y. Yuan and M. J. Shaw, “IndSets Syst., vol. 69, no. 2, pp. 1

[13] http://archive.ics.uci.edu/ml/

6065707580859095

100

Hab

erm

an

Live

r

Bala

nced

Sca

le

Brea

st C

ance

r

Cred

it

Win

eClas

sific

atio

n A

ccur

acy

(in %

)

Dataset

Classification Accuradatasets using HS

of utes

No. of Classes

2

2

3

2

2

3

2

4

2

7

Classification Accuracy

obtained using Fuzzy HSM

(FHSM) algorithm (in %)

69.2

66.7

78.5

92.4

85.8

91.6

88.9

81.7

72.1

94.2

ferent datasets using HSM and Fuzzy HSM

NCLUSION

SM generates decision tree that n trees with crisp boundaries eal life, soft boundaries play an algorithm is proposed using

n to make the crisp boundaries decision trees. Classification oncept is better as compared to data partitioning. FHSM has ation accuracy than HSM for

is quite evident from the

ENCES

aul, “A new node splitting measure for , Pattern Recognition, Vol.43 (8),

se, “Moving towards efficient decision on Sciences, Vol.179 (8), P.1059-1069,

Heterogeneous Node Split Measure For , IEEE 2011 arghese, ”Fuzzy SLIQ Decision Tree n, Vol 38(5), pp. 1083-4419, Oct. 2008. ooy, “Why Fuzzy Decision Trees are IEEE Transaction. rties of splitting criteria”, Machine 996.

of decision trees”, Machine learning,

ms for Machine Learning”, San Mateo, 3. ngineering applications”, third edition,

J. Riassnen, “SLIQ: a fast scalable g”, extending database technology,

”, IEEE Trans. Knowl. Data Eng., vol. Apr. 2002. duction of fuzzy decision trees”, Fuzzy 125–139, Jan. 1995.

Win

e

Iono

sphe

re

Lym

phog

raph

Pim

a In

dian

Zoo

ts

acy on different M and FHSM

Classification Accuracy obtained using HSM algorithm (in %)

Classification Accuracy obtained using FHSM algorithm (in %)

578 2014 IEEE International Advance Computing Conference (IACC)