[IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan...

5
Weighted Naive Bayesian Classifier Hamad Alhammady Etisalat University College - UAE [email protected] Abstract The naive Bayesian (NB) classifier is one of the simple yet powerful classification methods. One of the important problems in NB (and many other classifiers) is that it is built using crisp classes assigned to the training data. In this paper, we propose an improvement over the NB classifier by employing emerging patterns (EPs) to weight the training instances. That is, we generalize the NB classifier so that it can take into account weighted classes assigned to the training data. EPs are those itemsets whose frequencies in one class are significantly higher than their frequencies in the other classes. Our experiments prove that our proposed method is superior to the original NB classifier. 1. Introduction The naive Bayesian (NB) classifier [1] is considered as one of the effective classifiers despite its simplicity. However, the NB classifier is based on probability calculations over crisp classes in the dataset. That is, probabilities are calculated assuming that each training data instance is related completely to one class only. This assumption conflicts with the fact that most real life datasets suffer from noise. That is, a training instance might not always be assigned to its real class. We proposed the notion of weighted classes in previous research [2]. Assume that we have a dataset consisting of three classes: 1 C , 2 C , and 3 C . An instance i is said to have a crisp class if it is assigned completely to one of the three classes. However, instance i may still have some relations with the other two classes. The notion of weighted classes indicates that i is related to the three classes with different weights. Figure 1 shows examples of a crisp class and a weighted class. In the crisp class, 100% of the weight of instance i is assigned to one of the three classes (in this example, class 1 C ). In the weighted class, the weight is distributed among the three classes. The weight assigned to each class is proportional to the strength of the relation between this class and instance i . Figure 1. Examples of a crisp class and a weighted class In this paper, we apply the concept of weighted classes on the NB classifier. Our weighting scheme, proposed in [2], is based on emerging patterns [EPs]. EPs are a new kind of patterns introduced recently [3]. They have been proved to have a great impact in many applications [4] [5] [6] [7] [8] [9]. EPs can capture significant changes between datasets. They are defined as itemsets whose supports increase significantly from one class to another. The discriminating power of EPs can be measured by their growth rates. The growth rate of an EP is the ratio of its support in a certain class over that in another class. Usually the discriminating power of an EP is proportional to its growth rate. For example, the Mushroom dataset, from the UCI Machine Learning Repository [10], contains a large number of EPs between the poisonous and the edible mushroom classes. Table 1 shows two examples of these EPs. These two EPs consist of 3 items. e1 is an EP from the poisonous mushroom class to the edible mushroom class. It never exists in the poisonous mushroom class, and exists in 63.9% of the instances in the edible mushroom class; hence, its growth rate is (63.9 / 0). It has a very high predictive power to contrast edible mushrooms against poisonous mushrooms. On the other hand, e2 is an EP from the edible mushroom class to the poisonous mushroom class. It exists in 3.8% of the instances in the edible mushroom class, and in 81.4% of the instances in the poisonous mushroom class; hence, its growth rate is 21.4 (81.4 / 3.8). It has a high predictive power to contrast poisonous mushrooms against edible mushrooms. C 1 C 2 C 3 100% 0% 0% C 1 C 2 C 3 70% 20% 10% Crisp Class Weighted Class 437 1-4244-1031-2/07/$25.00©2007 IEEE

Transcript of [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan...

Page 1: [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan (2007.05.13-2007.05.16)] 2007 IEEE/ACS International Conference on Computer Systems

Weighted Naive Bayesian Classifier

Hamad Alhammady Etisalat University College - UAE

[email protected]

Abstract

The naive Bayesian (NB) classifier is one of the simple yet powerful classification methods. One of the important problems in NB (and many other classifiers) is that it is built using crisp classes assigned to the training data. In this paper, we propose an improvement over the NB classifier by employing emerging patterns (EPs) to weight the training instances. That is, we generalize the NB classifier so that it can take into account weighted classes assigned to the training data. EPs are those itemsets whose frequencies in one class are significantly higher than their frequencies in the other classes. Our experiments prove that our proposed method is superior to the original NB classifier. 1. Introduction

The naive Bayesian (NB) classifier [1] is considered as one of the effective classifiers despite its simplicity. However, the NB classifier is based on probability calculations over crisp classes in the dataset. That is, probabilities are calculated assuming that each training data instance is related completely to one class only. This assumption conflicts with the fact that most real life datasets suffer from noise. That is, a training instance might not always be assigned to its real class. We proposed the notion of weighted classes in previous research [2]. Assume that we have a dataset consisting of three classes: 1C , 2C , and 3C . An instance i is said to have a crisp class if it is assigned completely to one of the three classes. However, instance i may still have some relations with the other two classes. The notion of weighted classes indicates that i is related to the three classes with different weights. Figure 1 shows examples of a crisp class and a weighted class. In the crisp class, 100% of the weight of instance i is assigned to one of the three classes (in this example, class 1C ). In the weighted class, the weight is distributed among the three classes. The weight assigned to each class is proportional to the strength of the relation between this class and instance i .

Figure 1. Examples of a crisp class and a weighted class

In this paper, we apply the concept of weighted classes

on the NB classifier. Our weighting scheme, proposed in [2], is based on emerging patterns [EPs]. EPs are a new kind of patterns introduced recently [3]. They have been proved to have a great impact in many applications [4] [5] [6] [7] [8] [9]. EPs can capture significant changes between datasets. They are defined as itemsets whose supports increase significantly from one class to another. The discriminating power of EPs can be measured by their growth rates. The growth rate of an EP is the ratio of its support in a certain class over that in another class. Usually the discriminating power of an EP is proportional to its growth rate.

For example, the Mushroom dataset, from the UCI Machine Learning Repository [10], contains a large number of EPs between the poisonous and the edible mushroom classes. Table 1 shows two examples of these EPs. These two EPs consist of 3 items. e1 is an EP from the poisonous mushroom class to the edible mushroom class. It never exists in the poisonous mushroom class, and exists in 63.9% of the instances in the edible mushroom class; hence, its growth rate is ∞ (63.9 / 0). It has a very high predictive power to contrast edible mushrooms against poisonous mushrooms. On the other hand, e2 is an EP from the edible mushroom class to the poisonous mushroom class. It exists in 3.8% of the instances in the edible mushroom class, and in 81.4% of the instances in the poisonous mushroom class; hence, its growth rate is 21.4 (81.4 / 3.8). It has a high predictive power to contrast poisonous mushrooms against edible mushrooms.

C1 C2 C3

100%

0% 0% C1 C2 C3

70%

20% 10%

Crisp Class Weighted Class

4371-4244-1031-2/07/$25.00©2007 IEEE

Page 2: [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan (2007.05.13-2007.05.16)] 2007 IEEE/ACS International Conference on Computer Systems

Table1. Examples of emerging patterns.

2. Related Work The NB classifier is based on using the Bayesian theorem to compute the probability1 scores of a test instance in each class in the dataset. The classifier then assigns the test instance to the class with the highest probability. For example, the probability of a test instance, t , in class jC

is as follows.

)()|()(

)|(tP

CtPCPtCP jj

j = (1)

As the denominator does not depend on the class and t

is given, equation 1 can be written as follows.

)|()()|( jjj CtPCPtCP = (2)

Assume that )|( yxP denotes the probability of x given y , m is the number of attributes, and la is the

value of the thl attribute in t , equation (2) can be written as follows.

∏==

m

l jljj CaPCPtCP1

)|()()|( (3)

Equation 3 is used to calculate the probability of each

class given t. The class with the highest probability is assigned to instance t.

Before revising our weighting scheme proposed in [2], we start by defining EPs and other related terminologies.

Let obj = {a1, a2, a3, ... an} be a data object following the schema {A1, A2, A3, ... An}. A1, A2, A3.... An are called attributes, and a1, a2, a3, ... an are values related to these attributes. We call each pair (attribute, value) an item.

Let I denote the set of all items in an encoding dataset D. Itemsets are subsets of I. We say an instance Y contains an itemset X, if X ⊆ Y.

Definition 1. Given a dataset D, and an itemset X, the

support of X in D, sD(X), is defined as

1 The probabilities are estimated from the training set.

||)(

)(D

XcountXs D

D = (4)

where countD(X) is the number of instances in D

containing X. Definition 2. Given two different classes of datasets

D1 and D2. Let si(X) denote the support of the itemset X in the dataset Di. The growth rate of an itemset X from D1 to D2, )(

21Xgr DD → , is defined as

≠=∞==

=→

otherwise ,)()(

0)( and 0)( if ,0)( and 0)( if ,0

)(

1

2

21

21

21

XsXs

XsXsXsXs

Xgr DD

(5)

Definition 3. Given a growth rate threshold ρ >1, an

itemset X is said to be a ρ -emerging pattern ( ρ -EP or

simply EP) from D1 to D2 if ρ≥→ )(21

Xgr DD .

The strength of an EP e, strg(e), is defined as follows.

)(*)(1

)()( esegr

egrestrg+

= (6)

The strength of an EP is proportional to both its

growth rate (discriminating power) and support. Notice that if an EP has a high growth rate and a low support its strength might be low. In addition, if it has a low growth rate and a high support its strength might also be low.

Let C = {c1, … ck} is a set of class labels. A training dataset is a set of data objects such that, for each object obj, there exists a class label cobj ∈ C associated with it. A classifier is a function from attributes {A1, A2, A3, ... An} to class labels {c1, … ck}, that assigns class labels to unseen examples.

Our weighting scheme [2] is defined as follows. Assume that we have a set of n training instances,

},...,,{ 21 niiiT = , and a set of k classes, },...,,{ 21 kCCCC = . We have a set of EPs mined for

each class, },...,,{ 21 CkCC EEEE = , such that cjE is a

set of EPs related to class jC . The overall contribution of

EPs contained in an instance Ti∈ in class jC , )(iCjβ ,

is found by aggregating the strength of these EPs.

EP Support in poisonous mushrooms

Support in edible mushrooms

Growth rate

e1 0% 63.9% ∞ e2 81.4% 3.8% 21.4

e1 = {(ODOR = none), (GILL_SIZE = broad), (RING_NUMBER = one)} e2 = {(BRUISES = no), (GILL_SPACING = close), (VEIL_COLOR = white)}

438

Page 3: [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan (2007.05.13-2007.05.16)] 2007 IEEE/ACS International Conference on Computer Systems

∑∈⊆

=CjEeie

CjCj ei,

)()( αβ (7)

The aggregated values of instances in a class are then

divided by the median aggregated value in the same class to eliminate the effect of different number of instances in different classes.

Cj

CjCj Median

ii

)()(

βω = (8)

The normalized weight of a training instance Ti∈ in

class jC , )(iCjδ , is defined as follows.

∑=

= k

fCf

CjCj

i

ii

1

)(

)()(

ω

ωδ

(9)

After applying the above weighting scheme on data

instances, these instances will change from crisp classes where every instance is assigned completely to one class (table (a) in figure 2) to weighted instances where the weight of each instance is distributed among different classes (table (b) in figure 2).

Figure 2. Crisp instances and weighted instances

3. Weighted Naive Bayesian Classifier

In [2], we used the previous weighting scheme to build effective weighted decision trees which outperform the original decision trees. In this paper, we propose using

the same weighting scheme to define a new type of Bayesian classifiers, weighted naive Bayesian classifier (WNBC). We start by defining the two parts of equation 3 which represents a proper definition of the naive Bayesian classifier. These two parts are )( jCP and

)|( jl CaP . Suppose that T represents all the instances

in the dataset. The probability of class jC , )( jCP , is

defined according to the crisp classes as follows.

||)(

TCtoassignedTininstofnumbertotal

CP jj = (10)

Considering the weighted classes, )( jCP is defined

by aggregating the weights of the instances in class jC

as follows.

||

)()(

T

iCP Ti

Cj

j

∑∈=δ

(11)

The probability of attribute value la in class jC ,

)|( jl CaP , is defined according to the crisp classes as

follows.

j

ljjl CtoassignedTininstofnumbertotal

acontainandCtoassignedTininstofnumbertotalCaP =)|(

(12)

This probability is defined according to the weighted

classes by aggregating the weights of instances that contain the attribute value la as follows.

∑∑

⊂∈=

TiCj

aiTiCj

jl i

iCaP l

)(

)()|( ,

δ

δ (13)

By substituting equations 11 and 13 in equation 3, we

end up with equation 14 which defines the basic idea of our proposed classifier, weighted naive Bayesian classifier (WNBC). That is, instead of counting instances that satisfy certain conditions, we aggregate the weights of instances related to the required class under specific conditions. At the end, instance t is assigned to the class that maximizes the probability in equation 14.

439

Page 4: [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan (2007.05.13-2007.05.16)] 2007 IEEE/ACS International Conference on Computer Systems

∏ ∑∑∑

=

⊂∈∈=m

l

TiCj

aiTiCj

TiCj

j i

i

T

itCP l

1

,

)(

)(

||

)()|(

δ

δδ (14)

4. Experimental Evaluation

In this section, we compare our proposed Bayesian classifier with two other Bayesian classifiers, naive Bayesian classifier (NB) and Bayesian classification using emerging patterns (BCEP) [7]. We carry out experiments on 34 datasets from UCI repository of machine learning databases [10]. The accuracy is obtained using the methodology of stratified 10-fold cross-validation. The results are shown in table 2. The following points summarize the obtained results:

• Our proposed classifier, WNBC, scores the highest average accuracy. This proves that it outperforms the other two Bayesian classifiers, NB and BCEP.

• Compared to NB, WBNC wins on all 34 datasets.

Compared to BCEP, WNBC wins on 25 datasets. BCEP wins on 3 datasets. They tie on 6 datasets. 5. Conclusions

In this paper, we propose a new Bayesian classifier called weighted naive Bayesian classifier (WNBC). Our proposal is based on using weighted classes instead of crisp classes to find the probabilities required in the Bayesian classifiers. We redefine the naive Bayesian classifier according to the weights obtained using emerging patterns (EPs). We experimentally show that our new Bayesian classifier outperforms other Bayesian classifiers. Our future work will concentrate on applying our weighting scheme on other types of classification methods.

Table 2. Accuracy comparison

Dataset NB BCEP WNBC

Adult 83.2 85 86.2 Australian 84.5 86.4 87

Breast 96.1 97.3 97.7 Chess 87.9 98.9 98.9 Cleve 82.8 82.4 83.1 Crx 77.9 86.8 88.3

Diabetes 75.7 76.8 77.9 Flare 80.5 80.6 81.6

German 74.1 74.7 75.7 Glass 65.8 73.7 75.3 Heart 81.9 81.9 83.8

Hepatitis 83.9 83.3 86.1 Horse 78.6 83.1 84.2 Hypo 97.9 98.9 98.9 Iono 89.5 93.2 93.4

Labor 86.3 90 90.1 Letter 74.9 84.3 85.2 Lymph 78.8 83.1 84.6

Mushroom 95.8 100 100 Pima 75.7 75.7 75.9

Satimage 81.8 87.4 88.2 Segment 91.8 95.2 95.7 Shuttle 99.3 99.9 99.9

Shuttle-small 98.7 99.7 99.7 Sick 84.3 97.3 96.1

Sonar 75.4 78.4 79.2 Splice 94.6 94.1 94.8

Tic-tac-toe 70.1 99.3 97 Vehicle 61.1 68.1 68.4

Vote 87.9 89.8 90.2 Waveform-21 81 82.7 83.6

Wine 96.9 97.5 97.9 Yeast 57.4 58.2 58.2 Zoo 92.7 94.6 93

Average 83.1 87.00 87.5 References [1] Duda, R., & Hart, P. (1973). Pattern Classification and scene analysis. New York: John Wiley & Sons. [2]Alhammady, H., & Ramamohanarao, K.. Using Emerging Patterns to Construct Weighted Decision Trees. In IEEE Transaction on Knowledge and Data Engineering, Vol 18, NO. 7, July 2006.

440

Page 5: [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan (2007.05.13-2007.05.16)] 2007 IEEE/ACS International Conference on Computer Systems

[3] G. Dong, and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, USA. [4] H. Alhammady, and K. Ramamohanarao. The Application of Emerging Patterns for Improving the Quality of Rare-class Classification. In Proceedings of the 2004 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia. [5] H. Alhammady, and K. Ramamohanarao. Using Emerging Patterns and Decision Trees in Rare-class Classification. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM'04), Brighton, UK. [6] H. Alhammady, and K. Ramamohanarao. Expanding the Training Data Space Using Emerging Patterns and Genetic Methods. In Proceeding of the 2005 SIAM International Conference on Data Mining (SDM’05), New Port Beach, CA, USA. [7] H. Fan, and K. Ramamohanarao. A Bayesian Approach to Use Emerging Patterns for Classification. In Proceedings of the 14th Australasian Database Conference (ADC’03), Adelaide, Australia. [8] Guozhu D., Xiuzhen Z., Limsoon W., and Jinyan L.. CAEP: Classification by Aggregating Emerging Patterns. In Proceedings of the 2nd International Conference on Discovery Science (DS'99), Tokyo, Japan. [9] Alhammady, H., & Ramamohanarao, K. (2005). Mining emerging patterns and classification in data streams. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), Compiegne, France, pp. 272-275. [10] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine, CA, 1999.

441