[IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan...
Transcript of [IEEE 2007 IEEE/ACS International Conference on Computer Systems and Applications - Amman, Jordan...
Weighted Naive Bayesian Classifier
Hamad Alhammady Etisalat University College - UAE
Abstract
The naive Bayesian (NB) classifier is one of the simple yet powerful classification methods. One of the important problems in NB (and many other classifiers) is that it is built using crisp classes assigned to the training data. In this paper, we propose an improvement over the NB classifier by employing emerging patterns (EPs) to weight the training instances. That is, we generalize the NB classifier so that it can take into account weighted classes assigned to the training data. EPs are those itemsets whose frequencies in one class are significantly higher than their frequencies in the other classes. Our experiments prove that our proposed method is superior to the original NB classifier. 1. Introduction
The naive Bayesian (NB) classifier [1] is considered as one of the effective classifiers despite its simplicity. However, the NB classifier is based on probability calculations over crisp classes in the dataset. That is, probabilities are calculated assuming that each training data instance is related completely to one class only. This assumption conflicts with the fact that most real life datasets suffer from noise. That is, a training instance might not always be assigned to its real class. We proposed the notion of weighted classes in previous research [2]. Assume that we have a dataset consisting of three classes: 1C , 2C , and 3C . An instance i is said to have a crisp class if it is assigned completely to one of the three classes. However, instance i may still have some relations with the other two classes. The notion of weighted classes indicates that i is related to the three classes with different weights. Figure 1 shows examples of a crisp class and a weighted class. In the crisp class, 100% of the weight of instance i is assigned to one of the three classes (in this example, class 1C ). In the weighted class, the weight is distributed among the three classes. The weight assigned to each class is proportional to the strength of the relation between this class and instance i .
Figure 1. Examples of a crisp class and a weighted class
In this paper, we apply the concept of weighted classes
on the NB classifier. Our weighting scheme, proposed in [2], is based on emerging patterns [EPs]. EPs are a new kind of patterns introduced recently [3]. They have been proved to have a great impact in many applications [4] [5] [6] [7] [8] [9]. EPs can capture significant changes between datasets. They are defined as itemsets whose supports increase significantly from one class to another. The discriminating power of EPs can be measured by their growth rates. The growth rate of an EP is the ratio of its support in a certain class over that in another class. Usually the discriminating power of an EP is proportional to its growth rate.
For example, the Mushroom dataset, from the UCI Machine Learning Repository [10], contains a large number of EPs between the poisonous and the edible mushroom classes. Table 1 shows two examples of these EPs. These two EPs consist of 3 items. e1 is an EP from the poisonous mushroom class to the edible mushroom class. It never exists in the poisonous mushroom class, and exists in 63.9% of the instances in the edible mushroom class; hence, its growth rate is ∞ (63.9 / 0). It has a very high predictive power to contrast edible mushrooms against poisonous mushrooms. On the other hand, e2 is an EP from the edible mushroom class to the poisonous mushroom class. It exists in 3.8% of the instances in the edible mushroom class, and in 81.4% of the instances in the poisonous mushroom class; hence, its growth rate is 21.4 (81.4 / 3.8). It has a high predictive power to contrast poisonous mushrooms against edible mushrooms.
C1 C2 C3
100%
0% 0% C1 C2 C3
70%
20% 10%
Crisp Class Weighted Class
4371-4244-1031-2/07/$25.00©2007 IEEE
Table1. Examples of emerging patterns.
2. Related Work The NB classifier is based on using the Bayesian theorem to compute the probability1 scores of a test instance in each class in the dataset. The classifier then assigns the test instance to the class with the highest probability. For example, the probability of a test instance, t , in class jC
is as follows.
)()|()(
)|(tP
CtPCPtCP jj
j = (1)
As the denominator does not depend on the class and t
is given, equation 1 can be written as follows.
)|()()|( jjj CtPCPtCP = (2)
Assume that )|( yxP denotes the probability of x given y , m is the number of attributes, and la is the
value of the thl attribute in t , equation (2) can be written as follows.
∏==
m
l jljj CaPCPtCP1
)|()()|( (3)
Equation 3 is used to calculate the probability of each
class given t. The class with the highest probability is assigned to instance t.
Before revising our weighting scheme proposed in [2], we start by defining EPs and other related terminologies.
Let obj = {a1, a2, a3, ... an} be a data object following the schema {A1, A2, A3, ... An}. A1, A2, A3.... An are called attributes, and a1, a2, a3, ... an are values related to these attributes. We call each pair (attribute, value) an item.
Let I denote the set of all items in an encoding dataset D. Itemsets are subsets of I. We say an instance Y contains an itemset X, if X ⊆ Y.
Definition 1. Given a dataset D, and an itemset X, the
support of X in D, sD(X), is defined as
1 The probabilities are estimated from the training set.
||)(
)(D
XcountXs D
D = (4)
where countD(X) is the number of instances in D
containing X. Definition 2. Given two different classes of datasets
D1 and D2. Let si(X) denote the support of the itemset X in the dataset Di. The growth rate of an itemset X from D1 to D2, )(
21Xgr DD → , is defined as
≠=∞==
=→
otherwise ,)()(
0)( and 0)( if ,0)( and 0)( if ,0
)(
1
2
21
21
21
XsXs
XsXsXsXs
Xgr DD
(5)
Definition 3. Given a growth rate threshold ρ >1, an
itemset X is said to be a ρ -emerging pattern ( ρ -EP or
simply EP) from D1 to D2 if ρ≥→ )(21
Xgr DD .
The strength of an EP e, strg(e), is defined as follows.
)(*)(1
)()( esegr
egrestrg+
= (6)
The strength of an EP is proportional to both its
growth rate (discriminating power) and support. Notice that if an EP has a high growth rate and a low support its strength might be low. In addition, if it has a low growth rate and a high support its strength might also be low.
Let C = {c1, … ck} is a set of class labels. A training dataset is a set of data objects such that, for each object obj, there exists a class label cobj ∈ C associated with it. A classifier is a function from attributes {A1, A2, A3, ... An} to class labels {c1, … ck}, that assigns class labels to unseen examples.
Our weighting scheme [2] is defined as follows. Assume that we have a set of n training instances,
},...,,{ 21 niiiT = , and a set of k classes, },...,,{ 21 kCCCC = . We have a set of EPs mined for
each class, },...,,{ 21 CkCC EEEE = , such that cjE is a
set of EPs related to class jC . The overall contribution of
EPs contained in an instance Ti∈ in class jC , )(iCjβ ,
is found by aggregating the strength of these EPs.
EP Support in poisonous mushrooms
Support in edible mushrooms
Growth rate
e1 0% 63.9% ∞ e2 81.4% 3.8% 21.4
e1 = {(ODOR = none), (GILL_SIZE = broad), (RING_NUMBER = one)} e2 = {(BRUISES = no), (GILL_SPACING = close), (VEIL_COLOR = white)}
438
∑∈⊆
=CjEeie
CjCj ei,
)()( αβ (7)
The aggregated values of instances in a class are then
divided by the median aggregated value in the same class to eliminate the effect of different number of instances in different classes.
Cj
CjCj Median
ii
)()(
βω = (8)
The normalized weight of a training instance Ti∈ in
class jC , )(iCjδ , is defined as follows.
∑=
= k
fCf
CjCj
i
ii
1
)(
)()(
ω
ωδ
(9)
After applying the above weighting scheme on data
instances, these instances will change from crisp classes where every instance is assigned completely to one class (table (a) in figure 2) to weighted instances where the weight of each instance is distributed among different classes (table (b) in figure 2).
Figure 2. Crisp instances and weighted instances
3. Weighted Naive Bayesian Classifier
In [2], we used the previous weighting scheme to build effective weighted decision trees which outperform the original decision trees. In this paper, we propose using
the same weighting scheme to define a new type of Bayesian classifiers, weighted naive Bayesian classifier (WNBC). We start by defining the two parts of equation 3 which represents a proper definition of the naive Bayesian classifier. These two parts are )( jCP and
)|( jl CaP . Suppose that T represents all the instances
in the dataset. The probability of class jC , )( jCP , is
defined according to the crisp classes as follows.
||)(
TCtoassignedTininstofnumbertotal
CP jj = (10)
Considering the weighted classes, )( jCP is defined
by aggregating the weights of the instances in class jC
as follows.
||
)()(
T
iCP Ti
Cj
j
∑∈=δ
(11)
The probability of attribute value la in class jC ,
)|( jl CaP , is defined according to the crisp classes as
follows.
j
ljjl CtoassignedTininstofnumbertotal
acontainandCtoassignedTininstofnumbertotalCaP =)|(
(12)
This probability is defined according to the weighted
classes by aggregating the weights of instances that contain the attribute value la as follows.
∑∑
∈
⊂∈=
TiCj
aiTiCj
jl i
iCaP l
)(
)()|( ,
δ
δ (13)
By substituting equations 11 and 13 in equation 3, we
end up with equation 14 which defines the basic idea of our proposed classifier, weighted naive Bayesian classifier (WNBC). That is, instead of counting instances that satisfy certain conditions, we aggregate the weights of instances related to the required class under specific conditions. At the end, instance t is assigned to the class that maximizes the probability in equation 14.
439
∏ ∑∑∑
=
∈
⊂∈∈=m
l
TiCj
aiTiCj
TiCj
j i
i
T
itCP l
1
,
)(
)(
||
)()|(
δ
δδ (14)
4. Experimental Evaluation
In this section, we compare our proposed Bayesian classifier with two other Bayesian classifiers, naive Bayesian classifier (NB) and Bayesian classification using emerging patterns (BCEP) [7]. We carry out experiments on 34 datasets from UCI repository of machine learning databases [10]. The accuracy is obtained using the methodology of stratified 10-fold cross-validation. The results are shown in table 2. The following points summarize the obtained results:
• Our proposed classifier, WNBC, scores the highest average accuracy. This proves that it outperforms the other two Bayesian classifiers, NB and BCEP.
• Compared to NB, WBNC wins on all 34 datasets.
Compared to BCEP, WNBC wins on 25 datasets. BCEP wins on 3 datasets. They tie on 6 datasets. 5. Conclusions
In this paper, we propose a new Bayesian classifier called weighted naive Bayesian classifier (WNBC). Our proposal is based on using weighted classes instead of crisp classes to find the probabilities required in the Bayesian classifiers. We redefine the naive Bayesian classifier according to the weights obtained using emerging patterns (EPs). We experimentally show that our new Bayesian classifier outperforms other Bayesian classifiers. Our future work will concentrate on applying our weighting scheme on other types of classification methods.
Table 2. Accuracy comparison
Dataset NB BCEP WNBC
Adult 83.2 85 86.2 Australian 84.5 86.4 87
Breast 96.1 97.3 97.7 Chess 87.9 98.9 98.9 Cleve 82.8 82.4 83.1 Crx 77.9 86.8 88.3
Diabetes 75.7 76.8 77.9 Flare 80.5 80.6 81.6
German 74.1 74.7 75.7 Glass 65.8 73.7 75.3 Heart 81.9 81.9 83.8
Hepatitis 83.9 83.3 86.1 Horse 78.6 83.1 84.2 Hypo 97.9 98.9 98.9 Iono 89.5 93.2 93.4
Labor 86.3 90 90.1 Letter 74.9 84.3 85.2 Lymph 78.8 83.1 84.6
Mushroom 95.8 100 100 Pima 75.7 75.7 75.9
Satimage 81.8 87.4 88.2 Segment 91.8 95.2 95.7 Shuttle 99.3 99.9 99.9
Shuttle-small 98.7 99.7 99.7 Sick 84.3 97.3 96.1
Sonar 75.4 78.4 79.2 Splice 94.6 94.1 94.8
Tic-tac-toe 70.1 99.3 97 Vehicle 61.1 68.1 68.4
Vote 87.9 89.8 90.2 Waveform-21 81 82.7 83.6
Wine 96.9 97.5 97.9 Yeast 57.4 58.2 58.2 Zoo 92.7 94.6 93
Average 83.1 87.00 87.5 References [1] Duda, R., & Hart, P. (1973). Pattern Classification and scene analysis. New York: John Wiley & Sons. [2]Alhammady, H., & Ramamohanarao, K.. Using Emerging Patterns to Construct Weighted Decision Trees. In IEEE Transaction on Knowledge and Data Engineering, Vol 18, NO. 7, July 2006.
440
[3] G. Dong, and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, USA. [4] H. Alhammady, and K. Ramamohanarao. The Application of Emerging Patterns for Improving the Quality of Rare-class Classification. In Proceedings of the 2004 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia. [5] H. Alhammady, and K. Ramamohanarao. Using Emerging Patterns and Decision Trees in Rare-class Classification. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM'04), Brighton, UK. [6] H. Alhammady, and K. Ramamohanarao. Expanding the Training Data Space Using Emerging Patterns and Genetic Methods. In Proceeding of the 2005 SIAM International Conference on Data Mining (SDM’05), New Port Beach, CA, USA. [7] H. Fan, and K. Ramamohanarao. A Bayesian Approach to Use Emerging Patterns for Classification. In Proceedings of the 14th Australasian Database Conference (ADC’03), Adelaide, Australia. [8] Guozhu D., Xiuzhen Z., Limsoon W., and Jinyan L.. CAEP: Classification by Aggregating Emerging Patterns. In Proceedings of the 2nd International Conference on Discovery Science (DS'99), Tokyo, Japan. [9] Alhammady, H., & Ramamohanarao, K. (2005). Mining emerging patterns and classification in data streams. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), Compiegne, France, pp. 272-275. [10] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine, CA, 1999.
441