First pages-------------- -...
Transcript of First pages-------------- -...
In a real world dataset, missing data have been one of the major factors affecting the quality of the data. Fortunately, missing data imputation techniques can be used to improve quality of the data. The conventional methods available to handle missing values focused on either by ignoring all cases with missing attribute values or by substituting plausible values for the missing items. In contrast to the conventional methods, RST is an effective tool for knowledge acquisition directly from the original dataset with missing attribute values. In this chapter, different RST based paradigms for handling missing values are explained in detail with suitable examples. A popular RST based method of handling missing values, namely Characteristic Set based approach is implemented and tested with a small dataset. A modification of the Characteristic Set based approach is suggested in order to eliminate some of the drawbacks encountered while calculating the similarity among the objects in the dataset. The superiority of the proposed approach is established through experimental analysis.
5.1 Introduction
Missing attribute values are very common in a real world data set. The
existence of missing data set can cause serious problems during data analysis.
It leads to biased conclusions by which the efficiency and power of many
data mining operations may be affected. Hence handling of missing values
in a data set is one of the most important and challenging problem in research
activities, especially in the area of data mining. The nature of missing data is
an important factor in the selection of appropriate missing data handling
techniques for a particular data domain. A number of different techniques
have been proposed by many researchers to handle the issues related to
missing attribute missing values.
In a dataset, data values are missing because of two main reasons –
either they are lost or omitted due to don’t care conditions [Grzymala-Busse
Chapter 5
146 Studies on Rough sets theory with applications to data mining
and Siddhaye, 2004a]. It may happen when it was mistakenly erased or
unknowingly neglected. Similarly at the time of data collection, respondents
often leave items blank on questionnaires or refuse to report some personal
information. For example, people with low income are less likely to report
their family income than people with higher income. Such values, which are
important for processing and missing, will be termed as lost values
[Grzymala-Busse, 2003]. Another reason for incompleteness is due to the
irrelevance of the data to be stored. If it is possible to classify a given case on
the basis of some attribute values, other attribute values are irrelevant and are
not recorded. For example, if a doctor examines and diagnoses a patient using
only some selected medical test reports, the other test results might be
irrelevant and hence the doctor may keep such entries blank. Such missing
attribute values are called don’t care conditions [Grzymala- Busse and
Siddhaye, 2004b].
The most appropriate way to handle missing data depends on how data
points are missing, specifically to what extend they are completely or
conditionally random. Based on the relationship between measured variables
and the probability of missing data, Little and Rubin classified the missing
data distributions into three: Missing Completely at Random (MCAR),
Missing at Random (MAR) and Missing not at Random (MNAR) [Little and
Rubin, 1987][Little and Rubin, 2002]. These are some assumptions that
dictate how a particular missing data technique will perform.
Missing Completely at Random (MCAR): Data are MCAR when the
probability of missing data on a variable (attribute) x is unrelated to other
measured variables and to the values of x itself. This means, missingness in
this case is completely unsystematic and the observed data can be considered
as a random sub sample of the hypothetically complete data. As an example,
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 147
consider a dataset collected in connection with an educational study. In this
study, consider the case of a child shifted to another district midway through
his study. If the reason for his shift is a field in the data set, then the missing
values for this attribute (field) is MCAR, provided it is unrelated to other
attributes in the data set such as socio economic status, disciplinary problems
etc. In a practical data set, the possibility of other attribute values being
unrelated to the missing value is very rare because usually essential fields are
included in the dataset. But most of the missing data handling methods
produce accurate estimates when the data are MCAR [Raghunathan,
2004][Muthen et al., 1987].
Missing at Random (MAR): This is a systematic type of missingness. In
MAR the missingness is not related to the underlying values of the
incomplete attribute. But it affects the other measured attribute in the dataset
[Baraldi and Enders, 2010]. This means that missing values for the attribute x
can be explained by other variables in the data set, but it is random if we
consider the underlying values of the attribute x. Hence the data are called
Missing at Random. For example, ‘income’ is MAR, if the probability of
missing data on income depends on an observed attribute like ‘marital status’
with values single, married, divorced etc. and at the same time the probability
of income is unrelated to the values of the attribute ‘income’.
Missing not at Random (MNAR): This is also a systematic type of missingness.
Data are MNAR if the probability of missing data is systematically related to the
hypothetical values that are missing, but at the same time it is related to the other
observed data. In this case, the missing value is not completely at random and
cannot be completely explained by other attributes in the data set [Little and
Rubin ,2002]. As an example, ‘income’ is MNAR if households with low
income are less likely to report their income, may be, due to some other reasons.
Chapter 5
148 Studies on Rough sets theory with applications to data mining
In this case researchers adjust the value based on other observed values of the
attribute such as occupation.
In the next section, some common methods of handling missing
values are discussed which include three different Rough Set based
approaches. First Rough Set based method, called RSFit approach [Li and
Cercone, 2006], predicts the missing attribute values based on a distance
function. Second method is called the Characteristic Set based Approach
[Grzymala-Busse and Siddhaye, 2004a]. It generates decision rules directly
from the incomplete information system. Finally, a new parallel approach of
handling missing values based on a similarity relation is proposed.
Significance of these approaches in handling missing values is studied using
experimental datasets and results are analyzed.
5.2 Methods of Handling Missing Attribute Values
To deal with missing values, researchers usually employed a wide variety
of techniques. In data mining, these techniques are usually classified into two –
sequential and parallel [Grzymala-Busse and Grzymala-Busse, 2005]
5.2.1 Sequential Methods
In sequential methods, as a preprocessing, missing attribute values are
replaced by known values. This will produce a complete dataset with no
missing values and hence suitable to perform various data mining operations.
Methodologists have proposed a number of such methods which are reported
in the literature [Peugh and Enders, 2004][Breiman et al., 1984][Brazdil and
Bruha, 1992][Bruha, 2004][Allison, 2002][Little and Rubin, 2002]. It
includes traditional approaches such as List Wise Deletion, Mean Imputation,
Most Common Value method, Global Closest Fit method etc. [Baraldi and
Enders, 2010][Allison, 2002][Little and Rubin, 2002][Schafer and Graham,
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 149
2002]. In addition to these techniques a modern technique based on EM
algorithm [Schafer, 1997] and a popular Rough Set based missing value
imputation method, RSFit approach [Li and Cercone, 2006], are widely used.
Traditional Approaches of Handling Missing Values
(i) Deleting all cases with missing attribute values: This is the most basic
traditional method of handling missing data. It is also called List Wise (or
Case Wise) deletion. In this method, all cases with missing attribute values
are simply discarded so that the data analysis is restricted to the cases which
have complete data [Baraldi and Enders, 2010]. List Wise deletion assumes
that all the data are MCAR. When this assumption is violated in a dataset,
there is a chance to produce biased results as an outcome of the analysis.
However, there are some reasons [Allison, 2002][Little and Rubin, 2002] to
consider it as a good method. Since the method discards all the cases
containing the missing values, valuable information may lose from the
original data set and as a result the analysis may produce biased results.
(ii) Most Common Value Method: This is one of the simplest methods to
handle missing attribute values. In this method, the missing values are
replaced with the most common known value of the attribute. An
implementation of this method is available in the literature [Clark and
Niblett, 1989]. Kononenko proposed a modification of this approach
[Kononenko et al., 1984] in which the most common value of the attribute
restricted to a concept is used to fill the missing value instead of considering
all the cases. In other words, the replacement is based on a probability
distribution representing the likelihood of possible values for the missing
attribute calculated using the frequency counts of the existing entries of the
Chapter 5
150 Studies on Rough sets theory with applications to data mining
missing attribute [Prajapathi and Prajapathi, 2011]. Using this approach, the
complexity of the overall process can be reduced to a great extent.
(iii) Missing value imputation using all possible known values of the missing
attribute: This method is suggested by Grzymala Busse and is implemented
in LERS (Learning from Examples using Rough Sets). Here every case with
a missing value is replaced with a set of cases formed by replacing the
missing value with all possible known values of the considered missing
attribute [Grzymala-Busse, 1991]. A modification of this approach is
proposed by Jerzy W.Grzymala-Busse and Ming Hu, in which a case with
missing value is replaced with a set of cases obtained by assigning all
possible known values of the missing attribute restricted to the concept in
which the case belongs [Grzymala-Busse and Hu, 2000]. A major limitation
of this approach is the possibility for the inconsistency of the resulting
dataset. However standard Rough Set methods are available to induce rules
from an inconsistent information system.
(iv) Replacing missing attribute values by the attribute mean: In this method,
every missing values of an attribute is replaced by the arithmetic mean or
average of its own known attribute values. This method is suitable only for
attributes having numerical values. The method assumes that all missing
attribute values are having the same imputed value. This may lead to
considerable distortions in the data distribution [Baraldi and Enders, 2010].
To overcome the above drawback, a modified method was reported in which
the missing value is replaced with the arithmetic mean of all known values of
the attribute restricted to the concept [Grzymala-Busse and Grzymala-Busse,
2005].
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 151
(v) Closest Fit Method: Grzymala-Busse et al. proposed an efficient missing
imputation method based on replacing a missing attribute value in a particular
case by the known value of another case which is approximately similar to
that of the former [Grzymala-Busse et al., 2002]. This similar case is
considered as the Closest Fit case. To find out the Closest Fit case the target
vector, the vector representing the case with missing attribute value, is
compared with all other candidate vectors in the dataset. Hence this method is
also known as Global Closest Fit approach. For each case, a distance which is
a measure of the similarity is computed with the target vector [Grzymala-
Busse et al., 2002]. The case which gives the minimum distance is the closest
fitting case and the value in this closest fit case is used to replace the missing
value in the target vector. If there is a tie, arbitrarily select any one vector as
the closest fit case. The same authors proposed a modification in this
approach known as the Concept Closest Fit. In this method, the closest fit
case restricted to the concept is used to identify the missing values in the
target case. A major limitation of the Closest Fit Method is that it does not
guarantee that all missing values are replaced with known values.
Expectation – Maximization (EM) Algorithm
EM algorithm proposed by Schafer, is a widely recommended
statistical method for handling missing attribute values [Schafer,
1997][Schafer, 1999][Schafer and Olsen, 1998]. To impute the missing
values, the algorithm first compute statistical parameters such as variances,
co-variances and means from the complete data, perhaps obtained after the
List Wise deletion. With the help of these parameter values the algorithm
generates a regression model. Using this model the missing values are
initially imputed in the original dataset. Having filled in missing data, the
parameter values are re-estimated to modify the regression model. Using this
Chapter 5
152 Studies on Rough sets theory with applications to data mining
modified model the missing data values are imputed again. The process is
repeated until the solution stabilizes. At this point the algorithm will return
the maximum likelihood estimate of the parameters and the final regression
equation to fill the missing data is constructed from these parameters. A
small error value is added at each stage of the computation of the variance to
compensate the error in estimation. This approach is superior to the
traditional missing value handling techniques because it produces unbiased
estimates with both MCAR and MAR data [Schafer and Olsen,
1998][Enders, 2006]. As no data values are thrown out from the dataset, this
method can be considered as more powerful in comparison with the other
methods. Although the EM algorithm is efficient in many situations, it
produces biased estimates when the data are MNAR. But when compared to
the traditional methods, the bias tends to be far less.
RSFit approach to Assign Missing Values
This is a Rough Set based approach for handling missing attribute
values, proposed by J. Li and N Cercone. In this method, the RST concepts,
reduct and core, are effectively utilized for the prediction of missing values.
The attributes of reduct/core depend on each other based on certain statistical
measure. Since it represents the whole information system, the attribute
value pairs contained in the reduct /core are sufficient to predict the missing
values [Li and Cercone, 2006].
In RSFit approach the similar attribute value pairs for the data
instances containing missing values are identified and the most relevant value
is supplied. Let decision table T is the input to RSFit approach and Ck
represents the target attribute for which the value is missing. As a first step of
the process, the core of the data set is generated. If the target attribute Ck does
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 153
not belong to the core a reduct of T is considered and if it does not belong to
the reduct, add the attribute Ck to the reduct. By considering the attributes in
reduct/core including Ck, a reduced decision table T′ is constructed. To
predict the missing values, a match function (or a distance function) is
formulated from this new decision table T′. To design the match function
there exist two possibilities. The first one is called global, where all data
instances in T′ are considered for designing the match function [Li and
Cercone, 2006]. In the second possibility, instead of considering all the data
instances, only the data instances having the same decision attribute value
(i.e., concepts) are considered. Hence this process is called concept RSFit
approach. To define the match function, let Ui = {Vi1, Vi2, ……,Vik,…,Vim, di},
be the ith object containing the missing attribute value Vik for Ck, 1≤ k ≤ m.
Let Uj be any data instance from the Universe U. Now, the distance from the
target data instance Ui to Uj , dist(Ui,Uj), is defined as :
vvvv
vvvv
vvvv
mm
jmimjijiji UUdist
minmaxminmaxminmax),(
22
22
11
11
−
−+⋅⋅⋅+
−
−+
−
−= 5.1
where vv
ji vv
11
11
minmax −
− represents the maximum similarity between the
values of the first attribute in Ui and Uj [Li and Cercone, 2006]. For attribute
with missing values, this component is set as 1, the maximum difference
between unknown values. The smallest value of dist(Ui,Uj) is considered and
conclude that Uj is the best matched object for Ui. Then the target attribute Ck
in Ui is assigned with the corresponding value from Uj. If there are multiple
matching cases, randomly select one case for Ck. For non numeric attributes,
they are converted to numeric ones during the pre-processing stage.
Chapter 5
154 Studies on Rough sets theory with applications to data mining
5.2.2 Parallel Methods
In parallel methods, input data sets are not pre-processed to handle
missing data values as in sequential methods. Instead, the algorithms for
performing various data mining operations are modified to perform the
operations directly from the original incomplete datasets. So in parallel
methods, missing data values are handled in parallel with the data mining
operations.
Characteristic Set based Approach
The Characteristic Set based approach [Grzymala-Busse and
Siddhaye, 2004] is a Rough Set based parallel method for handling missing
values. Using this method, it is possible to induce decision rules directly from
the incomplete decision table. From the view point of RST, any decision
table T = (U, A, d) defines an information function f to assign various
attribute values. A decision table with an incompletely specified information
function is called an incomplete decision table. In this the lost values are
denoted by ‘?’ and don’t care conditions are denoted by ‘*’ [Grzymala-
Busse, 2003].
Grzymala Busse and Sachin Siddhaye [Grzymala-Busse and
Siddhaye, 2004] generalized the usual definition of indiscernibility relation
to describe incomplete decision tables. To implement this idea, the block of
attribute-value pairs introduced in Section 4.2.1 has been modified as
• If for an attribute a, there exist a case x such that f(x, a) =? i.e., the
corresponding value is lost, then x should not include in any block .
• If for an attribute a, there exist a case x such that f(x, a) = *, i.e., the
value is a don’t care condition, then x should be included in blocks
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 155
[(a, v)] of all specified values v of attribute a [Grzymala-Busse and
Siddhaye, 2004].
Based on the above argument, they suggest the idea of a Characteristic
Set KB(x), which is defined as the intersection of blocks of attribute value
pairs (a, v) for all attributes a from B ⊆ A for which f(x, a) is specified. The
characteristic set KB (x) may be interpreted as the smallest set of cases that
are indistinguishable from x using all attributes from B and with the given
interpretation of the missing attribute values [Grzymala-Busse-
2004a][Grzymala-Busse, 2004b][Grzymala-Busse, 2004c].
From the definition of KB(x), for any attribute subset B of A, it is
possible to define a binary relation called a characteristic relation R(B) which
is defined as
(x, y) ∈ R(B) iff y ∈ KB(x) 5.2
To induce rules from an incomplete decision table, the basic definition
of lower and upper approximations of a concept X is modified by considering
the definition of Characteristic Set KB(x). There are two ways of defining
lower and upper approximations. The first definition, called the subset B-
lower ( XB ) and subset B-upper ( )XB ) approximations of X, are defined as
( ) ( ){ }XxKUxxKXB BB ⊆∈∪= , and 5.3
{ }φ≠∩∈∪= XxKUxxKXB BB )(,)( 5.4
The second possibility is a modification of the subset B-lower and
subset B-upper approximations by replacing the universe U by a concept X.
As per this modification, the subset B-lower approximation becomes concept
B-lower approximation which is denoted as XB and is defined as
Chapter 5
156 Studies on Rough sets theory with applications to data mining
( ) ( ){ }XxKXxxKXB BB ⊆∈∪= , 5.5
Similarly, the concept B upper approximation is defined as
{ }φ≠∩∈∪= XxKXxxKXB BB )(,)( . 5.6
These lower and upper approximations lead to the generation of
decision rules from the incomplete decision table. By providing the lower and
upper approximations of a concept X separately in the rule induction
algorithm MLEM2 [Grzymala-Busse, 1988][Grzymala-Busse, 2002] the
certain rules and possible rules can be generated. So in this approach missing
attribute values in a dataset are handled by computing block of attribute value
pairs, characteristic sets, lower and upper approximations. Finally, it
generates decision rules using MLEM2 algorithm.
In the case of rule generation from incomplete decision tables, the
concept based lower and upper approximations are more useful compared to
that of the subset based lower and upper approximations. This is because
from the concept lower approximation, it is possible to generate the same set
of certain rules as in subset lower approximation. In the case of concept
upper approximation, the possible rules generated are more significant and
less in number compared to the rules from subset approximation.
An example of an incomplete decision table is presented in Table 5.1.
Here the lost values are denoted by “?” and don’t care conditions are denoted
by “*”.
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 157
Table 5.1: Incomplete decision table with lost and don’t care values
Case Temperature Headache Nausea Flu 1 high ? no yes
2 very_high yes yes yes
3 ? no no no
4 high yes yes yes
5 high ? yes no
6 normal yes no no
7 normal no yes no
8 * yes * yes
The blocks of attribute value pairs consistent with the interpretation of
missing attribute values, lost and don’t care conditions are as follows:
[(Temperature, high)] = {1, 4, 5, 8} [(Temperature, very_high)] = {2, 8}
[(Temperature, normal)] = {6, 7, 8} [(Headache, yes)] = {2, 4, 6, 8}
[(Headache, no)] = {3, 7} [(Nausea, no)] = {1, 3, 6, 8}
[(Nausea, yes)] = {2, 4, 5, 7, 8}
For Table 5.1 and B = A, the values of the characteristic sets KB(x) are:
KA(1) = {1, 4, 5, 8}∩{1, 3, 6, 8}={1, 8}
KA(2) = {2, 8}∩{2, 4, 6, 8}∩{2, 4, 5, 7, 8} = {2, 8}
KA(3) = {3, 7}∩{1, 3, 6, 8} = {3}
KA(4) = {1, 4, 5, 8}∩{2, 4, 6, 8}∩{2, 4, 5, 7, 8} = {4, 8}
KA(5) = {1, 4, 5, 8}∩{2, 4, 5, 7, 8} = {4, 5, 8}
KA(6) = {6, 7, 8}∩{2, 4, 6, 8}∩{1, 3, 6, 8} = {6, 8}
KA(7) = {6, 7, 8}∩{3, 7}∩{2, 4, 5, 7, 8} = {7} and
KA(8) = {2, 4, 6, 8}
Chapter 5
158 Studies on Rough sets theory with applications to data mining
From Table 5.1, the concept A-lower and A-upper approximations are:
{ } { }1, 2 , 4 , 8 1, 2 , 4 , 8 A =
{3, 5, 6, 7} {3, 7}A =
{1,2,4,8} {1,2,4,6,8}A = and
{3,5,6,7} {3,4,5,6,7,8}A =
The following are set of certain rules in LERS [Grzymala-Busse,
1992] format induced from Table 5.1 using concept lower approximation.
2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)
2, 3, 3 (Headache, yes) & (Nausea, yes) → (Flu, yes)
1, 2, 2 (Headache, no) → (Flu, no)
The corresponding possible rules set induced from the concept upper
approximation is
2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)
1, 3, 4 (Headache, yes) → (Flu, yes)
2, 3, 1 (Temperature, high) & (Nausea, yes) → (Flu, no)
1, 2, 3 (Temperature, normal) → (Flu, no)
1, 2, 2 (Headache, no) → (Flu, no)
5.3 A Modified Rough Set Approach to Handle Missing Values
In this method, to handle missing attribute values, a similarity relation
is proposed in place of the characteristic relation. A limitation of the
Characteristic Set based missing value handling approach is that, it considers
only do not care conditions to measure the similarity existing between two
objects. But when the decision making process is considered the lost values
are more important than don’t care conditions. Also in a practical dataset, it is
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 159
very difficult to distinguish lost values and don’t care conditions. Hence in
this approach, to measure the similarity, both lost values and don’t care
conditions are considered. Based on this, a new Similarity relation SIM is
suggested. For an attribute subset B and by interpreting both these values as
missing values, the Similarity relation SIM(B) is defined as
( ) {( , ) | , ( , ) ( , )( , ) * / ? ( , ) * / ?},
SIM B x y U U a B f x a f y aor f x a or f y a B A
= ∈ × ∀ ∈ == = ⊆
5.7
With this new Similarity relation, the set of objects similar to x, SB(x) using
all attributes from B is defined as
( ) ( ) ( ){ }ABBSIMyxUyxS B ⊆∈∈= ,,| 5.8
SB(x) represents the maximal set of objects which are possibly
indiscernible by B with x using this new interpretation of the missing values.
The novelty of this modified approach is that instead of generating
concept lower and upper approximations using Characteristic Set, it is
possible to generate these approximations using the Similarity relation and
the objects in SB(x). So in the proposed approach, the concept B-lower
approximation XB is defined as
})()({ XxSandXxxS BBXB ⊆∈∪= an 5.9
the concept B- upper approximation, XB is defined as
})()({ φ≠∩∈∪= XxSandXxxS BBXB 5.10
Since Similarity relation and the objects in SB(x) are generated by
considering both lost values and do not care conditions, rules induced from
the similarity based lower and upper approximations bring more consistency
Chapter 5
160 Studies on Rough sets theory with applications to data mining
in handling missing values in comparison with the Characteristic Set based
approach.
From the decision rules generated using the Characteristic Set based and
Similarity based approaches, it is evident that when the number of missing
values increases, it will greatly affect the quality of the derived rules and hence
decision making leads to wrong conclusions. Hence in these two approaches, the
generalized decisions are far from the actual decisions. So the number of missing
values in an object must be taken into account during the process of rule
induction from an incomplete decision table. Considering this, we again
modified the similarity relation defined in 5.7. For this modification, the
following basic ideas of the information function f are employed.
i. f(x, a) = v means that the object x has a value v for the attribute a ,
where x ∈ U and a∈A.
ii. f(x, a) is said to represent a defined value if and only if f(x, a) ≠ */?;
i.e., in object x attribute a can assign a specific value v from its
domain.
iii. If f(x, a) ≠ */? , ∀a∈ A, the set of conditional attributes then x is called
a completely defined object.
iv. The number E, where E = |{ai , i =1, 2,…..,|B|}|, with f(x, ai) = f(y, ai),
∀ai ∈B, represents the number of attributes having similar values in
both x and y with respect to the attribute subset B.
v. If Mx represents the number of missing values in x, then x is called a
well defined object with respect to the attribute subset B if and only if
Mx ≤ N/2, where N represents the number of attributes in B; otherwise
x is called a poorly defined object [Rady et al., 2007].
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 161
Based on these properties, a modified similarity relation MSIM(B) to handle
missing attribute values is defined as
MSIM(B) = { (x, y) ∈ U × U | E ≥ N/2,
x and y are well defined objects in U } 5.11
The set of objects similar to x based on MSIM(B), MSB(x) is given by
MSB(x) ={y ∈ U: (x, y) ∈ MSIM(B)} 5.12
The set MSB(x) represents the set of objects possibly indiscernible
with x based on the defined basic properties and with the interpretation of the
missing values used to define the similarity relation SIM(B) given in 5.7. By
applying the modified similarity relation MSIM(B), the concept B-lower and
concept B-upper approximations are defined respectively as
})(,)({ XxMSXxxMSXB BB ⊆∈∪= 5.13
})(,)({ φ≠∩∈∪= XxMSXxxMSXB BB 5.14
Table 5.2: Incomplete decision table
Case Temperature Headache Nausea Flu 1 high * no yes
2 very_high yes yes yes
3 * no no no
4 high yes yes yes
5 high * yes no
6 normal yes no no
7 normal no yes no
8 * yes * yes
Chapter 5
162 Studies on Rough sets theory with applications to data mining
The application of the above definitions on Table 5.2 with B = A, the
entire set of attributes, produces the following results:
MSIM(A) = {(1,1), (2,2), (3,3), (4,4), (4,5), (5,4), (5,5), (6,6), (7,7), (8,8)}
MSA(1) = {1} MSA(2) = {2} MSA(3) = {3} MSA(4) ={4, 5}
MSA(5) = {4, 5} MSA(6) = {6} MSA(7) = {7} MSA(8) = {8}
By considering set A of all the attributes of the decision table
presented in Table 5.2 , the concept A-lower and A-upper approximations of
the two concepts {1, 2, 4, 8} and {3, 5, 6, 7} are:
}8,2,1{}8,4,2,1{ =A A {1, 2, 4, 8} = {1, 2, 4, 5, 8}
}7,6,3{}7,6,5,3{ =A A {3, 5, 6, 7} = {3, 4, 5, 6, 7}
Rules in LERS format induced from Table 5.2 using concept
approximations are:
The certain rule set:
2, 3, 3 (Temperature, high) & (Nausea, no) → (Flu, yes)
2, 2, 2 (Headache, yes) & (Nausea, yes) → (Flu, yes)
1, 2, 2 (Temperature, normal) → (Flu, no)
The possible rule set:
2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)
1, 3, 4 (Headache, yes) → (Flu, yes)
2, 1, 3 (Temperature, high) & (Nausea, yes) → (Flu, no)
1, 2, 2 (Headache, no) → (Flu, no)
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 163
Table 5.3: Complete decision table
Case Temperature Headache Nausea Flu 1 high yes no yes
2 very_high yes yes yes
3 high no no no
4 high yes yes yes
5 high yes yes no
6 normal yes no no
7 normal no yes no
8 normal yes no yes
The algorithm LEM2 induced the following rule set from the complete
decision table presented in Table 5.3.
Certain rule set:
1, 1, 1 (Temperature, very_high) → (Flu, yes)
3, 1, 1 (Temperature, high) & (Nausea, no) & (Headache, yes) →
(Flu, yes)
1, 2, 2 (Headache, no) → (Flu, no)
And possible rule set:
1, 4, 6 (Headache, yes) → (Flu, yes)
1, 2, 3 (Temperature, normal) → (Flu, no)
2, 1, 2 (Temperature, high) & (Nausea, yes) → (Flu, no)
1, 2, 2 (Headache, no) → (Flu, no)
Chapter 5
164 Studies on Rough sets theory with applications to data mining
5.4 Experimental Analysis and Results
Rules generated from Characteristic Set based approach and the
proposed Similarity based approach are compared with the rules obtained
from the corresponding complete decision table. The Characteristic Set
based approach generates eight rules from the incomplete decision table. But
the proposed Similarity based approach generates only seven rules which is
same as that of the number of rules generated from the complete decision
table. In both Characteristic Set based approach and in the proposed
approach, four rules are similar to the rule set given by the complete decision
table. In the proposed approach out of seven, four of them are similar.
Hence the performance of the proposed approach is comparable with the
performance of the Characteristic Set based approach.
5.5 Summary
In this Chapter a survey of existing conventional missing value
handling methods as well as three popular Rough Set based approaches to
deal with missing values are discussed. The missing value handling methods
are broadly classified into two – Sequential and Parallel. In Sequential
methods, as a preprocessing, missing attribute values are directly replaced
with known values to produce a complete dataset. Then data mining
operations are performed on this complete dataset. Some traditional
approaches as well as a Rough Set based RSFit approach are presented in
sequential methods. In parallel methods, missing values are handled in
parallel with the data mining process, e.g., rule induction. A popular Rough
Set based parallel method of handling missing values called Characteristic set
based approach is implemented and tested with a small dataset. Based on
this, a modified approach to handle missing values is presented. The
Rough Set Strategies to Handle Missing Attribute Values
Studies on Rough sets theory with applications to data mining 165
proposed method is implemented and tested with the same dataset. It is
found that the method has comparable performance as that of the
Characteristic Set based approach. A detailed study on the effectiveness of
the proposed approach is left as a future work.
Major contributions:
• Survey of Rough Set based sequential and parallel approaches of
handling missing values are presented in a concise manner.
• A novel Rough Set based parallel method of handling missing values
is proposed and its effectiveness is established by comparing the
performance with an existing method namely Characteristic Set based
approach.
……. …….