First pages-------------- -...

21
In a real world dataset, missing data have been one of the major factors affecting the quality of the data. Fortunately, missing data imputation techniques can be used to improve quality of the data. The conventional methods available to handle missing values focused on either by ignoring all cases with missing attribute values or by substituting plausible values for the missing items. In contrast to the conventional methods, RST is an effective tool for knowledge acquisition directly from the original dataset with missing attribute values. In this chapter, different RST based paradigms for handling missing values are explained in detail with suitable examples. A popular RST based method of handling missing values, namely Characteristic Set based approach is implemented and tested with a small dataset. A modification of the Characteristic Set based approach is suggested in order to eliminate some of the drawbacks encountered while calculating the similarity among the objects in the dataset. The superiority of the proposed approach is established through experimental analysis. 5.1 Introduction Missing attribute values are very common in a real world data set. The existence of missing data set can cause serious problems during data analysis. It leads to biased conclusions by which the efficiency and power of many data mining operations may be affected. Hence handling of missing values in a data set is one of the most important and challenging problem in research activities, especially in the area of data mining. The nature of missing data is an important factor in the selection of appropriate missing data handling techniques for a particular data domain. A number of different techniques have been proposed by many researchers to handle the issues related to missing attribute missing values. In a dataset, data values are missing because of two main reasons – either they are lost or omitted due to don’t care conditions [Grzymala-Busse

Transcript of First pages-------------- -...

Page 1: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

In a real world dataset, missing data have been one of the major factors affecting the quality of the data. Fortunately, missing data imputation techniques can be used to improve quality of the data. The conventional methods available to handle missing values focused on either by ignoring all cases with missing attribute values or by substituting plausible values for the missing items. In contrast to the conventional methods, RST is an effective tool for knowledge acquisition directly from the original dataset with missing attribute values. In this chapter, different RST based paradigms for handling missing values are explained in detail with suitable examples. A popular RST based method of handling missing values, namely Characteristic Set based approach is implemented and tested with a small dataset. A modification of the Characteristic Set based approach is suggested in order to eliminate some of the drawbacks encountered while calculating the similarity among the objects in the dataset. The superiority of the proposed approach is established through experimental analysis.

5.1 Introduction

Missing attribute values are very common in a real world data set. The

existence of missing data set can cause serious problems during data analysis.

It leads to biased conclusions by which the efficiency and power of many

data mining operations may be affected. Hence handling of missing values

in a data set is one of the most important and challenging problem in research

activities, especially in the area of data mining. The nature of missing data is

an important factor in the selection of appropriate missing data handling

techniques for a particular data domain. A number of different techniques

have been proposed by many researchers to handle the issues related to

missing attribute missing values.

In a dataset, data values are missing because of two main reasons –

either they are lost or omitted due to don’t care conditions [Grzymala-Busse

Page 2: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

146 Studies on Rough sets theory with applications to data mining

and Siddhaye, 2004a]. It may happen when it was mistakenly erased or

unknowingly neglected. Similarly at the time of data collection, respondents

often leave items blank on questionnaires or refuse to report some personal

information. For example, people with low income are less likely to report

their family income than people with higher income. Such values, which are

important for processing and missing, will be termed as lost values

[Grzymala-Busse, 2003]. Another reason for incompleteness is due to the

irrelevance of the data to be stored. If it is possible to classify a given case on

the basis of some attribute values, other attribute values are irrelevant and are

not recorded. For example, if a doctor examines and diagnoses a patient using

only some selected medical test reports, the other test results might be

irrelevant and hence the doctor may keep such entries blank. Such missing

attribute values are called don’t care conditions [Grzymala- Busse and

Siddhaye, 2004b].

The most appropriate way to handle missing data depends on how data

points are missing, specifically to what extend they are completely or

conditionally random. Based on the relationship between measured variables

and the probability of missing data, Little and Rubin classified the missing

data distributions into three: Missing Completely at Random (MCAR),

Missing at Random (MAR) and Missing not at Random (MNAR) [Little and

Rubin, 1987][Little and Rubin, 2002]. These are some assumptions that

dictate how a particular missing data technique will perform.

Missing Completely at Random (MCAR): Data are MCAR when the

probability of missing data on a variable (attribute) x is unrelated to other

measured variables and to the values of x itself. This means, missingness in

this case is completely unsystematic and the observed data can be considered

as a random sub sample of the hypothetically complete data. As an example,

Page 3: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 147

consider a dataset collected in connection with an educational study. In this

study, consider the case of a child shifted to another district midway through

his study. If the reason for his shift is a field in the data set, then the missing

values for this attribute (field) is MCAR, provided it is unrelated to other

attributes in the data set such as socio economic status, disciplinary problems

etc. In a practical data set, the possibility of other attribute values being

unrelated to the missing value is very rare because usually essential fields are

included in the dataset. But most of the missing data handling methods

produce accurate estimates when the data are MCAR [Raghunathan,

2004][Muthen et al., 1987].

Missing at Random (MAR): This is a systematic type of missingness. In

MAR the missingness is not related to the underlying values of the

incomplete attribute. But it affects the other measured attribute in the dataset

[Baraldi and Enders, 2010]. This means that missing values for the attribute x

can be explained by other variables in the data set, but it is random if we

consider the underlying values of the attribute x. Hence the data are called

Missing at Random. For example, ‘income’ is MAR, if the probability of

missing data on income depends on an observed attribute like ‘marital status’

with values single, married, divorced etc. and at the same time the probability

of income is unrelated to the values of the attribute ‘income’.

Missing not at Random (MNAR): This is also a systematic type of missingness.

Data are MNAR if the probability of missing data is systematically related to the

hypothetical values that are missing, but at the same time it is related to the other

observed data. In this case, the missing value is not completely at random and

cannot be completely explained by other attributes in the data set [Little and

Rubin ,2002]. As an example, ‘income’ is MNAR if households with low

income are less likely to report their income, may be, due to some other reasons.

Page 4: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

148 Studies on Rough sets theory with applications to data mining

In this case researchers adjust the value based on other observed values of the

attribute such as occupation.

In the next section, some common methods of handling missing

values are discussed which include three different Rough Set based

approaches. First Rough Set based method, called RSFit approach [Li and

Cercone, 2006], predicts the missing attribute values based on a distance

function. Second method is called the Characteristic Set based Approach

[Grzymala-Busse and Siddhaye, 2004a]. It generates decision rules directly

from the incomplete information system. Finally, a new parallel approach of

handling missing values based on a similarity relation is proposed.

Significance of these approaches in handling missing values is studied using

experimental datasets and results are analyzed.

5.2 Methods of Handling Missing Attribute Values

To deal with missing values, researchers usually employed a wide variety

of techniques. In data mining, these techniques are usually classified into two –

sequential and parallel [Grzymala-Busse and Grzymala-Busse, 2005]

5.2.1 Sequential Methods

In sequential methods, as a preprocessing, missing attribute values are

replaced by known values. This will produce a complete dataset with no

missing values and hence suitable to perform various data mining operations.

Methodologists have proposed a number of such methods which are reported

in the literature [Peugh and Enders, 2004][Breiman et al., 1984][Brazdil and

Bruha, 1992][Bruha, 2004][Allison, 2002][Little and Rubin, 2002]. It

includes traditional approaches such as List Wise Deletion, Mean Imputation,

Most Common Value method, Global Closest Fit method etc. [Baraldi and

Enders, 2010][Allison, 2002][Little and Rubin, 2002][Schafer and Graham,

Page 5: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 149

2002]. In addition to these techniques a modern technique based on EM

algorithm [Schafer, 1997] and a popular Rough Set based missing value

imputation method, RSFit approach [Li and Cercone, 2006], are widely used.

Traditional Approaches of Handling Missing Values

(i) Deleting all cases with missing attribute values: This is the most basic

traditional method of handling missing data. It is also called List Wise (or

Case Wise) deletion. In this method, all cases with missing attribute values

are simply discarded so that the data analysis is restricted to the cases which

have complete data [Baraldi and Enders, 2010]. List Wise deletion assumes

that all the data are MCAR. When this assumption is violated in a dataset,

there is a chance to produce biased results as an outcome of the analysis.

However, there are some reasons [Allison, 2002][Little and Rubin, 2002] to

consider it as a good method. Since the method discards all the cases

containing the missing values, valuable information may lose from the

original data set and as a result the analysis may produce biased results.

(ii) Most Common Value Method: This is one of the simplest methods to

handle missing attribute values. In this method, the missing values are

replaced with the most common known value of the attribute. An

implementation of this method is available in the literature [Clark and

Niblett, 1989]. Kononenko proposed a modification of this approach

[Kononenko et al., 1984] in which the most common value of the attribute

restricted to a concept is used to fill the missing value instead of considering

all the cases. In other words, the replacement is based on a probability

distribution representing the likelihood of possible values for the missing

attribute calculated using the frequency counts of the existing entries of the

Page 6: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

150 Studies on Rough sets theory with applications to data mining

missing attribute [Prajapathi and Prajapathi, 2011]. Using this approach, the

complexity of the overall process can be reduced to a great extent.

(iii) Missing value imputation using all possible known values of the missing

attribute: This method is suggested by Grzymala Busse and is implemented

in LERS (Learning from Examples using Rough Sets). Here every case with

a missing value is replaced with a set of cases formed by replacing the

missing value with all possible known values of the considered missing

attribute [Grzymala-Busse, 1991]. A modification of this approach is

proposed by Jerzy W.Grzymala-Busse and Ming Hu, in which a case with

missing value is replaced with a set of cases obtained by assigning all

possible known values of the missing attribute restricted to the concept in

which the case belongs [Grzymala-Busse and Hu, 2000]. A major limitation

of this approach is the possibility for the inconsistency of the resulting

dataset. However standard Rough Set methods are available to induce rules

from an inconsistent information system.

(iv) Replacing missing attribute values by the attribute mean: In this method,

every missing values of an attribute is replaced by the arithmetic mean or

average of its own known attribute values. This method is suitable only for

attributes having numerical values. The method assumes that all missing

attribute values are having the same imputed value. This may lead to

considerable distortions in the data distribution [Baraldi and Enders, 2010].

To overcome the above drawback, a modified method was reported in which

the missing value is replaced with the arithmetic mean of all known values of

the attribute restricted to the concept [Grzymala-Busse and Grzymala-Busse,

2005].

Page 7: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 151

(v) Closest Fit Method: Grzymala-Busse et al. proposed an efficient missing

imputation method based on replacing a missing attribute value in a particular

case by the known value of another case which is approximately similar to

that of the former [Grzymala-Busse et al., 2002]. This similar case is

considered as the Closest Fit case. To find out the Closest Fit case the target

vector, the vector representing the case with missing attribute value, is

compared with all other candidate vectors in the dataset. Hence this method is

also known as Global Closest Fit approach. For each case, a distance which is

a measure of the similarity is computed with the target vector [Grzymala-

Busse et al., 2002]. The case which gives the minimum distance is the closest

fitting case and the value in this closest fit case is used to replace the missing

value in the target vector. If there is a tie, arbitrarily select any one vector as

the closest fit case. The same authors proposed a modification in this

approach known as the Concept Closest Fit. In this method, the closest fit

case restricted to the concept is used to identify the missing values in the

target case. A major limitation of the Closest Fit Method is that it does not

guarantee that all missing values are replaced with known values.

Expectation – Maximization (EM) Algorithm

EM algorithm proposed by Schafer, is a widely recommended

statistical method for handling missing attribute values [Schafer,

1997][Schafer, 1999][Schafer and Olsen, 1998]. To impute the missing

values, the algorithm first compute statistical parameters such as variances,

co-variances and means from the complete data, perhaps obtained after the

List Wise deletion. With the help of these parameter values the algorithm

generates a regression model. Using this model the missing values are

initially imputed in the original dataset. Having filled in missing data, the

parameter values are re-estimated to modify the regression model. Using this

Page 8: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

152 Studies on Rough sets theory with applications to data mining

modified model the missing data values are imputed again. The process is

repeated until the solution stabilizes. At this point the algorithm will return

the maximum likelihood estimate of the parameters and the final regression

equation to fill the missing data is constructed from these parameters. A

small error value is added at each stage of the computation of the variance to

compensate the error in estimation. This approach is superior to the

traditional missing value handling techniques because it produces unbiased

estimates with both MCAR and MAR data [Schafer and Olsen,

1998][Enders, 2006]. As no data values are thrown out from the dataset, this

method can be considered as more powerful in comparison with the other

methods. Although the EM algorithm is efficient in many situations, it

produces biased estimates when the data are MNAR. But when compared to

the traditional methods, the bias tends to be far less.

RSFit approach to Assign Missing Values

This is a Rough Set based approach for handling missing attribute

values, proposed by J. Li and N Cercone. In this method, the RST concepts,

reduct and core, are effectively utilized for the prediction of missing values.

The attributes of reduct/core depend on each other based on certain statistical

measure. Since it represents the whole information system, the attribute

value pairs contained in the reduct /core are sufficient to predict the missing

values [Li and Cercone, 2006].

In RSFit approach the similar attribute value pairs for the data

instances containing missing values are identified and the most relevant value

is supplied. Let decision table T is the input to RSFit approach and Ck

represents the target attribute for which the value is missing. As a first step of

the process, the core of the data set is generated. If the target attribute Ck does

Page 9: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 153

not belong to the core a reduct of T is considered and if it does not belong to

the reduct, add the attribute Ck to the reduct. By considering the attributes in

reduct/core including Ck, a reduced decision table T′ is constructed. To

predict the missing values, a match function (or a distance function) is

formulated from this new decision table T′. To design the match function

there exist two possibilities. The first one is called global, where all data

instances in T′ are considered for designing the match function [Li and

Cercone, 2006]. In the second possibility, instead of considering all the data

instances, only the data instances having the same decision attribute value

(i.e., concepts) are considered. Hence this process is called concept RSFit

approach. To define the match function, let Ui = {Vi1, Vi2, ……,Vik,…,Vim, di},

be the ith object containing the missing attribute value Vik for Ck, 1≤ k ≤ m.

Let Uj be any data instance from the Universe U. Now, the distance from the

target data instance Ui to Uj , dist(Ui,Uj), is defined as :

vvvv

vvvv

vvvv

mm

jmimjijiji UUdist

minmaxminmaxminmax),(

22

22

11

11

−+⋅⋅⋅+

−+

−= 5.1

where vv

ji vv

11

11

minmax −

− represents the maximum similarity between the

values of the first attribute in Ui and Uj [Li and Cercone, 2006]. For attribute

with missing values, this component is set as 1, the maximum difference

between unknown values. The smallest value of dist(Ui,Uj) is considered and

conclude that Uj is the best matched object for Ui. Then the target attribute Ck

in Ui is assigned with the corresponding value from Uj. If there are multiple

matching cases, randomly select one case for Ck. For non numeric attributes,

they are converted to numeric ones during the pre-processing stage.

Page 10: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

154 Studies on Rough sets theory with applications to data mining

5.2.2 Parallel Methods

In parallel methods, input data sets are not pre-processed to handle

missing data values as in sequential methods. Instead, the algorithms for

performing various data mining operations are modified to perform the

operations directly from the original incomplete datasets. So in parallel

methods, missing data values are handled in parallel with the data mining

operations.

Characteristic Set based Approach

The Characteristic Set based approach [Grzymala-Busse and

Siddhaye, 2004] is a Rough Set based parallel method for handling missing

values. Using this method, it is possible to induce decision rules directly from

the incomplete decision table. From the view point of RST, any decision

table T = (U, A, d) defines an information function f to assign various

attribute values. A decision table with an incompletely specified information

function is called an incomplete decision table. In this the lost values are

denoted by ‘?’ and don’t care conditions are denoted by ‘*’ [Grzymala-

Busse, 2003].

Grzymala Busse and Sachin Siddhaye [Grzymala-Busse and

Siddhaye, 2004] generalized the usual definition of indiscernibility relation

to describe incomplete decision tables. To implement this idea, the block of

attribute-value pairs introduced in Section 4.2.1 has been modified as

• If for an attribute a, there exist a case x such that f(x, a) =? i.e., the

corresponding value is lost, then x should not include in any block .

• If for an attribute a, there exist a case x such that f(x, a) = *, i.e., the

value is a don’t care condition, then x should be included in blocks

Page 11: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 155

[(a, v)] of all specified values v of attribute a [Grzymala-Busse and

Siddhaye, 2004].

Based on the above argument, they suggest the idea of a Characteristic

Set KB(x), which is defined as the intersection of blocks of attribute value

pairs (a, v) for all attributes a from B ⊆ A for which f(x, a) is specified. The

characteristic set KB (x) may be interpreted as the smallest set of cases that

are indistinguishable from x using all attributes from B and with the given

interpretation of the missing attribute values [Grzymala-Busse-

2004a][Grzymala-Busse, 2004b][Grzymala-Busse, 2004c].

From the definition of KB(x), for any attribute subset B of A, it is

possible to define a binary relation called a characteristic relation R(B) which

is defined as

(x, y) ∈ R(B) iff y ∈ KB(x) 5.2

To induce rules from an incomplete decision table, the basic definition

of lower and upper approximations of a concept X is modified by considering

the definition of Characteristic Set KB(x). There are two ways of defining

lower and upper approximations. The first definition, called the subset B-

lower ( XB ) and subset B-upper ( )XB ) approximations of X, are defined as

( ) ( ){ }XxKUxxKXB BB ⊆∈∪= , and 5.3

{ }φ≠∩∈∪= XxKUxxKXB BB )(,)( 5.4

The second possibility is a modification of the subset B-lower and

subset B-upper approximations by replacing the universe U by a concept X.

As per this modification, the subset B-lower approximation becomes concept

B-lower approximation which is denoted as XB and is defined as

Page 12: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

156 Studies on Rough sets theory with applications to data mining

( ) ( ){ }XxKXxxKXB BB ⊆∈∪= , 5.5

Similarly, the concept B upper approximation is defined as

{ }φ≠∩∈∪= XxKXxxKXB BB )(,)( . 5.6

These lower and upper approximations lead to the generation of

decision rules from the incomplete decision table. By providing the lower and

upper approximations of a concept X separately in the rule induction

algorithm MLEM2 [Grzymala-Busse, 1988][Grzymala-Busse, 2002] the

certain rules and possible rules can be generated. So in this approach missing

attribute values in a dataset are handled by computing block of attribute value

pairs, characteristic sets, lower and upper approximations. Finally, it

generates decision rules using MLEM2 algorithm.

In the case of rule generation from incomplete decision tables, the

concept based lower and upper approximations are more useful compared to

that of the subset based lower and upper approximations. This is because

from the concept lower approximation, it is possible to generate the same set

of certain rules as in subset lower approximation. In the case of concept

upper approximation, the possible rules generated are more significant and

less in number compared to the rules from subset approximation.

An example of an incomplete decision table is presented in Table 5.1.

Here the lost values are denoted by “?” and don’t care conditions are denoted

by “*”.

Page 13: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 157

Table 5.1: Incomplete decision table with lost and don’t care values

Case Temperature Headache Nausea Flu 1 high ? no yes

2 very_high yes yes yes

3 ? no no no

4 high yes yes yes

5 high ? yes no

6 normal yes no no

7 normal no yes no

8 * yes * yes

The blocks of attribute value pairs consistent with the interpretation of

missing attribute values, lost and don’t care conditions are as follows:

[(Temperature, high)] = {1, 4, 5, 8} [(Temperature, very_high)] = {2, 8}

[(Temperature, normal)] = {6, 7, 8} [(Headache, yes)] = {2, 4, 6, 8}

[(Headache, no)] = {3, 7} [(Nausea, no)] = {1, 3, 6, 8}

[(Nausea, yes)] = {2, 4, 5, 7, 8}

For Table 5.1 and B = A, the values of the characteristic sets KB(x) are:

KA(1) = {1, 4, 5, 8}∩{1, 3, 6, 8}={1, 8}

KA(2) = {2, 8}∩{2, 4, 6, 8}∩{2, 4, 5, 7, 8} = {2, 8}

KA(3) = {3, 7}∩{1, 3, 6, 8} = {3}

KA(4) = {1, 4, 5, 8}∩{2, 4, 6, 8}∩{2, 4, 5, 7, 8} = {4, 8}

KA(5) = {1, 4, 5, 8}∩{2, 4, 5, 7, 8} = {4, 5, 8}

KA(6) = {6, 7, 8}∩{2, 4, 6, 8}∩{1, 3, 6, 8} = {6, 8}

KA(7) = {6, 7, 8}∩{3, 7}∩{2, 4, 5, 7, 8} = {7} and

KA(8) = {2, 4, 6, 8}

Page 14: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

158 Studies on Rough sets theory with applications to data mining

From Table 5.1, the concept A-lower and A-upper approximations are:

{ } { }1, 2 , 4 , 8 1, 2 , 4 , 8 A =

{3, 5, 6, 7} {3, 7}A =

{1,2,4,8} {1,2,4,6,8}A = and

{3,5,6,7} {3,4,5,6,7,8}A =

The following are set of certain rules in LERS [Grzymala-Busse,

1992] format induced from Table 5.1 using concept lower approximation.

2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)

2, 3, 3 (Headache, yes) & (Nausea, yes) → (Flu, yes)

1, 2, 2 (Headache, no) → (Flu, no)

The corresponding possible rules set induced from the concept upper

approximation is

2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)

1, 3, 4 (Headache, yes) → (Flu, yes)

2, 3, 1 (Temperature, high) & (Nausea, yes) → (Flu, no)

1, 2, 3 (Temperature, normal) → (Flu, no)

1, 2, 2 (Headache, no) → (Flu, no)

5.3 A Modified Rough Set Approach to Handle Missing Values

In this method, to handle missing attribute values, a similarity relation

is proposed in place of the characteristic relation. A limitation of the

Characteristic Set based missing value handling approach is that, it considers

only do not care conditions to measure the similarity existing between two

objects. But when the decision making process is considered the lost values

are more important than don’t care conditions. Also in a practical dataset, it is

Page 15: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 159

very difficult to distinguish lost values and don’t care conditions. Hence in

this approach, to measure the similarity, both lost values and don’t care

conditions are considered. Based on this, a new Similarity relation SIM is

suggested. For an attribute subset B and by interpreting both these values as

missing values, the Similarity relation SIM(B) is defined as

( ) {( , ) | , ( , ) ( , )( , ) * / ? ( , ) * / ?},

SIM B x y U U a B f x a f y aor f x a or f y a B A

= ∈ × ∀ ∈ == = ⊆

5.7

With this new Similarity relation, the set of objects similar to x, SB(x) using

all attributes from B is defined as

( ) ( ) ( ){ }ABBSIMyxUyxS B ⊆∈∈= ,,| 5.8

SB(x) represents the maximal set of objects which are possibly

indiscernible by B with x using this new interpretation of the missing values.

The novelty of this modified approach is that instead of generating

concept lower and upper approximations using Characteristic Set, it is

possible to generate these approximations using the Similarity relation and

the objects in SB(x). So in the proposed approach, the concept B-lower

approximation XB is defined as

})()({ XxSandXxxS BBXB ⊆∈∪= an 5.9

the concept B- upper approximation, XB is defined as

})()({ φ≠∩∈∪= XxSandXxxS BBXB 5.10

Since Similarity relation and the objects in SB(x) are generated by

considering both lost values and do not care conditions, rules induced from

the similarity based lower and upper approximations bring more consistency

Page 16: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

160 Studies on Rough sets theory with applications to data mining

in handling missing values in comparison with the Characteristic Set based

approach.

From the decision rules generated using the Characteristic Set based and

Similarity based approaches, it is evident that when the number of missing

values increases, it will greatly affect the quality of the derived rules and hence

decision making leads to wrong conclusions. Hence in these two approaches, the

generalized decisions are far from the actual decisions. So the number of missing

values in an object must be taken into account during the process of rule

induction from an incomplete decision table. Considering this, we again

modified the similarity relation defined in 5.7. For this modification, the

following basic ideas of the information function f are employed.

i. f(x, a) = v means that the object x has a value v for the attribute a ,

where x ∈ U and a∈A.

ii. f(x, a) is said to represent a defined value if and only if f(x, a) ≠ */?;

i.e., in object x attribute a can assign a specific value v from its

domain.

iii. If f(x, a) ≠ */? , ∀a∈ A, the set of conditional attributes then x is called

a completely defined object.

iv. The number E, where E = |{ai , i =1, 2,…..,|B|}|, with f(x, ai) = f(y, ai),

∀ai ∈B, represents the number of attributes having similar values in

both x and y with respect to the attribute subset B.

v. If Mx represents the number of missing values in x, then x is called a

well defined object with respect to the attribute subset B if and only if

Mx ≤ N/2, where N represents the number of attributes in B; otherwise

x is called a poorly defined object [Rady et al., 2007].

Page 17: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 161

Based on these properties, a modified similarity relation MSIM(B) to handle

missing attribute values is defined as

MSIM(B) = { (x, y) ∈ U × U | E ≥ N/2,

x and y are well defined objects in U } 5.11

The set of objects similar to x based on MSIM(B), MSB(x) is given by

MSB(x) ={y ∈ U: (x, y) ∈ MSIM(B)} 5.12

The set MSB(x) represents the set of objects possibly indiscernible

with x based on the defined basic properties and with the interpretation of the

missing values used to define the similarity relation SIM(B) given in 5.7. By

applying the modified similarity relation MSIM(B), the concept B-lower and

concept B-upper approximations are defined respectively as

})(,)({ XxMSXxxMSXB BB ⊆∈∪= 5.13

})(,)({ φ≠∩∈∪= XxMSXxxMSXB BB 5.14

Table 5.2: Incomplete decision table

Case Temperature Headache Nausea Flu 1 high * no yes

2 very_high yes yes yes

3 * no no no

4 high yes yes yes

5 high * yes no

6 normal yes no no

7 normal no yes no

8 * yes * yes

Page 18: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

162 Studies on Rough sets theory with applications to data mining

The application of the above definitions on Table 5.2 with B = A, the

entire set of attributes, produces the following results:

MSIM(A) = {(1,1), (2,2), (3,3), (4,4), (4,5), (5,4), (5,5), (6,6), (7,7), (8,8)}

MSA(1) = {1} MSA(2) = {2} MSA(3) = {3} MSA(4) ={4, 5}

MSA(5) = {4, 5} MSA(6) = {6} MSA(7) = {7} MSA(8) = {8}

By considering set A of all the attributes of the decision table

presented in Table 5.2 , the concept A-lower and A-upper approximations of

the two concepts {1, 2, 4, 8} and {3, 5, 6, 7} are:

}8,2,1{}8,4,2,1{ =A A {1, 2, 4, 8} = {1, 2, 4, 5, 8}

}7,6,3{}7,6,5,3{ =A A {3, 5, 6, 7} = {3, 4, 5, 6, 7}

Rules in LERS format induced from Table 5.2 using concept

approximations are:

The certain rule set:

2, 3, 3 (Temperature, high) & (Nausea, no) → (Flu, yes)

2, 2, 2 (Headache, yes) & (Nausea, yes) → (Flu, yes)

1, 2, 2 (Temperature, normal) → (Flu, no)

The possible rule set:

2, 2, 2 (Temperature, high) & (Nausea, no) → (Flu, yes)

1, 3, 4 (Headache, yes) → (Flu, yes)

2, 1, 3 (Temperature, high) & (Nausea, yes) → (Flu, no)

1, 2, 2 (Headache, no) → (Flu, no)

Page 19: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 163

Table 5.3: Complete decision table

Case Temperature Headache Nausea Flu 1 high yes no yes

2 very_high yes yes yes

3 high no no no

4 high yes yes yes

5 high yes yes no

6 normal yes no no

7 normal no yes no

8 normal yes no yes

The algorithm LEM2 induced the following rule set from the complete

decision table presented in Table 5.3.

Certain rule set:

1, 1, 1 (Temperature, very_high) → (Flu, yes)

3, 1, 1 (Temperature, high) & (Nausea, no) & (Headache, yes) →

(Flu, yes)

1, 2, 2 (Headache, no) → (Flu, no)

And possible rule set:

1, 4, 6 (Headache, yes) → (Flu, yes)

1, 2, 3 (Temperature, normal) → (Flu, no)

2, 1, 2 (Temperature, high) & (Nausea, yes) → (Flu, no)

1, 2, 2 (Headache, no) → (Flu, no)

Page 20: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Chapter 5

164 Studies on Rough sets theory with applications to data mining

5.4 Experimental Analysis and Results

Rules generated from Characteristic Set based approach and the

proposed Similarity based approach are compared with the rules obtained

from the corresponding complete decision table. The Characteristic Set

based approach generates eight rules from the incomplete decision table. But

the proposed Similarity based approach generates only seven rules which is

same as that of the number of rules generated from the complete decision

table. In both Characteristic Set based approach and in the proposed

approach, four rules are similar to the rule set given by the complete decision

table. In the proposed approach out of seven, four of them are similar.

Hence the performance of the proposed approach is comparable with the

performance of the Characteristic Set based approach.

5.5 Summary

In this Chapter a survey of existing conventional missing value

handling methods as well as three popular Rough Set based approaches to

deal with missing values are discussed. The missing value handling methods

are broadly classified into two – Sequential and Parallel. In Sequential

methods, as a preprocessing, missing attribute values are directly replaced

with known values to produce a complete dataset. Then data mining

operations are performed on this complete dataset. Some traditional

approaches as well as a Rough Set based RSFit approach are presented in

sequential methods. In parallel methods, missing values are handled in

parallel with the data mining process, e.g., rule induction. A popular Rough

Set based parallel method of handling missing values called Characteristic set

based approach is implemented and tested with a small dataset. Based on

this, a modified approach to handle missing values is presented. The

Page 21: First pages-------------- - INFLIBNETshodhganga.inflibnet.ac.in/bitstream/10603/28504/14/14_chapter5.pdf · Chapter 5 146 Studies on Rough sets theory with applications to data mining

Rough Set Strategies to Handle Missing Attribute Values

Studies on Rough sets theory with applications to data mining 165

proposed method is implemented and tested with the same dataset. It is

found that the method has comparable performance as that of the

Characteristic Set based approach. A detailed study on the effectiveness of

the proposed approach is left as a future work.

Major contributions:

• Survey of Rough Set based sequential and parallel approaches of

handling missing values are presented in a concise manner.

• A novel Rough Set based parallel method of handling missing values

is proposed and its effectiveness is established by comparing the

performance with an existing method namely Characteristic Set based

approach.

……. …….