Missing values problem in Data Mining

32
Missing values problem in Data Mining JELENA STOJANOVIC 03/20/2014

description

Missing values problem in Data Mining. Jelena Stojanovic 03/20/2014. Outline. Missing data problem Missing values in attributes Missing values in target variable Missingness mechanisms A approaches to Missing values Eliminate Data Objects Estimate Missing Values - PowerPoint PPT Presentation

Transcript of Missing values problem in Data Mining

Page 1: Missing values  problem in Data  Mining

Missing values problem in Data MiningJELENA STOJANOVIC03/20/2014

Page 2: Missing values  problem in Data  Mining

Outline

Missing data problem Missing values in attributes Missing values in target variable Missingness mechanisms

Aapproaches to Missing values Eliminate Data Objects Estimate Missing Values Handling the Missing Value During Analysis

Experimental analisys Conclusion

Page 3: Missing values  problem in Data  Mining

Missing Data problem

There are a lot of serious data quality problems in real datasets: incomplete, redundant, inconsistent and noisy reduce the performance of data mining algorithms

Missing data is a common issue in almost every real dataset. Caused by varied factors:

high cost involved in measuring variables, failure of sensors, reluctance of respondents in answering certain questions or an ill-designed questionnaire.

Page 4: Missing values  problem in Data  Mining

Missing values in datasets

The missing data problem arises when values for one or more variables are missing from recorded observations.

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 5: Missing values  problem in Data  Mining

Missing values in attributes (independant variables)

Page 6: Missing values  problem in Data  Mining

Missing labels

Page 7: Missing values  problem in Data  Mining

Missingness mechanism

Missing Completely At Random

Missing At Random

Missing Not At Random

Page 8: Missing values  problem in Data  Mining

Missing Completely at Random - MCAR

Missing Completely at Random - the missingness mechanism does not depend on the variable of interest, or any other variable, which is observed in the dataset.

The data are collected and observed arbitrarily and the collected data does not depend on any other variable of the dataset.

The case when respondents decide to reveal their income levels based on coin-flips

This type of missing data is very rarely found and the best method is to ignore such cases.

Page 9: Missing values  problem in Data  Mining

MCAR (continued)

Estimate E(X) from partially observed data:X* = [0, 1, m, m,1,1, m, 0, 0, m…] E(X)=?

True data:X = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1…] E(X) =

0.5 Rx = [0, 0, 1, 1, 0, 0, 1, 0, 0, 1…] If MCAR:

X* = [0, 1, m, m,1,1, m, 0, 0, m…] and E(X) = 3/6 =0.5

Page 10: Missing values  problem in Data  Mining

Missing At Random - MAR

Missing at random - when the probability of an instance having a missing value for an attribute may depend on the known values, but not on the value of the missing data itself;

Missingness can only be explained by variables that are fully observed whereas those that are partially observed cannot be responsible for missingness in others; an unrealistic assumption in many cases.

Women in the population are more likely to not reveal their age, therefore percentage of missing data among female individuals will be higher.

Page 11: Missing values  problem in Data  Mining

Missing Not Ar Random- MNAR

When data are not either MCAR or MAR Missingness mechanism depends on another

partially observed variable Situation in witch the missingness mechanism

depends on the actual value of missing data. The probability of an instance having a missing value for an attribute could depend on the value of that attribute

Difficult task; model the missingness

Page 12: Missing values  problem in Data  Mining

Missing data consequences

They can significantly bias the outcome of research studies. Response profiles of non-respondents and respondents can be significantly

different from each other. Performing the analysis using only complete cases and ignoring

the cases with missing values can reduce the sample size thereby substantially reducing estimation efficiency.

Many of the algorithms and statistical techniques are generally tailored to draw inferences from complete datasets. It may be difficult or even inappropriate to apply these algorithms and statistical

techniques on incomplete datasets.

Page 13: Missing values  problem in Data  Mining

Handling missing values

In general, methods to handle missing values belong either to sequential methods (preprocessing methods) or to parallel methods (methods in which missing attribute values are taken into account during the main process of acquiring knowledge).

Existing approaches: Eliminate Data Objects or Attributes Estimate Missing Values Handling the Missing Values During Analysis

Page 14: Missing values  problem in Data  Mining

Eliminate data objects

Page 15: Missing values  problem in Data  Mining

Eliminating data attributes

Page 16: Missing values  problem in Data  Mining

Estimate Missing Values

most common/mean value

Page 17: Missing values  problem in Data  Mining

Imputation

Page 18: Missing values  problem in Data  Mining

Imputation- nearest neighbor

K-NN

Page 19: Missing values  problem in Data  Mining

Handling the Missing Value During Analysis

Missing values are taken into account during the main process of acquiring knowledge

Some examples: Clustering - similarity between the objects calculated using only the

attributes that do not have missing values. C4.5 - splitting cases with missing attribute values into fractions and adding

these fractions to new case subsets. CART -A method of surrogate splits to handle missing attribute values Rule-based induction algorithms- missing values „do not care conditions“ Pairwise deletion is used to evaluate statistical parameters from available

information CRF-marginalizing out effect of missing label instances on labeled data

Page 20: Missing values  problem in Data  Mining

Internal missing data strategy used by C4.5

C4.5 uses a probabilistic approach to handle missing data C4.5:

Multiple split (Each node T can be partitioned into T1 , T2 … Tn subsets) Evaluation measure: Information Gain ratio

If there exist missing values in an attribute X, C4.5 uses the subset with all known values of X to calculate the information gain.

Once a test based on an attribute X is chosen, sC4.5 uses a probabilistic approach to partition the instances with missing values in X

Page 21: Missing values  problem in Data  Mining

Internal missing data strategy used by C4.5 When an instance in T with known value is assigned to a subset Ti,

probability of that instance belonging to subset Ti is 1 probability of that instance belonging to all other subsets is 0

C4.5 associates to each instance in Ti a weight representing the probability of that instance belonging to Ti. If the instance has a known value, and satisfies the test with outcome Oi, then this instance is

assigned to Ti with weight 1 If the instance has an unknown value, this instance is assigned to all partitions with different

weights for each one: The weight for the partition Ti is the probability that instance belongs to Ti. This probability is estimated as the sum of the weights of instances in T known to satisfy

the test with outcome Oi, divided by the sum of weights of the cases in T with known values on the attribute X.

Page 22: Missing values  problem in Data  Mining

Experimental Analysis*

Using cross-validation estimated error rates compare performance of : K-nearest neighbour algorithm as an imputation method Mean or mode imputation method Internal algorithms used by C4.5 and CN2 to learn with missing data

Missing values were artificially implanted, in different rates and attributes (more than 50%)

Data sets from UCI [10]: Bupa, Cmc, Pima and Breast

*G. Batista and M.C. Monard, “An Analysis of Four Missing Data Treatment Methods for Supervised Learning,”Applied Artificial Intelligence,vol. 17, pp. 519-533, 2003

Page 23: Missing values  problem in Data  Mining

Comparative results for the Breast data set

Page 24: Missing values  problem in Data  Mining

Comparative results for the Bupa data set

Page 25: Missing values  problem in Data  Mining

Comparative results for the Cmc data set

Page 26: Missing values  problem in Data  Mining

Comparative results for the Prima data set

Page 27: Missing values  problem in Data  Mining

Conclusion

Missing data huge data quality problem Vast variety of causes of missingess In general, there is no best, universal method of handling

missing values Different types of missingness mechanism (MCAR, MAR,

MNAR) and datasets require different approaches of dealing with missing values

Page 28: Missing values  problem in Data  Mining

Thank you for your attention!

Questions?

Page 29: Missing values  problem in Data  Mining

Homework problem:

1. List the types of missingness mechanisms. State one way you think should be appropriate for solving each of them and shortly explain way.

Page 30: Missing values  problem in Data  Mining

Eliminate data objects or attributes

Eliminate objects with missing values (listwise deletion) Simple and effective strategy Even partially specified objects contains some information If there are many objects- reliable analysis can be difficult or impossible Unless data are missing completely at random, listwise deletion can bias the

outcome. Eliminate attributes that have missing values

Carefully: These attributes maybe critical for analysis Listwise deletion and pairwise deletion used in approximately 96% of studies

in the social and behavioral sciences.

Page 31: Missing values  problem in Data  Mining

Estimate Missing Values

Missing data sometimes can be estimated reliably using values of remaing cases or attrubutes: replacing a missing attribute value by the most common value of that attribute, replacing a missing attribute value by the mean for numerical attributes, assigning all possible values to the missing attribute value, assigning to a missing attribute value the corresponding value taken from the closest case, replacing a missing attribute value by a new value, computed from a new data set,

considering the original attribute as a decision (imputation) For this strategy, comonly used are machine learning algorithms:

Unstructured (Decision trees, Naive Bayes, K-Neares neighbors…) Structured (Hidden Markov Models, Conditional Random Fields, Structured SVM…)

Some of these methods are more accurate, but more computationaly expensive, so different situations require different solutions

Page 32: Missing values  problem in Data  Mining

Handling the Missing Value During Analysis

Missing attribute values are taken into account during the main process of acquiring knowledge In clustering, similarity between the objects calculated using only the attributes that do not have missing

values. Similarity in this case only approximation, but unless the total number of attributes is small or the numbers of missing values is high, degree of inaccuracy may not matter much.

C4.5 induces a decision tree during tree generation, splitting cases with missing attribute values into fractions and adding these fractions to new case subsets.

A method of surrogate splits to handle missing attribute values was introduced in CART. In modification of the LEM2 (Learning from Examples Module, version 2) rule induction algorithm rules are

induced form the original data set, with missing attribute values considered to be "do not care" conditions or lost values.

In statistics, pairwise deletion is used to evaluate statistical parameters from available information: to compute the covariance of variables X and Y , all those cases or observations in which both X and Y are

observed are used regardless of whether other variables in the dataset have missing values. In CRFs, marginalizing out effect of missing label instances on labeled data, and thus utilizing information of

all observations and preserving the observed graph structre.