[email protected] Evaluating data quality issues from an industrial data set Gernot...

[email protected]

Evaluating data quality issues from an industrial

data set

Gernot LiebchenBheki Twala

Mark Stephens Martin Shepperd

Michelle Cartwright

[email protected]

What is it all about?

• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work

[email protected]

Motivations

• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)

• Predictive accuracy was highest with polishing followed by pruning and only then by filtering

• But suspicions were mentioned (at EASE)

[email protected]

Suspicions about previous investigation

• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)

• Polishing alters the data (What impact can that have?)

• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?

[email protected]

Why do we bother?

• Good quality data is important for good quality predictions and assessments

• How can we hope for good quality results if the quality of the input data is not good?

• The data is used for a variety of different purposes – esp. analysis and estimation support

[email protected]

The Dataset

• Given a large dataset provided by a EDS• The original dataset contains more than 10

000 cases with 22 attributes• Contains information about software

projects carried out since the beginning of the 1990s

• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity

[email protected]

Suspicions

• The data might contain noise • which was confirmed by the

preliminary analysis of the data which also indicated the existence of outliers.

[email protected]

How could it occur? (in the case of the dataset)

• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous

• Misunderstood standards• The input tool might not provide range

checking (or maybe limited) • “Service Excellence” dashboard in head

quarters• Local management pressure

[email protected]

Suspicious Data Example

• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low

FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)

[email protected]

What imperfections could occur?

• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will

be called Noise because both are unwanted

[email protected]

Noise Detectioncan be

• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)

• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)

[email protected]

What to do with noise?

• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)

• 3 basic options of cleaning : Polishing, Filtering, Pruning

[email protected]

Polishing/Filtering/Pruning

• Polishing – identifying the noise and correcting it

• Filtering – Identifying the noise and eliminating it

• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out

[email protected]

What did we do? & How did we do it?

• Compared the results of filtering and pruning and discussed a implications of pruning

• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)

• Produced lists of “noisy” instances and polished counterparts

• Passed them on to Mark ( as metrics specialist)

[email protected]

Results

• Filtering produced a list of 226 cases from 436

(36% in noise list/ in cleaned set 21%)

• Pruning produced a list of 191 from 436

(33% in noise list/ 25% in cleaned set)

• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)

[email protected]

Results 2

• By just inspecting historical data it was not possible to judge which method performed better

• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge

[email protected]

So what about polishing?

• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances

• It makes them fit into the regression model

• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?

[email protected]

Issues/Conclusions

• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions

• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)

[email protected]

Where to go from here?

• Rerun the experiment without “unrealistic cases”

• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is

[email protected]

What was it all about?

• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work

[email protected]

Any Questions?

[email protected] Evaluating data quality issues from an industrial data set Gernot...

Documents

Transcript of [email protected] Evaluating data quality issues from an industrial data set Gernot...