[email protected] Evaluating data quality issues from an industrial data set Gernot...

21
[email protected] Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle Cartwright
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of [email protected] Evaluating data quality issues from an industrial data set Gernot...

Page 1: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Evaluating data quality issues from an industrial

data set

Gernot LiebchenBheki Twala

Mark Stephens Martin Shepperd

Michelle Cartwright

Page 2: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

What is it all about?

• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work

Page 3: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Motivations

• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)

• Predictive accuracy was highest with polishing followed by pruning and only then by filtering

• But suspicions were mentioned (at EASE)

Page 4: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Suspicions about previous investigation

• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)

• Polishing alters the data (What impact can that have?)

• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?

Page 5: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Why do we bother?

• Good quality data is important for good quality predictions and assessments

• How can we hope for good quality results if the quality of the input data is not good?

• The data is used for a variety of different purposes – esp. analysis and estimation support

Page 6: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

The Dataset

• Given a large dataset provided by a EDS• The original dataset contains more than 10

000 cases with 22 attributes• Contains information about software

projects carried out since the beginning of the 1990s

• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity

Page 7: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Suspicions

• The data might contain noise • which was confirmed by the

preliminary analysis of the data which also indicated the existence of outliers.

Page 8: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

How could it occur? (in the case of the dataset)

• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous

• Misunderstood standards• The input tool might not provide range

checking (or maybe limited) • “Service Excellence” dashboard in head

quarters• Local management pressure

Page 9: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Suspicious Data Example

• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low

FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)

Page 10: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

What imperfections could occur?

• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will

be called Noise because both are unwanted

Page 11: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Noise Detectioncan be

• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)

• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)

Page 12: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

What to do with noise?

• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)

• 3 basic options of cleaning : Polishing, Filtering, Pruning

Page 13: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Polishing/Filtering/Pruning

• Polishing – identifying the noise and correcting it

• Filtering – Identifying the noise and eliminating it

• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out

Page 14: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

What did we do? & How did we do it?

• Compared the results of filtering and pruning and discussed a implications of pruning

• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)

• Produced lists of “noisy” instances and polished counterparts

• Passed them on to Mark ( as metrics specialist)

Page 15: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Results

• Filtering produced a list of 226 cases from 436

(36% in noise list/ in cleaned set 21%)

• Pruning produced a list of 191 from 436

(33% in noise list/ 25% in cleaned set)

• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)

Page 16: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Results 2

• By just inspecting historical data it was not possible to judge which method performed better

• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge

Page 17: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

So what about polishing?

• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances

• It makes them fit into the regression model

• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?

Page 18: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Issues/Conclusions

• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions

• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)

Page 19: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Where to go from here?

• Rerun the experiment without “unrealistic cases”

• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is

Page 20: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

What was it all about?

• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work

Page 21: Gernot.Liebchen@Brunel.ac.uk Evaluating data quality issues from an industrial data set Gernot Liebchen Bheki Twala Mark Stephens Martin Shepperd Michelle.

[email protected]

Any Questions?