[email protected] Evaluating data quality issues from an industrial data set Gernot...
-
date post
19-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of [email protected] Evaluating data quality issues from an industrial data set Gernot...
Evaluating data quality issues from an industrial
data set
Gernot LiebchenBheki Twala
Mark Stephens Martin Shepperd
Michelle Cartwright
What is it all about?
• Motivations• Dataset – the origin & quality issues • Noise & cleaning methods• The Experiment• Issues & conclusion• Future Work
Motivations
• A previous investigation compared 3 noise handling methods (robust algorithms [pruning] , filtering, polishing)
• Predictive accuracy was highest with polishing followed by pruning and only then by filtering
• But suspicions were mentioned (at EASE)
Suspicions about previous investigation
• The dataset contained missing values which were imputed (artificially created) during the build of the model (decision tree)
• Polishing alters the data (What impact can that have?)
• The methods were evaluated by using the predictions of another decision tree -> Can the findings be supported by metrics specialist?
Why do we bother?
• Good quality data is important for good quality predictions and assessments
• How can we hope for good quality results if the quality of the input data is not good?
• The data is used for a variety of different purposes – esp. analysis and estimation support
The Dataset
• Given a large dataset provided by a EDS• The original dataset contains more than 10
000 cases with 22 attributes• Contains information about software
projects carried out since the beginning of the 1990s
• Some attributes are more administrative (e.g. Project Name, Project ID), and will not have any impact on software productivity
Suspicions
• The data might contain noise • which was confirmed by the
preliminary analysis of the data which also indicated the existence of outliers.
How could it occur? (in the case of the dataset)
• Input errors (some teams might be more meticulous than others) / the person approving the data might not be as meticulous
• Misunderstood standards• The input tool might not provide range
checking (or maybe limited) • “Service Excellence” dashboard in head
quarters• Local management pressure
Suspicious Data Example
• Start Date: 01/08/2002 - 01/06/2002• Finish Date: 24/02/2004 - 09/02/2004• Name: *******Rel 24 - *******Rel 24 • FP Count: 1522 - 1522 • Effort: 38182.75 - 33461.5 • Country IRELAND - UK• Industry Sector Government - Government• Project Type Enhance. - Enhance.• Etc.• But there were also example with extremely high/low
FP counts per hour (1FP for 6916.25 hours; or 52 FP in 4 hours; 1746 FP in 468 hours)
What imperfections could occur?
• Noise – Random Errors• Outliers – Exceptional “True” Cases• Missing data• From now on Noise and Outliers will
be called Noise because both are unwanted
Noise Detectioncan be
• Distance based (e.g. visualisation methods; cooks, mahalanobis and euclidean distance; distance clustering)
• Distribution based (e.g. neural networks, forward search algorithms and robust tree modelling)
What to do with noise?
• First detection (we used decision trees- usually a pattern detection tool in data mining- but used to categorise the data in a training set and cases tested in a test set)
• 3 basic options of cleaning : Polishing, Filtering, Pruning
Polishing/Filtering/Pruning
• Polishing – identifying the noise and correcting it
• Filtering – Identifying the noise and eliminating it
• Pruning – Avoiding Overfitting (trying to ignore the leverage effects) – the instances which lead us to overfitting can be seen as noise and are taken out
What did we do? & How did we do it?
• Compared the results of filtering and pruning and discussed a implications of pruning
• Reduced the dataset to eliminate cases with missing values (avoid missing value imputation)
• Produced lists of “noisy” instances and polished counterparts
• Passed them on to Mark ( as metrics specialist)
Results
• Filtering produced a list of 226 cases from 436
(36% in noise list/ in cleaned set 21%)
• Pruning produced a list of 191 from 436
(33% in noise list/ 25% in cleaned set)
• Both were inspected and both contain a large number of possible true cases and unrealistic cases (productivity)
Results 2
• By just inspecting historical data it was not possible to judge which method performed better
• The decision tree as a noise detector does not detect unrealistic instances but outliers in the dataset which can only be overcome with domain knowledge
So what about polishing?
• Polishing does not necessarily alter size or effort, and we are still left with unrealistic instances
• It makes them fit into the regression model
• Is this acceptable from the point of view of the data owner?- depends on the application of the results- What if unrealistic cases impact on the model?
Issues/Conclusions
• In order to build the models we had to categorise the dependent variable – 3 categories (<=1042,<= 2985.5,>2985.5) BUT these categories appeared to coarse for our evaluation of the predictions
• If we know there are unrealistic cases, we should really take them out before we apply the cleaning methods (avoid the inclusion of these cases in the building of the model)
Where to go from here?
• Rerun the experiment without “unrealistic cases”
• Simulate a dataset with model, induce noise and missing values and evaluate methods with the knowledge of what the real underlying model is
What was it all about?
• Motivations• Dataset – the origin & quality issues • Noise & Cleaning methods• The Experiment• Issues & conclusion• Future Work
Any Questions?