DATA QUALITYTHE DATA SCIENCE STRUGGLE NOBODY MENTIONS
MAURICE VAN KEULEN
ASSOCIATE PROFESSOR
DATA MANAGEMENT TECHNOLOGY
HEAD OF DATABASE GROUP
First, a little story
Data imperfections
Some general mechanisms
My research
7 Sept 2017Data quality: the data science struggle nobody mentions 2
WHAT IS THIS TALK ABOUT?
7 Sept 2017Data quality: the data science struggle nobody mentions 3
Domain
understanding
Data
understanding
Data
preparation
Modeling
(Analysis)
Deployment
(Use) Data
Evaluation
(Interpretation)
Where the
magic happens
7 Sept 2017Data quality: the data science struggle nobody mentions 5
Research on
Pregnancy processes based on
Electronic Patient Dossiers (EPDs)
of some population of women
7 Sept 2017Data quality: the data science struggle nobody mentions 6
The start
1. Data scientist is a parent him/herself
2. Gather records from patient’s EPDs for
pregnancy period from multiple sources
3. Fairly straightforward to identify consults,
tests, scans, conditions
4. Extract and store them
7 Sept 2017Data quality: the data science struggle nobody mentions 7
First analysis
and evaluation
1. Analysis with process mining tool
2. Interaction with visualization
3. Interpret results
4. (S)he already sees some interesting patterns
7 Sept 2017Data quality: the data science struggle nobody mentions 8
But then
(s)he notices …
and realizes …
… a broken leg … dozens of specialists …
Assumption wrong: “all records that belong to
a pregnant woman are related to pregnancy”
Too many records selected during preparation
No objective means to ascertain this:
No field ‘related to pregnancy’
7 Sept 2017Data quality: the data science struggle nobody mentions 9
and a painstaking
process starts
Specify complex filter rules
Inspect samples of (not) selected records
Repeat!
Quick and
dirty or
thorough?
Never
perfect!What is good
enough?
How does it
affect results?
fatigue can show
later pregnancy
related or not
7 Sept 2017Data quality: the data science struggle nobody mentions 10
Fast forward a bit …
Re-perform analysis
and evaluation
1. Interaction with mining tool and visualization
2. Something strange in the times of consults:
Consult after blood test it prescribed???
7 Sept 2017Data quality: the data science struggle nobody mentions 11
Realization
More cleaning
1. Clinician contacted for explanation:
Notes during consult put in EPD in evenings!
2. Modification of EPD record (what is recorded)
≠ actual moment of activity (semantics)
Sequence and duration noise
3. More data cleaning ensues
7 Sept 2017Data quality: the data science struggle nobody mentions 12
Complex
Many unexpected surprises in
domain/data understanding
Time-consuming
Most time spent on data preparation
(upto 50-80%)
7 Sept 2017Data quality: the data science struggle nobody mentions 13
A data scientist should know and tell you about the
deficiencies in the data and the results
Quick and
dirty or
thorough?
Never
perfect!
What is good
enough?
How does it
affect results?
7 Sept 2017Data quality: the data science struggle nobody mentions 14
When presented with analytics results / visualizations,
Let people know you don’t trust any results if they don’t have
good answers to questions like these
How was it
cleaned?
Data quality
problems?
Data
discarded?
How reliable
are these
results?
How does
this affect
the results?
What is it
Data and specification on parts, substances, etc.
Why is it a problem?
High requirements on data quality
Errors and duplicates may be
costly or even pose health risks
Even so, it is a mess
7 Sept 2017Data quality: the data science struggle nobody mentions 16
PRODUCT DATAWHAT IS IT AND WHY IS IT A PROBLEM?
Proposed approach
Given catalogue / database with data on products
Gather data on the same products from websites
(many more or less independent sources)
Consolidate: match, merge and clean
One enriched description
for each product
7 Sept 2017Data quality: the data science struggle nobody mentions 17
PRODUCT INFORMATION CLEANING AND ENRICHMENT
7 Sept 2017Data quality: the data science struggle nobody mentions 18
PILOT: BALL BEARINGS1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS
7 Sept 2017Data quality: the data science struggle nobody mentions 19
PILOT: BALL BEARINGS2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE
Get product
pages
Extact data
Consolidate
(match,
merge, clean)
7 Sept 2017Data quality: the data science struggle nobody mentions 20
PILOT EXPERIENCES: THE DIRT WE FOUND!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle nobody mentions 21
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
Data sample looks fine
Result of analysis looks
perfectly reasonable
If you don’t look hard
enough
if you don’t properly pay
attention to it
… you will be unaware
… that you are possibly
looking at significantly
erroneous figures!!!
CustID Sales Name
1234 6000 John
2345 5000 Mary
3456 12000 Bart
… … …
7 Sept 2017Data quality: the data science struggle nobody mentions 22
DATA IMPERFECTIONS
SELECT SUM(Sales)
FROM CustSales
3423000
CustID Sales Name
6789 2 Tom
4567 6000 Jon
5678 NULL Nina
… … …
????
Wrong values included
Missing data
Double counting
etc.
Many more problems
at value, record,
schema, source, trust
levels
Data integration very important in Data Science
Example purpose: Data enrichment
Many names for the same problem
Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 23
DATA INTEGRATION: SEMANTIC ENTITY LINKINGONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!
Data perspective User perspective
Context
independent
Spelling error
Missing data
Incorrect value
Duplicate data
Inconsistent data format
Syntax violation
Violation of integrity constraints
Heterogeneity of measurement units
Existence of synonyms and
homonyms
Information is inaccessible
Information is insecure
Information is hardly retrievable
Information is difficult to aggregate
Errors in the information
transformation
Context
dependent
Violation of domain constraints
Violation of organization’s business
rules
Violation of company and
government regulations
Violation of constraints provided by
the database administrator
The information is not based on fact
Information is of doubtful credibility
Information presents impartial view
Information is irrelevant to the work
Information is incomplete
Information is compactly represented
Information is hard to manipulate
Information is hard to understand
Information is outdated
7 Sept 2017Data quality: the data science struggle nobody mentions 24
CATEGORIZATION OF DATA QUALITY PROBLEMSSOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA
QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014
Data/information quality cannot be measured with one value
Lot of research on describing desirable properties
Examples: Accuracy, Correctness, Currency,
Completeness, Relevance, Reliability, etc.
Metrics: concrete means to estimate score on dims
Quite futile endeavor in my opinion
200+ such dimensions have been identified,
even more possible metrics
Little agreement
Expensive, too complex
DATA QUALITY DIMENSIONS AND METRICS
7 Sept 2017Data quality: the data science struggle nobody mentions 25
“The data quality of a data set” is a meaningless notion
Data quality depends on the purpose!
What is good enough for one purpose,
may be insufficient for another
Alternative means of measurement
Define purpose as a set of queries on the data
Measure quality of the answers
Purpose determines relevant dimensions & metrics
7 Sept 2017Data quality: the data science struggle nobody mentions 26
DATA QUALITY DEPENDS ON THE PURPOSE
Wikipedia: “Process of implementing and developing
technical standards”
Requires consensus
Typically quite costly
Benefits but also downsides (lack of variety,
exceptions, slow evolution, defectors, etc.)
In essence non-technical solution, but can be
supported by technology
Useful but inherently imperfect itself
STANDARDIZATION
7 Sept 2017Data quality: the data science struggle nobody mentions 28
One can do quite a bit of checking
Profiling of domains: detects improper values
Profiling keys: values in column unique, expected keys
Verification of foreign key/primary key relationships
Verification of constraints and business rules
Verification of expected dependencies: inclusion,
(conditional) functional dependencies
Detects missing / erroneous rows
Matching with reference data
etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 29
ERROR DETECTION BY VERIFICATION & PROFILING
What can be automatically discovered?
Detection of outliers: possible erroneous values
Similarity matching: possible duplicates
Inclusion detection: possible inclusion dependencies➠ possible missing erroneous rows
Dependency detection: possible (conditional) functional dependencies➠ possible erroneous values, possible duplicates
Prediction: machine learning to predict value based on other data➠ possible erroneous values (e.g., categorizations)
There is much much more
7 Sept 2017Data quality: the data science struggle nobody mentions 30
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
What can be automatically discovered?
Detection of outliers: possible erroneous values
Similarity matching: possible duplicates
Inclusion detection: possible inclusion dependencies➠ possible missing erroneous rows
Dependency detection: possible (conditional) functional dependencies➠ possible erroneous values, possible duplicates
Prediction: machine learning to predict value based on other data➠ possible erroneous values (e.g., categorizations)
There is much much more
7 Sept 2017Data quality: the data science struggle nobody mentions 31
ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY
Very useful but again inherently imperfect itself
Manually / semi-automatic / automatic
Techniques of previous slide can not only be used to
discover errors, but also to automatically clean them;
If sufficiently probable
duplicate rows can be merged
erroneous rows deleted
erroneous values corrected
Missing rows / values
Data imputation: fill with predicted values
7 Sept 2017Data quality: the data science struggle nobody mentions 32
DATA CLEANING
We have domain knowledge
We know what to expect
We know what is likely and what is not
We compare with what we know and expect
We doubt
We (dis)trust
We learn from others
We reconsider
etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 34
HOW DO WE HUMANS FIND OUR WAY IN THIS MESS?
Probabilities
Uncertainty
Metadata of context:
source, situation, …
Evidence gathering
User feedback
THREE PRINCIPLES
Let me illustrate 2 & 3 with
an example of combining data7 Sept 2017Data quality: the data science struggle nobody mentions 35
7 Sept 2017Data quality: the data science struggle nobody mentions 36
EXAMPLE COMBINING DATAOBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100)
Keulen, M. (2012) Managing Uncertainty: The Road
Towards Better Data Interoperability. IT - Information
Technology, 54 (3). pp. 138-146. ISSN 1611-2776
Car brand Sales
B.M.W. 25
Mercedes 32
Renault 10
Car brand Sales
BMW 72
Mercedes-Benz 39
Renault 20
Car brand Sales
Bayerische Motoren Werke 8
Mercedes 35
Renault 15
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
7 Sept 2017Data quality: the data science struggle nobody mentions 37
VOILA! SEMANTIC DUPLICATES PROBLEM
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Preferred customers …
SELECT SUM(Sales)
FROM CarSales
WHERE Sales>100
0
‘No preferred customers’
DatabaseReal world
(of car brands)
Mercedes-Benz 39
72BMW
45Renault
67Mercedes
8Bayerische Motoren Werke
25B.M.W.
SalesCar brand ωd1
d2
d3
d4
d5d6
o1
o2
o3
o4
7 Sept 2017Data quality: the data science struggle nobody mentions 38
SEMANTIC DUPLICATES
7 Sept 2017Data quality: the data science struggle nobody mentions 39
MOST DATA IMPERFECTIONS
CAN BE MODELED AS UNCERTAINTY IN DATA
Car brand Sales
B.M.W. 25
Bayerische Motoren Werke 8
BMW 72
Mercedes 67
Mercedes-Benz 39
Renault 45
Mercedes 106
Mercedes-Benz 106
1
2
3
4
5
6
X=0
X=0
X=1 Y=0
X=1 Y=1
X=0 4 and 5 different 0.2
X=1 4 and 5 the same 0.8
Y=0 “Mercedes”
correct name
0.5
Y=1 “Mercedes-Benz”
correct name
0.5
B.M.W. / BMW / Bayerische Motoren Werke analogously
Run some duplicate detection
tool (similarity matching)
Looks like ordinary database
Several “possible”, “most likely” or “approximate”
answers to queries
Important: Scalability (big data!)
Sales of “preferred customers”
SELECT SUM(sales)
FROM carsales
WHERE sales≥ 100
Answer: 106 (most likely)
7 Sept 2017Data quality: the data science struggle nobody mentions 40
WHAT I HAVE NOW IS A PROBABILISTIC DATABASE
SUM(sales) P
0 14%
105 6%
106 56%
211 24%
Second most likely
answer at 24% with
impact factor 2 in
sales (211 vs 106)
Data integration very important in Data Science
Example purpose: Data enrichment
Many names for the same problem
Record linkage, Entity resolution, Entity linking,
Semantic duplicates, Data coupling, Data fusion, etc.
7 Sept 2017Data quality: the data science struggle nobody mentions 41
DATA INTEGRATION: SEMANTIC ENTITY LINKINGONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM
CustID Name …
1234 P. Jansen …
2345 P. Janssen …
3456 P. Janssen …
… … …
EmpNr Name …
6789 P. Jansen …
4567 P. Jansen …
5678 P. Janssen …
… … …
?
They could all be the same person or all different persons!
Remember this slide?
Let’s go for an initial
integration that can readily
and meaningfully be used
Let it improve during use
“Good is good enough”
7 Sept 2017Data quality: the data science struggle nobody mentions
DATA INTEGRATION THE PROBABILISTIC WAY
Use
Gather evidence
Improve data quality
Partial data integration
Enumerate cases for remaining problems
Store data with uncertainty in UDBMS
Initia
linte
gra
tion
Contin
uous im
pro
vem
ent
42
User perceives (part of) source as of doubtful credibilityWe humans say “I don’t trust this data”
Rows correct or not, so each with variable & probability
Sources contain conflicting data about (possibly) the same entities; We humans say “I’m left with some doubt”
We already saw this one: car brand example!
Automatic error discovery tool (AEDT) detects possible erroneous values or rowsWe humans say “I doubt that this is correct”
Alternative rows, each with probability of correctness
7 Sept 2017Data quality: the data science struggle nobody mentions 45
TRUST AND DOUBT THE PROBABILISTIC WAY
Little story
Peak into the life and frustrations of a data scientist
Awareness for responsible analytics
Data imperfections: What a mess!
Ball bearings pilot
Models for data quality (problems)
Some general mechanisms: Still quite a mess!
Standardization, error detection & discovery, cleaning
… and I talked about my own research
7 Sept 2017Data quality: the data science struggle nobody mentions 55
WHAT HAVE I TALKED ABOUT
7 Sept 2017Data quality: the data science struggle nobody mentions 56
With this a data scientist can
not trust certain parts of input data as much as others
integrate data in a quick and dirty but responsible way
information about data problems is *in* the data
only solve data quality problems to the degree needed
measure data uncertainty and quality
documentation of data manipulation and its reasons
know how DQ problems affect reliability of the results
7 Sept 2017Data quality: the data science struggle nobody mentions 57
(Francis Bacon, 1605)
(Jorge Luis Borges, 1979)
dripping clock: http://www.gemfive.com
flower: http://cdn.tinybuddha.com
big data: http://tr1.cbsistatic.com
awareness: http://7minutesinthemorning.com
7 Sept 2017Data quality: the data science struggle nobody mentions 58
SOURCES OF IMAGES
Top Related