Download - Data Quality: The Data Science struggle nobody mentions - Data Science MeetUp Twente 7 Sep 2017

DATA QUALITYTHE DATA SCIENCE STRUGGLE NOBODY MENTIONS

MAURICE VAN KEULEN

ASSOCIATE PROFESSOR

DATA MANAGEMENT TECHNOLOGY

HEAD OF DATABASE GROUP

First, a little story

Data imperfections

Some general mechanisms

My research

7 Sept 2017Data quality: the data science struggle nobody mentions 2

WHAT IS THIS TALK ABOUT?


Domain

understanding

Data

understanding

Data

preparation

Modeling

(Analysis)

Deployment

(Use) Data

Evaluation

(Interpretation)

Where the

magic happens


Research on

Pregnancy processes based on

Electronic Patient Dossiers (EPDs)

of some population of women


The start

1. Data scientist is a parent him/herself

2. Gather records from patient’s EPDs for

pregnancy period from multiple sources

3. Fairly straightforward to identify consults,

tests, scans, conditions

4. Extract and store them


First analysis

and evaluation

1. Analysis with process mining tool

2. Interaction with visualization

3. Interpret results

4. (S)he already sees some interesting patterns


But then

(s)he notices …

and realizes …

… a broken leg … dozens of specialists …

Assumption wrong: “all records that belong to

a pregnant woman are related to pregnancy”

Too many records selected during preparation

No objective means to ascertain this:

No field ‘related to pregnancy’


and a painstaking

process starts

Specify complex filter rules

Inspect samples of (not) selected records

Repeat!

Quick and

dirty or

thorough?

Never

perfect!What is good

enough?

How does it

affect results?

fatigue can show

later pregnancy

related or not


Fast forward a bit …

Re-perform analysis

and evaluation

1. Interaction with mining tool and visualization

2. Something strange in the times of consults:

Consult after blood test it prescribed???


Realization

More cleaning

1. Clinician contacted for explanation:

Notes during consult put in EPD in evenings!

2. Modification of EPD record (what is recorded)

≠ actual moment of activity (semantics)

Sequence and duration noise

3. More data cleaning ensues


Complex

Many unexpected surprises in

domain/data understanding

Time-consuming

Most time spent on data preparation

(upto 50-80%)


A data scientist should know and tell you about the

deficiencies in the data and the results

Quick and

dirty or

thorough?

Never

perfect!

What is good

enough?

How does it

affect results?


When presented with analytics results / visualizations,

Let people know you don’t trust any results if they don’t have

good answers to questions like these

How was it

cleaned?

Data quality

problems?

Data

discarded?

How reliable

are these

results?

How does

this affect

the results?

DATA IMPERFECTIONS

What is it

Data and specification on parts, substances, etc.

Why is it a problem?

High requirements on data quality

Errors and duplicates may be

costly or even pose health risks

Even so, it is a mess


PRODUCT DATAWHAT IS IT AND WHY IS IT A PROBLEM?

Proposed approach

Given catalogue / database with data on products

Gather data on the same products from websites

(many more or less independent sources)

Consolidate: match, merge and clean

One enriched description

for each product


PRODUCT INFORMATION CLEANING AND ENRICHMENT


PILOT: BALL BEARINGS1. GIVEN CATALOGUE / DATABASE WITH DATA ON PRODUCTS


PILOT: BALL BEARINGS2. GATHER DATA ON THE SAME PRODUCTS FROM WEBSITES; 3. CONSOLIDATE

Get product

pages

Extact data

Consolidate

(match,

merge, clean)


PILOT EXPERIENCES: THE DIRT WE FOUND!!

CustID Sales Name

1234 6000 John

2345 5000 Mary

3456 12000 Bart

… … …


DATA IMPERFECTIONS

SELECT SUM(Sales)

FROM CustSales

3423000

Data sample looks fine

Result of analysis looks

perfectly reasonable

If you don’t look hard

enough

if you don’t properly pay

attention to it

… you will be unaware

… that you are possibly

looking at significantly

erroneous figures!!!

CustID Sales Name

1234 6000 John

2345 5000 Mary

3456 12000 Bart

… … …


DATA IMPERFECTIONS

SELECT SUM(Sales)

FROM CustSales

3423000

CustID Sales Name

6789 2 Tom

4567 6000 Jon

5678 NULL Nina

… … …

????

Wrong values included

Missing data

Double counting

etc.

Many more problems

at value, record,

schema, source, trust

levels

Data integration very important in Data Science

Example purpose: Data enrichment

Many names for the same problem

Record linkage, Entity resolution, Entity linking,

Semantic duplicates, Data coupling, Data fusion, etc.


DATA INTEGRATION: SEMANTIC ENTITY LINKINGONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM

CustID Name …

1234 P. Jansen …

2345 P. Janssen …

3456 P. Janssen …

… … …

EmpNr Name …

6789 P. Jansen …

4567 P. Jansen …

5678 P. Janssen …

… … …

?

They could all be the same person or all different persons!

Data perspective User perspective

Context

independent

Spelling error

Missing data

Incorrect value

Duplicate data

Inconsistent data format

Syntax violation

Violation of integrity constraints

Heterogeneity of measurement units

Existence of synonyms and

homonyms

Information is inaccessible

Information is insecure

Information is hardly retrievable

Information is difficult to aggregate

Errors in the information

transformation

Context

dependent

Violation of domain constraints

Violation of organization’s business

rules

Violation of company and

government regulations

Violation of constraints provided by

the database administrator

The information is not based on fact

Information is of doubtful credibility

Information presents impartial view

Information is irrelevant to the work

Information is incomplete

Information is compactly represented

Information is hard to manipulate

Information is hard to understand

Information is outdated


CATEGORIZATION OF DATA QUALITY PROBLEMSSOURCE: P. WOODALL, M. OBERHOFER, A. BOREK, “A CLASSIFICATION OF DATA

QUALITY ASSESSMENT AND IMPROVEMENT METHODS”, JDIQ 3(4), 2014

http://www.inderscience.com/offer.php?id=68656

Data/information quality cannot be measured with one value

Lot of research on describing desirable properties

Examples: Accuracy, Correctness, Currency,

Completeness, Relevance, Reliability, etc.

Metrics: concrete means to estimate score on dims

Quite futile endeavor in my opinion

200+ such dimensions have been identified,

even more possible metrics

Little agreement

Expensive, too complex

DATA QUALITY DIMENSIONS AND METRICS


“The data quality of a data set” is a meaningless notion

Data quality depends on the purpose!

What is good enough for one purpose,

may be insufficient for another

Alternative means of measurement

Define purpose as a set of queries on the data

Measure quality of the answers

Purpose determines relevant dimensions & metrics


DATA QUALITY DEPENDS ON THE PURPOSE

SOME GENERAL MECHANISMS

Wikipedia: “Process of implementing and developing

technical standards”

Requires consensus

Typically quite costly

Benefits but also downsides (lack of variety,

exceptions, slow evolution, defectors, etc.)

In essence non-technical solution, but can be

supported by technology

Useful but inherently imperfect itself

STANDARDIZATION


One can do quite a bit of checking

Profiling of domains: detects improper values

Profiling keys: values in column unique, expected keys

Verification of foreign key/primary key relationships

Verification of constraints and business rules

Verification of expected dependencies: inclusion,

(conditional) functional dependencies

Detects missing / erroneous rows

Matching with reference data

etc.


ERROR DETECTION BY VERIFICATION & PROFILING

What can be automatically discovered?

Detection of outliers: possible erroneous values

Similarity matching: possible duplicates

Inclusion detection: possible inclusion dependencies➠ possible missing erroneous rows

Dependency detection: possible (conditional) functional dependencies➠ possible erroneous values, possible duplicates

Prediction: machine learning to predict value based on other data➠ possible erroneous values (e.g., categorizations)

There is much much more


ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY

What can be automatically discovered?

Detection of outliers: possible erroneous values

Similarity matching: possible duplicates

Inclusion detection: possible inclusion dependencies➠ possible missing erroneous rows

Dependency detection: possible (conditional) functional dependencies➠ possible erroneous values, possible duplicates

Prediction: machine learning to predict value based on other data➠ possible erroneous values (e.g., categorizations)

There is much much more


ADVANCED PROFILING / AUTOMATIC ERROR DISCOVERY

Very useful but again inherently imperfect itself

Manually / semi-automatic / automatic

Techniques of previous slide can not only be used to

discover errors, but also to automatically clean them;

If sufficiently probable

duplicate rows can be merged

erroneous rows deleted

erroneous values corrected

Missing rows / values

Data imputation: fill with predicted values


DATA CLEANING

MY RESEARCHHow do we

humans find

our way in

this mess?

We have domain knowledge

We know what to expect

We know what is likely and what is not

We compare with what we know and expect

We doubt

We (dis)trust

We learn from others

We reconsider

etc.


HOW DO WE HUMANS FIND OUR WAY IN THIS MESS?

Probabilities

Uncertainty

Metadata of context:

source, situation, …

Evidence gathering

User feedback

THREE PRINCIPLES

Let me illustrate 2 & 3 with

an example of combining data7 Sept 2017Data quality: the data science struggle nobody mentions 35


EXAMPLE COMBINING DATAOBJECTIVE: DETERMINE PREFERRED CUSTOMER (WITH SALES > 100)

Keulen, M. (2012) Managing Uncertainty: The Road

Towards Better Data Interoperability. IT - Information

Technology, 54 (3). pp. 138-146. ISSN 1611-2776

Car brand Sales

B.M.W. 25

Mercedes 32

Renault 10

Car brand Sales

BMW 72

Mercedes-Benz 39

Renault 20

Car brand Sales

Bayerische Motoren Werke 8

Mercedes 35

Renault 15

Car brand Sales

B.M.W. 25


BMW 72

Mercedes 67

Mercedes-Benz 39

Renault 45

http://eprints.eemcs.utwente.nl/21947/


VOILA! SEMANTIC DUPLICATES PROBLEM

Car brand Sales

B.M.W. 25


BMW 72

Mercedes 67

Mercedes-Benz 39

Renault 45

Preferred customers …

SELECT SUM(Sales)

FROM CarSales

WHERE Sales>100

0

‘No preferred customers’

DatabaseReal world

(of car brands)

Mercedes-Benz 39

72BMW

45Renault

67Mercedes

8Bayerische Motoren Werke

25B.M.W.

SalesCar brand ωd1

d2

d3

d4

d5d6

o1

o2

o3

o4


SEMANTIC DUPLICATES


MOST DATA IMPERFECTIONS

CAN BE MODELED AS UNCERTAINTY IN DATA

Car brand Sales

B.M.W. 25


BMW 72

Mercedes 67

Mercedes-Benz 39

Renault 45

Mercedes 106

Mercedes-Benz 106

1

2

3

4

5

6

X=0

X=0

X=1 Y=0

X=1 Y=1

X=0 4 and 5 different 0.2

X=1 4 and 5 the same 0.8

Y=0 “Mercedes”

correct name

0.5

Y=1 “Mercedes-Benz”

correct name

0.5

B.M.W. / BMW / Bayerische Motoren Werke analogously

Run some duplicate detection

tool (similarity matching)

Looks like ordinary database

Several “possible”, “most likely” or “approximate”

answers to queries

Important: Scalability (big data!)

Sales of “preferred customers”

SELECT SUM(sales)

FROM carsales

WHERE sales≥ 100

Answer: 106 (most likely)


WHAT I HAVE NOW IS A PROBABILISTIC DATABASE

SUM(sales) P

0 14%

105 6%

106 56%

211 24%

Second most likely

answer at 24% with

impact factor 2 in

sales (211 vs 106)

Data integration very important in Data Science

Example purpose: Data enrichment

Many names for the same problem

Record linkage, Entity resolution, Entity linking,

Semantic duplicates, Data coupling, Data fusion, etc.


DATA INTEGRATION: SEMANTIC ENTITY LINKINGONE OF THE MOST IMPORTANT DATA QUALITY PROBLEM

CustID Name …

1234 P. Jansen …

2345 P. Janssen …

3456 P. Janssen …

… … …

EmpNr Name …

6789 P. Jansen …

4567 P. Jansen …

5678 P. Janssen …

… … …

?

They could all be the same person or all different persons!

Remember this slide?

Let’s go for an initial

integration that can readily

and meaningfully be used

Let it improve during use

“Good is good enough”

7 Sept 2017Data quality: the data science struggle nobody mentions

DATA INTEGRATION THE PROBABILISTIC WAY

Use

Gather evidence

Improve data quality

Partial data integration

Enumerate cases for remaining problems

Store data with uncertainty in UDBMS

Initia

linte

gra

tion

Contin

uous im

pro

vem

ent

42

User perceives (part of) source as of doubtful credibilityWe humans say “I don’t trust this data”

Rows correct or not, so each with variable & probability

Sources contain conflicting data about (possibly) the same entities; We humans say “I’m left with some doubt”

We already saw this one: car brand example!

Automatic error discovery tool (AEDT) detects possible erroneous values or rowsWe humans say “I doubt that this is correct”

Alternative rows, each with probability of correctness


TRUST AND DOUBT THE PROBABILISTIC WAY

CONCLUSIONS

Little story

Peak into the life and frustrations of a data scientist

Awareness for responsible analytics

Data imperfections: What a mess!

Ball bearings pilot

Models for data quality (problems)

Some general mechanisms: Still quite a mess!

Standardization, error detection & discovery, cleaning

… and I talked about my own research


WHAT HAVE I TALKED ABOUT


With this a data scientist can

not trust certain parts of input data as much as others

integrate data in a quick and dirty but responsible way

information about data problems is *in* the data

only solve data quality problems to the degree needed

measure data uncertainty and quality

documentation of data manipulation and its reasons

know how DQ problems affect reliability of the results


(Francis Bacon, 1605)

(Jorge Luis Borges, 1979)

dripping clock: http://www.gemfive.com

flower: http://cdn.tinybuddha.com

big data: http://tr1.cbsistatic.com

awareness: http://7minutesinthemorning.com


SOURCES OF IMAGES

http://www.gemfive.com

http://cdn.tinybuddha.com

http://tr1.cbsistatic.com

http://7minutesinthemorning.com