Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence...

Digitization - Data - Intelligence

Automated Data Quality Assurancewith Machine Learning and Autoencoders

SDS2019

Bern, 14.06.2019

Martin Müller-Lennert

Senior Data Scientist

[email protected]

Milica Petrović

Senior Data Scientist

[email protected]

2

Talk Outline

2

What’s Wrong with Data Quality?

3

Error Detection using ML

4

Demo

Findings and Outlook

1

3

Talk Outline

2


3


4

Demo


1

4

Data Quality TodayWhat’s wrong with it?

Data Sources

Data Quality

End User

5


Data Sources

Data Quality

End User

6


Data Sources

Data Quality

End User

7


Data Sources

Data Quality

End User

8


Data Sources

Data Quality

End User

9


Data Sources

Data Quality

End User

10


Data Sources

Data Quality

End User

11


Data Sources

Data Quality

End User

12

Data Quality TodayOur Take at a Solution

Data Quality Today

▪ Manually coded

SQL rules

▪ Uni-/bi-variate

checks

▪ Too much data: set of rules

▪ Too few rules: undetected

errors

▪ Too narrow focus: one-

dimensional

▪ Too late: new errors types

detected after occurrence

▪ Automate: simultaneous error detection & faster process

▪ Reusability: tailored ML algorithms reused for fields of similar type

▪ Deep dive: discovery of new types of errors based on multivariate

relationships

▪ Unsupervised

▪ Model of input data:

→ Anomalies easily

detected

▪ Capture multivariate

relationships

Ch

all

en

ges

Solutions with Machine Learning

Au

toan

co

ders

13

Talk Outline

2


3


4

Demo


1

14

Autoencoders for Data QualityArchitecture and Training

Target: Reconstruct input

Bottleneck: Enforced by architecture or regularization

Ensures network learns structure of input data

For good data only

INPUTINPUT OUTPUT

15

Autoencoders for Data QualityArchitecture and Training

Training on imperfect data: Requires large share of good data

Limits potency of network: More layers not always better

From simple one-layer NN up to VAE with LSTM cells

INPUT OUTPUT INPUT

For good data only

16

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kernel Density Estimate

17

Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors

Mean

Sq

uare

d E

rro

r

Individual Data Records

Challenge: Many data points and potentially extreme class imbalance

Kern

el D

en

sity E

stimate

18

Discriminating Good and Bad Data RecordsSequence of Autoencoders

1st iteration

Keep Rest of Data

Challenge: Magnitude of reconstruction error

varies across data error types

Remove Detected

Anomalies

19


20


21


Across iterations: Increase model complexity

Stopping: When threshold separates large chunk of data

22

Talk Outline

2


3


4

Demo


1

23

DemoBirth date

24

DemoBirth date

25

DemoFirst name

26

DemoFirst name

27

DemoRevenue

28

DemoRevenue

29

Talk Outline

2


3


4

Demo


1

30

Reusability of Pre-Processing and Model Setup

Type of variable Pre-processing Model

Character One-hot encoding of charactersVariational autoencoders with

LSTM cells

Categorical One-hot encodingComplete autoencoder with

regularization

DateNumerical features from digits Complete autoencoder with

regularizationNormalization

Numerical NormalizationUndercomplete autoencoder

with custom loss

Generic pipeline per field type → can be reused for other fields of same type

31

Key Findings from Application to Production Data

High reusability: One-time customization effort

Replication: ML can automatically replicate rule-based data quality checks

Extension: Autoencoders can find additional errors

SME feedback necessary: Sanity checks during model building

Multivariate relationships: Detection of interdependencies

Unsupervised learning! Training data quality matters

Tra

inin

gP

erf

orm

an

ce

32

Model lifecycle: improve models over time from feedback

Detected anomalies are errors: correct or leave out

Automated error correction

Data remediation using RPA

Batch processing: extend error detection to whole batches of data

Detect faulty data sourcing process

Detected anomalies are false positives: increase weight during training

OutlookFuture Endeavors

Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence...

Documents

Transcript of Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence...