Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence...
Transcript of Automated Data Quality Assurance with Machine Learning ......Digitization - Data - Intelligence...
Digitization - Data - Intelligence
Automated Data Quality Assurancewith Machine Learning and Autoencoders
SDS2019
Bern, 14.06.2019
Martin Müller-Lennert
Senior Data Scientist
Milica Petrović
Senior Data Scientist
2
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
4
Demo
Findings and Outlook
1
3
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
4
Demo
Findings and Outlook
1
4
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
5
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
6
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
7
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
8
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
9
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
10
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
11
Data Quality TodayWhat’s wrong with it?
Data Sources
Data Quality
End User
12
Data Quality TodayOur Take at a Solution
Data Quality Today
▪ Manually coded
SQL rules
▪ Uni-/bi-variate
checks
▪ Too much data: set of rules
▪ Too few rules: undetected
errors
▪ Too narrow focus: one-
dimensional
▪ Too late: new errors types
detected after occurrence
▪ Automate: simultaneous error detection & faster process
▪ Reusability: tailored ML algorithms reused for fields of similar type
▪ Deep dive: discovery of new types of errors based on multivariate
relationships
▪ Unsupervised
▪ Model of input data:
→ Anomalies easily
detected
▪ Capture multivariate
relationships
Ch
all
en
ges
Solutions with Machine Learning
Au
toan
co
ders
13
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
4
Demo
Findings and Outlook
1
14
Autoencoders for Data QualityArchitecture and Training
Target: Reconstruct input
Bottleneck: Enforced by architecture or regularization
Ensures network learns structure of input data
For good data only
INPUTINPUT OUTPUT
15
Autoencoders for Data QualityArchitecture and Training
Training on imperfect data: Requires large share of good data
Limits potency of network: More layers not always better
From simple one-layer NN up to VAE with LSTM cells
INPUT OUTPUT INPUT
For good data only
16
Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors
Mean
Sq
uare
d E
rro
r
Individual Data Records
Challenge: Many data points and potentially extreme class imbalance
Kernel Density Estimate
17
Discriminating Good and Bad Data RecordsClustering the Reconstruction Errors
Mean
Sq
uare
d E
rro
r
Individual Data Records
Challenge: Many data points and potentially extreme class imbalance
Kern
el D
en
sity E
stimate
18
Discriminating Good and Bad Data RecordsSequence of Autoencoders
1st iteration
Keep Rest of Data
Challenge: Magnitude of reconstruction error
varies across data error types
Remove Detected
Anomalies
19
Discriminating Good and Bad Data RecordsSequence of Autoencoders
20
Discriminating Good and Bad Data RecordsSequence of Autoencoders
21
Discriminating Good and Bad Data RecordsSequence of Autoencoders
Across iterations: Increase model complexity
Stopping: When threshold separates large chunk of data
22
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
4
Demo
Findings and Outlook
1
23
DemoBirth date
24
DemoBirth date
25
DemoFirst name
26
DemoFirst name
27
DemoRevenue
28
DemoRevenue
29
Talk Outline
2
What’s Wrong with Data Quality?
3
Error Detection using ML
4
Demo
Findings and Outlook
1
30
Reusability of Pre-Processing and Model Setup
Type of variable Pre-processing Model
Character One-hot encoding of charactersVariational autoencoders with
LSTM cells
Categorical One-hot encodingComplete autoencoder with
regularization
DateNumerical features from digits Complete autoencoder with
regularizationNormalization
Numerical NormalizationUndercomplete autoencoder
with custom loss
Generic pipeline per field type → can be reused for other fields of same type
31
Key Findings from Application to Production Data
High reusability: One-time customization effort
Replication: ML can automatically replicate rule-based data quality checks
Extension: Autoencoders can find additional errors
SME feedback necessary: Sanity checks during model building
Multivariate relationships: Detection of interdependencies
Unsupervised learning! Training data quality matters
Tra
inin
gP
erf
orm
an
ce
32
Model lifecycle: improve models over time from feedback
Detected anomalies are errors: correct or leave out
Automated error correction
Data remediation using RPA
Batch processing: extend error detection to whole batches of data
Detect faulty data sourcing process
Detected anomalies are false positives: increase weight during training
OutlookFuture Endeavors