DATA CURATION AND CLEANING USING ARTIFICIAL INTELLIGENCE
Transcript of DATA CURATION AND CLEANING USING ARTIFICIAL INTELLIGENCE
DATA CURATION AND CLEANING USING
ARTIFICIAL INTELLIGENCE
A CRITICAL STEP FOR SCALABLE DATA
UNIFICATION
A PAPER BY CAPT (IN) VISHNU MOHINDRA
CAPTAIN(IT)
IHQ MOD(N)/ DIT
CONTENTS
• Introduction
• AI/ ML for Data Cleaning & Curation
• Stanford Initiatives
• Three Generations of AI Based Data Processing
• ‘Tamr’ – Data Unification At Scale
• Conclusion and Takeaways
INTRODUCTION
• The 800 Pound Gorilla Problem
✓ Big Variety
✓ Data within Navy has grown in Silos –
almost 300 databases, geographically
distributed
✓ Industry Solutions cannot be push fit
✓ Navy needs to Rapidly Deliver
Solutions & Insights
AI/ ML FOR DATA CURATION & CLEANING
• Multiple Data Schemas, Sources &
Metadata
• Data Cleaning & Organising is the
most Time Consuming Task of a
Modern Enterprise Data Scientist
• Data Cleaning needs to be Automated
using AI/ ML
• Human-in-the-loop
60 % Time Cleaning & Organising Data
DATA WRANGLING – STANFORD INITIATIVES
• Wrangler To Trifacta
• Deepdive – Pre Trained Systems
• Emergence of New Generation Programming
Languages
• Quality is the Key
• Automation and Scaling
DATA UNIFICATION – THREE GENERATIONS OF AI
• Generation 1 (mid to late 90s)
✓Extract Transform & Load
✓Build Data Warehouses
✓Static Rules with Scripts
• Generation 2
✓ETL with Cleaning Suites
✓Rules Based MDM
✓Human Generated Golden Records
DATA UNIFICATION – THREE GENERATIONS OF AI
• Generation 3
✓Machine Learning Based Unification
✓Apply Customer Rules to Construct a Classification
Model
✓Millions of Transactions Classified with ML
✓Schema Integration & Golden Records
✓Human in the Loop
DATA UNIFICATION – THREE GENERATIONS OF AI
DATA UNIFICATION AT SCALE
• Hadoop Based Data Lakes don’t Solve Everything
• The Seven Tenets by Tamr
✓Ingest
✓Clean
✓Transform
✓Integrate Schema
✓Deduplicate
✓Classify
✓Export
TAMR – DATA UNIFICATION AT SCALE
• Data Lakes don’t Solve Everything
• The Seven Tenets
✓ Ingest
✓ Clean
✓ Transform
✓ Integrate Schema
✓ Deduplicate
✓ Classify
✓ Export
CONCLUSION & TAKEAWAYS
• Data is Dirty and Needs to be Cleaned (Context &
User Matters)
• Clean Data at High Speed
• Data Transformation Needs a ‘Human in the Loop’
• AI/ ML for Data Unification
• Automated & Scalable Frameworks
• Collaborate with Start-ups & Open New Ideas
• Avoid Buzzwords & Marketing Spins