DATA CURATION AND CLEANING USING ARTIFICIAL INTELLIGENCE

14
DATA CURATION AND CLEANING USING ARTIFICIAL INTELLIGENCE A CRITICAL STEP FOR SCALABLE DATA UNIFICATION A PAPER BY CAPT (IN) VISHNU MOHINDRA CAPTAIN(IT) IHQ MOD(N)/ DIT

Transcript of DATA CURATION AND CLEANING USING ARTIFICIAL INTELLIGENCE

DATA CURATION AND CLEANING USING

ARTIFICIAL INTELLIGENCE

A CRITICAL STEP FOR SCALABLE DATA

UNIFICATION

A PAPER BY CAPT (IN) VISHNU MOHINDRA

CAPTAIN(IT)

IHQ MOD(N)/ DIT

CONTENTS

• Introduction

• AI/ ML for Data Cleaning & Curation

• Stanford Initiatives

• Three Generations of AI Based Data Processing

• ‘Tamr’ – Data Unification At Scale

• Conclusion and Takeaways

INTRODUCTION

• The 800 Pound Gorilla Problem

✓ Big Variety

✓ Data within Navy has grown in Silos –

almost 300 databases, geographically

distributed

✓ Industry Solutions cannot be push fit

✓ Navy needs to Rapidly Deliver

Solutions & Insights

AI/ ML FOR DATA CURATION & CLEANING

• Multiple Data Schemas, Sources &

Metadata

• Data Cleaning & Organising is the

most Time Consuming Task of a

Modern Enterprise Data Scientist

• Data Cleaning needs to be Automated

using AI/ ML

• Human-in-the-loop

60 % Time Cleaning & Organising Data

DATA WRANGLING – STANFORD INITIATIVES

• Wrangler To Trifacta

• Deepdive – Pre Trained Systems

• Emergence of New Generation Programming

Languages

• Quality is the Key

• Automation and Scaling

DATA UNIFICATION – THREE GENERATIONS OF AI

• Generation 1 (mid to late 90s)

✓Extract Transform & Load

✓Build Data Warehouses

✓Static Rules with Scripts

• Generation 2

✓ETL with Cleaning Suites

✓Rules Based MDM

✓Human Generated Golden Records

DATA UNIFICATION – THREE GENERATIONS OF AI

• Generation 3

✓Machine Learning Based Unification

✓Apply Customer Rules to Construct a Classification

Model

✓Millions of Transactions Classified with ML

✓Schema Integration & Golden Records

✓Human in the Loop

DATA UNIFICATION – THREE GENERATIONS OF AI

DATA UNIFICATION AT SCALE

• Hadoop Based Data Lakes don’t Solve Everything

• The Seven Tenets by Tamr

✓Ingest

✓Clean

✓Transform

✓Integrate Schema

✓Deduplicate

✓Classify

✓Export

TAMR – DATA UNIFICATION AT SCALE

TAMR – DATA UNIFICATION AT SCALE

• Data Lakes don’t Solve Everything

• The Seven Tenets

✓ Ingest

✓ Clean

✓ Transform

✓ Integrate Schema

✓ Deduplicate

✓ Classify

✓ Export

DATA MASTERING USING ML

CONCLUSION & TAKEAWAYS

• Data is Dirty and Needs to be Cleaned (Context &

User Matters)

• Clean Data at High Speed

• Data Transformation Needs a ‘Human in the Loop’

• AI/ ML for Data Unification

• Automated & Scalable Frameworks

• Collaborate with Start-ups & Open New Ideas

• Avoid Buzzwords & Marketing Spins

DISCUSSIONS