Introducing Databricks Delta
-
Upload
databricks -
Category
Data & Analytics
-
view
680 -
download
1
Transcript of Introducing Databricks Delta
Databricks Customers Across IndustriesFinancialServices Healthcare&Pharma Media&Entertainment Technology
PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech
Data&AnalyticsServices
Health care AI cloud dataset use caseFinancialServices Healthcare&Pharma Media&Entertainment Technology
PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech
Data&AnalyticsServices
Correlate EMR of 50,000 patients compared with their DNA
FinancialServices Healthcare&Pharma Media&Entertainment Technology
PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech
Data&AnalyticsServices
Enterprise AI use case
5
Provide recommendations to sales using NLP and deep learning
FinancialServices Healthcare&Pharma Media&Entertainment Technology
PublicSector Retail&CPG ConsumerServices Energy&IndustrialIoTMarketing&AdTech
Data&AnalyticsServices
6
Real-time AI use-case
Curb abusive behavior across gamers globally
Big Data was the Missing Link for AIBIG DATA
Customer Data
Emails/Web pages
Click Streams
Sensor data (IoT)
Video/Speech
…
Most companies are Struggling with Big Data
GREAT RESULTS
Hardest part of AI isn’t AI“Hidden Technical Debt in Machine Learning Systems", Google NIPS 2015
The hardest part of AI is Big Data
MLCode
Data Warehouse (DW)
THE GOOD• Pristine Data• Fast Queries• Transactional
THE BAD• Expensive to Scale, not Elastic• Requires ETL, Stale Data, No Real-Time• No Predictions, No ML• Closed formats (lock in)
Not Future Proof – Missing Predictions, Real-time, Scale
ETLimportantdatatocentralDWandgetBusinessIntelligence(BI)
THE BAD• Inconsistent Data• Unreliable for Analytics• Lack of Schema• Poor Performance
Hadoop Data Lake
Become a cheap messy data store with poor performance
ETLalldatatocentralscalableopenlakeforallusecases
THE GOOD• Massive scale • Inexpensive Storage• Open Formats (Parquet, ORC)• Promise of ML & Real Time
Streaming
Info Sec at a Fortune 100 Company
DISADVANTAGES OF ARCHITECTURE• Poor agility in responding to new threats• Scale Limitations, no historical data• 6 Months and twenty people to build
ENTERPRISE DATA WAREHOUSE• Only 2 weeks of data • Very expensive to scale• Proprietary Formats• No Predictions (ML)
Messy data not ready for analytics
Billions of records a day HADOOPDATA LAKE
Complex ETL
EDW
EDW
EDWIncidence Response
Alerting
Reports
First UNIFIED data management system that delivers:
The
SCALEof data lake
The
LOW-LATENCYof streaming
TheRELIABILITY &
PERFORMANCEof data warehouse
Announcing Databricks Delta
Databricks Delta
Enables Predictions, Real-time and Ad Hoc Analytics at Massive Scale
THE GOOD OF DATA LAKES• Massive scale on Amazon S3• Open Formats (Parquet, ORC)• Predictions (ML) & Real Time
Streaming
THE GOOD OF DATA WAREHOUSES• Pristine Data• Transactional Reliability• Fast Queries (10-100x)
Databricks Delta Under the Hood• Decouple Compute & Storage
• ACID Transactions & Data Validation
• Data Indexing & Caching (10-100x)
• Real-Time Streaming Ingest
MASSIVE SCALE
RELIABILITY
PERFORMANCE
LOW-LATENCY
Info Sec with Databricks Delta
DATABRICKS RUNTIMEpowered by
DATABRICKSRUNTIMETrillion Records a Day DATABRICKS
DELTA
ETL, Schema Validation SQL , ML, StreamADVANTAGES• AI capable data warehouse at the scale of a data lake• Interactive analysis on 2 years of data• 2 Weeks to build with a 5 person data platform team
Unified Analytics Platform
+UNIFIEDDATA
MANAGEMENTReliable Transactions, Performance
UNIFIEDEXPERIENCE
ACROSS TEAMSNotebooks, Dashboards, Reports
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Eventsλ-arch1
1
1
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
1
21
1
2
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
Reprocessing
Partitioned
1
2
3
1
1
3
2
Reprocessing
Challenge #4: Query Performance?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
Reprocessing
Compaction
Partitioned
CompactSmall Files
Scheduled to Avoid Compaction
1
2
3
1
1
2
4
4
4
2
Reprocessing
The Canonical Data Pipeline
Data Lake
λ-arch
λ-arch
StreamingAnalytics
Reporting
Events
Validation
λ-archValidation
Reprocessing
Compaction
Partitioned
CompactSmall Files
Scheduled to Avoid Compaction
1
2
3
1
1
2
4
4
4
2
Challenge
DELTA
DATA LAKE Reporting
StreamingAnalytics
TheLOW-LATENCY
of streaming
TheRELIABILITY &
PERFORMANCEof data warehouse
TheSCALE
of data lake
The Delta Architecture