Big Data Simplified "Is all about abˈstrakSH(ə)n"
HEMAL GANDHI D IRECTOR OF DATA ENGINEER ING
Background
Analyze Current State
• Challenges
• Facts
New Platform Design
• Define Goals
• Feature List
• Implementation Approach
Compare
• Feature List
• Trade Offs
• Cost Structure
Decision
Fix vs.
Build?
Analyze Current State
Platform is very complex
Struggling to keep up with business needs
Huge backlog
Code base is increasing rapidly
We are slow to respond to market needs
Outdated technology stack
Missing best practices
High cost of data Storage
Finding Insights Integration Maintenance
Strategic Value
Data Identity
Time Value
Dependencies
Lack of understanding business impact of data
Agile – mini waterfall
Process and Organization
High Investments Costs
Adoption Issues
Complex Framework
Lot of Challenges
NOT scalable platform
Can impact revenue negatively!!!
New Platform Design
Keep it simple
Keep up with business needs
Move fast
Keep technology stack current over time
Low cost of data Storage
Finding Insights Integration Maintenance
Strategic Value
Data Identity
Time Value
Dependencies
Understand business impact of data
Measure data
Be Agile – Do Less
Improve data ROI
Compare
Investment needs
Current Platform
High
New Platform Vs.
High
Scalability
Current Platform
Not Scalable
New Platform Vs.
Initially Scalable
Maintenance cost
Current Platform
High
New Platform Vs.
Initially low, grows over time
Technology
Current Platform
Outdated
New Platform Vs.
Big Data tools provide technology
not solutions to design problems
Technology choices
Decision Fix vs.
Build?
Next Steps
Build a feature based scalable big data
platform in 6 months with limited resources
while supporting legacy system.
Goal
Design Patterns
Take Platform Approach
Project Requirements
Data Platform Features
Reusable Components
Technology Abstraction
Business Logic Declarative
Configuration
Pick Technology at Runtime
Execution Engine
Data Access & Ingestion Abstraction
Data Storage
Data Access API Data Ingestion Framework
Data Producers Data Consumers
Data Integration Jobs
Stream Data to Storage Layer
Data Storage
Data Integration Jobs Stream
Hot Data
Hot/Cold Data Management
Cold Data Configuration
Configuration
abˈstrakSH(ə)n
High Level Architecture
Data Quality Service (Data Lineage & Profiling)
Security Scheduling & Cluster Monitoring
Applications & Visualization Tools
Dredge
Collection • Apache Flume • Sqoop
Flow • Kafka • Spark
Processing • PIG • Spark • Map Reduce
Storage • Hive • HBase • Vertica
Delivery • Looker • Tableau • Visualization (d3.js) • Email/FTP
Data Platform
Data Access Abstraction
Architecture
A declarative, abstraction layer for integrating big
data tools, enabling loosely coupled big data platform.
WHAT IS DREDGE
Dredge Logical View
Events Management Log Streaming
Tasks Hadoop Cluster
Source Readers
Target Writer Streams/Direct
Dredge Repository – HBase
Target End
Points
Source End
Points
Configuration Abstraction
Dredge Repository – HBase
LAMDA Architecture : HDFS, Hive, HBase, PIG, Flume, Kafka, Oozie
Dredge Runtime Temp Store - HDFS Event
Management Temp Cache- HDFS Logger Stream
Dredge Data Services
Aggregator
UDF’s
Combiners, Routers..
Plugin (Java/Shell, PIG, SQL)
Rank, Sorter Set Operations
Filters/Patterns Analysis
Abstraction builder (Kafka, Flume, Pig, Custom)
Source Readers (Logs, RDBMS, unstructured data, Custom)
Direct/Stream
Target Writers (Hive, HBase, RDBMS, Custom) Direct/Stream
Dredge UI
Declarative configuration
Logical Flows
Data Lineage
Runtime Logs
Admin
Dredge Architecture
• From 1000+ scripts to 50-100 scripts
• From 1000+ configuration files to <5 files
• Logical view of workflow, abstract physical implementation
• Quickly integrate new tools, declarative configuration
implementation for big data tools
• Improved SLA, time to market, better cluster utilization,
higher performance
• Simplified integration
• Minimal migration costs
• Low maintenance, configurable archiving of data
DREDGE BENEFITS
Summarizing
ü Abstraction layer
ü Technology
ü Data access
ü Data ingestion
ü Dependencies… It is all about abˈstrakSH(ə)n
ü Reusable data components
ü Event driven dependencies
ü Plug & Play integration, loosely coupled (Cluster resources, Data)
Big data requires a different mindset:
Innovate, iterate often and keep it simple.
Thank you.
E N G I N E E R I N G . O N E K I N G S L A N E . C O M
Top Related