The Evolution of Big Data Pipelines at Intuit
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
715 -
download
0
Transcript of The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines At IntuitJune 30, 2016
#hadoopsummit #HS16SJ
Your Speakers
Lokesh RajaramSenior Software Engineer, Intuit
likes Photography
Rekha JoshiPrincipal Software Engineer, Intuit
Currently likes Chopped
The Plan
Unicellular Amoeba
Multicellular Humans
Cannot Evolve? Disappear..
Gone!
Evolution of Big Data
Our Mission
To improve our customers’ financial lives so profoundly … they can’t imagine going back to the old way!
Consumers Small Businesses Accounting Professionals
Who we serve
42M 2.3M 7MFile their own taxes with
TurboTaxRun their small businesses
with QuickBooksManage their personal finances
with Mint
The Numbers Are Growing
65+ Applications, 25% of US GDP
Era of Windows Era of
Web
Era of the Cloud
Era of DOS
Intuit - An Evolution Case Study
Compliantdata
Mobile First
1980s 1990s 2000s
• Employees: 150• Customers: 1.3M customers• Revenue: $33M
• Employees: 4,500 • Customers: 5.6M • Revenue: $1.04B
• Employees: 7,700• Customers: 37M• Revenue: $4.2B
20162010
Regulatory data Transactional data Batch data Real time data Complex, secure data
Data Is The Decision Maker
Evolution of Big Data Pipelines – The Need
Secure Cloud Environment
Single Cohesive Data Pipeline
AB Testing
Personalization
StreamingProfile Store
Fraud Detection
Support Varied Use Cases
and more..
Evolution of Big Data Pipelines
Thin Slices - Minimal Viable Product
Evolution of Big Data Pipelines – The Recipe
Taking the Data In
Transforming Data
Handling The Indigestion With Scale
Evolution of Big Data Pipelines – The Recipe
No SnowflakesSolutions
Getting Vested Stakeholders Agreements
Establishing The
Standards
Evolution of Big Data Pipelines – The Recipe
Breaking The Silos
Moving Organization In
One Direction
Evolution of Big Data Pipelines – The Recipe
● Making The Configuration Knobs Work● At Scaleo Latency o Throughput
● Schema, PII, Metadata, Changes, Audit, Governance ● Controlled Access←→ Innovation● Error Monitoring ● Cluster Deployment
Organization Evolution Data Evolution
SDK
User-entered data
Apache Kafka
Collector: User-entered and clickstream data
Real-time processing
Personalization Engine
Profile Store
Big Data Pipeline Slice View
Big Data Pipeline Components
Monitoring The Pipeline
AWS resource alarms
Custom App MetricsJVM and App Metrics
Custom process alerts
Logging and alert
Evolution In Stages
Evolution - Stage 0: Disparate And Chaotic
Disparate Databases
Data Pipeline (an example)
• Collect event stream data into one location
• Handle ~ 200k events / sec
• Payload ~ 3-5KB
• Enrich message and load it into Hive in defined SLA
Evolution - Stage 1
Event Stream
OozieSqoop
Netezza LoaderHive QL
operationsStormSamzaFlume
Evolution - Stage 2
Event Stream
{ ReST }
Evolution - Stage 3 (HA & DR)SDK
{ ReST }SDK
{ ReST }Mirroring
Challenges & Opportunities
Set of Changes
• Network upgrades
• Increase pipe
• Broker
• Mirrormaker
• Host TCP
Evolution - Stage 4 (Streaming + Batch)
SDK
{ ReST }
SDK
{ ReST }
Mirroring
Evolution - Stage 5 (Cloud only)
SDK
{ ReST }
Kafka Connectors
Evolution - Stage 5 (Cloud only - Future state)
Pipeline Essentials
SDK { ReST }
SDK { ReST }
Traffic Rate Monitoring
Trust by Verification
• Test all Observable End-points• Functional• Data Loss• Data Parity
• Measure for SLA• Baseline Tests
Thank You!