Lambda architecture @ Indix
-
Upload
rajesh-muppalla -
Category
Engineering
-
view
315 -
download
5
description
Transcript of Lambda architecture @ Indix
Lambda Architecture
Analyzing large scale, unstructured, dynamic data
Rajesh Muppalla (@codingnirvana)[email protected]
Indix - Quick Overview
Am I priced higher or lower w.r.t my competitor on Nikon D700?
Which product has the UPC - 8745354434?
What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes
in Walmart in the last 3 months?
Data Pipeline @ Indix
C
Crawling Parsing
ML Model
ML Model
Classification
C1 C1 C1 C1
C2 C2 C2
C2 C2
Matching
Product & Price Catalog
Data Pipeline @ Indix
Analytics(Precomputes,
Insights)
Search Index
Product & Price Catalog
Experiences
We released the v1.0 of our API today - developer.indix.com
Data is Dynamic
CC1 C1 C1 C1
C2 C2 C2
C2 C2
ML Model
ML Model(new)
Crawling Parsing Classification Matching
Data Scale
400 MProduct
URLs4 TB
HTML Data Crawled
Daily
100 TB Data
Processed Daily
3000Categories
10 BPrice
Points
2000Sites
Data Pipeline v1.0
Batch using HBase & MapReduce
Problem 1
Data Systems should be Human Fault Tolerant
Mutable State
Problem 2
Compactions
Random Write databases are hard to manage at large scale
Problem 3
16 hours
16 hours latency is a lot. We wanted it to be couple of hours
Three Problems
● No Human Fault Tolerance○ Mutable State
● Operational Complexity○ Random Writes (Compactions)
● High Latency○ Batch system architectural tradeoff
Rethink our data systems
Lambda Architecture
Lambda Architecture
● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic
● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm
HBase
Data System - Traditional Approach
Application
Source of Truth
Data System - New Approach
ImmutableRawData
ApplicationProcessed
View(s)
Source of Truth
Let’s take an example
Find the count of unique products in any given category for the entire time range
Two Requirements
● Recomputations● Large Scale
Batch Layer Implementation
C1 5
C2 7
C3 4
C4 7
C5 1
HDFS (Vertical Partitioning) HBase
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
MR Job 2New Data
Handling Recomputations
C1 5
C2 7
C3 4
C4 7
C5 1
HDFS (Vertical Partitioning) HBase
Products Master Data
9 am
10 am
11 am
12 pm
1 pm
2 pm
Query
Intermediate view
C1
C2
C3
C4
C5
MR Job 1
Batch View
MR Job 2New Data
Handling Scale
● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability
Three Problems (Recap)
● No Human Fault Tolerance○ Mutable State
● Operational Complexity○ Random Writes (Compactions)
● High Latency○ Batch system architectural tradeoff
Human Fault Tolerance
● Bugs in the batch jobs○ Discard views & Recompute
● Bugs in the master data jobs○ Re-process the master data to hide the old data
● Bugs in the query○ Re-deploy the query layer
● Traceability as a side effect
Operational Complexity
● No random writes in the batch layer○ Bulk Updates to build the batch view
Great… What about Latency?
Speed Layer
Queue(Kafka)
Recent Data
Real Time Processing(Storm)
QueryHyperloglog SetsHyperloglog SetsHyperloglog
Random Writes
(Updates)
Read-Write Data Store(Riak, HBase, Cassandra)
Speed Layer has mutation... But
● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data
● Easy to manage operationally
Complexity Isolation
Final Step - Merging Results
Batch Layer
Speed Layer
DataQuery
Merged ResultsC1 - 50000
C1 - 499(Approximate with error 0.02%)
C1 - 50499
What about Accuracy?
Batch Layer
Speed Layer
DataQuery
Merged Results
C1 - 499(Approximate with error 0.02%)
C1’ - 50500
Batch LayerC1’ - 50500C1 - 50000
Eventually Accurate
Lambda Architecture
Lambda Architecture @ INDIX
Lambda Architecture @ Indix
Batch Layer @ Indix
● Pail○ Vertical partitioning ○ Consolidation of small files
● Scalding● Thrift for enforcing schemas● HBase/Solr for views
○ Bulk updates to create views
Speed Layer @ Indix
● Still WIP● To reduce latency
○ Micro batches for Speed layer○ Use the last batch run + bulk update views
Open Challenges
● Managing both Batch & Real Time still painful● Two broad directions
○ Abstractions■ SummingBird (Twitter)
○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)
In Conclusion...
● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature
Resources
● Indix Engineering Blog - http://engineering.indix.com
● Runaway Complexity in Big Data Systems● Lambda Architecture● Big Data Book - Manning● Scalding● Spark● Pail● Summingbird
Key Takeaways
- Human Fault Tolerance
- Complexity Isolation
- Higher Level Abstractions
Thank You
Batch vs Real Time Choices
Tying it all together - Go-CD
Extras
● Monoids● LA is not new
○ Search Engines (fast, slow crawl)
○ Event Sourcing (immutable events to maintain
state)○ Patch, Audit, Bootstrap
Problem Statement - Optimization