Lambda architecture @ Indix

44
Lambda Architecture Analyzing large scale, unstructured, dynamic data Rajesh Muppalla (@codingnirvana) [email protected]

description

Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014. It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.

Transcript of Lambda architecture @ Indix

Page 1: Lambda architecture @ Indix

Lambda Architecture

Analyzing large scale, unstructured, dynamic data

Rajesh Muppalla (@codingnirvana)[email protected]

Page 2: Lambda architecture @ Indix

Indix - Quick Overview

Am I priced higher or lower w.r.t my competitor on Nikon D700?

Which product has the UPC - 8745354434?

What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes

in Walmart in the last 3 months?

Page 3: Lambda architecture @ Indix

Data Pipeline @ Indix

C

Crawling Parsing

ML Model

ML Model

Classification

C1 C1 C1 C1

C2 C2 C2

C2 C2

Matching

Product & Price Catalog

Page 4: Lambda architecture @ Indix

Data Pipeline @ Indix

Analytics(Precomputes,

Insights)

Search Index

Product & Price Catalog

Experiences

We released the v1.0 of our API today - developer.indix.com

Page 5: Lambda architecture @ Indix

Data is Dynamic

CC1 C1 C1 C1

C2 C2 C2

C2 C2

ML Model

ML Model(new)

Crawling Parsing Classification Matching

Page 6: Lambda architecture @ Indix

Data Scale

400 MProduct

URLs4 TB

HTML Data Crawled

Daily

100 TB Data

Processed Daily

3000Categories

10 BPrice

Points

2000Sites

Page 7: Lambda architecture @ Indix

Data Pipeline v1.0

Page 8: Lambda architecture @ Indix

Batch using HBase & MapReduce

Page 9: Lambda architecture @ Indix

Problem 1

Data Systems should be Human Fault Tolerant

Mutable State

Page 10: Lambda architecture @ Indix

Problem 2

Compactions

Random Write databases are hard to manage at large scale

Page 11: Lambda architecture @ Indix

Problem 3

16 hours

16 hours latency is a lot. We wanted it to be couple of hours

Page 12: Lambda architecture @ Indix

Three Problems

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Page 13: Lambda architecture @ Indix

Rethink our data systems

Page 14: Lambda architecture @ Indix

Lambda Architecture

Page 15: Lambda architecture @ Indix

Lambda Architecture

● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic

● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm

Page 16: Lambda architecture @ Indix

HBase

Data System - Traditional Approach

Application

Source of Truth

Page 17: Lambda architecture @ Indix

Data System - New Approach

ImmutableRawData

ApplicationProcessed

View(s)

Source of Truth

Page 18: Lambda architecture @ Indix

Let’s take an example

Find the count of unique products in any given category for the entire time range

Page 19: Lambda architecture @ Indix

Two Requirements

● Recomputations● Large Scale

Page 20: Lambda architecture @ Indix

Batch Layer Implementation

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Page 21: Lambda architecture @ Indix

Handling Recomputations

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Page 22: Lambda architecture @ Indix

Handling Scale

● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability

Page 23: Lambda architecture @ Indix

Three Problems (Recap)

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Page 24: Lambda architecture @ Indix

Human Fault Tolerance

● Bugs in the batch jobs○ Discard views & Recompute

● Bugs in the master data jobs○ Re-process the master data to hide the old data

● Bugs in the query○ Re-deploy the query layer

● Traceability as a side effect

Page 25: Lambda architecture @ Indix

Operational Complexity

● No random writes in the batch layer○ Bulk Updates to build the batch view

Page 26: Lambda architecture @ Indix

Great… What about Latency?

Page 27: Lambda architecture @ Indix

Speed Layer

Queue(Kafka)

Recent Data

Real Time Processing(Storm)

QueryHyperloglog SetsHyperloglog SetsHyperloglog

Random Writes

(Updates)

Read-Write Data Store(Riak, HBase, Cassandra)

Page 28: Lambda architecture @ Indix

Speed Layer has mutation... But

● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data

● Easy to manage operationally

Complexity Isolation

Page 29: Lambda architecture @ Indix

Final Step - Merging Results

Batch Layer

Speed Layer

DataQuery

Merged ResultsC1 - 50000

C1 - 499(Approximate with error 0.02%)

C1 - 50499

Page 30: Lambda architecture @ Indix

What about Accuracy?

Batch Layer

Speed Layer

DataQuery

Merged Results

C1 - 499(Approximate with error 0.02%)

C1’ - 50500

Batch LayerC1’ - 50500C1 - 50000

Eventually Accurate

Page 31: Lambda architecture @ Indix

Lambda Architecture

Page 32: Lambda architecture @ Indix

Lambda Architecture @ INDIX

Page 33: Lambda architecture @ Indix

Lambda Architecture @ Indix

Page 34: Lambda architecture @ Indix

Batch Layer @ Indix

● Pail○ Vertical partitioning ○ Consolidation of small files

● Scalding● Thrift for enforcing schemas● HBase/Solr for views

○ Bulk updates to create views

Page 35: Lambda architecture @ Indix

Speed Layer @ Indix

● Still WIP● To reduce latency

○ Micro batches for Speed layer○ Use the last batch run + bulk update views

Page 36: Lambda architecture @ Indix

Open Challenges

● Managing both Batch & Real Time still painful● Two broad directions

○ Abstractions■ SummingBird (Twitter)

○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)

Page 37: Lambda architecture @ Indix

In Conclusion...

● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature

Page 39: Lambda architecture @ Indix

Key Takeaways

- Human Fault Tolerance

- Complexity Isolation

- Higher Level Abstractions

Page 40: Lambda architecture @ Indix

Thank You

Page 41: Lambda architecture @ Indix

Batch vs Real Time Choices

Page 42: Lambda architecture @ Indix

Tying it all together - Go-CD

Page 43: Lambda architecture @ Indix

Extras

● Monoids● LA is not new

○ Search Engines (fast, slow crawl)

○ Event Sourcing (immutable events to maintain

state)○ Patch, Audit, Bootstrap

Page 44: Lambda architecture @ Indix

Problem Statement - Optimization