Lambda architecture @ Indix

Post on 01-Dec-2014

315 views 5 download

description

Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014. It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.

Transcript of Lambda architecture @ Indix

Lambda Architecture

Analyzing large scale, unstructured, dynamic data

Rajesh Muppalla (@codingnirvana)rajesh@indix.com

Indix - Quick Overview

Am I priced higher or lower w.r.t my competitor on Nikon D700?

Which product has the UPC - 8745354434?

What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes

in Walmart in the last 3 months?

Data Pipeline @ Indix

C

Crawling Parsing

ML Model

ML Model

Classification

C1 C1 C1 C1

C2 C2 C2

C2 C2

Matching

Product & Price Catalog

Data Pipeline @ Indix

Analytics(Precomputes,

Insights)

Search Index

Product & Price Catalog

Experiences

We released the v1.0 of our API today - developer.indix.com

Data is Dynamic

CC1 C1 C1 C1

C2 C2 C2

C2 C2

ML Model

ML Model(new)

Crawling Parsing Classification Matching

Data Scale

400 MProduct

URLs4 TB

HTML Data Crawled

Daily

100 TB Data

Processed Daily

3000Categories

10 BPrice

Points

2000Sites

Data Pipeline v1.0

Batch using HBase & MapReduce

Problem 1

Data Systems should be Human Fault Tolerant

Mutable State

Problem 2

Compactions

Random Write databases are hard to manage at large scale

Problem 3

16 hours

16 hours latency is a lot. We wanted it to be couple of hours

Three Problems

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Rethink our data systems

Lambda Architecture

Lambda Architecture

● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic

● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm

HBase

Data System - Traditional Approach

Application

Source of Truth

Data System - New Approach

ImmutableRawData

ApplicationProcessed

View(s)

Source of Truth

Let’s take an example

Find the count of unique products in any given category for the entire time range

Two Requirements

● Recomputations● Large Scale

Batch Layer Implementation

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Handling Recomputations

C1 5

C2 7

C3 4

C4 7

C5 1

HDFS (Vertical Partitioning) HBase

Products Master Data

9 am

10 am

11 am

12 pm

1 pm

2 pm

Query

Intermediate view

C1

C2

C3

C4

C5

MR Job 1

Batch View

MR Job 2New Data

Handling Scale

● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability

Three Problems (Recap)

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Human Fault Tolerance

● Bugs in the batch jobs○ Discard views & Recompute

● Bugs in the master data jobs○ Re-process the master data to hide the old data

● Bugs in the query○ Re-deploy the query layer

● Traceability as a side effect

Operational Complexity

● No random writes in the batch layer○ Bulk Updates to build the batch view

Great… What about Latency?

Speed Layer

Queue(Kafka)

Recent Data

Real Time Processing(Storm)

QueryHyperloglog SetsHyperloglog SetsHyperloglog

Random Writes

(Updates)

Read-Write Data Store(Riak, HBase, Cassandra)

Speed Layer has mutation... But

● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data

● Easy to manage operationally

Complexity Isolation

Final Step - Merging Results

Batch Layer

Speed Layer

DataQuery

Merged ResultsC1 - 50000

C1 - 499(Approximate with error 0.02%)

C1 - 50499

What about Accuracy?

Batch Layer

Speed Layer

DataQuery

Merged Results

C1 - 499(Approximate with error 0.02%)

C1’ - 50500

Batch LayerC1’ - 50500C1 - 50000

Eventually Accurate

Lambda Architecture

Lambda Architecture @ INDIX

Lambda Architecture @ Indix

Batch Layer @ Indix

● Pail○ Vertical partitioning ○ Consolidation of small files

● Scalding● Thrift for enforcing schemas● HBase/Solr for views

○ Bulk updates to create views

Speed Layer @ Indix

● Still WIP● To reduce latency

○ Micro batches for Speed layer○ Use the last batch run + bulk update views

Open Challenges

● Managing both Batch & Real Time still painful● Two broad directions

○ Abstractions■ SummingBird (Twitter)

○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)

In Conclusion...

● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature

Key Takeaways

- Human Fault Tolerance

- Complexity Isolation

- Higher Level Abstractions

Thank You

Batch vs Real Time Choices

Tying it all together - Go-CD

Extras

● Monoids● LA is not new

○ Search Engines (fast, slow crawl)

○ Event Sourcing (immutable events to maintain

state)○ Patch, Audit, Bootstrap

Problem Statement - Optimization