Lambda architecture @ Indix

Lambda Architecture

Analyzing large scale, unstructured, dynamic data

Rajesh Muppalla (@codingnirvana)rajesh@indix.com

Indix - Quick Overview

Am I priced higher or lower w.r.t my competitor on Nikon D700?

Which product has the UPC - 8745354434?

What are all the variants of Apple Macbook Air 13”? What is the average price change of all Nike Shoes

in Walmart in the last 3 months?

Data Pipeline @ Indix

Crawling Parsing

ML Model

Classification

C1 C1 C1 C1

C2 C2 C2

Matching

Product & Price Catalog

Data Pipeline @ Indix

Analytics(Precomputes,

Insights)

Search Index

Product & Price Catalog

Experiences

We released the v1.0 of our API today - developer.indix.com

Data is Dynamic

CC1 C1 C1 C1

C2 C2 C2

ML Model

ML Model(new)

Crawling Parsing Classification Matching

Data Scale

400 MProduct

URLs4 TB

HTML Data Crawled

100 TB Data

Processed Daily

3000Categories

10 BPrice

Points

2000Sites

Data Pipeline v1.0

Batch using HBase & MapReduce

Problem 1

Data Systems should be Human Fault Tolerant

Mutable State

Problem 2

Compactions

Random Write databases are hard to manage at large scale

Problem 3

16 hours

16 hours latency is a lot. We wanted it to be couple of hours

Three Problems

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Rethink our data systems

Lambda Architecture

● An approach to build big data systems○ Architectural Components & Principles○ Ties Batch & Real Time Systems○ General Purpose - Domain Agnostic

● Coined by Nathan Marz○ Ex-Twitter Engineer○ Creator of Storm

Data System - Traditional Approach

Application

Source of Truth

Data System - New Approach

ImmutableRawData

ApplicationProcessed

View(s)

Source of Truth

Let’s take an example

Find the count of unique products in any given category for the entire time range

Two Requirements

● Recomputations● Large Scale

Batch Layer Implementation

HDFS (Vertical Partitioning) HBase

Products Master Data

Intermediate view

MR Job 1

Batch View

MR Job 2New Data

Handling Recomputations

HDFS (Vertical Partitioning) HBase

Products Master Data

Intermediate view

MR Job 1

Batch View

MR Job 2New Data

Handling Scale

● Hadoop HDFS, MapReduce, HBase● Proven Linear Scalability

Three Problems (Recap)

● No Human Fault Tolerance○ Mutable State

● Operational Complexity○ Random Writes (Compactions)

● High Latency○ Batch system architectural tradeoff

Human Fault Tolerance

● Bugs in the batch jobs○ Discard views & Recompute

● Bugs in the master data jobs○ Re-process the master data to hide the old data

● Bugs in the query○ Re-deploy the query layer

● Traceability as a side effect

Operational Complexity

● No random writes in the batch layer○ Bulk Updates to build the batch view

Great… What about Latency?

Speed Layer

Queue(Kafka)

Recent Data

Real Time Processing(Storm)

QueryHyperloglog SetsHyperloglog SetsHyperloglog

Random Writes

(Updates)

Read-Write Data Store(Riak, HBase, Cassandra)

Speed Layer has mutation... But

● Speed layer deals with much smaller data○ Batch Layer - Months/years of data○ Speed Layer - Few hours or 1 day of data

● Easy to manage operationally

Complexity Isolation

Final Step - Merging Results

Batch Layer

Speed Layer

DataQuery

Merged ResultsC1 - 50000

C1 - 499(Approximate with error 0.02%)

C1 - 50499

What about Accuracy?

Batch Layer

Speed Layer

DataQuery

Merged Results

C1 - 499(Approximate with error 0.02%)

C1’ - 50500

Batch LayerC1’ - 50500C1 - 50000

Eventually Accurate

Lambda Architecture

Lambda Architecture @ INDIX

Lambda Architecture @ Indix

Batch Layer @ Indix

● Pail○ Vertical partitioning ○ Consolidation of small files

● Scalding● Thrift for enforcing schemas● HBase/Solr for views

○ Bulk updates to create views

Speed Layer @ Indix

● Still WIP● To reduce latency

○ Micro batches for Speed layer○ Use the last batch run + bulk update views

Open Challenges

● Managing both Batch & Real Time still painful● Two broad directions

○ Abstractions■ SummingBird (Twitter)

○ Unified Stack■ Spark ■ Kafka + Samza/Storm (LinkedIn)■ Cloud Data Flow (Google)

In Conclusion...

● Lambda Architecture○ A different approach to build data systems○ Solid principles ○ Domain Agnostic○ Tools not yet mature

Resources

● Indix Engineering Blog - http://engineering.indix.com

● Runaway Complexity in Big Data Systems● Lambda Architecture● Big Data Book - Manning● Scalding● Spark● Pail● Summingbird

Key Takeaways

- Human Fault Tolerance

- Complexity Isolation

- Higher Level Abstractions

Thank You

Batch vs Real Time Choices

Tying it all together - Go-CD

Extras

● Monoids● LA is not new

○ Search Engines (fast, slow crawl)

○ Event Sourcing (immutable events to maintain

state)○ Patch, Audit, Bootstrap

Problem Statement - Optimization

Lambda architecture @ Indix

Engineering

Transcript of Lambda architecture @ Indix

Speed layer : Real time views in LAMBDA architecture

Lambda Data Grid: Communications Architecture in Support of Grid

Riga dev day: Lambda architecture at AWS

Implementing the speed layer in a lambda architecture - IT4BIit4bi.univ-tours.fr/it4bi/medias/pdfs/2016_Master_Thesis/... · Implementing the speed layer in a lambda architecture

Lambda data grid: communications architecture in support of grid computing

Optimised Lambda Architecture for monitoring scientiﬁc ... · presents an Optimised Lambda Architecture (OLA) using the Apache Spark ecosystem, which involves modelling an efﬁcient

Big data Lambda Architecture - Batch Layer Hands On

Clojure Applications in Building Serverless · Brief introduction to Serverless Architecture AWS Lambda Working with AWS Lambda, API Gateway and Clojure Full Stack Architecture Deploying

Microservice Architecture on AWS using AWS Lambda and Docker Containers

Lambda architecture: from zero to One

IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL … · IMPLEMENTING A LAMBDA ARCHITECTURE TO PERFORM REAL-TIME UPDATES by PRAMOD KUMAR GUDIPATI B.E., OSMANIA UNIVERSITY (OU), INDIA,

How we (Almost) Forgot Lambda Architecture and used Elasticsearch

Lambda Architecture The Hive

Achieve big data analytic platform with lambda architecture on cloud

Spotify's Music Recommendations Lambda Architecture

Lambda Architecture for Batch and Real- Time Processing on AWS ...

Lambda Architecture with Apache Spark - Meetupfiles.meetup.com/18245106/Galway Data Meetup - Lambda Architectu… · Lambda Architecture—Layers • Batch layer – managing the

Building a Lambda Architecture with Elasticsearch at Yieldbot

indix - Market & Competitive Intelligence Platform

A real-time Lambda Architecture using Hadoop & Storm · PDF fileElasticSearch/Solr: ... The Lambda Architecture can discard any view, ... A real-time (Lambda) Architecture using Hadoop