Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

50
@fdouetteau #lambdataiku Lambda Architecture @fdouetteau Dataiku, www.dataiku.com Florian Douetteau, CEO Dataiku

description

This talk given at Devoxx Paris 2014 gives an overview of lambda architecture, and possible alternative in their implementation

Transcript of Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

Page 1: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Lambda Architecture

@fdouetteauDataiku, www.dataiku.comFlorian Douetteau, CEO Dataiku

Page 2: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Topics For Today

• WHAT is a lambda architecture• Examples - Principle• Motivation – Hard Points

• HOW to you build a lambda architecture ? • Components per component

Page 3: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Lambda

EVENTS PROCESS

STATE

SE

RV

E

Page 4: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

ƛ : SOME USE CASES

• Online Advertising• Keep track of number of displays / clicks

per positions / campaigns

• Recommender Systems• Keep track of production displays / views /

click / buy

• Statistical Time Line• Keep Track of number of tweets per

hashtag / hour

Page 5: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

SQL WAY

EVENTS PROCESS

STATE

SE

RV

EUSER1 ITEM1 VIEW

USER1 ITEM2 BUY

INSERT OR UPDATE VIEWS SET pageviews = pageviews + 1

WHERE user=USER1 …

RDBMSSQL

Page 6: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Functional Programming Append Only

EVENTS PROCESS

STATE(APPEND ONLY)

SE

RV

E

newstate = Fagg (oldstate, Fstore(events))

result= F (state, lastevents, scope)

Page 7: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

E.g. counting twitter hashtags

EVENTS PROCESS

STATE SE

RV

E

Fmap ( ) = { (#tag, time) -> count }

FReduce( hashmap, hashmap ) = fuse count in maps

FDisplay( hashmap, events ) = Freduce(hashmap, Fmap(events))

TWEET COUNTS(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3

NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar

Page 8: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

E.g. counting twitter hashtags in “SQL”

EVENTS

SE

RV

E

TWEET COUNTS TABLE(2014-02-31 13, #foo) -> 8(2014-02-31 13, #foo2) -> 3(2014-02-31 13, #foo3) -> 3(2014-02-31 13, #foo4) -> 1

NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar

PARTIAL TWEET COUNT TABLE(2014-02-31 13, #foo) -> 1(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) ->

NEW TWEET COUNT TABLE(2014-02-31 13, #foo) -> 9(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3

CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAGCREATE AS

SELEC time, tag, SUM(counts)FROM ( oldtable … UNION

partialtable) GROUP BY TIME, TAG

SELECT, time, tag, SUM(c) FROM (SELECT time, tag, c FROM oldtable WHERE tag = …UNIONSELECT time, tag, c FROM partialtable WHERE tag=…)

INSERT VALUES …

RENAME TABLE …

EXECUTE EACH 5 MINUTES

EXECUTEEACH HOUR

Page 9: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

ƛ : PRINCIPLE

EVENTS

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

Page 10: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Backtype Story

Capture events and logs from twitter

25TB binary data100 Billlios records400 QPS AverageScale 1 -> 150 on peak

Take off with a team of 3 engineers with seed funding in 2008 Christopher Golda Michael Montano Nathan Marz

Acquired by Twitter ( power twitter trends …) in 2011

CascalogStormElephantDB

Page 11: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TWITTER HASHTAGS

2014-02-31 13:14

#foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14

#foo bar

(2014-02-31 13, #foo) -> 3

(2014-02-31 13, #foo) -> 3

COMPUTE EVERY 5 MINUTESHASHTAG COUNTS FORTHE LAST 5 MINUTES

(IN MEMORY)

COMPUTE EVERY HOUR HASHTAG

COUNT FOR THE LAST HOUR(ON DISK)

Page 12: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

RECOMMENDER SYSTEM

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

USER1 ITEM1 VIEW

USER1 ITEM2 BUY

USER1 ITEM1 VIEW

USER1 ITEM1 VIEW

ITEM-ITEM SIMILARITY MATRIX

USER -> [ ITEM1, … ITEMn]

RECOMMENDATION

Page 13: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

THREE KEY DRIVERS FOR LAMBDA ARCH

Page 14: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

DRIVER 1: Support Smooth Evolution

2014-02-31 13:14 #foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14 #foo bar

(2014-02-31 13:14,, #foo) -> 3

(2014-02-31 13:14, #foo) -> 3

(1) RECOMPUTE NEW VERSIONON BATCH WHILE KEEPING THE OLD ONE (2014-02-31 13, #foo) -> 3

(2) THEN UPDATE THE ONLINE VERSION

Page 15: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

DRIVER 2: Real-Time System Offline

2014-02-31 13:14

#foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14

#foo bar

(2014-02-31 13, #foo) -> 3

(2014-02-31 13, #foo) -> 3

COMPUTE EVERY HOUR HASHTAG

COUNT FOR THE LAST HOUR(ON DISK)

FALLBACK TO PARTIAL RESULT WHEN REAL-TIMEGRID IS OFFLINE

Page 16: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

DRIVER 3 : CAN’T RECOMPUTE

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

USER1 ITEM1 VIEW

USER1 ITEM2 BUY

USER1 ITEM1 VIEW

USER1 ITEM1 VIEW

ITEM-ITEM SIMILARITY MATRIX

USER -> [ ITEM1, … ITEMn]

RECOMMENDATION

Page 17: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

PAIN POINTS

Page 18: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

PAINT POINT 1 : EXACTLY ONCE

2014-02-31 13:14 #foo bar

2014-02-31 13:15 toto

2014-02-31 13:15 tutu

2014-02-31 13:16 #two

Retry

Page 19: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

PAINT POINT 2 : DYNAMIC SCALE

START AT 100 events per secondHOW TO GROW TO 10k events per second without rebuilding everything ?

Page 20: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

PAINT POINT 3 : SCHEMA CHANGE

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

EVENTS V1

EVENTS V2

MIX OF VERSION 1 AND VERSION

2 !!!!

Page 21: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TOOLSAND

FRAMEWORK

Page 22: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Lambda Architecture Building Blocks

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Page 23: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Components

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

STORM

HDFS MapRed HBASE

MEMCACHE MONGODB

WEBAPPRABBITMQ

FLUME

Page 24: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Components

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Page 25: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Message Queues

Kestrel (Single Node)

Kafka(Linkedin, Distributed)

RabbitMQActiveMQ

Micro-Batch, State in ProcessorPersitent

Event, State in Queue, Rich Routing

Page 26: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TOPOLOGY : SINGLE PIPE

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

STORM

STORM

Page 27: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Storm

Developped in 2008-2009 at BackType

First open source release in 2011

BOLTTUPLE

TUPLE

TUPLE

SPOUTTUPLE

Page 28: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Topologies

SPOUT

SPOUT

BOLT

BOLT

BOLT

BOLT

This onelikely to write

in a State

This one tooo

Page 29: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {

for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(new Values(tweet.time, hashtag));

} } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(‘time’, ‘hashtag’)); } }

Parse Tweet Bolt

Page 30: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Topologies

TweetSpout

ParseTweetBolt

Count HashTags Bolt

Storein

Flat File

Tweet

Page 31: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

BALANCING

CLUSTERNODE

PROCESS

EXECUTOR

TASK

TASK

ONE PER TOPOLOGYPER SPOUT OR

BOLTEXECUTOR

TASK

NODE

PROCESS

REBALANCE

Page 32: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

(Optional) RELIABILITY

• When emitting a tuple from an existing tuple, trace origin• “Ack” or “Fail” each tuple• If a tuple or dependent

tuples not fully “acked” REPLAY

Page 33: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@YourTwitterHandle#YourSessionHashtag

public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {

for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(tweet, new Values(tweet.time, hashtag));

} _collector.ack(tweet); } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(‘time’, ‘hashtag’)); } }

Reliable Parse Tweet

Page 34: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TOPOLOGY 2 : SHARE RT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

TRIDENT

TRIDENT

TRIDENT

Page 35: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TRIDENT

• Higher Level Operations

• Use Storm as an RPC Framework

• State “Management”

Page 36: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

From Schema To Storm Topology

Page 37: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

How is exactly-once implemented?{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=1, item=car, event=imp}

{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}

txid=1

txid=3

txid=2

Page 38: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Exactly-Once in statepaul -> { car: 2, txid=2 } pierre -> {car : 5, txid=3 }

paul -> { car: 3, txid=3 } pierre -> {car : 5, txid=3 }

{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}

txid=3

Keep Track of last transaction in

state

Transaction does not applyto newer state

parts

Page 39: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TOPOLOGY 1 : SHARE STATE

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

gUSE A SINGLE NOSQL SERVICE FOR ALL USE

CASES

Page 40: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

REDIS VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

REDIS

REDIS REDIS

REDISALSO USE THE NOSQL AS A MESSAGE QUEUE

Page 41: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

TOPOLOGY 3 : SHARED PROCESSING

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Page 42: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

SummingBird

Single Scala specification than can run in “Batch” on “Real-Time” Mode Single Scala

Code

Run on Storm

Topology

Run on Cascading

(Batch)

Page 43: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

object TweetHashTagCount { implicit val timeOf: TimeExtractor[Status] = TimeExtractor(_.getCreatedAt.getTime) implicit val batcher = Batcher.ofHours(1)

….def hashTagCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tweet.getHashTags.map(_ -> 1L) } .sumByKey(store)}

Tweet SummingBird

Page 44: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Putting this together

SUMMING BIRD

CASCADING

MAP REDUCE

TRIDENT STORM

RT STORES(NoSQL .. etc..

BATCH STORES(HDFS …)

Distributed Batch Computation

SQL Level Abstraction

DistributedRT Computation

COMMON ABSTRACTION

STATERPC

Page 45: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

WEB-SCALE VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Insert in Mongo

Insert in Mongo

MongoMapRedu

ce

MongoCollectio

n

MongoMongo

Aggregation

Page 46: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

HADOOPY VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

INSERT IN

HBASE

HIVE/MAP

REDUCE HBASE

HBASE HBASE Queries

Page 47: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Integrated Publish

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Page 48: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

SploutSQL

Page 49: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

SPARK VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

SPARK STREAMING

HDFS SPARK

MEMORY

Page 50: Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

QUESTIONS

QUESTION QUEUE

florian.douetteau@

dataiku.com

MAIL

MY MEMORY ANSWER

AUDIENCEHAPPY

ANSWERTO

MAIL

BatchProcessi

ng

Real-Time Processin

g