Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

Post on 10-May-2015

3.762 views 1 download

Tags:

description

This talk given at Devoxx Paris 2014 gives an overview of lambda architecture, and possible alternative in their implementation

Transcript of Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Overview

@fdouetteau#lambdataiku

Lambda Architecture

@fdouetteauDataiku, www.dataiku.comFlorian Douetteau, CEO Dataiku

@fdouetteau#lambdataiku

Topics For Today

• WHAT is a lambda architecture• Examples - Principle• Motivation – Hard Points

• HOW to you build a lambda architecture ? • Components per component

@fdouetteau#lambdataiku

Lambda

EVENTS PROCESS

STATE

SE

RV

E

@fdouetteau#lambdataiku

ƛ : SOME USE CASES

• Online Advertising• Keep track of number of displays / clicks

per positions / campaigns

• Recommender Systems• Keep track of production displays / views /

click / buy

• Statistical Time Line• Keep Track of number of tweets per

hashtag / hour

@fdouetteau#lambdataiku

SQL WAY

EVENTS PROCESS

STATE

SE

RV

EUSER1 ITEM1 VIEW

USER1 ITEM2 BUY

INSERT OR UPDATE VIEWS SET pageviews = pageviews + 1

WHERE user=USER1 …

RDBMSSQL

@fdouetteau#lambdataiku

Functional Programming Append Only

EVENTS PROCESS

STATE(APPEND ONLY)

SE

RV

E

newstate = Fagg (oldstate, Fstore(events))

result= F (state, lastevents, scope)

@fdouetteau#lambdataiku

E.g. counting twitter hashtags

EVENTS PROCESS

STATE SE

RV

E

Fmap ( ) = { (#tag, time) -> count }

FReduce( hashmap, hashmap ) = fuse count in maps

FDisplay( hashmap, events ) = Freduce(hashmap, Fmap(events))

TWEET COUNTS(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3

NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar

@fdouetteau#lambdataiku

E.g. counting twitter hashtags in “SQL”

EVENTS

SE

RV

E

TWEET COUNTS TABLE(2014-02-31 13, #foo) -> 8(2014-02-31 13, #foo2) -> 3(2014-02-31 13, #foo3) -> 3(2014-02-31 13, #foo4) -> 1

NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar

PARTIAL TWEET COUNT TABLE(2014-02-31 13, #foo) -> 1(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) ->

NEW TWEET COUNT TABLE(2014-02-31 13, #foo) -> 9(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3

CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAGCREATE AS

SELEC time, tag, SUM(counts)FROM ( oldtable … UNION

partialtable) GROUP BY TIME, TAG

SELECT, time, tag, SUM(c) FROM (SELECT time, tag, c FROM oldtable WHERE tag = …UNIONSELECT time, tag, c FROM partialtable WHERE tag=…)

INSERT VALUES …

RENAME TABLE …

EXECUTE EACH 5 MINUTES

EXECUTEEACH HOUR

@fdouetteau#lambdataiku

ƛ : PRINCIPLE

EVENTS

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

@fdouetteau#lambdataiku

Backtype Story

Capture events and logs from twitter

25TB binary data100 Billlios records400 QPS AverageScale 1 -> 150 on peak

Take off with a team of 3 engineers with seed funding in 2008 Christopher Golda Michael Montano Nathan Marz

Acquired by Twitter ( power twitter trends …) in 2011

CascalogStormElephantDB

@fdouetteau#lambdataiku

TWITTER HASHTAGS

2014-02-31 13:14

#foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14

#foo bar

(2014-02-31 13, #foo) -> 3

(2014-02-31 13, #foo) -> 3

COMPUTE EVERY 5 MINUTESHASHTAG COUNTS FORTHE LAST 5 MINUTES

(IN MEMORY)

COMPUTE EVERY HOUR HASHTAG

COUNT FOR THE LAST HOUR(ON DISK)

@fdouetteau#lambdataiku

RECOMMENDER SYSTEM

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

USER1 ITEM1 VIEW

USER1 ITEM2 BUY

USER1 ITEM1 VIEW

USER1 ITEM1 VIEW

ITEM-ITEM SIMILARITY MATRIX

USER -> [ ITEM1, … ITEMn]

RECOMMENDATION

@fdouetteau#lambdataiku

THREE KEY DRIVERS FOR LAMBDA ARCH

@fdouetteau#lambdataiku

DRIVER 1: Support Smooth Evolution

2014-02-31 13:14 #foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14 #foo bar

(2014-02-31 13:14,, #foo) -> 3

(2014-02-31 13:14, #foo) -> 3

(1) RECOMPUTE NEW VERSIONON BATCH WHILE KEEPING THE OLD ONE (2014-02-31 13, #foo) -> 3

(2) THEN UPDATE THE ONLINE VERSION

@fdouetteau#lambdataiku

DRIVER 2: Real-Time System Offline

2014-02-31 13:14

#foo bar

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N2014-02-31 13:14

#foo bar

2014-02-31 13:14

#foo bar

(2014-02-31 13, #foo) -> 3

(2014-02-31 13, #foo) -> 3

COMPUTE EVERY HOUR HASHTAG

COUNT FOR THE LAST HOUR(ON DISK)

FALLBACK TO PARTIAL RESULT WHEN REAL-TIMEGRID IS OFFLINE

@fdouetteau#lambdataiku

DRIVER 3 : CAN’T RECOMPUTE

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

USER1 ITEM1 VIEW

USER1 ITEM2 BUY

USER1 ITEM1 VIEW

USER1 ITEM1 VIEW

ITEM-ITEM SIMILARITY MATRIX

USER -> [ ITEM1, … ITEMn]

RECOMMENDATION

@fdouetteau#lambdataiku

PAIN POINTS

@fdouetteau#lambdataiku

PAINT POINT 1 : EXACTLY ONCE

2014-02-31 13:14 #foo bar

2014-02-31 13:15 toto

2014-02-31 13:15 tutu

2014-02-31 13:16 #two

Retry

@fdouetteau#lambdataiku

PAINT POINT 2 : DYNAMIC SCALE

START AT 100 events per secondHOW TO GROW TO 10k events per second without rebuilding everything ?

@fdouetteau#lambdataiku

PAINT POINT 3 : SCHEMA CHANGE

BATCH VIEW

REAL-TIME RESULT

BATCH PROC

REAL-TIMEPROC

FED

ER

ATIO

N

EVENTS V1

EVENTS V2

MIX OF VERSION 1 AND VERSION

2 !!!!

@fdouetteau#lambdataiku

TOOLSAND

FRAMEWORK

@fdouetteau#lambdataiku

Lambda Architecture Building Blocks

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

@fdouetteau#lambdataiku

Components

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

STORM

HDFS MapRed HBASE

MEMCACHE MONGODB

WEBAPPRABBITMQ

FLUME

@fdouetteau#lambdataiku

Components

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

@fdouetteau#lambdataiku

Message Queues

Kestrel (Single Node)

Kafka(Linkedin, Distributed)

RabbitMQActiveMQ

Micro-Batch, State in ProcessorPersitent

Event, State in Queue, Rich Routing

@fdouetteau#lambdataiku

TOPOLOGY : SINGLE PIPE

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

STORM

STORM

@fdouetteau#lambdataiku

Storm

Developped in 2008-2009 at BackType

First open source release in 2011

BOLTTUPLE

TUPLE

TUPLE

SPOUTTUPLE

@fdouetteau#lambdataiku

Topologies

SPOUT

SPOUT

BOLT

BOLT

BOLT

BOLT

This onelikely to write

in a State

This one tooo

@fdouetteau#lambdataiku

public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {

for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(new Values(tweet.time, hashtag));

} } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(‘time’, ‘hashtag’)); } }

Parse Tweet Bolt

@fdouetteau#lambdataiku

Topologies

TweetSpout

ParseTweetBolt

Count HashTags Bolt

Storein

Flat File

Tweet

@fdouetteau#lambdataiku

BALANCING

CLUSTERNODE

PROCESS

EXECUTOR

TASK

TASK

ONE PER TOPOLOGYPER SPOUT OR

BOLTEXECUTOR

TASK

NODE

PROCESS

REBALANCE

@fdouetteau#lambdataiku

(Optional) RELIABILITY

• When emitting a tuple from an existing tuple, trace origin• “Ack” or “Fail” each tuple• If a tuple or dependent

tuples not fully “acked” REPLAY

@YourTwitterHandle#YourSessionHashtag

public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector

public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {

for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(tweet, new Values(tweet.time, hashtag));

} _collector.ack(tweet); } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(‘time’, ‘hashtag’)); } }

Reliable Parse Tweet

@fdouetteau#lambdataiku

TOPOLOGY 2 : SHARE RT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

TRIDENT

TRIDENT

TRIDENT

@fdouetteau#lambdataiku

TRIDENT

• Higher Level Operations

• Use Storm as an RPC Framework

• State “Management”

@fdouetteau#lambdataiku

From Schema To Storm Topology

@fdouetteau#lambdataiku

How is exactly-once implemented?{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=1, item=car, event=imp}

{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}

txid=1

txid=3

txid=2

@fdouetteau#lambdataiku

Exactly-Once in statepaul -> { car: 2, txid=2 } pierre -> {car : 5, txid=3 }

paul -> { car: 3, txid=3 } pierre -> {car : 5, txid=3 }

{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}

txid=3

Keep Track of last transaction in

state

Transaction does not applyto newer state

parts

@fdouetteau#lambdataiku

TOPOLOGY 1 : SHARE STATE

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

gUSE A SINGLE NOSQL SERVICE FOR ALL USE

CASES

@fdouetteau#lambdataiku

REDIS VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

REDIS

REDIS REDIS

REDISALSO USE THE NOSQL AS A MESSAGE QUEUE

@fdouetteau#lambdataiku

TOPOLOGY 3 : SHARED PROCESSING

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

@fdouetteau#lambdataiku

SummingBird

Single Scala specification than can run in “Batch” on “Real-Time” Mode Single Scala

Code

Run on Storm

Topology

Run on Cascading

(Batch)

@fdouetteau#lambdataiku

object TweetHashTagCount { implicit val timeOf: TimeExtractor[Status] = TimeExtractor(_.getCreatedAt.getTime) implicit val batcher = Batcher.ofHours(1)

….def hashTagCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tweet.getHashTags.map(_ -> 1L) } .sumByKey(store)}

Tweet SummingBird

@fdouetteau#lambdataiku

Putting this together

SUMMING BIRD

CASCADING

MAP REDUCE

TRIDENT STORM

RT STORES(NoSQL .. etc..

BATCH STORES(HDFS …)

Distributed Batch Computation

SQL Level Abstraction

DistributedRT Computation

COMMON ABSTRACTION

STATERPC

@fdouetteau#lambdataiku

WEB-SCALE VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

Insert in Mongo

Insert in Mongo

MongoMapRedu

ce

MongoCollectio

n

MongoMongo

Aggregation

@fdouetteau#lambdataiku

HADOOPY VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

INSERT IN

HBASE

HIVE/MAP

REDUCE HBASE

HBASE HBASE Queries

@fdouetteau#lambdataiku

Integrated Publish

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

@fdouetteau#lambdataiku

SploutSQL

@fdouetteau#lambdataiku

SPARK VARIANT

Message Queue

Batch State

BatchPump

Real-Time State

Real-Time Views

Service

FederatedView

Batch Views

Service

BatchProcessi

ng

Real-Time Processin

g

SPARK STREAMING

HDFS SPARK

MEMORY

@fdouetteau#lambdataiku

QUESTIONS

QUESTION QUEUE

florian.douetteau@

dataiku.com

MAIL

MY MEMORY ANSWER

AUDIENCEHAPPY

ANSWERTO

MAIL

BatchProcessi

ng

Real-Time Processin

g