@fdouetteau#lambdataiku
Lambda Architecture
@fdouetteauDataiku, www.dataiku.comFlorian Douetteau, CEO Dataiku
@fdouetteau#lambdataiku
Topics For Today
• WHAT is a lambda architecture• Examples - Principle• Motivation – Hard Points
• HOW to you build a lambda architecture ? • Components per component
@fdouetteau#lambdataiku
Lambda
EVENTS PROCESS
STATE
SE
RV
E
@fdouetteau#lambdataiku
ƛ : SOME USE CASES
• Online Advertising• Keep track of number of displays / clicks
per positions / campaigns
• Recommender Systems• Keep track of production displays / views /
click / buy
• Statistical Time Line• Keep Track of number of tweets per
hashtag / hour
@fdouetteau#lambdataiku
SQL WAY
EVENTS PROCESS
STATE
SE
RV
EUSER1 ITEM1 VIEW
USER1 ITEM2 BUY
INSERT OR UPDATE VIEWS SET pageviews = pageviews + 1
WHERE user=USER1 …
RDBMSSQL
@fdouetteau#lambdataiku
Functional Programming Append Only
EVENTS PROCESS
STATE(APPEND ONLY)
SE
RV
E
newstate = Fagg (oldstate, Fstore(events))
result= F (state, lastevents, scope)
@fdouetteau#lambdataiku
E.g. counting twitter hashtags
EVENTS PROCESS
STATE SE
RV
E
Fmap ( ) = { (#tag, time) -> count }
FReduce( hashmap, hashmap ) = fuse count in maps
FDisplay( hashmap, events ) = Freduce(hashmap, Fmap(events))
TWEET COUNTS(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3
NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar
@fdouetteau#lambdataiku
E.g. counting twitter hashtags in “SQL”
EVENTS
SE
RV
E
TWEET COUNTS TABLE(2014-02-31 13, #foo) -> 8(2014-02-31 13, #foo2) -> 3(2014-02-31 13, #foo3) -> 3(2014-02-31 13, #foo4) -> 1
NEW TWEETS TABLE2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar 2014-02-31 13:14 #foo bar
PARTIAL TWEET COUNT TABLE(2014-02-31 13, #foo) -> 1(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) -> 3(2014-02-31 14, #foo) ->
NEW TWEET COUNT TABLE(2014-02-31 13, #foo) -> 9(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3(2014-02-31 13, #foo) -> 3
CREATE … AS SELECT time, tag, COUNT(*) GROUP BY TIME, TAGCREATE AS
SELEC time, tag, SUM(counts)FROM ( oldtable … UNION
partialtable) GROUP BY TIME, TAG
SELECT, time, tag, SUM(c) FROM (SELECT time, tag, c FROM oldtable WHERE tag = …UNIONSELECT time, tag, c FROM partialtable WHERE tag=…)
INSERT VALUES …
RENAME TABLE …
EXECUTE EACH 5 MINUTES
EXECUTEEACH HOUR
@fdouetteau#lambdataiku
ƛ : PRINCIPLE
EVENTS
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N
@fdouetteau#lambdataiku
Backtype Story
Capture events and logs from twitter
25TB binary data100 Billlios records400 QPS AverageScale 1 -> 150 on peak
Take off with a team of 3 engineers with seed funding in 2008 Christopher Golda Michael Montano Nathan Marz
Acquired by Twitter ( power twitter trends …) in 2011
CascalogStormElephantDB
@fdouetteau#lambdataiku
TWITTER HASHTAGS
2014-02-31 13:14
#foo bar
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N2014-02-31 13:14
#foo bar
2014-02-31 13:14
#foo bar
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
COMPUTE EVERY 5 MINUTESHASHTAG COUNTS FORTHE LAST 5 MINUTES
(IN MEMORY)
COMPUTE EVERY HOUR HASHTAG
COUNT FOR THE LAST HOUR(ON DISK)
@fdouetteau#lambdataiku
RECOMMENDER SYSTEM
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N
USER1 ITEM1 VIEW
USER1 ITEM2 BUY
USER1 ITEM1 VIEW
USER1 ITEM1 VIEW
ITEM-ITEM SIMILARITY MATRIX
USER -> [ ITEM1, … ITEMn]
RECOMMENDATION
@fdouetteau#lambdataiku
THREE KEY DRIVERS FOR LAMBDA ARCH
@fdouetteau#lambdataiku
DRIVER 1: Support Smooth Evolution
2014-02-31 13:14 #foo bar
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N2014-02-31 13:14
#foo bar
2014-02-31 13:14 #foo bar
(2014-02-31 13:14,, #foo) -> 3
(2014-02-31 13:14, #foo) -> 3
(1) RECOMPUTE NEW VERSIONON BATCH WHILE KEEPING THE OLD ONE (2014-02-31 13, #foo) -> 3
(2) THEN UPDATE THE ONLINE VERSION
@fdouetteau#lambdataiku
DRIVER 2: Real-Time System Offline
2014-02-31 13:14
#foo bar
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N2014-02-31 13:14
#foo bar
2014-02-31 13:14
#foo bar
(2014-02-31 13, #foo) -> 3
(2014-02-31 13, #foo) -> 3
COMPUTE EVERY HOUR HASHTAG
COUNT FOR THE LAST HOUR(ON DISK)
FALLBACK TO PARTIAL RESULT WHEN REAL-TIMEGRID IS OFFLINE
@fdouetteau#lambdataiku
DRIVER 3 : CAN’T RECOMPUTE
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N
USER1 ITEM1 VIEW
USER1 ITEM2 BUY
USER1 ITEM1 VIEW
USER1 ITEM1 VIEW
ITEM-ITEM SIMILARITY MATRIX
USER -> [ ITEM1, … ITEMn]
RECOMMENDATION
@fdouetteau#lambdataiku
PAIN POINTS
@fdouetteau#lambdataiku
PAINT POINT 1 : EXACTLY ONCE
2014-02-31 13:14 #foo bar
2014-02-31 13:15 toto
2014-02-31 13:15 tutu
2014-02-31 13:16 #two
…
…
Retry
@fdouetteau#lambdataiku
PAINT POINT 2 : DYNAMIC SCALE
START AT 100 events per secondHOW TO GROW TO 10k events per second without rebuilding everything ?
@fdouetteau#lambdataiku
PAINT POINT 3 : SCHEMA CHANGE
BATCH VIEW
REAL-TIME RESULT
BATCH PROC
REAL-TIMEPROC
FED
ER
ATIO
N
EVENTS V1
EVENTS V2
MIX OF VERSION 1 AND VERSION
2 !!!!
@fdouetteau#lambdataiku
TOOLSAND
FRAMEWORK
@fdouetteau#lambdataiku
Lambda Architecture Building Blocks
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
@fdouetteau#lambdataiku
Components
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
STORM
HDFS MapRed HBASE
MEMCACHE MONGODB
WEBAPPRABBITMQ
FLUME
@fdouetteau#lambdataiku
Components
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
@fdouetteau#lambdataiku
Message Queues
Kestrel (Single Node)
Kafka(Linkedin, Distributed)
RabbitMQActiveMQ
Micro-Batch, State in ProcessorPersitent
Event, State in Queue, Rich Routing
@fdouetteau#lambdataiku
TOPOLOGY : SINGLE PIPE
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
STORM
STORM
@fdouetteau#lambdataiku
Storm
Developped in 2008-2009 at BackType
First open source release in 2011
BOLTTUPLE
TUPLE
TUPLE
SPOUTTUPLE
@fdouetteau#lambdataiku
Topologies
SPOUT
SPOUT
BOLT
BOLT
BOLT
BOLT
This onelikely to write
in a State
This one tooo
@fdouetteau#lambdataiku
public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector
public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {
for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(new Values(tweet.time, hashtag));
} } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(‘time’, ‘hashtag’)); } }
Parse Tweet Bolt
@fdouetteau#lambdataiku
Topologies
TweetSpout
ParseTweetBolt
Count HashTags Bolt
Storein
Flat File
Tweet
@fdouetteau#lambdataiku
BALANCING
CLUSTERNODE
PROCESS
EXECUTOR
TASK
TASK
ONE PER TOPOLOGYPER SPOUT OR
BOLTEXECUTOR
TASK
NODE
PROCESS
REBALANCE
@fdouetteau#lambdataiku
(Optional) RELIABILITY
• When emitting a tuple from an existing tuple, trace origin• “Ack” or “Fail” each tuple• If a tuple or dependent
tuples not fully “acked” REPLAY
@YourTwitterHandle#YourSessionHashtag
public class HashTagParseBolt extends BaseRichBolt { OutputCollector _collector
public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } public void execute(Tuple tweet) {
for(String hashtag : tweet.getString(‘hashtags’)) { _collector.emit(tweet, new Values(tweet.time, hashtag));
} _collector.ack(tweet); } public void deplaceOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(‘time’, ‘hashtag’)); } }
Reliable Parse Tweet
@fdouetteau#lambdataiku
TOPOLOGY 2 : SHARE RT
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
TRIDENT
TRIDENT
TRIDENT
@fdouetteau#lambdataiku
TRIDENT
• Higher Level Operations
• Use Storm as an RPC Framework
• State “Management”
@fdouetteau#lambdataiku
From Schema To Storm Topology
@fdouetteau#lambdataiku
How is exactly-once implemented?{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=1, item=car, event=imp}
{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}
…
txid=1
txid=3
txid=2
@fdouetteau#lambdataiku
Exactly-Once in statepaul -> { car: 2, txid=2 } pierre -> {car : 5, txid=3 }
paul -> { car: 3, txid=3 } pierre -> {car : 5, txid=3 }
{user=paul, item=car, event=imp}{user=pierre, item=car, event=imp}{user=pierre, item=car, event=imp}
txid=3
Keep Track of last transaction in
state
Transaction does not applyto newer state
parts
@fdouetteau#lambdataiku
TOPOLOGY 1 : SHARE STATE
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
gUSE A SINGLE NOSQL SERVICE FOR ALL USE
CASES
@fdouetteau#lambdataiku
REDIS VARIANT
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
REDIS
REDIS REDIS
REDISALSO USE THE NOSQL AS A MESSAGE QUEUE
@fdouetteau#lambdataiku
TOPOLOGY 3 : SHARED PROCESSING
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
@fdouetteau#lambdataiku
SummingBird
Single Scala specification than can run in “Batch” on “Real-Time” Mode Single Scala
Code
Run on Storm
Topology
Run on Cascading
(Batch)
@fdouetteau#lambdataiku
object TweetHashTagCount { implicit val timeOf: TimeExtractor[Status] = TimeExtractor(_.getCreatedAt.getTime) implicit val batcher = Batcher.ofHours(1)
….def hashTagCount[P <: Platform[P]]( source: Producer[P, Status], store: P#Store[String, Long]) = source .filter(_.getText != null) .flatMap { tweet: Status => tweet.getHashTags.map(_ -> 1L) } .sumByKey(store)}
Tweet SummingBird
@fdouetteau#lambdataiku
Putting this together
SUMMING BIRD
CASCADING
MAP REDUCE
TRIDENT STORM
RT STORES(NoSQL .. etc..
BATCH STORES(HDFS …)
Distributed Batch Computation
SQL Level Abstraction
DistributedRT Computation
COMMON ABSTRACTION
STATERPC
@fdouetteau#lambdataiku
WEB-SCALE VARIANT
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
Insert in Mongo
Insert in Mongo
MongoMapRedu
ce
MongoCollectio
n
MongoMongo
Aggregation
@fdouetteau#lambdataiku
HADOOPY VARIANT
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
INSERT IN
HBASE
HIVE/MAP
REDUCE HBASE
HBASE HBASE Queries
@fdouetteau#lambdataiku
Integrated Publish
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
@fdouetteau#lambdataiku
SploutSQL
@fdouetteau#lambdataiku
SPARK VARIANT
Message Queue
Batch State
BatchPump
Real-Time State
Real-Time Views
Service
FederatedView
Batch Views
Service
BatchProcessi
ng
Real-Time Processin
g
SPARK STREAMING
HDFS SPARK
MEMORY
@fdouetteau#lambdataiku
QUESTIONS
QUESTION QUEUE
florian.douetteau@
dataiku.com
MY MEMORY ANSWER
AUDIENCEHAPPY
ANSWERTO
BatchProcessi
ng
Real-Time Processin
g
Top Related