Post on 27-Jan-2015
description
Apache Samza
Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Jakob Homan London HUG
Who I am
• Samza for five months• Before that Hadoop, Hive, Giraph• Say hi: @blueboxtraveler
Things we would like to do(better)
Provide timely, relevant updates to your newsfeed
Update search results with new information as it appears
Sculpt metrics and logs into useful shapes
Tools?
Response latency
Samza
Milliseconds to minutes
RPC
Synchronous Later. Possibly much later.
Frame(work) of reference
ClassicHadoop
Samza
Storage layerExecutionengine API
HDFS
Kafka
Map-Reduce
YARN
map(k, v) => (k,v)reduce(k, list(v)) => (k,v)
process(msg(k,v)) => msg(k,v)
Storage layer: Kafka
Apache Kafka
• Persistent, reliable,distributed message queue
Shiny new logo!
At LinkedIn
10+ billionwrites per day
172kmessages per second
(average)
55+ billionmessages per day
to real-time consumers
Quick aside…
Kafka: First among (pluggable) equals
LinkedIn: Espresso and Databus
Coming soon? HDFS, ActiveMQ, Amazon SQS
Kafka in four bullet points
• Producers send messages to brokers• Messages are key, value pairs• Brokers store messages in topics for
consumers• Consumers pull messages from brokers
A Kafka Topic
“Very sleepy”53 4 “Car nicked!”75 5 “The ref’s blind!”23 4 “Nicked a car!”53 4
Topic: StatusUpdateEvent
Key: User ID of user who updated the status
Value: Timestamp, new status, geolocation, etc.
Kafka topics are partitioned
Message contentsKe y Message
contentsKe yMessage contentsKe y Message
contentsKe y Message contentsKe y Message
contentsKe y
Message contentsKe y Message
contentsKe y Message contentsKe y Message
contentsKe yPartition 0
Partition 1
Partition 2
For our purposes, hash partitioned on the key!
A Samza job
Input topics
• StatusUpdateEvent• NewConnectionEvent• LikeUpdateEvent
Some code
MyStreamTask implements StreamTask{ …………. }
Output topics
• NewsUpdatePost• UpdatesPerHourMetric
Execution engine: YARN
What we use YARN for
• Distributing our tasks across multiple machines
• Letting us know when one has died• Distributing a replacement• Isolating our tasks from each other
Machine 1 Machine 1
YARN: Execution and reliability
MyStreamTask:process()
Samza TaskRunner: Partition 0
MyStreamTask:process()
Samza TaskRunner: Partition 1
Node Manager 2Node Manager 1
Samza App Master
Kafka Broker Kafka Broker
Co-partitioning of topics
MyStreamTask:process()
Samza TaskRunner: Partition 0StatusUpdateEvent, Partition 0
NewConnectionEvent, Partition 0
NewsUpdatePost
An instance of StreamTask is responsible for a specific partition
API: process()
public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) }
getKey(), getMsg()
sendMsg(topic, key, value)commit(), shutdown()
Awesome feature: State
• Generic data store interface• Key-value out-of-box– More soon? Bloom filter, lucene, etc.
• Restored by Samza upon task crash
MyStreamTask:process()
Samza TaskRunner: Partition 0
Store state
(Pseudo)code snippet: Newsfeed
• Consume StatusUpdateEvent– Send those updates to all your conmections via
the NewsUpdatePost topic• Consume NewConnectionEvent– Maintain state of connections to know who to
send to
public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”))
} } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }
Current status
Hello, Samza!
Cool, eh? bit.ly/hello-samza
Consume Wikipedia edits live
Up and running in 3 minutes
Generate stats on those edits
samza.incubator.apache.org bit.ly/samza_newbie_issues
Cheers!
• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org• Newbie issues: bit.ly/samza_newbie_issues• Detailed Samza and YARN talk: bit.ly/samza_and_yarn• Twitter: @samzastream