Post on 21-May-2015
description
spark streamingwith C*
jacek.lewandowski@datastax.com
…applies where you need near-realtime data analysis
Spark vs Spark Streaming
zillions of bytes gigabytes per second
stat
ic d
atas
etstream
of data
applications sensors web mobile phones
intrusion detection malfunction detection site analytics network metrics analysis
fraud detection dynamic process optimisation recommendations location based ads
log processing supply chain planning sentiment analysis spying
What can you do with it?
applications sensors web mobile phones
intrusion detection malfunction detection site analytics network metrics analysis
fraud detection dynamic process optimisation recommendations location based ads
log processing supply chain planning sentiment analysis spying
What can you do with it?
AlmostWhateverSource
YouWant
AlmostWhatever
DestinationYou
Want
so, let’s see how it works
μBatch (ordinary RDD) μBatch (ordinary RDD) μBatch (ordinary RDD)
DStream - A continuous sequence of micro batches
Processing of DStream = Processing of μBatches, RDDs
DStream
Receiver9 8 7 6 5 4 3 2 1Interface between different stream sources and Spark
Receiver9 8 7 6 5 4 3 2 1
Spark memory boundaryBlock Manager
Interface between different stream sources and Spark
Receiver9 8 7 6 5 4 3 2 1
Spark memory boundaryBlock Manager
Replication and building μBatches
Interface between different stream sources and Spark
Spark memory boundaryBlock Manager
Spark memory boundaryBlock Manager
9 8 7 6 5 4 3 2 1
Blocks of input data
Spark memory boundaryBlock Manager
9 8 7 6 5 4 3 2 1
9 8 7 6 5 4 3 2 1
Blocks of input data
μBatch made of blocks
9 8 7 6 5 4 3 2 1
μBatch made of blocks
9 8 7 6 5 4 3 2 1
μBatch made of blocks
Partition Partition Partition
9 8 7 6 5 4 3 2 1
μBatch made of blocks
Partition Partition Partition
Ingestion from multiple sources
Receiving,μBatch building
Receiving,μBatch building
Receiving,μBatch building
Ingestion from multiple sources
Receiving,μBatch building
Receiving,μBatch building
Receiving,μBatch building
2s 1s 0s
μBatch μBatch
A well-worn example
• ingestion of text messages• splitting them into separate words• count the occurrence of words within 5
seconds windows• save word counts from the last 5 seconds,
every 5 second to Cassandra, and display the first few results on the console
how to do that ?
well…
Yes, it is that easycase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValue()val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
DStream stateless operators(quick recap)
• map• flatMap• filter• repartition• union• count• countByValue
• reduce• reduceByKey• joins• cogroup• transform• transformWith
DStream[Bean].count()
count 4 3
1s 1s 1s 1s
DStream[Bean].count()
count 4 3
1s 1s 1s 1s
DStream[Orange].union(DStream[Apple])
union
1s 1s
Other stateless operations
• join(DStream[(K, W)])• leftOuterJoin(DStream[(K, W)])• rightOuterJoin(DStream[(K, W)])• cogroup(DStream[(K, W)])
are applied on pairs of corresponding μBatches
transform, transformWith
• DStream[T].transform(RDD[T] => RDD[U]): DStream[U]• DStream[T].transformWith(DStream[U], (RDD[T], RDD[U]) => RDD[V]): DStream[V]
allow you to create new stateless operators
DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]
1-A 2-A 3-A
1-B 2-B 3-B
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]
1-A 2-A 3-A
1-B 2-B 3-B
1-A x 1-B 2-A x 2-B 3-A x 3-B
DStream[Blue].transformWith(DStream[Red], …): DStream[Violet]
1-A 2-A 3-A
1-B 2-B 3-B
1-A x 1-B 2-A x 2-B 3-A x 3-B
Windowing
0s 1s 2s 3s 4s 5s 6s 7s
By default:window = slide = μBatch duration
window
slide
Windowing
0s 1s 2s 3s 4s 5s 6s 7s
By default:window = slide = μBatch duration
window
slide
Windowing
0s 1s 2s 3s 4s 5s 6s 7s
By default:window = slide = μBatch duration
window
slide
Windowing
The resulting DStream consists of 3 seconds μBatches!
Each resulting μBatch overlaps the preceding one by 1 second
0s 1s 2s 3s 4s 5s 6s 7s
window
slide
Windowing
The resulting DStream consists of 3 seconds μBatches!
Each resulting μBatch overlaps the preceding one by 1 second
0s 1s 2s 3s 4s 5s 6s 7s
window
slide
Windowing
The resulting DStream consists of 3 seconds μBatches!
Each resulting μBatch overlaps the preceding one by 1 second
0s 1s 2s 3s 4s 5s 6s 7s
window
slide
Windowing
1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window
window
slide
μBatch appears in output stream every 1s!
It contains messages collected during 3s
1s
Windowing
1 2 3 4 5 6 7 8 1 2 3 4 5 6 3 4 5 6 7 8window
window
slide
μBatch appears in output stream every 1s!
It contains messages collected during 3s
1s
DStream window operators
• groupByKeyAndWindow(Duration, Duration)• reduceByKeyAndWindow((V, V) => V, Duration, Duration)
• window(Duration, Duration)• countByWindow(Duration, Duration)• reduceByWindow(Duration, Duration, (T, T) => T)• countByValueAndWindow(Duration, Duration)
Let’s modify the example
• ingestion of text messages• splitting them into separate words• count the occurrence of words within 10
seconds windows• save word counts from the last 10 seconds,
every 2 second to Cassandra, and display the first few results on the console
Yes, it is still easy to docase class WordCount(time: Long, word: String, count: Int)val paragraphs: DStream[String] = stream.map { case (_, paragraph) => paragraph}val words: DStream[String] = paragraphs.flatMap(_.split( """\s+""")) val wordCounts: DStream[(String, Long)] = words.countByValueAndWindow(Seconds(10), Seconds(2)) val topWordCounts: DStream[WordCount] = wordCounts.transform((rdd, time) => val mappedWordCounts: RDD[(Int, WordCount)] = rdd.map { case (word, count) => (count.toInt, WordCount(time.milliseconds, word, count.toInt)) } val topWordCountsRDD: RDD[WordCount] = mappedWordCounts .sortByKey(ascending = false).values) topWordsStream.saveToCassandra("meetup", "word_counts") topWordsStream.print()
DStream stateful operator• DStream[(K, V)].updateStateByKey
(f: (Seq[V], Option[S]) => Option[S]): DStream[(K, S)]
1
A
2
B
3
A
4
C
5
A
6
B
7
A
8
B
9
C
• R1 = f(Seq(1, 3, 5), Some(7))• R2 = f(Seq(2, 6), Some(8))• R3 = f(Seq(4), Some(9))
R1
A
R2
B
R3
C
case class WordCount(time: Long, word: String, count: Int)def update(counts: Seq[Long], state: Option[Long]): Option[Long] = { val sum = counts.sum Some(state.getOrElse(0L) + sum)} val totalWords: DStream[(String, Long)] = stream.map { case (_, paragraph) => paragraph} .flatMap(_.split( """\s+""")) .countByValue() .updateStateByKey(update) val topTotalWordCounts: DStream[WordCount] = totalWords.transform((rdd, time) => rdd.map { case (word, count) => (count, WordCount(time.milliseconds, word, count.toInt)) }.sortByKey(ascending = false).values ) topTotalWordCounts.saveToCassandra("meetup", "word_counts_total") topTotalWordCounts.print()
Total word count example
Obtaining DStreams
• ZeroMQ• Kinesis• HDFS compatible file system• Akka actor• Twitter• MQTT• Kafka• Socket• Flume• …
Particular DStreams are available in separate modules
GroupId ArtifactId Latest Version
org.apache.spark spark-streaming-kinesis-asl_2.10 1.1.0
org.apache.spark spark-streaming-mqtt_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-zeromq_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-flume_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-flume-sink_2.10 1.1.0
org.apache.spark spark-streaming-kafka_2.10 1.1.0 all (7)
org.apache.spark spark-streaming-twitter_2.10 1.1.0 all (7)
If something goes wrong…
Fault tolerance
The sequence of transformations is known
to Spark Streaming
μBatches are replicated once they are received
Lost data can be recomputed
But there are pitfalls
• Spark replicates blocks, not single messages
• It is up to a particular receiver to decide whether to form the block from a single message or to collect more messages before pushing the block
• The data collected in the receiver before the block is pushed will be lost in case of failure of the receiver
• Typical tradeoff - efficiency vs fault tolerance
Built-in receivers breakdown
Pushing single messages Can do both Pushing whole blocks
Kafka Akka RawNetworkReceiver
Twitter Custom ZeroMQ
Socket
MQTT
Thank you !
Questions?!
http://spark.apache.org/https://github.com/datastax/spark-cassandra-connectorhttp://cassandra.apache.org/http://www.datastax.com/