The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache...
Transcript of The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache...
Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs
Gerard MaasSenior SW Engineer, Lightbend, Inc.
Gerard MaasSeñor SW Engineer
@maasg
https://github.com/maasg
https://www.linkedin.com/in/gerardmaas/
https://stackoverflow.com/users/764040/maasg
@maasg
What is Spark and Why We Should Care?
Streaming APIs in Spark- Structured Streaming
- Interactive Session 1- Spark Streaming
- Interactive Session 2
Spark Streaming X Structured Streaming
Agenda
@maasg
Streaming | Big Data
@maasg
100Tb 5Mb
@maasg
100Tb 5Mb/s
@maasg
∑ Stream = Dataset
𝚫 Dataset = Stream- Tyler Akidau, Google
Once upon a time...
@maasg
Apache Spark Core
Spark SQL
Spar
k M
LLib
Spar
k St
ream
ing
Stru
ctur
ed
Stre
amin
g
Data
sets
/Fra
mes
Grap
hFra
mes
Data Sources
@maasg
Apache Spark Core
Spark SQL
Spar
k M
LLib
Spar
k St
ream
ing
Stru
ctur
ed
Stre
amin
g
Grap
hFra
mes
Data Sources
Data
sets
/Fra
mes
@maasg
Structured Streaming
@maasg
Structured Streaming
@maasg
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
StreamingDataFrame
@maasg
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
StreamingDataFrame
@maasg
Structured Streaming
Kafka
Sockets
HDFS/S3
Custom
StreamingDataFrame
Query
Kafka
Files
foreachSink
console
memory
OutputMode
@maasg
Interactive Session
1 Structured Streaming
@maasg
SensorData
Multiplexer
Structured Streaming
Spark Notebook
Local Process
Sensor Anomaly Detection
@maasg
Live
@maasg
Interactive Session 1 Structured Streaming
QUICK RECAP
@maasg
val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()
Sources
@maasg
Operations...
val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String]
val jsonValues = rawValues.select(from_json($"value", schema) as "record")
val sensorData = jsonValues.select("record.*").as[SensorData]…
@maasg
Event Time...
val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))...
@maasg
Sinks...val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start()
...val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()
@maasg
Use Cases● Streaming ETL● Stream aggregations, windows● Event-time oriented analytics● Arbitrary stateful stream processing● Join Streams with other streams and with Fixed
Datasets● Apply Machine Learning Models
@maasg
Structured Streaming
@maasg
Spark Streaming
Kafka
Flume
Kinesis
Sockets
HDFS/S3
Custom Apache Spark
Spar
k SQ
L
Spar
k M
L
...
Databases
HDFS
API Server
Streams
@maasg
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
@maasg
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
TransformationT -> U
@maasg
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
TransformationT -> U
Actions
@maasg
API: Transformations
map, flatmap, filter
count, reduce, countByValue,reduceByKey
n
union,joincogroup
@maasg
API: Transformations
mapWithState… …
@maasg
API: Transformations
transform
val iotDstream = MQTTUtils.createStream(...)val devicePriority = sparkContext.cassandraTable(...)val prioritizedDStream = iotDstream.transform{rdd =>
rdd.map(d => (d.id, d)).join(devicePriority)}
@maasg
Actions
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles
xxxyyyzzz
foreachRDD *
@maasg
Actions
-------------------------------------------
Time: 1459875469000 ms
-------------------------------------------
data1
data2
saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles
xxxyyyzzz
foreachRDD *Spark SQLDataframesGraphFramesAny API
@maasg
Interactive Session
2 Spark Streaming
@maasg
SensorData
Multiplexer
Structured Streaming
Spark Notebook
Local Process
Sensor Anomaly Detection
@maasg
Live
@maasg
Interactive Session 2 Spark Streaming
QUICK RECAP
@maasg
import org.apache.spark.streaming.StreamingContextval streamingContext = new StreamingContext(sparkContext, Seconds(10))
Streaming Context
@maasg
val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString)
val topics = Set(topic)@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)
Source
@maasg
import spark.implicits._val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd}
Transformations
@maasg
val model = new M2Model()
…
model.trainOn(inputData)
…
val scoredDStream = model.predictOnValues(inputData)
DIY Custom Model
@maasg
suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample)}
Output
@maasg
Usecases
● Complex computing/state management (local + cluster)● Streaming Machine Learning
○ Learn○ Score
● Join Streams with Updatable Datasets● RDD-based streaming computations
● [-] Event-time oriented analytics● [-] Optimizations: Query & Data● [-] Continuous processing
@maasg
SensorData
Multiplexer
Structured StreamingLocal Process
Sensor Anomaly Detection (Real Time Detection)
Structured Streaming
Structured Streaming+
@maasg
Spark Streaming + Structured Streaming
47
val parse: Dataset[String] => Dataset[Record] = ???
val process: Dataset[Record] => Dataset[Result] = ???
val serialize: Dataset[Result] => Dataset[String] = ???
val kafkaStream = spark.readStream…
val f = parse andThen process andThen serialize
val result = f(kafkaStream)
result.writeStream
.format("kafka")
.option("kafka.bootstrap.servers",bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.start()
val dstream = KafkaUtils.createDirectStream(...)
dstream.map{rdd =>
val ds = sparkSession.createDataset(rdd)
val f = parse andThen process andThen serialize
val result = f(ds)
result.write.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("topic", writeTopic)
.option("checkpointLocation", checkpointLocation)
.save()
}
Structured StreamingSpark Streaming
@maasg
Streaming Pipelines
Structured Streaming
Keyword Extraction
KeywordRelevance Similarity
DB Storage
@maasg
Time
Execution
Abstraction
Structured Streaming Spark Streaming
Abstract(Processing Time, Event Time)
Fixed to microbatchStreaming Interval
Fixed Micro batch, Best Effort MB, Continuous (NRT) Fixed Micro batch
DataFrames/Dataset DStream, RDD
Access to the scheduler
Structured Streaming
New Project?
80%20%
lightbend.com/fast-data-platform
@maasg
Gerard MaasSeñor SW Engineer
@maasg
https://github.com/maasg
https://www.linkedin.com/in/gerardmaas/
https://stackoverflow.com/users/764040/maasg
Thank You!