The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache...

54
Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Transcript of The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache...

Page 1: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs

Gerard MaasSenior SW Engineer, Lightbend, Inc.

Page 2: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Gerard MaasSeñor SW Engineer

@maasg

https://github.com/maasg

https://www.linkedin.com/in/gerardmaas/

https://stackoverflow.com/users/764040/maasg

Page 3: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

What is Spark and Why We Should Care?

Streaming APIs in Spark- Structured Streaming

- Interactive Session 1- Spark Streaming

- Interactive Session 2

Spark Streaming X Structured Streaming

Agenda

Page 4: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Streaming | Big Data

Page 5: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

100Tb 5Mb

Page 6: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

100Tb 5Mb/s

Page 7: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

∑ Stream = Dataset

𝚫 Dataset = Stream- Tyler Akidau, Google

Page 8: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Once upon a time...

Page 9: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Apache Spark Core

Spark SQL

Spar

k M

LLib

Spar

k St

ream

ing

Stru

ctur

ed

Stre

amin

g

Data

sets

/Fra

mes

Grap

hFra

mes

Data Sources

Page 10: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Apache Spark Core

Spark SQL

Spar

k M

LLib

Spar

k St

ream

ing

Stru

ctur

ed

Stre

amin

g

Grap

hFra

mes

Data Sources

Data

sets

/Fra

mes

Page 11: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Page 12: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Page 13: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Kafka

Sockets

HDFS/S3

Custom

StreamingDataFrame

Page 14: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Kafka

Sockets

HDFS/S3

Custom

StreamingDataFrame

Page 15: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Kafka

Sockets

HDFS/S3

Custom

StreamingDataFrame

Query

Kafka

Files

foreachSink

console

memory

OutputMode

Page 16: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Interactive Session

1 Structured Streaming

Page 17: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

SensorData

Multiplexer

Structured Streaming

Spark Notebook

Local Process

Sensor Anomaly Detection

Page 18: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Live

Page 19: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Interactive Session 1 Structured Streaming

QUICK RECAP

Page 20: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

val rawData = sparkSession.readStream .format("kafka") // csv, json, parquet, socket, rate .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("subscribe", sourceTopic) .option("startingOffsets", "latest") .load()

Sources

Page 21: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Operations...

val rawValues = rawData.selectExpr("CAST(value AS STRING)") .as[String]

val jsonValues = rawValues.select(from_json($"value", schema) as "record")

val sensorData = jsonValues.select("record.*").as[SensorData]…

Page 22: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Event Time...

val movingAverage = sensorData .withColumn("timestamp", toSeconds($"ts").cast(TimestampType)) .withWatermark("timestamp", "30 seconds") .groupBy($"id", window($"timestamp", "30 seconds", "10 seconds")) .agg(avg($"temp"))...

Page 23: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Sinks...val visualizationQuery = sensorData.writeStream .queryName("visualization") // this will be the SQL table name .format("memory") .outputMode("update") .start()

...val kafkaWriterQuery = kafkaFormat.writeStream .queryName("kafkaWriter") .format("kafka") .outputMode("append") .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("topic", targetTopic) .option("checkpointLocation", "/tmp/spark/checkpoint") .start()

Page 24: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Use Cases● Streaming ETL● Stream aggregations, windows● Event-time oriented analytics● Arbitrary stateful stream processing● Join Streams with other streams and with Fixed

Datasets● Apply Machine Learning Models

Page 25: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Structured Streaming

Page 26: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Spark Streaming

Kafka

Flume

Kinesis

Twitter

Sockets

HDFS/S3

Custom Apache Spark

Spar

k SQ

L

Spar

k M

L

...

Databases

HDFS

API Server

Streams

Page 27: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

Page 28: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]

TransformationT -> U

Page 29: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

t0 t1 t2 t3 ti ti+1

RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]

TransformationT -> U

Actions

Page 30: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

API: Transformations

map, flatmap, filter

count, reduce, countByValue,reduceByKey

n

union,joincogroup

Page 31: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

API: Transformations

mapWithState… …

Page 32: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

API: Transformations

transform

val iotDstream = MQTTUtils.createStream(...)val devicePriority = sparkContext.cassandraTable(...)val prioritizedDStream = iotDstream.transform{rdd =>

rdd.map(d => (d.id, d)).join(devicePriority)}

Page 33: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Actions

print

-------------------------------------------

Time: 1459875469000 ms

-------------------------------------------

data1

data2

saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles

xxxyyyzzz

foreachRDD *

Page 34: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Actions

print

-------------------------------------------

Time: 1459875469000 ms

-------------------------------------------

data1

data2

saveAsTextFiles,saveAsObjectFiles,saveAsHadoopFiles

xxxyyyzzz

foreachRDD *Spark SQLDataframesGraphFramesAny API

Page 35: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Interactive Session

2 Spark Streaming

Page 36: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

SensorData

Multiplexer

Structured Streaming

Spark Notebook

Local Process

Sensor Anomaly Detection

Page 37: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Live

Page 38: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Interactive Session 2 Spark Streaming

QUICK RECAP

Page 39: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

import org.apache.spark.streaming.StreamingContextval streamingContext = new StreamingContext(sparkContext, Seconds(10))

Streaming Context

Page 40: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

val kafkaParams = Map[String, String]( "metadata.broker.list" -> kafkaBootstrapServer, "group.id" -> "sensor-tracker-group", "auto.offset.reset" -> "largest", "enable.auto.commit" -> (false: java.lang.Boolean).toString)

val topics = Set(topic)@transient val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( streamingContext, kafkaParams, topics)

Source

Page 41: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

import spark.implicits._val sensorDataStream = stream.transform{rdd => val jsonData = rdd.map{case (k,v) => v} val ds = sparkSession.createDataset(jsonData) val jsonDF = spark.read.json(ds) val sensorDataDS = jsonDF.as[SensorData] sensorDataDS.rdd}

Transformations

Page 42: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

val model = new M2Model()

model.trainOn(inputData)

val scoredDStream = model.predictOnValues(inputData)

DIY Custom Model

Page 43: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

suspects.foreachRDD{rdd => val sample = rdd.take(20).map(_.toString) val total = s"total found: ${rdd.count}" outputBox(total +: sample)}

Output

Page 44: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Usecases

● Complex computing/state management (local + cluster)● Streaming Machine Learning

○ Learn○ Score

● Join Streams with Updatable Datasets● RDD-based streaming computations

● [-] Event-time oriented analytics● [-] Optimizations: Query & Data● [-] Continuous processing

Page 45: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

SensorData

Multiplexer

Structured StreamingLocal Process

Sensor Anomaly Detection (Real Time Detection)

Structured Streaming

Page 46: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Structured Streaming+

Page 47: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Spark Streaming + Structured Streaming

47

val parse: Dataset[String] => Dataset[Record] = ???

val process: Dataset[Record] => Dataset[Result] = ???

val serialize: Dataset[Result] => Dataset[String] = ???

val kafkaStream = spark.readStream…

val f = parse andThen process andThen serialize

val result = f(kafkaStream)

result.writeStream

.format("kafka")

.option("kafka.bootstrap.servers",bootstrapServers)

.option("topic", writeTopic)

.option("checkpointLocation", checkpointLocation)

.start()

val dstream = KafkaUtils.createDirectStream(...)

dstream.map{rdd =>

val ds = sparkSession.createDataset(rdd)

val f = parse andThen process andThen serialize

val result = f(ds)

result.write.format("kafka")

.option("kafka.bootstrap.servers", bootstrapServers)

.option("topic", writeTopic)

.option("checkpointLocation", checkpointLocation)

.save()

}

Structured StreamingSpark Streaming

Page 48: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Streaming Pipelines

Structured Streaming

Keyword Extraction

KeywordRelevance Similarity

DB Storage

Page 49: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Time

Execution

Abstraction

Structured Streaming Spark Streaming

Abstract(Processing Time, Event Time)

Fixed to microbatchStreaming Interval

Fixed Micro batch, Best Effort MB, Continuous (NRT) Fixed Micro batch

DataFrames/Dataset DStream, RDD

Access to the scheduler

Page 50: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Structured Streaming

New Project?

80%20%

Page 51: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

lightbend.com/fast-data-platform

Page 52: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

@maasg

Page 53: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Gerard MaasSeñor SW Engineer

@maasg

https://github.com/maasg

https://www.linkedin.com/in/gerardmaas/

https://stackoverflow.com/users/764040/maasg

Page 54: The Tale of Two Streaming APIs Processing Fast Data with ... · Processing Fast Data with Apache Spark: The Tale of Two Streaming APIs Gerard Maas Senior SW Engineer, Lightbend, Inc.

Thank You!