Introduction to Structured Streaming
-
Upload
knoldus-software-llp -
Category
Software
-
view
4.680 -
download
1
Transcript of Introduction to Structured Streaming
Introduction to Structured StreamingManish Mishra
Software ConsultantKnoldus Software LLP
What is Structured Streaming ?
How is it different from Previous Streaming Engine?
Structured Streaming Programming model
Basic Operations, Selection, Projection and Aggregation
Window Operations on Event Time
Example Demo
Agenda
What is Structured Streaming ?
What is Structured Streaming
It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
It was part of Spark 2.0 release
A unified API for streams which can combine stream computation and batch processing
The computations can be performed in sql-like queries which are applicable for Dataset/DataFrames on streaming dataframes.
What is new with this Streaming Engine?
What is new in Structured Streaming
The entry point of the streaming app is spark session in spite of previous streamingContext
Unlike Dstreams, It is an infinite Dataframe.
It is interoperable with DStreams
It can harness the power of Catalyst Optimizer to increase performance of query without changing the query semantics
An Unified API makes developer task easy and no one has to reason about how streaming computation will differ from a normal map-red computation
Structured Streaming Programming Model
It treats a live data stream as an unbounded table
Any streaming computation can be expressed as a batch-like query on static tables
The spark runs this computation as incremental query internally
The result of the computation depends on the output modes specified in the streaming query.
Structured Streaming Programming Model
Image Source: Apache Spark Documentations
Structured Streaming Programming Model
Image Source: Apache Spark Documentations
There are three output modes which decides what result output goes into the sink namely: Complete Mode:
Update Mode:
Append Mode:
Structured Streaming Programming Model:Output Modes
Complete Mode: Entire updated Result Table will be written to the sink. It is up to the storage connector to decide how to handle writing of the entire table. It can be specified by outputMode("complete") while instantiating a stream query object.
Structured Streaming Programming Model:Output Modes
Append Mode (default) : Only the new rows appended in the result Table since the last trigger will be written to the external storage.
This is applicable only on the queries where existing rows in the Result Table are not expected to change.
Structured Streaming Programming Model:Output Modes
Update Mode : Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (not available yet in Spark 2.0). Note that this is different from the Complete Mode in that this mode does not output the rows that are not changed.
Note: This mode is not implemented yet till Spark 2.0
Structured Streaming Programming Model:Output Modes
// Create DataFrame representing the stream of input lines from
connection to localhost:9000
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9000)
.load()
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()Example: Running
Word Count
// Start running the query that prints the running counts to the
console
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()Example: Running Word Count
Basic Operations - Selection, Projection, Aggregation
case class DeviceData(device: String, type: String, signal:
Double, time: DateTime)
val df: DataFrame = ... // streaming DataFrame with IOT device data
with schema { device: string, type: string, signal: double, time:
string }
val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming
Dataset with IOT device data
/ Select the devices which have signal more than 10
df.select("device").where("signal > 10") // using untyped
APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs
Basic Operations - Selection, Projection, Aggregation
/ Running count of the number of updates for each device
type
df.groupBy("type").count() // using untyped API
// Running average signal for each device type
import org.apache.spark.sql.expressions.scalalang.typed._
ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed
API
Basic Operations - Selection, Projection, Aggregation
Window Operations on Event Time
Window Operations on Event Time
import spark.implicits._
val words = ... // streaming DataFrame of schema { timestamp:
Timestamp, word: String }
// Group the data by window and word and compute the count of each
group
val windowedCounts = words.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word"
).count()
Window Operations on Event Time
Image Source: Apache Spark Documentations
References
Structured Streaming Programming Guide http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Thanks!!