Gearpump akka streams
-
Upload
kam-kasravi -
Category
Software
-
view
1.438 -
download
0
Transcript of Gearpump akka streams
Implementing an akka-streams materializer for big data
The Gearpump Materializer
Kam Kasravi
Technical Presentation
● Familiarity with akka-streams flow and graph DSL’s● Familiarity with big data and real time streaming platforms● Familiarity with scala
● Effort between the akka-streams and Gearpump teams started late last year● Resulted in a number of pull requests into akka-streams to enable different materializers● Close to completion with good support of the akka-streams DSL (all GraphStages)● Fairly seamless to switch between local and distributed
Who am I?● Committer on Apache Gearpump (incubating)
- http://gearpump.apache.org● Architect on Trusted Analytics Platform (TAP)
- http://trustedanalytics.org● Lead or Architect across many companies, industries
- NYSE, eBay, PayPal, Yahoo, ...Title Goes Here
There are many variations of passages of lorem ipsum available, but the
majority suffered alteration some form.
What is Apache Gearpump?
● Accepted into Apache incubator last March● Similar to Apache Beam and Apache Flink (real-time message delivery)● Heavily leverages the actor model and akka (more so than others)● Unique features like dynamic DAG● Excellent runtime visualization tooling of cluster and application DAGs● One of the best big data performance profiles (both throughput, latency)
Age
nda ● Why?
○ Why integrate akka-streams into a big data platform?● Big Data platform evolving features
○ Functionality big data platforms are embracing● Prerequisites needed for any Big Data platform
○ Minimal features a big data platform must have ● Big data platform integration challenges
○ What concepts do not map well within big data platforms? ● Object models: akka-streams, Gearpump● Materialization
○ ActorMaterializer - materializing the module tree○ GearpumpMaterializer - rewriting the module tree
Why?
● Akka-streams has limitations inherent within a single JVM ○ Throughput and latency are key big data features that require scaling beyond single JVM’s
● Akka-streams DSL is a superset of other big data platform DSLs○ Has a logical plan (declarative) that can be transformed to an execution plan (runtime)
● Akka-streams programming paradigm is declarative, composable, extensible*, stackable* and reusable*
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Extensible
● Extend GraphStage● Extend Source, Sink, Flow or BidiFlow● All derive from Graph
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Stackable
● Another term for nestable or recursive. Reference to Kleisli (theoretical).● Source, Sink, Flow or BidiFlow may contain their own topologies
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Reusable
● Graph topologies can be attached anywhere (any Graph)● Recent akka-streams feature is dynamic attachment via hubs● Hubs will take advantage of Gearpump dynamic DAG within the
GearpumpMaterializer
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Big Data platform evolving features (1)● Big data platforms are moving to consolidate disparate API’s
○ Too many APIs: Concord, Flink, Heron, Pulsar, Spark, Storm, Samza○ Common DSL is also an approach being taken by Apache Beam○ Analogy to SQL - common grammar that different platforms execute
Big Data platform evolving features (2)● Big data platforms will increasingly require dynamic
pipelines that are compositional and reusable● Examples include:
○ Machine learning○ IoT sensors
Big Data platform evolving features (3)● Machine learning use cases
○ Replace or update scoring models
○ Model Ensembles
■ concept drift
■ data drift
Big Data platform evolving features (4)● IoT use cases
○ Bring new sensors on line with no interruption
○ Change or update configuration parameters at remote sensors
Prerequisites needed for any Big Data platform (1)
Downstream must be able to pull
Upstream must be able to push
1. Push and Pull
Downstream must be able to backpressureall the way to source
2. Backpressure
<< <<
Prerequisites needed for any Big Data platform (2)
3. Parallelization
4. Asynchronous
5. Bidirectional
Big data platform integration challenges (1)
A number of GraphStages have completion or cancellation semantics. Big data pipelines are often infinite streams and do not complete. Cancel is often viewed as a failure.
● Balance[T]● Completion[T]● Merge[T]● Split[T]
Big data platform integration challenges (2)
A number of GraphStages have specific upstream and downstream ordering and timing directives.
● Batch[T]● Concat[T]● Delay[T]● DelayInitial[T]● Interleave[T]
Big data platform integration challenges (3)
The async attribute as well as fusing do not map cleanly when distributing GraphStage functionality across machines.
● Graph.async● Fusing
Graph.async● Collapses multiple operations (GraphStageLogic) into one actor● Distributed scenarios where one may want actors within the
same JVM or on the same machine
Fusing● Creates one or more islands delimited by async boundaries● For distributed scenario no fusing should occur until the
materializer can evaluate and optimize the execution plan
Object Models● Akka-stream’s GraphStage, Module, Shape● Gearpump’s Graph, Task, Partitioner
Akka-streams Object Model ↪ Base type is a Graph. Common base type is a GraphStage↪ Graph contains a
↳ Module contains a↳ Shape
↪ Only a RunnableGraph can be materialized↪ A RunnableGraph needs at least one Source and one Sink
Akka-streams Graph[S, M]
● Graph is parameterized by ○ Shape○ Materialized Value
● Graph contains a Module contains a Shape○ Module is where the runtime is constructed and manipulated
● Graph’s first level subtypes provide basic functionality○ Source○ Sink○ Flow○ BidiFlow
S MGraph
Source
Sink
Flow
BidiFlow
ModuleShape
GraphStage[S <: Shape]
Graph
GraphStageWithMaterializedValue
GraphStage
GraphStageModule
Module
GraphStage[S <: Shape]subtypes (incomplete)↳ Balance[T]↳ Batch[In, Out]↳ Broadcast[T]↳ Collect[In, Out]↳ Concat[T]↳ DelayInitial[T]↳ DropWhile[T]↳ Expand[In, Out]↳ FlattenMerge[T, M]↳ Fold[In, Out]
↳ FoldAsync[T]↳ FutureSource[T]↳ GroupBy[T, K]↳ Grouped[T]↳ GroupedWithin[T]↳ Interleave[T]↳ Intersperse[T]↳ LimitWeighted[T]↳ Map[In, Out]↳ MapAsync[In, Out]
↳ Merge[T]↳ MergePreferred[T]↳ MergeSorted[T]↳ OrElse[T]↳ Partition[T]↳ PrefixAndTail[T]↳ Recover[T]↳ Scan[In, Out]↳ SimpleLinearGraph[T]↳ Sliding[T]
What about Module? ● Module is a recursive structure containing a Set[Modules]● Module is a declarative data structure used as the AST● Module is used to represent a graph of nodes and edges from the original
GraphStages● Module contains downstream and upstream ports (edges)● Materializers walk the module tree to create and run instances of publishers
and subscribers. ● Each publisher and subscriber is an actor (ActorGraphInterpreter)
Gearpump Object Model ↪ Graph[Node, Edge] holds
↳ Tasks (Node)↳ Partitioners (Edge)
↪ This is a Gearpump Graph, not to be confused with akka-streams Graph.
Gearpump Graph[N<:Task, E<:Partitioner]
● Graph is parameterized by ○ Node - must be a subtype of Task○ Edge - must be a subtype of Parititioner
N EGraphList[Task]List[Partitioner]
Task
Task
GraphTask
GraphTasksubtypes (incomplete)↳ BalanceTask↳ BatchTask[In, Out]↳ BroadcastTask[T]↳ CollectTask[In, Out]↳ ConcatTask↳ DelayInitialTask[T]↳ DropWhileTask[T]↳ ExpandTask[In, Out]↳ FlattenMerge[T, M]↳ FoldTask[In, Out]
↳ FutureSourceTask[T]↳ GroupByTask[T, K]↳ GroupedTask[T]↳ GroupedWithinTask[T]↳ InterleaveTask[T]↳ IntersperseTask[T]↳ LimitWeightedTask[T]↳ MapTask[In, Out]↳ MapAsyncTask[In, Out]
↳ MergeTask[T]↳ OrElseTask[T]↳ PartitionTask[T]↳ PrefixAndTailTask[T]↳ RecoverTask[T]↳ ScanTask[In, Out]↳ SlidingTask[T]
Materializer Variations1. AST (module tree) is matched for every module type
(GearpumpMaterializer)2. AST (module tree) is matched for certain module types
○ After distribution - local ActorMaterializer is used for operations on that worker
○ Materializer works more as a distribution coordinator
Example 1Source Broadcast Flow MergeSink
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
Example 1
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
Source Broadcast
Flow
Flow
Merge
GraphStages
Sink
class SinkActor extends Actor {
def receive: Receive = {
case any: Any =>
println(s“Confirm received: $any”)
}
Example 1
Source Broadcast
Flow
Flow
Merge
GraphStages
Sink
Module TreeGraphStageModule
GraphStageModule
stage=SingleSource
stage=StatefulMapConcat
ActorRefSink
stage=Broadcast
stage=Map
stage=Merge
GraphStageModule
GraphStageModule
GraphStageModule
Example 1
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
source broadcast
flowA
flowB
merge
GraphStages
sink
Example 1
processing broadcasted element : 1 in flowA
processing broadcasted element : 1 in flowB
processing broadcasted element : 2 in flowA
Confirm received: 1
Confirm received: 1
processing broadcasted element : 2 in flowB
Confirm received: 2
Confirm received: 2
processing broadcasted element : 3 in flowA
processing broadcasted element : 3 in flowB
processing broadcasted element : 4 in flowA
processing broadcasted element : 4 in flowB
Confirm received: 3
Confirm received: 3
processing broadcasted element : 5 in flowA
processing broadcasted element : 5 in flowB
Confirm received: 4
Confirm received: 4
Confirm received: 5
Confirm received: 5
Confirm received: COMPLETE
source broadcast
flowA
flowB
merge
GraphStages
sink
ActorMaterializer Output
Example 1
implicit val materializer = GearpumpMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
source broadcast
flowA
flowB
merge
GraphStages
sink
Example 1
processing broadcasted element : 1 in flowA
processing broadcasted element : 1 in flowB
processing broadcasted element : 2 in flowB
processing broadcasted element : 2 in flowA
processing broadcasted element : 3 in flowB
processing broadcasted element : 3 in flowA
processing broadcasted element : 4 in flowB
processing broadcasted element : 4 in flowA
processing broadcasted element : 5 in flowB
Confirm received: 1
processing broadcasted element : 5 in flowA
Confirm received: 1
Confirm received: 2
Confirm received: 2
Confirm received: 3
Confirm received: 3
Confirm received: 4
Confirm received: 4
Confirm received: 5
Confirm received: 5
source broadcast
flowA
flowB
merge
GraphStages
sink
GearpumpMaterializer Output
Demo
GraphStageModule(stage=SingleSource)
ActorRefSinkGraphStageModule(stage=StatefulMapConcat)
GraphStageModule(stage=Broadcast)
GraphStageModule(stage=Map)
GraphStageModule(stage=Merge)
ActorMaterializer
GraphStageModule(stage=SingleSource)
ActorRefSinkGraphStageModule(stage=StatefulMapConcat)
GraphStageModule(stage=Broadcast)
GraphStageModule(stage=Map)
GraphStageModule(stage=Merge)
1. Traverses the Module Tree
ActorMaterializer
2. Builds a runtime graph of BoundaryPublisher and BoundarySubscribers (Reactive API).
3. Each Publisher or Subscriber contains an instance of GraphStageLogic specific to that GraphStage.
4. Each Publisher or Subscriber also contains an instance of ActorGraphInterpreter - an Actor that manages the message flow using GraphStageLogic.
GearpumpMaterializer
GraphStageModule(stage=SingleSource)
ActorRefSink
GraphStageModule(stage=Broadcast)
GraphStageModule(stage=Map)
GraphStageModule(stage=Merge)
1. Rewrites the Module Tree into ‘local’ and ‘remote’ Gearpump Graphs.
GraphStageModule(stage=StatefulMapConcat)
GearpumpMaterializer
GraphStageModule(stage=SingleSource)
ActorRefSink
GraphStageModule(stage=Broadcast)
GraphStageModule(stage=Map)
GraphStageModule(stage=Merge)
2. Choice of ‘local’ and ‘remote’ is determined by a ‘Strategy’. The default Strategy is to put Source and Sink types in local
GraphStageModule(stage=StatefulMapConcat)
GearpumpMaterializer
ActorRefSink
3. Inserts BridgeModules into both Graphs
SourceBridgeModule
SinkBridgeModule
SinkBridgeModule
GraphStageModule(stage=Broadcast)
GraphStageModule(stage=Map)
GraphStageModule(stage=Merge)GraphStageModule(
stage=StatefulMapConcat)
GraphStageModule(stage=SingleSource)
SourceBridgeModule
GearpumpMaterializer
ActorRefSink
4. Local graph is passed to a LocalGraphMaterializer
SinkBridgeModule
GraphStageModule(stage=SingleSource)
SourceBridgeModule
LocalGraphMaterializer is a variant (subtype) of ActorMaterializer
GearpumpMaterializer
5. Converts the remote graph’s Modules into Tasks
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
GearpumpMaterializer
6. Sends this Graph to the Gearpump master
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
GearpumpMaterializer
7. Materialization is controlled at BridgeTasks
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
Example 2No local graph.More typical of distributed apps.
implicit val materializer = GearpumpMaterializer()
val sink = GearSink.to(new LoggerSink[String]))
val sourceData = new CollectionDataSource(
List("red hat", "yellow sweater", "blue jack", "red
apple", "green plant", "blue sky"))
val source = GearSource.from[String](sourceData)
source.filter(_.startsWith("red")).map("I want to order
item: " + _).runWith(sink)
Example 3More complex Graph with loops
implicit val materializer = GearpumpMaterializer()
RunnableGraph.fromGraph(GraphDSL.create() {
implicitbuilder =>
val A = builder.add(Source.single(0)).out
val B = builder.add(Broadcast[Int](2))
val C = builder.add(Merge[Int](2))
val D = builder.add(Flow[Int].map(_ + 1))
val E = builder.add(Balance[Int](2))
val F = builder.add(Merge[Int](2))
val G = builder.add(Sink.foreach(println)).in
C <~ F
A ~> B ~> C ~> F
B ~> D ~> E ~> F
E ~> G
ClosedShape
}).run()
Sum
mar
y ● Akka-streams provides a compelling programming model that enables declarative pipeline reuse and extensibility.
● Akka-streams allows different materializers to control and materialize different parts of the module tree.
● It’s possible to provide a seamless (or nearly seamless) conversion of akka-streams to run in a distributed setting by merely replacing ActorMaterializer with GearpumpMaterializer.
● Alternative distributed materializers can be implemented using a similar approach.
● Distributed akka-streams via Apache Gearpump will be available in the next release of Apache Gearpump (0.8.2) or will be made available within an akka specific repo.
Thank you
twitter:@ApacheGearpump@kkasravi