SlideShare a Scribd company logo
Implementing an akka-streams
materializer for big data
The Gearpump Materializer
Kam Kasravi
Technical Presentation
● Familiarity with akka-streams flow and graph DSL’s
● Familiarity with big data and real time streaming platforms
● Familiarity with scala
● Effort between the akka-streams and Gearpump teams started late last year
● Resulted in a number of pull requests into akka-streams to enable different materializers
● Close to completion with good support of the akka-streams DSL (all GraphStages)
● Fairly seamless to switch between local and distributed
Who am I?
● Committer on Apache Gearpump (incubating)
- http://gearpump.apache.org
● Architect on Trusted Analytics Platform (TAP)
- http://trustedanalytics.org
● Lead or Architect across many companies, industries
- NYSE, eBay, PayPal, Yahoo, ...
Title Goes Here
There are many variations of passages
of lorem ipsum available, but the
majority suffered alteration
some form.
What is Apache Gearpump?
● Accepted into Apache incubator last March
● Similar to Apache Beam and Apache Flink (real-time message delivery)
● Heavily leverages the actor model and akka (more so than others)
● Unique features like dynamic DAG
● Excellent runtime visualization tooling of cluster and application DAGs
● One of the best big data performance profiles (both throughput, latency)
Agenda
● Why?
○ Why integrate akka-streams into a big data platform?
● Big Data platform evolving features
○ Functionality big data platforms are embracing
● Prerequisites needed for any Big Data platform
○ Minimal features a big data platform must have
● Big data platform integration challenges
○ What concepts do not map well within big data platforms?
● Object models: akka-streams, Gearpump
● Materialization
○ ActorMaterializer - materializing the module tree
○ GearpumpMaterializer - rewriting the module tree
Why?
● Akka-streams has limitations inherent within a single JVM
○ Throughput and latency are key big data features that require scaling beyond single JVM’s
● Akka-streams DSL is a superset of other big data platform DSLs
○ Has a logical plan (declarative) that can be transformed to an execution plan (runtime)
● Akka-streams programming paradigm is declarative, composable,
extensible*, stackable* and reusable*
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Extensible
● Extend GraphStage
● Extend Source, Sink, Flow or BidiFlow
● All derive from Graph
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Stackable
● Another term for nestable or recursive. Reference to Kleisli (theoretical).
● Source, Sink, Flow or BidiFlow may contain their own topologies
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Reusable
● Graph topologies can be attached anywhere (any Graph)
● Recent akka-streams feature is dynamic attachment via hubs
● Hubs will take advantage of Gearpump dynamic DAG within the
GearpumpMaterializer
* Provides a level of extensibility and functionality beyond most big data platform DSLs
Big Data platform
evolving features
(1)
● Big data platforms are moving to consolidate disparate API’s
○ Too many APIs: Concord, Flink, Heron, Pulsar, Spark, Storm, Samza
○ Common DSL is also an approach being taken by Apache Beam
○ Analogy to SQL - common grammar that different platforms execute
Big Data platform
evolving features
(2)
● Big data platforms will increasingly require dynamic
pipelines that are compositional and reusable
● Examples include:
○ Machine learning
○ IoT sensors
Big Data platform
evolving features
(3)
● Machine learning use cases
○ Replace or update scoring models
○ Model Ensembles
■ concept drift
■ data drift
Big Data platform
evolving features
(4)
● IoT use cases
○ Bring new sensors on line with no interruption
○ Change or update configuration parameters at remote
sensors
Prerequisites
needed for any
Big Data
platform (1)
Downstream must
be able to pull
Upstream must
be able to push
1. Push and Pull
Downstream must
be able to backpressure
all the way to source
2. Backpressure
<< <<
Prerequisites
needed for any
Big Data
platform (2)
3. Parallelization
4. Asynchronous
5. Bidirectional
Big data
platform
integration
challenges (1)
A number of
GraphStages have
completion or
cancellation
semantics. Big data
pipelines are often
infinite streams and
do not complete.
Cancel is often
viewed as a failure.
● Balance[T]
● Completion[T]
● Merge[T]
● Split[T]
Big data
platform
integration
challenges (2)
A number of
GraphStages have
specific upstream and
downstream ordering
and timing directives.
● Batch[T]
● Concat[T]
● Delay[T]
● DelayInitial[T]
● Interleave[T]
Big data
platform
integration
challenges (3)
The async attribute as
well as fusing do not
map cleanly when
distributing GraphStage
functionality across
machines.
● Graph.async
● Fusing
Graph.async
● Collapses multiple operations (GraphStageLogic) into one actor
● Distributed scenarios where one may want actors within the
same JVM or on the same machine
Fusing
● Creates one or more islands delimited by async boundaries
● For distributed scenario no fusing should occur until the
materializer can evaluate and optimize the execution plan
Object Models
● Akka-stream’s GraphStage, Module, Shape
● Gearpump’s Graph, Task, Partitioner
Akka-streams Object Model
↪ Base type is a Graph. Common base type is a GraphStage
↪ Graph contains a
↳ Module contains a
↳ Shape
↪ Only a RunnableGraph can be materialized
↪ A RunnableGraph needs at least one Source and one Sink
Akka-streams Graph[S, M]
● Graph is parameterized by
○ Shape
○ Materialized Value
● Graph contains a Module contains a Shape
○ Module is where the runtime is constructed and manipulated
● Graph’s first level subtypes provide basic functionality
○ Source
○ Sink
○ Flow
○ BidiFlow
S M
Graph
Source
Sink
Flow
BidiFlow
Module
Shape
GraphStage[S <: Shape]
Graph
GraphStageWithMaterializedValue
GraphStage
GraphStageModule
Module
GraphStage[S <: Shape]
subtypes (incomplete)
↳ Balance[T]
↳ Batch[In, Out]
↳ Broadcast[T]
↳ Collect[In, Out]
↳ Concat[T]
↳ DelayInitial[T]
↳ DropWhile[T]
↳ Expand[In, Out]
↳ FlattenMerge[T, M]
↳ Fold[In, Out]
↳ FoldAsync[T]
↳ FutureSource[T]
↳ GroupBy[T, K]
↳ Grouped[T]
↳ GroupedWithin[T]
↳ Interleave[T]
↳ Intersperse[T]
↳ LimitWeighted[T]
↳ Map[In, Out]
↳ MapAsync[In, Out]
↳ Merge[T]
↳ MergePreferred[T]
↳ MergeSorted[T]
↳ OrElse[T]
↳ Partition[T]
↳ PrefixAndTail[T]
↳ Recover[T]
↳ Scan[In, Out]
↳ SimpleLinearGraph[T]
↳ Sliding[T]
What about Module?
● Module is a recursive structure containing a Set[Modules]
● Module is a declarative data structure used as the AST
● Module is used to represent a graph of nodes and edges from the original
GraphStages
● Module contains downstream and upstream ports (edges)
● Materializers walk the module tree to create and run instances of publishers
and subscribers.
● Each publisher and subscriber is an actor (ActorGraphInterpreter)
Gearpump Object Model
↪ Graph[Node, Edge] holds
↳ Tasks (Node)
↳ Partitioners (Edge)
↪ This is a Gearpump Graph, not to be
confused with akka-streams Graph.
Gearpump Graph[N<:Task, E<:Partitioner]
● Graph is parameterized by
○ Node - must be a subtype of Task
○ Edge - must be a subtype of Parititioner
N E
Graph
List[Task]
List[Partitioner]
Task
Task
GraphTask
GraphTask
subtypes (incomplete)
↳ BalanceTask
↳ BatchTask[In, Out]
↳ BroadcastTask[T]
↳ CollectTask[In, Out]
↳ ConcatTask
↳ DelayInitialTask[T]
↳ DropWhileTask[T]
↳ ExpandTask[In, Out]
↳ FlattenMerge[T, M]
↳ FoldTask[In, Out]
↳ FutureSourceTask[T]
↳ GroupByTask[T, K]
↳ GroupedTask[T]
↳ GroupedWithinTask[T]
↳ InterleaveTask[T]
↳ IntersperseTask[T]
↳ LimitWeightedTask[T]
↳ MapTask[In, Out]
↳ MapAsyncTask[In, Out]
↳ MergeTask[T]
↳ OrElseTask[T]
↳ PartitionTask[T]
↳ PrefixAndTailTask[T]
↳ RecoverTask[T]
↳ ScanTask[In, Out]
↳ SlidingTask[T]
Materializer Variations
1. AST (module tree) is matched for every module type
(GearpumpMaterializer)
2. AST (module tree) is matched for certain module types
○ After distribution - local ActorMaterializer is used for
operations on that worker
○ Materializer works more as a distribution coordinator
Example 1
Source
Broadcast
Flow
Merge
Sink
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
Example 1
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
Source Broadcast
Flow
Flow
Merge
GraphStages
Sink
class SinkActor extends Actor {
def receive: Receive = {
case any: Any =>
println(s“Confirm received: $any”)
}
Example 1
Source Broadcast
Flow
Flow
Merge
GraphStages
Sink
Module Tree
GraphStageModule
GraphStageModule
stage=SingleSource
stage=StatefulMapConcat
ActorRefSink
stage=Broadcast
stage=Map
stage=Merge
GraphStageModule
GraphStageModule
GraphStageModule
Example 1
implicit val materializer = ActorMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
source broadcast
flowA
flowB
merge
GraphStages
sink
Example 1
processing broadcasted element : 1 in flowA
processing broadcasted element : 1 in flowB
processing broadcasted element : 2 in flowA
Confirm received: 1
Confirm received: 1
processing broadcasted element : 2 in flowB
Confirm received: 2
Confirm received: 2
processing broadcasted element : 3 in flowA
processing broadcasted element : 3 in flowB
processing broadcasted element : 4 in flowA
processing broadcasted element : 4 in flowB
Confirm received: 3
Confirm received: 3
processing broadcasted element : 5 in flowA
processing broadcasted element : 5 in flowB
Confirm received: 4
Confirm received: 4
Confirm received: 5
Confirm received: 5
Confirm received: COMPLETE
source broadcast
flowA
flowB
merge
GraphStages
sink
ActorMaterializer Output
Example 1
implicit val materializer = GearpumpMaterializer()
val sinkActor = system.actorOf(Props(new SinkActor())
val source = Source((1 to 5))
val sink = Sink.actorRef(sinkActor, "COMPLETE")
val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowA"); x
}
val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map {
x => println(s"processing broadcasted element : $x in flowB"); x
}
val graph = RunnableGraph.fromGraph(GraphDSL.create() {
implicit b =>
val broadcast = b.add(Broadcast[Int](2))
val merge = b.add(Broadcast[Int](2))
source ~> broadcast
broadcast ~> flowA ~> merge
broadcast ~> flowB ~> merge
merge ~> sink
ClosedShape
})
graph.run()
source broadcast
flowA
flowB
merge
GraphStages
sink
Example 1
processing broadcasted element : 1 in flowA
processing broadcasted element : 1 in flowB
processing broadcasted element : 2 in flowB
processing broadcasted element : 2 in flowA
processing broadcasted element : 3 in flowB
processing broadcasted element : 3 in flowA
processing broadcasted element : 4 in flowB
processing broadcasted element : 4 in flowA
processing broadcasted element : 5 in flowB
Confirm received: 1
processing broadcasted element : 5 in flowA
Confirm received: 1
Confirm received: 2
Confirm received: 2
Confirm received: 3
Confirm received: 3
Confirm received: 4
Confirm received: 4
Confirm received: 5
Confirm received: 5
source broadcast
flowA
flowB
merge
GraphStages
sink
GearpumpMaterializer Output
Demo
GraphStageModule(
stage=SingleSource)
ActorRefSinkGraphStageModule(
stage=StatefulMapConcat)
GraphStageModule(
stage=Broadcast)
GraphStageModule(
stage=Map)
GraphStageModule(
stage=Merge)
ActorMaterializer
GraphStageModule(
stage=SingleSource)
ActorRefSinkGraphStageModule(
stage=StatefulMapConcat)
GraphStageModule(
stage=Broadcast)
GraphStageModule(
stage=Map)
GraphStageModule(
stage=Merge)
1. Traverses the Module Tree
ActorMaterializer
2. Builds a runtime graph of BoundaryPublisher and
BoundarySubscribers (Reactive API).
3. Each Publisher or Subscriber contains an instance of
GraphStageLogic specific to that GraphStage.
4. Each Publisher or Subscriber also contains an
instance of ActorGraphInterpreter - an Actor that
manages the message flow using GraphStageLogic.
GearpumpMaterializer
GraphStageModule(
stage=SingleSource)
ActorRefSink
GraphStageModule(
stage=Broadcast)
GraphStageModule(
stage=Map)
GraphStageModule(
stage=Merge)
1. Rewrites the Module Tree into ‘local’ and ‘remote’
Gearpump Graphs.
GraphStageModule(
stage=StatefulMapConcat)
GearpumpMaterializer
GraphStageModule(
stage=SingleSource)
ActorRefSink
GraphStageModule(
stage=Broadcast)
GraphStageModule(
stage=Map)
GraphStageModule(
stage=Merge)
2. Choice of ‘local’ and ‘remote’ is determined by a
‘Strategy’. The default Strategy is to put Source and Sink
types in local
GraphStageModule(
stage=StatefulMapConcat)
GearpumpMaterializer
ActorRefSink
3. Inserts BridgeModules into both Graphs
SourceBridgeModule
SinkBridgeModule
SinkBridgeModule
GraphStageModule(
stage=Broadcast)
GraphStageModule(
stage=Map)
GraphStageModule(
stage=Merge)GraphStageModule(
stage=StatefulMapConcat)
GraphStageModule(
stage=SingleSource)
SourceBridgeModule
GearpumpMaterializer
ActorRefSink
4. Local graph is passed to a LocalGraphMaterializer
SinkBridgeModule
GraphStageModule(
stage=SingleSource)
SourceBridgeModule
LocalGraphMaterializer is a
variant (subtype) of
ActorMaterializer
GearpumpMaterializer
5. Converts the remote graph’s Modules into Tasks
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
GearpumpMaterializer
6. Sends this Graph to the Gearpump master
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
GearpumpMaterializer
7. Materialization is controlled at BridgeTasks
SourceBridgeTask SinkBridgeTaskBroadcastTask
TransformTask
MergeTaskStatefulMapConcatTask
Example 2
No local graph.
More typical of distributed apps.
implicit val materializer = GearpumpMaterializer()
val sink = GearSink.to(new LoggerSink[String]))
val sourceData = new CollectionDataSource(
List("red hat", "yellow sweater", "blue jack", "red
apple", "green plant", "blue sky"))
val source = GearSource.from[String](sourceData)
source.filter(_.startsWith("red")).map("I want to order
item: " + _).runWith(sink)
Example 3
More complex Graph with loops
implicit val materializer = GearpumpMaterializer()
RunnableGraph.fromGraph(GraphDSL.create() {
implicitbuilder =>
val A = builder.add(Source.single(0)).out
val B = builder.add(Broadcast[Int](2))
val C = builder.add(Merge[Int](2))
val D = builder.add(Flow[Int].map(_ + 1))
val E = builder.add(Balance[Int](2))
val F = builder.add(Merge[Int](2))
val G = builder.add(Sink.foreach(println)).in
C <~ F
A ~> B ~> C ~> F
B ~> D ~> E ~> F
E ~> G
ClosedShape
}).run()
Summary
● Akka-streams provides a compelling programming model
that enables declarative pipeline reuse and extensibility.
● Akka-streams allows different materializers to control and
materialize different parts of the module tree.
● It’s possible to provide a seamless (or nearly seamless)
conversion of akka-streams to run in a distributed setting
by merely replacing ActorMaterializer with
GearpumpMaterializer.
● Alternative distributed materializers can be implemented
using a similar approach.
● Distributed akka-streams via Apache Gearpump will be
available in the next release of Apache Gearpump (0.8.2)
or will be made available within an akka specific repo.
Thank you
twitter:
@ApacheGearpump
@kkasravi

More Related Content

Gearpump akka streams

  • 1. Implementing an akka-streams materializer for big data The Gearpump Materializer Kam Kasravi
  • 2. Technical Presentation ● Familiarity with akka-streams flow and graph DSL’s ● Familiarity with big data and real time streaming platforms ● Familiarity with scala ● Effort between the akka-streams and Gearpump teams started late last year ● Resulted in a number of pull requests into akka-streams to enable different materializers ● Close to completion with good support of the akka-streams DSL (all GraphStages) ● Fairly seamless to switch between local and distributed
  • 3. Who am I? ● Committer on Apache Gearpump (incubating) - http://gearpump.apache.org ● Architect on Trusted Analytics Platform (TAP) - http://trustedanalytics.org ● Lead or Architect across many companies, industries - NYSE, eBay, PayPal, Yahoo, ... Title Goes Here There are many variations of passages of lorem ipsum available, but the majority suffered alteration some form.
  • 4. What is Apache Gearpump? ● Accepted into Apache incubator last March ● Similar to Apache Beam and Apache Flink (real-time message delivery) ● Heavily leverages the actor model and akka (more so than others) ● Unique features like dynamic DAG ● Excellent runtime visualization tooling of cluster and application DAGs ● One of the best big data performance profiles (both throughput, latency)
  • 5. Agenda ● Why? ○ Why integrate akka-streams into a big data platform? ● Big Data platform evolving features ○ Functionality big data platforms are embracing ● Prerequisites needed for any Big Data platform ○ Minimal features a big data platform must have ● Big data platform integration challenges ○ What concepts do not map well within big data platforms? ● Object models: akka-streams, Gearpump ● Materialization ○ ActorMaterializer - materializing the module tree ○ GearpumpMaterializer - rewriting the module tree
  • 6. Why? ● Akka-streams has limitations inherent within a single JVM ○ Throughput and latency are key big data features that require scaling beyond single JVM’s ● Akka-streams DSL is a superset of other big data platform DSLs ○ Has a logical plan (declarative) that can be transformed to an execution plan (runtime) ● Akka-streams programming paradigm is declarative, composable, extensible*, stackable* and reusable* * Provides a level of extensibility and functionality beyond most big data platform DSLs
  • 7. Extensible ● Extend GraphStage ● Extend Source, Sink, Flow or BidiFlow ● All derive from Graph * Provides a level of extensibility and functionality beyond most big data platform DSLs
  • 8. Stackable ● Another term for nestable or recursive. Reference to Kleisli (theoretical). ● Source, Sink, Flow or BidiFlow may contain their own topologies * Provides a level of extensibility and functionality beyond most big data platform DSLs
  • 9. Reusable ● Graph topologies can be attached anywhere (any Graph) ● Recent akka-streams feature is dynamic attachment via hubs ● Hubs will take advantage of Gearpump dynamic DAG within the GearpumpMaterializer * Provides a level of extensibility and functionality beyond most big data platform DSLs
  • 10. Big Data platform evolving features (1) ● Big data platforms are moving to consolidate disparate API’s ○ Too many APIs: Concord, Flink, Heron, Pulsar, Spark, Storm, Samza ○ Common DSL is also an approach being taken by Apache Beam ○ Analogy to SQL - common grammar that different platforms execute
  • 11. Big Data platform evolving features (2) ● Big data platforms will increasingly require dynamic pipelines that are compositional and reusable ● Examples include: ○ Machine learning ○ IoT sensors
  • 12. Big Data platform evolving features (3) ● Machine learning use cases ○ Replace or update scoring models ○ Model Ensembles ■ concept drift ■ data drift
  • 13. Big Data platform evolving features (4) ● IoT use cases ○ Bring new sensors on line with no interruption ○ Change or update configuration parameters at remote sensors
  • 14. Prerequisites needed for any Big Data platform (1) Downstream must be able to pull Upstream must be able to push 1. Push and Pull Downstream must be able to backpressure all the way to source 2. Backpressure << <<
  • 15. Prerequisites needed for any Big Data platform (2) 3. Parallelization 4. Asynchronous 5. Bidirectional
  • 16. Big data platform integration challenges (1) A number of GraphStages have completion or cancellation semantics. Big data pipelines are often infinite streams and do not complete. Cancel is often viewed as a failure. ● Balance[T] ● Completion[T] ● Merge[T] ● Split[T]
  • 17. Big data platform integration challenges (2) A number of GraphStages have specific upstream and downstream ordering and timing directives. ● Batch[T] ● Concat[T] ● Delay[T] ● DelayInitial[T] ● Interleave[T]
  • 18. Big data platform integration challenges (3) The async attribute as well as fusing do not map cleanly when distributing GraphStage functionality across machines. ● Graph.async ● Fusing
  • 19. Graph.async ● Collapses multiple operations (GraphStageLogic) into one actor ● Distributed scenarios where one may want actors within the same JVM or on the same machine
  • 20. Fusing ● Creates one or more islands delimited by async boundaries ● For distributed scenario no fusing should occur until the materializer can evaluate and optimize the execution plan
  • 21. Object Models ● Akka-stream’s GraphStage, Module, Shape ● Gearpump’s Graph, Task, Partitioner
  • 22. Akka-streams Object Model ↪ Base type is a Graph. Common base type is a GraphStage ↪ Graph contains a ↳ Module contains a ↳ Shape ↪ Only a RunnableGraph can be materialized ↪ A RunnableGraph needs at least one Source and one Sink
  • 23. Akka-streams Graph[S, M] ● Graph is parameterized by ○ Shape ○ Materialized Value ● Graph contains a Module contains a Shape ○ Module is where the runtime is constructed and manipulated ● Graph’s first level subtypes provide basic functionality ○ Source ○ Sink ○ Flow ○ BidiFlow S M Graph Source Sink Flow BidiFlow Module Shape
  • 25. GraphStage[S <: Shape] subtypes (incomplete) ↳ Balance[T] ↳ Batch[In, Out] ↳ Broadcast[T] ↳ Collect[In, Out] ↳ Concat[T] ↳ DelayInitial[T] ↳ DropWhile[T] ↳ Expand[In, Out] ↳ FlattenMerge[T, M] ↳ Fold[In, Out] ↳ FoldAsync[T] ↳ FutureSource[T] ↳ GroupBy[T, K] ↳ Grouped[T] ↳ GroupedWithin[T] ↳ Interleave[T] ↳ Intersperse[T] ↳ LimitWeighted[T] ↳ Map[In, Out] ↳ MapAsync[In, Out] ↳ Merge[T] ↳ MergePreferred[T] ↳ MergeSorted[T] ↳ OrElse[T] ↳ Partition[T] ↳ PrefixAndTail[T] ↳ Recover[T] ↳ Scan[In, Out] ↳ SimpleLinearGraph[T] ↳ Sliding[T]
  • 26. What about Module? ● Module is a recursive structure containing a Set[Modules] ● Module is a declarative data structure used as the AST ● Module is used to represent a graph of nodes and edges from the original GraphStages ● Module contains downstream and upstream ports (edges) ● Materializers walk the module tree to create and run instances of publishers and subscribers. ● Each publisher and subscriber is an actor (ActorGraphInterpreter)
  • 27. Gearpump Object Model ↪ Graph[Node, Edge] holds ↳ Tasks (Node) ↳ Partitioners (Edge) ↪ This is a Gearpump Graph, not to be confused with akka-streams Graph.
  • 28. Gearpump Graph[N<:Task, E<:Partitioner] ● Graph is parameterized by ○ Node - must be a subtype of Task ○ Edge - must be a subtype of Parititioner N E Graph List[Task] List[Partitioner]
  • 30. GraphTask subtypes (incomplete) ↳ BalanceTask ↳ BatchTask[In, Out] ↳ BroadcastTask[T] ↳ CollectTask[In, Out] ↳ ConcatTask ↳ DelayInitialTask[T] ↳ DropWhileTask[T] ↳ ExpandTask[In, Out] ↳ FlattenMerge[T, M] ↳ FoldTask[In, Out] ↳ FutureSourceTask[T] ↳ GroupByTask[T, K] ↳ GroupedTask[T] ↳ GroupedWithinTask[T] ↳ InterleaveTask[T] ↳ IntersperseTask[T] ↳ LimitWeightedTask[T] ↳ MapTask[In, Out] ↳ MapAsyncTask[In, Out] ↳ MergeTask[T] ↳ OrElseTask[T] ↳ PartitionTask[T] ↳ PrefixAndTailTask[T] ↳ RecoverTask[T] ↳ ScanTask[In, Out] ↳ SlidingTask[T]
  • 31. Materializer Variations 1. AST (module tree) is matched for every module type (GearpumpMaterializer) 2. AST (module tree) is matched for certain module types ○ After distribution - local ActorMaterializer is used for operations on that worker ○ Materializer works more as a distribution coordinator
  • 32. Example 1 Source Broadcast Flow Merge Sink implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run()
  • 33. Example 1 implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() Source Broadcast Flow Flow Merge GraphStages Sink class SinkActor extends Actor { def receive: Receive = { case any: Any => println(s“Confirm received: $any”) }
  • 34. Example 1 Source Broadcast Flow Flow Merge GraphStages Sink Module Tree GraphStageModule GraphStageModule stage=SingleSource stage=StatefulMapConcat ActorRefSink stage=Broadcast stage=Map stage=Merge GraphStageModule GraphStageModule GraphStageModule
  • 35. Example 1 implicit val materializer = ActorMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() source broadcast flowA flowB merge GraphStages sink
  • 36. Example 1 processing broadcasted element : 1 in flowA processing broadcasted element : 1 in flowB processing broadcasted element : 2 in flowA Confirm received: 1 Confirm received: 1 processing broadcasted element : 2 in flowB Confirm received: 2 Confirm received: 2 processing broadcasted element : 3 in flowA processing broadcasted element : 3 in flowB processing broadcasted element : 4 in flowA processing broadcasted element : 4 in flowB Confirm received: 3 Confirm received: 3 processing broadcasted element : 5 in flowA processing broadcasted element : 5 in flowB Confirm received: 4 Confirm received: 4 Confirm received: 5 Confirm received: 5 Confirm received: COMPLETE source broadcast flowA flowB merge GraphStages sink ActorMaterializer Output
  • 37. Example 1 implicit val materializer = GearpumpMaterializer() val sinkActor = system.actorOf(Props(new SinkActor()) val source = Source((1 to 5)) val sink = Sink.actorRef(sinkActor, "COMPLETE") val flowA: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowA"); x } val flowB: Flow[Int, Int, NotUsed] = Flow[Int].map { x => println(s"processing broadcasted element : $x in flowB"); x } val graph = RunnableGraph.fromGraph(GraphDSL.create() { implicit b => val broadcast = b.add(Broadcast[Int](2)) val merge = b.add(Broadcast[Int](2)) source ~> broadcast broadcast ~> flowA ~> merge broadcast ~> flowB ~> merge merge ~> sink ClosedShape }) graph.run() source broadcast flowA flowB merge GraphStages sink
  • 38. Example 1 processing broadcasted element : 1 in flowA processing broadcasted element : 1 in flowB processing broadcasted element : 2 in flowB processing broadcasted element : 2 in flowA processing broadcasted element : 3 in flowB processing broadcasted element : 3 in flowA processing broadcasted element : 4 in flowB processing broadcasted element : 4 in flowA processing broadcasted element : 5 in flowB Confirm received: 1 processing broadcasted element : 5 in flowA Confirm received: 1 Confirm received: 2 Confirm received: 2 Confirm received: 3 Confirm received: 3 Confirm received: 4 Confirm received: 4 Confirm received: 5 Confirm received: 5 source broadcast flowA flowB merge GraphStages sink GearpumpMaterializer Output
  • 41. ActorMaterializer 2. Builds a runtime graph of BoundaryPublisher and BoundarySubscribers (Reactive API). 3. Each Publisher or Subscriber contains an instance of GraphStageLogic specific to that GraphStage. 4. Each Publisher or Subscriber also contains an instance of ActorGraphInterpreter - an Actor that manages the message flow using GraphStageLogic.
  • 43. GearpumpMaterializer GraphStageModule( stage=SingleSource) ActorRefSink GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge) 2. Choice of ‘local’ and ‘remote’ is determined by a ‘Strategy’. The default Strategy is to put Source and Sink types in local GraphStageModule( stage=StatefulMapConcat)
  • 44. GearpumpMaterializer ActorRefSink 3. Inserts BridgeModules into both Graphs SourceBridgeModule SinkBridgeModule SinkBridgeModule GraphStageModule( stage=Broadcast) GraphStageModule( stage=Map) GraphStageModule( stage=Merge)GraphStageModule( stage=StatefulMapConcat) GraphStageModule( stage=SingleSource) SourceBridgeModule
  • 45. GearpumpMaterializer ActorRefSink 4. Local graph is passed to a LocalGraphMaterializer SinkBridgeModule GraphStageModule( stage=SingleSource) SourceBridgeModule LocalGraphMaterializer is a variant (subtype) of ActorMaterializer
  • 46. GearpumpMaterializer 5. Converts the remote graph’s Modules into Tasks SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  • 47. GearpumpMaterializer 6. Sends this Graph to the Gearpump master SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  • 48. GearpumpMaterializer 7. Materialization is controlled at BridgeTasks SourceBridgeTask SinkBridgeTaskBroadcastTask TransformTask MergeTaskStatefulMapConcatTask
  • 49. Example 2 No local graph. More typical of distributed apps. implicit val materializer = GearpumpMaterializer() val sink = GearSink.to(new LoggerSink[String])) val sourceData = new CollectionDataSource( List("red hat", "yellow sweater", "blue jack", "red apple", "green plant", "blue sky")) val source = GearSource.from[String](sourceData) source.filter(_.startsWith("red")).map("I want to order item: " + _).runWith(sink)
  • 50. Example 3 More complex Graph with loops implicit val materializer = GearpumpMaterializer() RunnableGraph.fromGraph(GraphDSL.create() { implicitbuilder => val A = builder.add(Source.single(0)).out val B = builder.add(Broadcast[Int](2)) val C = builder.add(Merge[Int](2)) val D = builder.add(Flow[Int].map(_ + 1)) val E = builder.add(Balance[Int](2)) val F = builder.add(Merge[Int](2)) val G = builder.add(Sink.foreach(println)).in C <~ F A ~> B ~> C ~> F B ~> D ~> E ~> F E ~> G ClosedShape }).run()
  • 51. Summary ● Akka-streams provides a compelling programming model that enables declarative pipeline reuse and extensibility. ● Akka-streams allows different materializers to control and materialize different parts of the module tree. ● It’s possible to provide a seamless (or nearly seamless) conversion of akka-streams to run in a distributed setting by merely replacing ActorMaterializer with GearpumpMaterializer. ● Alternative distributed materializers can be implemented using a similar approach. ● Distributed akka-streams via Apache Gearpump will be available in the next release of Apache Gearpump (0.8.2) or will be made available within an akka specific repo.