SlideShare a Scribd company logo
Martin Zapletal @zapletal_martin
Cake Solutions @cakesolutions
● Increasing importance of data analytics, data mining and
machine learning
● Current state
○ Destructive updates
○ Analytics tools with poor scalability and integration
○ Manual processes
○ Slow iterations
○ Not suitable for large amounts and fast data
● Shared memory, disk, shared nothing, threads, mutexes, transactional memory, message
passing, CSP, actors, futures, coroutines, evented, dataflow, ...
We can think of two reasons for using distributed machine learning: because you have to (so
much data), or because you want to (hoping it will be faster). Only the first reason is good.
Elapsed times for 20 PageRank iterations
[1, 2]
Zygmunt Z
● Complementary
● Distributed data processing framework Apache Spark won Daytona
Gray Sort 100TB Benchmark
● Distributed databases
● Whole lifecycle of data
● Data processing - Futures, Akka, Akka Cluster, Reactive Streams,
Spark, …
● Data stores
● Integration and messaging
● Distributed computing primitives
● Cluster managers and task schedulers
● Deployment, configuration management and DevOps
● Data analytics and machine learning
ACID Mutable State
CQRS
Kappa architecture
Batch-Pipeline
Kafka
Allyourdata
NoSQL
SQL
Spark
Client
Client
Client Views
Stream
processor
Client
QueryCommand
DBDB
Denormalise
/Precompute
Flume
Scoop
Hive
Impala
Serving DB
Oozie
HDFS
Lambda Architecture
Batch Layer Serving
Layer
Stream layer (fast)
Query
Query
Allyourdata
[3]
[4, 5]
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Output 0 with result 0.6615020337700888 in 12:15:53.564
Output 0 with result 0.6622847063345205 in 12:15:53.564
● Pure scala
● Functional programming
● Synchronization and memory management
● Actor framework for truly concurrent and distributed systems
● Thread safe mutable state - consistency boundary
● Domain modelling
● Distributed state, work, communication patterns
● Simple programming model - send messages, create new actors,
change behaviour
class UserActor extends PersistentActor {
override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId
private[this] val userAccountKey = GSetKey[Account]("userAccountKey")
override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator)
def notRegistered(distributedData: ActorRef): Receive = {
case cmd: AccountCommand =>
persist(AccountEvent(cmd.account)){ acc =>
distributedData ! Update(userAccountKey, GSet.empty[Account], WriteLocal)(_ + acc.account)
context.become(registered(acc))
}
}
def registered(account: Account): Receive = {
case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) =>
persist(eres)(data => sender() ! /-(id))
}
override def receiveRecover: Receive = {
...
}
}
class SensorDataProcessor[P, S] extends ActorPublisher[SensorData] with DataSink[P] with DataProcessingFlow[S] {
implicit val materializer = ActorMaterializer()
override def preStart() = {
FlowGraph.closed(sink) { implicit builder: FlowGraph.Builder[Future[Unit]] => s =>
Source(ActorPublisher(self)) ~> flow ~> s
}.run()
super.preStart()
}
def source(buffer: Seq[SensorData]): Receive = {
case data: SensorData if totalDemand > 0 && buffer.isEmpty => onNext(data)
case data: SensorData => context.become(source(buffer :+ data))
case Request(_) if buffer.nonEmpty =>
onNext(buffer.head)
context.become(source(buffer.tail))
}
override def receive: Receive = source(Seq())
}
Persistence
Sharding Replication
1.
4.
7.
2.
3.
5.
6.
8.
9.
10.
11.
?
?
?
? + 1
? + 1
? + 1
? + 1
? + 1
? + 1
? + 2
? + 2
● At-most-once. Messages may be lost.
● At-least-once. Messages may be duplicated but not lost.
● Exactly-once.
Ack
[6]
Output 18853 with result 0.6445355972059068 in 17:33:12.248
Output 18854 with result 0.6392081778097862 in 17:33:12.248
Output 18855 with result 0.6476549338361918 in 17:33:12.248
[17:33:12.353] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)]
Cluster Node [akka.tcp://ClusterSystem@127.0.0.1:2551] - Leader is removing unreachable node [akka.
tcp://ClusterSystem@127.0.0.1:54495]
[17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-22]
[akka.tcp://ClusterSystem@127.0.0.1:2551/user/sharding/PerceptronCoordinator] Member removed [akka.
tcp://ClusterSystem@127.0.0.1:54495]
[17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-35]
[17:33:12.415] [ClusterSystem-akka.actor.default-dispatcher-18] [akka:
//ClusterSystem/user/sharding/Edge/e-2-1-3-1] null java.lang.NullPointerException
● Microsoft's data centers average failure rate is 5.2 devices per day and 40.8 links per day,
with a median time to repair of approximately five minutes (and a maximum of one week).
● Google new cluster over one year. Five times rack issues 40-80 machines seeing 50 percent
packet loss. Eight network maintenance events (four of which might cause ~30-minute
random connectivity losses). Three router failures (resulting in the need to pull traffic
immediately for an hour).
● CENIC 500 isolating network partitions with median 2.7 and 32 minutes; 95th percentile of
19.9 minutes and 3.7 days, respectively for software and hardware problems
[7]
● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back
everything on the new primary
● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing
system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being
overbilled over a period of 40 minutes.
● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not
consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed
some newly created repositories.
● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will
accept writes on both of those nodes, and later discard the writes to one side.
● RabbitMQ lost ~35% of acknowledged writes under those conditions.
● Redis threw away 56% of the writes it told us succeeded.
● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency
settings
● MongoDB “strictly consistent” reads see stale versions of documents, but they can also return garbage
data from writes that never should have occurred.
[8]
● Publisher and subscriber
● Lazy topology definition
Source[Circle].map(_.toSquare).filter(_.color == blue)
Publisher Subscriber
toSquare
color == blue
backpressure
weights ~> zip.in0
zip.out ~> transform ~> broadcast
broadcast ~> zipWithIndex ~> sink
zip.in1 <~ concat <~ input
concat <~ broadcast
Network
zip transform
*
zipWithIndex
Layer
input n + 1
input 1
broadcast
index
weights
[9]
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
7 * Dumbbell
Alternating Curl
● In memory dataflow distributed data processing framework, streaming
and batch
● Distributes computation using a higher level API
● Load balancing
● Moves computation to data
● Fault tolerant
● Resilient Distributed Datasets
● Fault tolerance
● Caching
● Serialization
● Transformations
○ Lazy, form the DAG
○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ...
● Actions
○ Execute DAG, retrieve result
○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ...
● Accumulators
● Broadcast Variables
● Integration
● Streaming
● Machine Learning
● Graph Processing
Data
transform
transform
transform
collect
textFile mapmap
reduceByKey
collect
sc.textFile("counts")
.map(line => line.split("t"))
.map(word => (word(0), word(1).toInt))
.reduceByKey(_ + _)
.collect()
[10]
● Catalyst
● Multiple phases
● DataFrame
[11]
Data
Data
Preprocessing
Preprocessing
Features
Features
Training
Testing
Error %
val events = sc.eventTable().cache().toDF()
val pipeline = new Pipeline().setStages(Array(
new UserFilter(),
new ZScoreNormalizer(),
new IntensityFeatureExtractor(),
new LinearRegression()
))
getEligibleUsers(events, sessionEndedBefore)
.map { user =>
val model = pipeline.fit(
events,
ParamMap(ParamPair(userIdParam, user)))
val testData = // Prepare test data.
val predictions = model.transform(testData)
submitResult(userId, predictions, config)
}
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
Choose the best combination of tools for given use case.
Understand the internals of selected tools.
The environment often fully asynchronous and distributed.
1)
2)
3)
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015
● Jobs at www.cakesolutions.net/careers
● Code at https://github.com/muvr
● Twitter @zapletal_martin
[1] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf
[2] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/
[3] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/
[4] http://malteschwarzkopf.de/research/assets/google-stack.pdf
[5] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf
[6] http://en.wikipedia.org/wiki/Two_Generals%27_Problem
[7] https://queue.acm.org/detail.cfm?id=2655736
[8] https://aphyr.com/
[9] http://www.smartjava.org/content/visualizing-back-pressure-and-reactive-streams-akka-streams-statsd-grafana-and-influxdb
[10] http://www.slideshare.net/LisaHua/spark-overview-37479609
[11] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

More Related Content

Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala by the Bay 2015

  • 1. Martin Zapletal @zapletal_martin Cake Solutions @cakesolutions
  • 2. ● Increasing importance of data analytics, data mining and machine learning ● Current state ○ Destructive updates ○ Analytics tools with poor scalability and integration ○ Manual processes ○ Slow iterations ○ Not suitable for large amounts and fast data
  • 3. ● Shared memory, disk, shared nothing, threads, mutexes, transactional memory, message passing, CSP, actors, futures, coroutines, evented, dataflow, ... We can think of two reasons for using distributed machine learning: because you have to (so much data), or because you want to (hoping it will be faster). Only the first reason is good. Elapsed times for 20 PageRank iterations [1, 2] Zygmunt Z
  • 4. ● Complementary ● Distributed data processing framework Apache Spark won Daytona Gray Sort 100TB Benchmark ● Distributed databases
  • 5. ● Whole lifecycle of data ● Data processing - Futures, Akka, Akka Cluster, Reactive Streams, Spark, … ● Data stores ● Integration and messaging ● Distributed computing primitives ● Cluster managers and task schedulers ● Deployment, configuration management and DevOps ● Data analytics and machine learning
  • 11. Output 0 with result 0.6615020337700888 in 12:15:53.564 Output 0 with result 0.6622847063345205 in 12:15:53.564 ● Pure scala ● Functional programming ● Synchronization and memory management
  • 12. ● Actor framework for truly concurrent and distributed systems ● Thread safe mutable state - consistency boundary ● Domain modelling ● Distributed state, work, communication patterns ● Simple programming model - send messages, create new actors, change behaviour
  • 13. class UserActor extends PersistentActor { override def persistenceId: String = UserPersistenceId(self.path.name).persistenceId private[this] val userAccountKey = GSetKey[Account]("userAccountKey") override def receiveCommand: Receive = notRegistered(DistributedData(context.system).replicator) def notRegistered(distributedData: ActorRef): Receive = { case cmd: AccountCommand => persist(AccountEvent(cmd.account)){ acc => distributedData ! Update(userAccountKey, GSet.empty[Account], WriteLocal)(_ + acc.account) context.become(registered(acc)) } } def registered(account: Account): Receive = { case eres @ EntireResistanceExerciseSession(id, session, sets, examples, deviations) => persist(eres)(data => sender() ! /-(id)) } override def receiveRecover: Receive = { ... } }
  • 14. class SensorDataProcessor[P, S] extends ActorPublisher[SensorData] with DataSink[P] with DataProcessingFlow[S] { implicit val materializer = ActorMaterializer() override def preStart() = { FlowGraph.closed(sink) { implicit builder: FlowGraph.Builder[Future[Unit]] => s => Source(ActorPublisher(self)) ~> flow ~> s }.run() super.preStart() } def source(buffer: Seq[SensorData]): Receive = { case data: SensorData if totalDemand > 0 && buffer.isEmpty => onNext(data) case data: SensorData => context.become(source(buffer :+ data)) case Request(_) if buffer.nonEmpty => onNext(buffer.head) context.become(source(buffer.tail)) } override def receive: Receive = source(Seq()) }
  • 17. ? ? ? ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 1 ? + 2 ? + 2
  • 18. ● At-most-once. Messages may be lost. ● At-least-once. Messages may be duplicated but not lost. ● Exactly-once. Ack [6]
  • 19. Output 18853 with result 0.6445355972059068 in 17:33:12.248 Output 18854 with result 0.6392081778097862 in 17:33:12.248 Output 18855 with result 0.6476549338361918 in 17:33:12.248 [17:33:12.353] [ClusterSystem-akka.actor.default-dispatcher-21] [Cluster(akka://ClusterSystem)] Cluster Node [akka.tcp://ClusterSystem@127.0.0.1:2551] - Leader is removing unreachable node [akka. tcp://ClusterSystem@127.0.0.1:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-22] [akka.tcp://ClusterSystem@127.0.0.1:2551/user/sharding/PerceptronCoordinator] Member removed [akka. tcp://ClusterSystem@127.0.0.1:54495] [17:33:12.388] [ClusterSystem-akka.actor.default-dispatcher-35] [17:33:12.415] [ClusterSystem-akka.actor.default-dispatcher-18] [akka: //ClusterSystem/user/sharding/Edge/e-2-1-3-1] null java.lang.NullPointerException
  • 20. ● Microsoft's data centers average failure rate is 5.2 devices per day and 40.8 links per day, with a median time to repair of approximately five minutes (and a maximum of one week). ● Google new cluster over one year. Five times rack issues 40-80 machines seeing 50 percent packet loss. Eight network maintenance events (four of which might cause ~30-minute random connectivity losses). Three router failures (resulting in the need to pull traffic immediately for an hour). ● CENIC 500 isolating network partitions with median 2.7 and 32 minutes; 95th percentile of 19.9 minutes and 3.7 days, respectively for software and hardware problems [7]
  • 21. ● MongoDB separated primary from its 2 secondaries. 2 hours later the old primary rejoined and rolled back everything on the new primary ● A network partition isolated the Redis primary from all secondaries. Every API call caused the billing system to recharge customer credit cards automatically, resulting in 1.1 percent of customers being overbilled over a period of 40 minutes. ● The partition caused inconsistency in the MySQL database. Because foreign key relationships were not consistent, Github showed private repositories to the wrong users' dashboards and incorrectly routed some newly created repositories. ● For several seconds, Elasticsearch is happy to believe two nodes in the same cluster are both primaries, will accept writes on both of those nodes, and later discard the writes to one side. ● RabbitMQ lost ~35% of acknowledged writes under those conditions. ● Redis threw away 56% of the writes it told us succeeded. ● In Riak, last-write-wins resulted in dropping 30-70% of writes, even with the strongest consistency settings ● MongoDB “strictly consistent” reads see stale versions of documents, but they can also return garbage data from writes that never should have occurred. [8]
  • 22. ● Publisher and subscriber ● Lazy topology definition Source[Circle].map(_.toSquare).filter(_.color == blue) Publisher Subscriber toSquare color == blue backpressure
  • 23. weights ~> zip.in0 zip.out ~> transform ~> broadcast broadcast ~> zipWithIndex ~> sink zip.in1 <~ concat <~ input concat <~ broadcast Network zip transform * zipWithIndex Layer input n + 1 input 1 broadcast index weights
  • 24. [9]
  • 28. ● In memory dataflow distributed data processing framework, streaming and batch ● Distributes computation using a higher level API ● Load balancing ● Moves computation to data ● Fault tolerant
  • 29. ● Resilient Distributed Datasets ● Fault tolerance ● Caching ● Serialization ● Transformations ○ Lazy, form the DAG ○ map, filter, flatMap, union, group, reduce, sort, join, repartition, cartesian, glom, ... ● Actions ○ Execute DAG, retrieve result ○ reduce, collect, count, first, take, foreach, saveAs…, min, max, ... ● Accumulators ● Broadcast Variables ● Integration ● Streaming ● Machine Learning ● Graph Processing
  • 31. textFile mapmap reduceByKey collect sc.textFile("counts") .map(line => line.split("t")) .map(word => (word(0), word(1).toInt)) .reduceByKey(_ + _) .collect() [10]
  • 32. ● Catalyst ● Multiple phases ● DataFrame [11]
  • 34. val events = sc.eventTable().cache().toDF() val pipeline = new Pipeline().setStages(Array( new UserFilter(), new ZScoreNormalizer(), new IntensityFeatureExtractor(), new LinearRegression() )) getEligibleUsers(events, sessionEndedBefore) .map { user => val model = pipeline.fit( events, ParamMap(ParamPair(userIdParam, user))) val testData = // Prepare test data. val predictions = model.transform(testData) submitResult(userId, predictions, config) }
  • 36. Choose the best combination of tools for given use case. Understand the internals of selected tools. The environment often fully asynchronous and distributed. 1) 2) 3)
  • 38. ● Jobs at www.cakesolutions.net/careers ● Code at https://github.com/muvr ● Twitter @zapletal_martin
  • 39. [1] http://www.csie.ntu.edu.tw/~cjlin/talks/twdatasci_cjlin.pdf [2] http://blog.acolyer.org/2015/06/05/scalability-but-at-what-cost/ [3] http://www.benstopford.com/2015/04/28/elements-of-scale-composing-and-scaling-data-platforms/ [4] http://malteschwarzkopf.de/research/assets/google-stack.pdf [5] http://malteschwarzkopf.de/research/assets/facebook-stack.pdf [6] http://en.wikipedia.org/wiki/Two_Generals%27_Problem [7] https://queue.acm.org/detail.cfm?id=2655736 [8] https://aphyr.com/ [9] http://www.smartjava.org/content/visualizing-back-pressure-and-reactive-streams-akka-streams-statsd-grafana-and-influxdb [10] http://www.slideshare.net/LisaHua/spark-overview-37479609 [11] https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/