Big Data Analytics
with Scala
What is Big Data Analytics?

It’s about doing aggregations and running
complex models on large datasets, offline, in
real time or both.
Lambda Architecture
Blueprint for a Big Data analytics
Big Data Analytics with Scala at SCALA.IO 2013

Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Big Data Analytics with Scala at SCALA.IO 2013
Map Reduce redux
map : (Km, Vm)  List (Km, Vm)
in Scala : T =>
reduce :(Km, List(Vm))List(Kr, Vr)
(K, List[V]) => List[(K,V)]

Big Data Analytics with Scala at SCALA.IO 2013
Big data ‘’Hello World’’ : Word count
Enters Cascading
Big Data Analytics with Scala at SCALA.IO 2013

Word Count Redux
(Flat)Map -Reduce
class WordCount(args : Args) extends Job(args) {
.flatMap ('line -> 'word) {
line :String => line.split(“ s+”)
.groupBy('word){ group => group.size }

SCALDING : Clustering with Mahout
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent);

cl }
.flatMap(c => c.iterator.asScala.toIterable)
SCALDING : Clustering with Mahout
val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)

- Two APIs : Field based API, and Typed API
- Field API : project, map, discard , groupBy…
- Typed API : TypedPipe[T], works like

- Matrix Library
- ALGEBIRD : Abstract Algebra library … we’ll
talk about it later
Big Data Analytics with Scala at SCALA.IO 2013
- Distributed, fault tolerant, real time stream
computation engine.
- Four concepts
- Streams : infinite sequence of tuples
- Spouts : Source of streams
- Bolts : Process and produces streams
Can do : Filtering, aggregations, Joins, …
- Topologies : define a flow or network of
spouts and blots.

Big Data Analytics with Scala at SCALA.IO 2013
Streaming Word Count
TridentTopology topology = new TridentTopology();

TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new Factory(),
new Count(),
new Fields("count"))
ScalaStorm by Evan Chan

class SplitSentence extends
StormBolt(outputFields = List("word")) {
def execute(t: Tuple) = t matchSeq {
case Seq(line: String) => line.split(‘’’’).foreach
{ word => using anchor t emit (word) }
t ack

Big Data Analytics with Scala at SCALA.IO 2013

Write your job once and run it on Storm and
def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap {
line => line.split(‘’s+’’).map(_ -> 1L) }
trait Platform[P <: Platform[P]]
type Source[+T]
type Store[-K, V]
type Sink[-T]
type Service[-K, +V]
type Plan[T}

apache spark 2.3spark cluster tuningspark jobs tuning
On Storm

- Source[+T] : Spout[(Long, T)]
- Store[-K, V] : StormStore [K, V]
- Sink[-T] : (T => Future[Unit])
- Service[-K, +V] : StormService[K,V]
- Plan[T] : StormTopology
SummingBird dependencies

• StoreHaus
• Chill
• Scalding
• Algebird
• Tormenta

- Can only aggregate values that are
associative : Monoids!!!!!!

trait Monoid [V] {
def zero : V
def aggregate(left : V, right :V): V

Big Data Analytics with Scala at SCALA.IO 2013
Clustering with Mahout redux
def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) {
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent); cl }
.flatMap(c => c.iterator.asScala.toIterable)
SCALDING : Clustering with Mahout
val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)

What is Spark?


Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude
faster (order of magnitude faster)

Improves efficiency through:
-General execution graphs
-In-memory storage
Improves usability through:
-Rich APIs in Java, Scala, Python
-Interactive shell
Key idea


Write programs in terms of transformations on distributed
Concept: resilient distributed datasets (RDDs)
- Collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)
Example: Word Count
Other RDD Operators




Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)

Base Transformed


errors = lines.filter(s => s.startswith(“ERROR”))

messages = => s.split(“t”))

messages.filter(s=> s.contains(“foo”)).count()

Cache 1



tasks Block 1


messages.filter(s=> s.contains(“bar”)).count()

Cache 2


. . .

Cache 3

Result: full-text search scaled to 1 TBin 0.5 in 5 (vs 20 s for on-disk
Result: of Wikipedia data sec sec
(vs 180 sec for on-disk data)

Block 3

Block 2
Fault Recovery
RDDs track lineage information that can be
used to efficiently recompute lost data

msgs = textFile.filter(-=> _.startsWith(“ERROR”))
.map(_ => _.split(“t”))


Filtered RDD

(func = _.contains(...))

Mapped RDD

(func = _.split(...))
Spark Streaming

- Extends Spark capabilities to large scale stream
- Scales to 100s of nodes and achieves second scale
-Efficient and fault-tolerant stateful stream processing
- Simple batch-like API for implementing complex
Discretized Stream
live data stream

 Chop up the live stream into batches of X
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches


batches of X


Discretized Stream
live data stream

 Batch sizes as low as ½ second, latency
of about 1 second

 Potential for combining batch
processing and streaming processing
in the same system


batches of X seconds


Example – Get hashtags from
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream
of data
Twitter Streaming API

batch @ t

batch @ t+1

batch @ t+2

tweets DStream

stored in memory as an RDD
(immutable, distributed)
Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))
new DStream

transformation: modify data in one DStream to create another
batch @ t

batch @ t+1

batch @ t+2

tweets DStream

hashTags Dstream
[#cat, #dog, … ]





new RDDs created
for every batch
Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed
batch @ t

batch @ t+1

batch @ t+2

tweets DStream







Write to database, update analytics
UI, do whatever you want

apache sparksparkaisummit
Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))
output operation: to push data to external storage

batch @ t

batch @ t+1

batch @ t+2

tweets DStream






hashTags DStream

every batch
saved to HDFS
Window-based Transformations
val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window

window length

sliding interval

window length

DStream of data

sliding interval
Compute TopK Ip addresses
val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …)
val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..)
val addresses = => ipAddress.getText)

val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC)
val globalCMS =
val mm = new MapMonoid[Long, Int]()
val topAddresses = adresses.mapPartitions(ids => { => cms.create(id))
.reduce(_ ++ _)
topAddresses.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
val partialTopK = =>
(id, partial.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalCMS ++= partial
val globalTopK = =>
(id, globalCMS.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalTopK.mkString("[", ",", "]")))

Multi purpose analytics stack









Almost Similar API for batch or Streaming
Single¨Platform with fewer moving parts
Order of magnitude faster
Sam Ritchie : SummingBird
Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala
Apache Spark :

Matei Zaharia : Parallel Programming with Spark
Big Data Analytics with Scala at SCALA.IO 2013

distributed systemscrdts
Big Data Analytics with Scala at SCALA.IO 2013

  • 1. Big Data Analytics with Scala Sam BESSALAH @samklr
  • 2. What is Big Data Analytics? It’s about doing aggregations and running complex models on large datasets, offline, in real time or both.
  • 3. Lambda Architecture Blueprint for a Big Data analytics architecture
  • 8. Map Reduce redux map : (Km, Vm)  List (Km, Vm) in Scala : T => List[(K,V)] reduce :(Km, List(Vm))List(Kr, Vr) (K, List[V]) => List[(K,V)]
  • 10. Big data ‘’Hello World’’ : Word count
  • 14. SCALDING class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap ('line -> 'word) { line :String => line.split(“ s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }
  • 15. SCALDING : Clustering with Mahout lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  • 16. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values
  • 17. Scalding - Two APIs : Field based API, and Typed API - Field API : project, map, discard , groupBy… - Typed API : TypedPipe[T], works like scala.collection.Iterator[T] - Matrix Library - ALGEBIRD : Abstract Algebra library … we’ll talk about it later
  • 19. STORM
  • 20. - Distributed, fault tolerant, real time stream computation engine. - Four concepts - Streams : infinite sequence of tuples - Spouts : Source of streams - Bolts : Process and produces streams Can do : Filtering, aggregations, Joins, … - Topologies : define a flow or network of spouts and blots.
  • 23. Trident TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);
  • 24. ScalaStorm by Evan Chan class SplitSentence extends StormBolt(outputFields = List("word")) { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }
  • 26. SummingBird Write your job once and run it on Storm and Hadoop
  • 27. def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { line => line.split(‘’s+’’).map(_ -> 1L) } .sumByKey(store)
  • 28. SummingBird trait Platform[P <: Platform[P]] { type Source[+T] type Store[-K, V] type Sink[-T] type Service[-K, +V] type Plan[T} }
  • 29. On Storm - Source[+T] : Spout[(Long, T)] - Store[-K, V] : StormStore [K, V] - Sink[-T] : (T => Future[Unit]) - Service[-K, +V] : StormService[K,V] - Plan[T] : StormTopology
  • 31. SummingBird dependencies • StoreHaus • Chill • Scalding • Algebird • Tormenta
  • 32. But - Can only aggregate values that are associative : Monoids!!!!!! trait Monoid [V] { def zero : V def aggregate(left : V, right :V): V }
  • 34. Clustering with Mahout redux def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) val count = 0; val sloppyClusters = source .map{ str => val vec = str.split("t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
  • 35. SCALDING : Clustering with Mahout val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }
  • 37. What is Spark? • • • Fast and expressive cluster computing system compatible with Apache Hadoop, but order of magnitude faster (order of magnitude faster) Improves efficiency through: -General execution graphs -In-memory storage Improves usability through: -Rich APIs in Java, Scala, Python -Interactive shell
  • 38. Key idea • • Write programs in terms of transformations on distributed datasets Concept: resilient distributed datasets (RDDs) - Collections of objects spread across a cluster - Built through parallel transformations (map, filter, etc) - Automatically rebuilt on failure - Controllable persistence (e.g. caching in RAM)
  • 41. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) Base Transformed RDD RDD results errors = lines.filter(s => s.startswith(“ERROR”)) messages = => s.split(“t”)) messages.cache() messages.filter(s=> s.contains(“foo”)).count() Cache 1 Driver Worker tasks Block 1 Action messages.filter(s=> s.contains(“bar”)).count() Cache 2 Worker . . . Cache 3 Worker Result: full-text search scaled to 1 TBin 0.5 in 5 (vs 20 s for on-disk Result: of Wikipedia data sec sec (vs 180 sec for on-disk data) data) Block 3 Block 2
  • 42. Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data Ex: msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“t”)) HDFS File Filtered RDD filter (func = _.contains(...)) Mapped RDD map (func = _.split(...))
  • 43. Spark Streaming - Extends Spark capabilities to large scale stream processing. - Scales to 100s of nodes and achieves second scale latencies -Efficient and fault-tolerant stateful stream processing - Simple batch-like API for implementing complex algorithms
  • 44. Discretized Stream Processing live data stream  Chop up the live stream into batches of X seconds  Spark treats each batch of data as RDDs and processes them using RDD operations  Finally, the processed results of the RDD operations are returned in batches Spark Streaming batches of X seconds Spark processed results 44
  • 45. Discretized Stream Processing live data stream  Batch sizes as low as ½ second, latency of about 1 second  Potential for combining batch processing and streaming processing in the same system Spark Streaming batches of X seconds Spark processed results 45
  • 46. Example – Get hashtags from Twitter val tweets = ssc.twitterStream() DStream: a sequence of RDDs representing a stream of data Twitter Streaming API batch @ t batch @ t+1 batch @ t+2 tweets DStream stored in memory as an RDD (immutable, distributed)
  • 47. Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) new DStream transformation: modify data in one DStream to create another DStream batch @ t batch @ t+1 batch @ t+2 tweets DStream hashTags Dstream [#cat, #dog, … ] flatMap flatMap … flatMap new RDDs created for every batch
  • 48. Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.foreach(hashTagRDD => { ... }) foreach: do whatever you want with the processed data batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap hashTags DStream flatMap flatMap foreach foreach foreach Write to database, update analytics UI, do whatever you want
  • 49. Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap save save save hashTags DStream every batch saved to HDFS
  • 50. Window-based Transformations val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue() sliding window operation window length sliding interval window length DStream of data sliding interval
  • 51. Compute TopK Ip addresses val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { => cms.create(id)) }) .reduce(_ ++ _)
  • 52. topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() val partialTopK = => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })
  • 53. Multi purpose analytics stack MLBASE TACHYON Stream Processing Spark + Shark + Spark Streaming Batch Processing Ad-hoc Queries GraphX BLINK DB
  • 54. SPARK SPARK STREAMING - Almost Similar API for batch or Streaming Single¨Platform with fewer moving parts Order of magnitude faster
  • 55. References Sam Ritchie : SummingBird Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala Apache Spark : Matei Zaharia : Parallel Programming with Spark