Big Data Analytics with Scala at SCALA.IO 2013

Big Data Analytics
with Scala
Sam BESSALAH
@samklr

What is Big Data Analytics?

It’s about doing aggregations and running
complex models on large datasets, offline, in
real time or both.

Lambda Architecture
Blueprint for a Big Data analytics
architecture

Map Reduce redux
map : (Km, Vm)  List (Km, Vm)
in Scala : T =>
List[(K,V)]
reduce :(Km, List(Vm))List(Kr, Vr)
(K, List[V]) => List[(K,V)]

Big data ‘’Hello World’’ : Word count

Word Count Redux
(Flat)Map -Reduce

SCALDING
class WordCount(args : Args) extends Job(args) {
TextLine(args("input"))
.flatMap ('line -> 'word) {
line :String => line.split(“ s+”)
}
.groupBy('word){ group => group.size }
.write(Tsv(args("output")))

}

SCALDING : Clustering with Mahout
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

TextLine(args("input"))
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
DenseVector(vec))
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent);

cl }
.flatMap(c => c.iterator.asScala.toIterable)

val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values

Scalding

- Two APIs : Field based API, and Typed API
- Field API : project, map, discard , groupBy…
- Typed API : TypedPipe[T], works like
scala.collection.Iterator[T]

- Matrix Library
- ALGEBIRD : Abstract Algebra library … we’ll
talk about it later

- Distributed, fault tolerant, real time stream
computation engine.
- Four concepts
- Streams : infinite sequence of tuples
- Spouts : Source of streams
- Bolts : Process and produces streams
Can do : Filtering, aggregations, Joins, …
- Topologies : define a flow or network of
spouts and blots.

Trident
TridentTopology topology = new TridentTopology();

TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new Factory(),
new Count(),
new Fields("count"))
.parallelismHint(6);

ScalaStorm by Evan Chan

class SplitSentence extends
StormBolt(outputFields = List("word")) {
def execute(t: Tuple) = t matchSeq {
case Seq(line: String) => line.split(‘’’’).foreach
{ word => using anchor t emit (word) }
t ack
}
}

SummingBird

Write your job once and run it on Storm and
Hadoop

def wordCount[P <: Platform[P]]
(source: Producer[P, String], store: P#Store[String, Long]) =
source.flatMap {
line => line.split(‘’s+’’).map(_ -> 1L) }
.sumByKey(store)

SummingBird
trait Platform[P <: Platform[P]]
{
type Source[+T]
type Store[-K, V]
type Sink[-T]
type Service[-K, +V]
type Plan[T}
}

On Storm

- Source[+T] : Spout[(Long, T)]
- Store[-K, V] : StormStore [K, V]
- Sink[-T] : (T => Future[Unit])
- Service[-K, +V] : StormService[K,V]
- Plan[T] : StormTopology

SummingBird dependencies

• StoreHaus
• Chill
• Scalding
• Algebird
• Tormenta

But

- Can only aggregate values that are
associative : Monoids!!!!!!

trait Monoid [V] {
def zero : V
def aggregate(left : V, right :V): V
}

Clustering with Mahout redux
def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) {
lazy val clust = new StreamingKMeans(new FastProjectionSearch(
new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters =

source
.map{ str =>
val vec = str.split("t").map(_.toDouble)
val cent = new Centroid(count, new
DenseVector(vec))
count += 1
cent }
.unorderedFold [StreamingKMeans,Centroid](clust)
{(cl,cent) => cl.cluster(cent); cl }
.flatMap(c => c.iterator.asScala.toIterable)

val finalClusters = sloppyClusters.groupAll
.mapValueStream { centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values
.saveTo(store)
}

What is Spark?

•
•
•

Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude
faster (order of magnitude faster)

Improves efficiency through:
-General execution graphs
-In-memory storage
Improves usability through:
-Rich APIs in Java, Scala, Python
-Interactive shell

Key idea

•
•

Write programs in terms of transformations on distributed
datasets
Concept: resilient distributed datasets (RDDs)
- Collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)

Other RDD Operators

•
•
•
•
•
•
•
•

map
filter

groupBy
sort
union
join
leftOuterJoin
rightOuterJoin

Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)

Base Transformed
RDD
RDD

results

errors = lines.filter(s => s.startswith(“ERROR”))

messages = errors.map(s => s.split(“t”))
messages.cache()

messages.filter(s=> s.contains(“foo”)).count()

Cache 1

Driver

Worker

tasks Block 1

Action

messages.filter(s=> s.contains(“bar”)).count()

Cache 2

Worker

. . .

Cache 3

Worker
Result: full-text search scaled to 1 TBin 0.5 in 5 (vs 20 s for on-disk
Result: of Wikipedia data sec sec
(vs 180 sec for on-disk data)
data)

Block 3

Block 2

Fault Recovery
RDDs track lineage information that can be
used to efficiently recompute lost data
Ex:

msgs = textFile.filter(-=> _.startsWith(“ERROR”))
.map(_ => _.split(“t”))

HDFS File

Filtered RDD

filter
(func = _.contains(...))

Mapped RDD

map
(func = _.split(...))

Spark Streaming

- Extends Spark capabilities to large scale stream
processing.
- Scales to 100s of nodes and achieves second scale
latencies
-Efficient and fault-tolerant stateful stream processing
- Simple batch-like API for implementing complex
algorithms

Discretized Stream
Processing
live data stream

 Chop up the live stream into batches of X
seconds
 Spark treats each batch of data as RDDs and
processes them using RDD operations
 Finally, the processed results of the RDD
operations are returned in batches

Spark
Streaming

batches of X
seconds
Spark
processed
results

44

Discretized Stream
Processing
live data stream

 Batch sizes as low as ½ second, latency
of about 1 second

 Potential for combining batch
processing and streaming processing
in the same system

Spark
Streaming

batches of X seconds

Spark
processed
results

45

Example – Get hashtags from
Twitter
val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream
of data
Twitter Streaming API

batch @ t

batch @ t+1

batch @ t+2

tweets DStream

stored in memory as an RDD
(immutable, distributed)

Example – Get hashtags from Twitter

val hashTags = tweets.flatMap (status => getTags(status))
new DStream

transformation: modify data in one DStream to create another
DStream
batch @ t

batch @ t+1

batch @ t+2

tweets DStream

hashTags Dstream
[#cat, #dog, … ]

flatMap

flatMap

…

flatMap

new RDDs created
for every batch


hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed
data
batch @ t

batch @ t+1

batch @ t+2

tweets DStream
flatMap

hashTags
DStream

flatMap

flatMap

foreach

foreach

foreach

Write to database, update analytics
UI, do whatever you want


hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage

batch @ t

batch @ t+1

batch @ t+2

tweets DStream
flatMap

flatMap

flatMap

save

save

save

hashTags DStream

every batch
saved to HDFS

Window-based Transformations

val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window
operation

window length

sliding interval

window length

DStream of data

sliding interval

Compute TopK Ip addresses
val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …)
val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..)
val addresses = stream.map(ipAddress => ipAddress.getText)

val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC)
val globalCMS = cms.zero
val mm = new MapMonoid[Long, Int]()
//init
val topAddresses = adresses.mapPartitions(ids => {
ids.map(id => cms.create(id))
})
.reduce(_ ++ _)

topAddresses.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
val partialTopK = partial.heavyHitters.map(id =>
(id, partial.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id =>
(id, globalCMS.frequency(id).estimate))
.toSeq.sortBy(_._2).reverse.slice(0, TOPK)
globalTopK.mkString("[", ",", "]")))
}
})

Multi purpose analytics stack

MLBASE

TACHYON

Stream
Processing

Spark
+
Shark
+
Spark
Streaming
Batch
Processing

Ad-hoc
Queries

GraphX

BLINK DB

SPARK

SPARK STREAMING
-

Almost Similar API for batch or Streaming
Single¨Platform with fewer moving parts
Order of magnitude faster

References
Sam Ritchie : SummingBird

https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-attwitter
Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala

http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learningwith-scala-linkedin
Apache Spark : http://spark.incubator.apache.org

Matei Zaharia : Parallel Programming with Spark

Big Data Analytics with Scala at SCALA.IO 2013

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analytics with Scala at SCALA.IO 2013

Similar to Big Data Analytics with Scala at SCALA.IO 2013 (20)

More from Samir Bessalah

More from Samir Bessalah (7)

Recently uploaded

Recently uploaded (20)

Big Data Analytics with Scala at SCALA.IO 2013