Spark Deep Dive

Spark Deep Dive
Corey Nolet
Tetra Concepts

Design Philosophies
●
Akka
●
Remote actors model
●
Designing for scalability
●
Distributed / concurrent processing
●
Across threads, processes, machines
●
Scala
●
Functional / closure-based
●
Lazy-evaluated
●
Immutable
●
Type inference
●
Terse but safe

Hadoop-based
●
Integration with HDFS
●
Preserves data locality
●
Shuffles for all-to-all communications
●
Integrates natively with resource negotiators
like YARN
●
Can use existing Hadoop input/output
formats

New concepts for the
community
●
Dependency graph instead of
map/combine/reduce
●
Can be narrow or wide depending on
communication model
●
Reprocessing partitions instead of restarting entire
tasks
●
Dataset appears like a local collection but
actions cause distributed computation
●
Memory can be used to cache data for reuse
across different transformations & actions.

RDD
●
API Similar to Scala's collection's API
●
Provides lazy functions like map(), flatMap(),
reduce(), collect(), etc…
●
Transformation lineage is tracked
●
Partitions can be rebuilt in the case of failure
●
Broken up into partitions that get scheduled
on tasks

Jobs, Stages, Tasks
●
SparkContext can be a long-running object and
we can submit many jobs to it.
●
Job: a sequence of transformations and actions
on an RDD
●
Stage: a specific transformation or action on an
RDD that gets scheduled on the executors.
●
Tasks: The actual closures executing on
executors to process stages.

What are partitions?
●
Chunks of data that make up an RDD.
●
Distributed across the cluster and control
parallelism of processing
●
Often start in a job from an input format
●
Similar to input splits in MapReduce
●
Number can change throughout the stages
that make up a job
●
Default can be set using
spark.default.parallelism

Partition Locality
●
Can carry a set of “preferred locations” for
which tasks should be scheduled.
●
Like splits in MapReduce
●
Locality levels lowered when tasks become
too busy
●
Process, Node, Rack, Any, or No Pref
●
Process is most preferred
●
Set through
spark.locality.wait.[process,node,rack]

Partition Sizes
●
Can be changed manually using
rdd.coalesce()
●
Low overhead in deserializing tasks to
process partitions
●
Unlike MapReduce, many small partitions
are recommended over few large ones.
●
Generally 2-3 per core
●
Tasks can be small enough to run in 200ms
and still be efficient.

Changing Partition Sizes
rdd.coalesce(
numPartitions,
shuffle?
)
rdd.repartition(
numPartitions
)

Changing Partition Sizes
rdd.repartition(numParts)
actually calls
rdd.coalesce(numParts, true)

Coalesce (not shuffled)
●
Results in narrow dependency
●
For reducing number of partitions
●
Drastic decrease (e.g. 1000→ 1) usually
benefits more from shuffling
●
Final number of parts will never be greater
than specified amount
●
Could be less if the number of parent parts
is less

Coalesce (not shuffled)
●
Groups final partitions so they map to the
same number of parent partitions
●
When parents have locality information:
●
Attempts to group parent partitions on their
local nodes
●
When parents don't have locality information:
●
Create groups by chunking parents that are
close in the array of partitions

Coalesce (shuffled)
●
Results in wide dependency
●
Allows number of partitions to be increased
at the expense of a shuffle.
●
Evens out distribution of data using a hash
partitioner.

Executor Memory
●
Divided among cache and processing
●
60% used for cached objects
spark.storage.memoryFraction=60
spark.storage.safetyFraction=90
●
20% used for shuffles
spark.shuffle.memoryFraction=20
spark.shuffle.safetyFraction=80
●
What's left over is for task execution
●
Usable memory is defined as follows:
(max memory allocated to JVM – overhead memory
used in the JVM) * memoryFraction * safetyFraction.

Executor Memory
●
High JVM overhead can significantly reduce amount of
memory available for caching, shuffles, and task
execution.
●
Default amount allocated for YARN executors used to
be 7%. Raised to 10% in 1.3
●
Dependent on choices of data structures and
overhead of classes used.
●
spark.yarn.executor.memoryOverhead

RDD Caching
●
Useful when multiple downstream
transformations depend on a single upstream
RDD
val rdd1 = inputRdd.map(..)..saveAsTextFile(..)
val rdd2 = inputRdd.map(..)..saveAsTextFile(..)
●
Done through rdd.persist()
●
LRU eviction of memory cached RDDs when
memory is full (automatic cleanup)
●
Can be manually evicted using
rdd.unpersist()

RDD Caching
●
Deserialized / Raw
●
Generally faster
●
No cost of serializing data
●
Larger data sets put pressure on the garbage
collector
●
Serialized
●
Can take up to 2x - 4x less memory
●
Can be slower processing than raw while
garbage collector is running efficiently

Storage Levels
●
MEMORY_ONLY
●
MEMORY_AND_DISK
●
MEMORY_ONLY_SER
●
MEMORY_AND_DISK_SER
●
DISK_ONLY
●
MEMORY_ONLY_SER_2
●
MEMORY_AND_DISK_SER_2
●
MEMORY_ONLY_2
●
MEMORY_AND_DISK_2
●
OFF_HEAP

Tachyon
●
Uses a Ramdisk, or in-memory file system,
to expose HDFS API.
●
Asynchronously writes to HDFS
●
Allows off-heap caching to put less pressure
on garbage collector
●
Data can be shared by multiple executors
●
Cached data is not lost when an executor
dies
●
Still experimental as of Spark 1.4.0

Project Tungsten
●
Designs for three major optimizations to Spark
●
One of them provides off-heap memory
management to lower object overheads and
bypass garbage collection.
●
Another provides cache-aware data structures
that can minimize memory lookups
●
https://databricks.com/blog/2015/04/28/project-
tungsten-bringing-spark-closer-to-bare-
metal.html

Shared Memory
●
Broadcast variables
●
Read-only memory cached on each executor
and shared across tasks
●
Can be used like distributed cache in
MapReduce to share large lookup tables
across tasks
●
Accumulators
●
Can be used like counters in MapReduce
●
Can also perform any generic associative
algorithm.

Broadcast & Accumulators
// Using broadcast variable
val valueToWrap = “fubar”
val broadcastVal = sc.broadcast(valueToWrap)
…
rdd.filter(_ == broadcastVal.value)
// Using accumulators
val accumulator = sc.accumulator(0)
rdd.map(it => {
accumulator += 1
it
})

Serialization in Spark
●
Two different types
●
Closures
●
Data

Closures
●
Scala can be a little confusing
●
Functions vs Methods
●
Objects vs Classes
●
Closure is just an anonymous
implementation of the FunctionX class in
Scala
●
Closure will always contain a reference to its
outer object
●
Any objects used inside the closure will be serialized

Functions vs. Methods
class MyClass {
// compiles down to Java method
def myMethod(): Unit {}
}
class MyClass {
// compiles to impl of FunctionX trait
val myFuction: () => Unit = () => {}
}
Methods can also be coerced into functions, allowing
them to pass around like functions.

Objects vs. Classes
object MyObject {
// compiles to static member of MyObject
val myVal: Boolean = true
// compiles to Java static method
}
class MyClass {
// compiles to instance value
val myVal: Boolean = true
// compiles to method on MyClass
}

Closure Serialization
●
The primary way code makes it from the
driver to executors
●
No more extends Mapper/Reducer
●
Closures can be shipped at runtime
●
Currently only supports Java serialization
●
Closure cleaner attempts to prune unused
references of the object graph
●
Can still use unnecessary memory if not careful

Closure Cleaner
class MyProcessor {
def process(rdd: RDD[String]) {
rdd.filter(_ == “good”)
...
}
}
The filter() closure's reference to outer class
MyProcessor gets pruned by the ClosureCleaner
because it is not used.

Closure Cleaner
class MyProcessor(
filterWord: String
) {
rdd.filter(_ == filterWord)
...
}
}
Whole class gets serialized but doesn't extend
Serializable. Execution will fail.

Closure Cleaner
object MyProcessor{
val filterWord = ...
def process(
rdd: RDD[String]
) {
rdd.filter(_ == filterWord)
...
}
}
process() compiles to a Java static method so
only filter()'s closure gets serialized.

Closure Cleaner
class MyProcessor(
filterWord: String
) {
val filterWord2 = filterWord
rdd.filter(_ == filterWord2)
...
}
}
The filter() closure serializes because filterWord2
has separated the value from the instance of
MyProcessor

Data Serialization
●
Kryo & Java both supported
●
Kryo is faster and more compact than Java
●
spark.serializer =
org.apache.spark.serializer.KryoSerializer
●
Kryo requires object serializers to be
registered
●
Native Scala classes are supported
●
Serialization errors will not be noticed until
the data leaves the JVM
●
Used in in memory and on disk

Shuffling
●
Required for all-to-all communications
●
reduceByKey(), aggregateByKey(),
sortByKey(), etc…
●
Always a bottleneck
●
Network & Disk IO
●
Serialization
●
Compression
●
Receiving lots of attention.

Spark vs MapReduce
●
Reduce phase does not overlap with the Map
phase like MapReduce
●
Spark reducer's pull shuffle data from
mappers
●
MapReduce does push in a concurrent copy
stage
●
Map and Reduce tasks all run on same
executor JVMs
●
MapReduce uses different JVMs for these
tasks

First there was a hash-
based shuffle...
●
Originally required M * R number of
intermediate files (that is, # of mappers & #
of reducers)
●
Concurrently opened files are C * R (# of
cores * # of reducers)
●
Enabling shuffle spilling created even more
temporary files.
●
Many random writes/reads caused CPU
time spent in reduces to mainly wait on disk
I/O

Then they consolidated
files...
●
Introduced an extra merge phase
●
All map tasks running on the same core write to
the same set of files in tandem
●
File consolidation reduces number of files to C *
R
●
Each reducer fetches a smaller number of files
●
Still bad for high numbers of reducers
●
Concurrently opened files are still C * R
●
spark.shuffle.consolidateFiles=true

And along came sort-based
Shuffle
1)Records sorted in memory by partition ID and merged into a single
file for each core along with an index file
●
If map-side combine, buckets sorted by key & partition and run
through combiner
●
Otherwise, just sorted only by partition
2)Ranges of buckets in each file served to reducers upon request
3)Each segment is merged together on the reducer
4)Records deserialized and passed through all-to-all function (e.g.
aggregateByKey(), reduceByKey()) to complete the stage
●
In the case of sortByKey() and other ordered functions, the
partitions are sorted before being run through the all-to-all
function.
5)When <= 200 reducers and no sort or aggregation is needed
hash-based is used instead
●
spark.shuffle.sort.bypassMergeThreshold

And along came a sort-
based Shuffle

Shuffle Evolution
●
Shuffle write consolidation in 0.9
●
Pluggable shuffle managers in Spark 1.0
●
Hash-based (pre-1.2)
●
Sort-based (introduced in 1.1, default in 1.2+)
●
NettyTransferService introduced in 1.2 for
transferring shuffle “blocks”
●
External shuffle service introduced in 1.2
●
In 1.5+, Community is working on tiered merge
strategy.

Shuffle Durability
●
Failure of an executor will lose shuffle files unless Aux
Shuffle Service is configured on the YARN
NodeManager.
SparkConf:
spark.yarn.shuffle.service = true
yarn-site.xml add spark_shuffle to:
yarn.nodemanager.aux-services
yarn-site.xml add:
yarn.nodemanager.aux-services.spark_shuffle.class =
org.apache.spark.network.yarn.YarnShuffleService

Perhaps we could establish
some best practices
●
Consider the parallelism at each stage of your
jobs based on your data and number of cores.
●
Executor memory should be fine-tuned for
expected cache and shuffle sizes.
●
Minimize footprint of closures
●
Use broadcasts for large values
●
Use Kryo to serialize data
●
Know your communication patterns (one-to-all,
all-to-all, etc..) and optimize accordingly
●
Use aux-shuffle service

Shuffle Optimization
●
A couple properties that affect shuffle
performance
●
spark.akka.threads
●
spark.reducer.maxMbInFlight
●
By default, shuffles will use only 20% of the
memory allocated to executor
●
Increase spark.shuffle.memoryFraction
at expense of
spark.storage.memoryFraction

Questions?
corey@tetraconcepts.com

Spark Deep Dive

More Related Content

Spark Deep Dive

Editor's Notes