Spark Tips & Tricks

1© Cloudera, Inc. All rights reserved.
Tips + Tricks
Best Practices + Recommendations for Apache Spark
Jason Hubbard | Systems Engineer @ Cloudera

Quick Spark Overview
Promise!

Overview: Spark Components
Spark is a fast, general purpose cluster
computing platform.
Spark takes advantage of parallelism, by
distributing processing across a cluster of
nodes, in order to provide fast processing of
data.
Each Spark application gets
its own executor processes
which stay up for the
duration of the application.
The executor runs tasks in
multiple threads.
The driver program
coordinates tasks and
handles resource requests to
the cluster manager. The
driver distributes application
code to each executor.
Normally, each task takes 1
core.
Example: When Spark is run interactively via pyspark or spark-shell
executors are assigned to the shell (driver) until the shell is exited.
Each time the user invokes an action on the data the driver invokes
that action on each executor as a task.

Spark on Yarn
YARN ( Yet Another Resource Negotiator) is a
resource manager which can be used by Spark as
the cluster manager (and is recommend for use
with CDH)
.
Resource Manager Node Manager
Container
Executor
Node Manager
Container
Executor
Application
Mstr.
Spark Driver
• The Spark Driver submits initial
request to Resource Manager
• Spark Application Master is launched
• Spark Application Master coordinates
with Resource Manager and Node
Managers to launch containers and
Executors
• Spark Driver Coordinates execution
of the application with the executors

What About pyspark?
•Spark operates on JVM
•Objects serialized for performance
•Python =/= Java
•Spark uses Py4J to move data from JVM to python
•Extra serialization cost (Pickle)
•Note: DataFrames create a query plan out of
pyspark and execute in JVM
•But only if there are no UDFs
•UDFs kick back to double serialization cost
•Apache Arrow aims to solve some of these
issues

Resource Management
How do I figure out # of executors, cores, memory?

What are we doing here?
•num-executors, executor-cores, executor-memory
•obviously pretty important
•How do we configure these?
•Tip: use executor-cores to drive the rest.
•i.e. X=total core, Y=executor-cores, then X/Y = num executors
•i.e. Z=total mem, Z/(X/Y) = executor-memory
•Good rule of thumb – try ~5 executor-cores
•Notes!
•Too few cores doesn’t take advantage of multiple tasks running in a executor (ex: sharing broadcast
variables)
•Too many tasks can create bad HDFS I/O (problems with concurrent threads)
•You can’t give all your resources to Spark
•At the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core,
YARN needs memory/core, offheap memory, other services..

Quick Aside on Memory Usage
• yarn.nodemanager.resource.memory-mb – what yarn is working with
• spark.executor.memory – memory per executor process (heap memory)
• Spark.yarn.executor.memoryOverhead – offheap memory
• Default is max (384 mb, 0.1 * spark.executor.memory)
• Memory is pretty key – controls how much data you can process, group, join,
cache, shuffle, etc,

Unified Memory Management Spark
• StorageExecution 1.6
• Evicts Storage, not execution
• Specify minimum unevictable amount (not
reservation)
• Tasks 1.0
• Static vs dynamic (slots determined
dynamically)
• Fair and starvation free, static simpler,
dynamic better for stragglers
• Off by default in CDH (performance
regressions)
• Toggle with spark.memory.useLegacyMode

Worked Example
16
Core
16
Core
16
Core
16
Core
64 Total Cores in Cluster
512 GB RAM
C
1 Core for OS
4 GB RAM
111C 12 Cores
105 RAM for
Executors
Core/RAM Allocation (per Host) GO AND TEST!
Also, keep executor mem < 64 GB [GC delays]
1 Executor
4 Cores
48 Cores
12 Executors with
4 Cores, 35 GB
RAM Each
x
Allocate Resources Try differ executor/core ratios
1 Executor
5 Cores
48 Cores
9 Executors with 5
Cores, 46 GB RAM
Each
x
(Leaves cores un-utilized)
Determine the optimal resource allocation for the Spark job
128
GB
128
GB
128
GB
128
GB
Worker Nodes
C 1 Core for CM agent
1 GB RAM
C 1 Core for NM
1 GB RAM
C 1 Core for DN
1 GB RAM
12 x 4 = 48 total cores
105 x 4 = 420 GB RAM
1 Executor
6 Cores
48 Cores
8 Executors with 6
Cores, 52 GB RAM
Each
xC 12 GB RAM for
overhead

This could be easier: Dynamic Resource Allocation
Yarn will handle the sizing and distribution of the requested executors.
CDH 5.5+
This configuration will allow the dynamic allocation of between 1 and
20 executors.
Spark will initially attempt to run the job with 5 executors, this helps
to speed up jobs which you know will require a certain number of
executors ahead of time.
Configuration settings should be placed in the spark configuration file. However
they can also be submitted with the job.
spark-submit --class com.cloudera.example.YarnExample
--master yarn-cluster
--conf "spark.dynamicAllocation.enabled=true"
--conf "spark.dynamicAllocation.minExecutors=1"
--conf "spark.dynamicAllocation.maxExecutors=20"
--conf "spark.dynamicAllocation.initialExecutors=5"
lib/yarn-example.jar
10
Dynamic Allocation of
Resources in Yarn only handles
allocation of Executors.
The number of cores per
executor is handled via
Spark.conf they are not
dynamically sized.
It is still important to
understand the sizing
limitations of your cluster in
order to properly set the
Min & Max executor settings
as well as the executor-
cores setting.
Spark also lets you control the
timeout of executors when
they are not being used.

Warning: Shuffle Block Size
•Spark imposes a 2 GB limit on shuffle block size
•Anything larger will cause application errors
•How do we fix this?
•Create more partitions
•Spark core: rdd.repartition, rdd.coalesce.
•Spark SQL: spark.sql.shuffle.partitions
•Default is 200!
•Note: If partitions > 2000, Spark uses HighlyCompressedMapStatus
•Rule of Thumb: 128 MB/partition
•Avoid Data Skew

Data Skew
•Avoid This!
•Skew slows down jobs/queries
•May even causes errors/break Spark (partition > 2GB, etc)
Good Not Good

How to Avoid Skew
• If things are running slowly, always inspect the data/partition
sizes
• If you notice skew, try adding a salt to keys
• With salt, do two stage operation, one on salted keys, then one on unsalted
results
• There are more transformations, but we spread the work around better so job
should be faster
• If there is a small number of of skewed keys, you can try isolated salting
• Note: Less of a problem in SparkSQL
• More efficient columnar cached representation
• Able to push some operations down to the data store
• Optimizer is able to look inside our operations (partition pruning, predicate
pushdowns, etc)

Finding Skew

DAGnabit
•ReduceByKey over GroupByKey
•GroupByKey is unbounded and dependent on data
•Tree(ReduceAggregate) over ReduceAggregate
•Push more work to workers and return less data to driver
•Less Spark
•mapPartitions - reuse resources (JDBC)
•Group or Reduce by then process in memory
•Submit multiple Jobs concurrently
•Java concurrency

Serialization
•In general, Spark stores data in memory as deserialized java objects and on
disk/through network as serialized binary
•Serialize stuff before you send it to executors, or don’t send stuff that don’t need
to be serialized
•This is bad:
val myObj=<something>
myRDD.filter(x => x == myObj.value)
•This is good:
val myObj=<something>
val myVal=myObj.value
myRDD.filter(x => x == myValue)
•Kyro is better than Java serialization. Register custom classes
•Cut fat off of data objects. Only use what you need.

Labeling
•This seems trivial but getting in good habits can save a lot of time
•Name your jobs
•sc.setJobGroup
•Name your RDDs
•rdd.setName
•Which is easier to understand?

Check Your Dependencies
•Incorrectly built jars and version mismatch are common issues
•Common errors:
•Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster
•java.lang.ClassNotFoundException:
org.apache.hadoop.mapreduce.MRJobConfig
•java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
•ImportError: No module named qt_returns_summary_wrapper
•NoSuchMethodException
•Double check and verify your artifacts
•Use versions of components provided by your distro – versions will work
together

Spark Streaming
It’s all about the microbatches

Spark Streaming
•Incoming data represented as DStreams (Discretized Streams)
•Data commonly read from streaming data channels like Kafka or Flume
•A spark-streaming application is a DAG of Transformations and Actions on
DStreams (and RDDs)

Discretized Stream
•Incoming data stream is broken down into micro-batches
•Micro-batch size is user defined, usually 0.3 to 1 second
•Micro-batches are disjoint
•Each micro-batch is an RDD
•Effectively, a DStream is a sequence of RDDs, one per micro-batch
•Spark Streaming known for high throughput

Windowed DStreams
• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size
• Operations invoked on each window’s data

Fault Tolerance
• Handle Driver Failures
• Process Control System
• Submit with deploy-mode=cluster, runs in Application Master
• Driver failure causes Application Master Failure
• Set yarn.resourcemanager.am.max-attempts higher (default 2), try 4
• Reset Failure Counts
• spark.yarn.am.attemptFailuresValidityInterval (default none), try 1h
• spark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 *
num_executors}
• spark.yarn.executor.failuresValidityInterval (default none), try 1h
• spark.task.maxFailures (default 4), try 8

Graceful Shutdown
• Design for failure, but may need to finish batches
• yarn application -kill [applicationId] (may stop in the middle)
• Shutdown hook too late
• spark.streaming.stopGracefullyOnShutdown doesn’t work on Yarn
• Marker file or http endpoint

Prevent Data Loss
• Receiver
• Enable Checkpoint
• Enable WAL
• Upgrades won’t work, delete checkpoint dir!
• Direct
• Checkpoint offsets (upgrades won’t work)
• Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc

Performance
• Prevent starvation, create dedicated Pool
• --queue realtime_queue
• Protect against stragglers, enable speculation
• --conf spark.speculation=true
• Single receiver most execute same task and node as receiver
• Increase replication or lower spark.locality.wait (default 10 ms)
• Batch time, Inverse function
• mapWithState instead of updateStateByKey (size of batch instead of state)

Parallelism/Partitions
• Repartition (may cause shuffle)
• Batch Interval & Block Interval determine # tasks (may cause shuffle)
• Lower block interval to increase tasks (min 50 ms)
• Batch Interval / Block Interval should = # executors
• Multiple receivers w/ union (avoids shuffle)
• Kafka direct, increase kafka partitions
• For receivers, don’t forger receiver consumes a long running task

Security
• Submit principal and keytab for secured cluster via submit
• --principal user/hostname@domain --keytab keytabfile
• Disable HDFS Cache with HA Namenode (HDFS-9276, SPARK-11182)
• --conf spark.hadoop.fs.hdfs.impl.disable.cache=true

Backpressure
• Find optimal records for processing time
• May hide lag, monitor
• Smooth startup
• Spark 2
• spark.streaming.backpressure.initialRate
• Spark 1
• Receiver: spark.streaming.backpressure.initialRate
• Direct: spark.streaming.kafka.maxRatePerPartition

Logging
• Enable YARN rolling aggregator
• yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
• Configure Spark rolling strategy
- or -
• Custom log4j appender
• --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties
• --conf spark.executor.extraJavaOptions=-
Dlog4j.configuration=file:log4j.properties
• --files /path/to/log4j.properties

Cloud

S3
• Treats S3 as a filesystem, S3A is preferred over S3N and S3
• Eventually consistent listing
• Consider writing to HDFS first then copy to S3
• DirectParquetOutputCommitter removed from Spark 2
• If writing directly to S3, use version 2 commit algorithm and turn off speculation
• spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
• spark.speculation=false

Parquet on S3
• When reading Parquet enable random I/O
• fs.s3a.experimental.input.fadvise=random (Filesystem Level when created)
• spark.hadoop.parquet.enable.summary-metadata=false
• spark.sql.parquet.mergeSchema=false
• spark.sql.parquet.filterPushdown=true
• spark.sql.hive.metastorePartitionPruning=true

S3 Performance
• fs.s3a.block.size
• Tune fs.s3a.multipart.threshold fs.s3a.multipart.size
• spark.hadoop.fs.s3a.readahead.range 67108864
• Expiremental fs.s3a.fast.upload (buffer in memory) fs.s3a.fast.buffer.size
• fs.s3a.threads.max active uploads fs.s3a.max.total.tasks queued

YARN
• yarn.scheduler.fair.locality.threshold.node = -1
• yarn.scheduler.fair.locality.threshold.rack = -1
• spark.locality.wait.rack=0

jason.hubbard@cloudera.com
Thank You

Spark Tips & Tricks

Related slideshows

More Related Content

Spark Tips & Tricks