SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Tips + Tricks
Best Practices + Recommendations for Apache Spark
Jason Hubbard | Systems Engineer @ Cloudera
2© Cloudera, Inc. All rights reserved.
Quick Spark Overview
Promise!
3© Cloudera, Inc. All rights reserved.
Overview: Spark Components
Spark is a fast, general purpose cluster
computing platform.
Spark takes advantage of parallelism, by
distributing processing across a cluster of
nodes, in order to provide fast processing of
data.
Each Spark application gets
its own executor processes
which stay up for the
duration of the application.
The executor runs tasks in
multiple threads.
The driver program
coordinates tasks and
handles resource requests to
the cluster manager. The
driver distributes application
code to each executor.
Normally, each task takes 1
core.
Example: When Spark is run interactively via pyspark or spark-shell
executors are assigned to the shell (driver) until the shell is exited.
Each time the user invokes an action on the data the driver invokes
that action on each executor as a task.
4© Cloudera, Inc. All rights reserved.
Spark on Yarn
YARN ( Yet Another Resource Negotiator) is a
resource manager which can be used by Spark as
the cluster manager (and is recommend for use
with CDH)
.
Resource Manager Node Manager
Container
Executor
Node Manager
Container
Executor
Application
Mstr.
Spark Driver
• The Spark Driver submits initial
request to Resource Manager
• Spark Application Master is launched
• Spark Application Master coordinates
with Resource Manager and Node
Managers to launch containers and
Executors
• Spark Driver Coordinates execution
of the application with the executors
5© Cloudera, Inc. All rights reserved.
What About pyspark?
•Spark operates on JVM
•Objects serialized for performance
•Python =/= Java
•Spark uses Py4J to move data from JVM to python
•Extra serialization cost (Pickle)
•Note: DataFrames create a query plan out of
pyspark and execute in JVM
•But only if there are no UDFs
•UDFs kick back to double serialization cost
•Apache Arrow aims to solve some of these
issues
6© Cloudera, Inc. All rights reserved.
Resource Management
How do I figure out # of executors, cores, memory?
7© Cloudera, Inc. All rights reserved.
What are we doing here?
•num-executors, executor-cores, executor-memory
•obviously pretty important
•How do we configure these?
•Tip: use executor-cores to drive the rest.
•i.e. X=total core, Y=executor-cores, then X/Y = num executors
•i.e. Z=total mem, Z/(X/Y) = executor-memory
•Good rule of thumb – try ~5 executor-cores
•Notes!
•Too few cores doesn’t take advantage of multiple tasks running in a executor (ex: sharing broadcast
variables)
•Too many tasks can create bad HDFS I/O (problems with concurrent threads)
•You can’t give all your resources to Spark
•At the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core,
YARN needs memory/core, offheap memory, other services..
8© Cloudera, Inc. All rights reserved.
Quick Aside on Memory Usage
• yarn.nodemanager.resource.memory-mb – what yarn is working with
• spark.executor.memory – memory per executor process (heap memory)
• Spark.yarn.executor.memoryOverhead – offheap memory
• Default is max (384 mb, 0.1 * spark.executor.memory)
• Memory is pretty key – controls how much data you can process, group, join,
cache, shuffle, etc,
9© Cloudera, Inc. All rights reserved.
Unified Memory Management Spark
• StorageExecution 1.6
• Evicts Storage, not execution
• Specify minimum unevictable amount (not
reservation)
• Tasks 1.0
• Static vs dynamic (slots determined
dynamically)
• Fair and starvation free, static simpler,
dynamic better for stragglers
• Off by default in CDH (performance
regressions)
• Toggle with spark.memory.useLegacyMode
10© Cloudera, Inc. All rights reserved.
Worked Example
16
Core
16
Core
16
Core
16
Core
64 Total Cores in Cluster
512 GB RAM
C
1 Core for OS
4 GB RAM
111C 12 Cores
105 RAM for
Executors
Core/RAM Allocation (per Host) GO AND TEST!
Also, keep executor mem < 64 GB [GC delays]
1 Executor
4 Cores
48 Cores
12 Executors with
4 Cores, 35 GB
RAM Each
x
Allocate Resources Try differ executor/core ratios
1 Executor
5 Cores
48 Cores
9 Executors with 5
Cores, 46 GB RAM
Each
x
(Leaves cores un-utilized)
Determine the optimal resource allocation for the Spark job
128
GB
128
GB
128
GB
128
GB
Worker Nodes
C 1 Core for CM agent
1 GB RAM
C 1 Core for NM
1 GB RAM
C 1 Core for DN
1 GB RAM
12 x 4 = 48 total cores
105 x 4 = 420 GB RAM
1 Executor
6 Cores
48 Cores
8 Executors with 6
Cores, 52 GB RAM
Each
xC 12 GB RAM for
overhead
11© Cloudera, Inc. All rights reserved.
This could be easier: Dynamic Resource Allocation
Yarn will handle the sizing and distribution of the requested executors.
CDH 5.5+
This configuration will allow the dynamic allocation of between 1 and
20 executors.
Spark will initially attempt to run the job with 5 executors, this helps
to speed up jobs which you know will require a certain number of
executors ahead of time.
Configuration settings should be placed in the spark configuration file. However
they can also be submitted with the job.
spark-submit --class com.cloudera.example.YarnExample 
--master yarn-cluster 
--conf "spark.dynamicAllocation.enabled=true" 
--conf "spark.dynamicAllocation.minExecutors=1" 
--conf "spark.dynamicAllocation.maxExecutors=20" 
--conf "spark.dynamicAllocation.initialExecutors=5" 
lib/yarn-example.jar 
10
Dynamic Allocation of
Resources in Yarn only handles
allocation of Executors.
The number of cores per
executor is handled via
Spark.conf they are not
dynamically sized.
It is still important to
understand the sizing
limitations of your cluster in
order to properly set the
Min & Max executor settings
as well as the executor-
cores setting.
Spark also lets you control the
timeout of executors when
they are not being used.
12© Cloudera, Inc. All rights reserved.
Warning: Shuffle Block Size
•Spark imposes a 2 GB limit on shuffle block size
•Anything larger will cause application errors
•How do we fix this?
•Create more partitions
•Spark core: rdd.repartition, rdd.coalesce.
•Spark SQL: spark.sql.shuffle.partitions
•Default is 200!
•Note: If partitions > 2000, Spark uses HighlyCompressedMapStatus
•Rule of Thumb: 128 MB/partition
•Avoid Data Skew
13© Cloudera, Inc. All rights reserved.
Data Skew
•Avoid This!
•Skew slows down jobs/queries
•May even causes errors/break Spark (partition > 2GB, etc)
Good Not Good
14© Cloudera, Inc. All rights reserved.
How to Avoid Skew
• If things are running slowly, always inspect the data/partition
sizes
• If you notice skew, try adding a salt to keys
• With salt, do two stage operation, one on salted keys, then one on unsalted
results
• There are more transformations, but we spread the work around better so job
should be faster
• If there is a small number of of skewed keys, you can try isolated salting
• Note: Less of a problem in SparkSQL
• More efficient columnar cached representation
• Able to push some operations down to the data store
• Optimizer is able to look inside our operations (partition pruning, predicate
pushdowns, etc)
15© Cloudera, Inc. All rights reserved.
Finding Skew
16© Cloudera, Inc. All rights reserved.
DAGnabit
•ReduceByKey over GroupByKey
•GroupByKey is unbounded and dependent on data
•Tree(ReduceAggregate) over ReduceAggregate
•Push more work to workers and return less data to driver
•Less Spark
•mapPartitions - reuse resources (JDBC)
•Group or Reduce by then process in memory
•Submit multiple Jobs concurrently
•Java concurrency
17© Cloudera, Inc. All rights reserved.
Serialization
•In general, Spark stores data in memory as deserialized java objects and on
disk/through network as serialized binary
•Serialize stuff before you send it to executors, or don’t send stuff that don’t need
to be serialized
•This is bad:
val myObj=<something>
myRDD.filter(x => x == myObj.value)
•This is good:
val myObj=<something>
val myVal=myObj.value
myRDD.filter(x => x == myValue)
•Kyro is better than Java serialization. Register custom classes
•Cut fat off of data objects. Only use what you need.
18© Cloudera, Inc. All rights reserved.
Labeling
•This seems trivial but getting in good habits can save a lot of time
•Name your jobs
•sc.setJobGroup
•Name your RDDs
•rdd.setName
•Which is easier to understand?
19© Cloudera, Inc. All rights reserved.
Check Your Dependencies
•Incorrectly built jars and version mismatch are common issues
•Common errors:
•Could not find or load main class
org.apache.spark.deploy.yarn.ApplicationMaster
•java.lang.ClassNotFoundException:
org.apache.hadoop.mapreduce.MRJobConfig
•java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream
•ImportError: No module named qt_returns_summary_wrapper
•NoSuchMethodException
•Double check and verify your artifacts
•Use versions of components provided by your distro – versions will work
together
20© Cloudera, Inc. All rights reserved.
Spark Streaming
It’s all about the microbatches
21© Cloudera, Inc. All rights reserved.
Spark Streaming
•Incoming data represented as DStreams (Discretized Streams)
•Data commonly read from streaming data channels like Kafka or Flume
•A spark-streaming application is a DAG of Transformations and Actions on
DStreams (and RDDs)
22© Cloudera, Inc. All rights reserved.
Discretized Stream
•Incoming data stream is broken down into micro-batches
•Micro-batch size is user defined, usually 0.3 to 1 second
•Micro-batches are disjoint
•Each micro-batch is an RDD
•Effectively, a DStream is a sequence of RDDs, one per micro-batch
•Spark Streaming known for high throughput
23© Cloudera, Inc. All rights reserved.
Windowed DStreams
• Defined by specifying a window size and a step size
• Both are multiples of micro-batch size
• Operations invoked on each window’s data
24© Cloudera, Inc. All rights reserved.
Fault Tolerance
• Handle Driver Failures
• Process Control System
• Submit with deploy-mode=cluster, runs in Application Master
• Driver failure causes Application Master Failure
• Set yarn.resourcemanager.am.max-attempts higher (default 2), try 4
• Reset Failure Counts
• spark.yarn.am.attemptFailuresValidityInterval (default none), try 1h
• spark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 *
num_executors}
• spark.yarn.executor.failuresValidityInterval (default none), try 1h
• spark.task.maxFailures (default 4), try 8
25© Cloudera, Inc. All rights reserved.
Graceful Shutdown
• Design for failure, but may need to finish batches
• yarn application -kill [applicationId] (may stop in the middle)
• Shutdown hook too late
• spark.streaming.stopGracefullyOnShutdown doesn’t work on Yarn
• Marker file or http endpoint
26© Cloudera, Inc. All rights reserved.
Prevent Data Loss
• Receiver
• Enable Checkpoint
• Enable WAL
• Upgrades won’t work, delete checkpoint dir!
• Direct
• Checkpoint offsets (upgrades won’t work)
• Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc
27© Cloudera, Inc. All rights reserved.
Performance
• Prevent starvation, create dedicated Pool
• --queue realtime_queue
• Protect against stragglers, enable speculation
• --conf spark.speculation=true
• Single receiver most execute same task and node as receiver
• Increase replication or lower spark.locality.wait (default 10 ms)
• Batch time, Inverse function
• mapWithState instead of updateStateByKey (size of batch instead of state)
28© Cloudera, Inc. All rights reserved.
Parallelism/Partitions
• Repartition (may cause shuffle)
• Batch Interval & Block Interval determine # tasks (may cause shuffle)
• Lower block interval to increase tasks (min 50 ms)
• Batch Interval / Block Interval should = # executors
• Multiple receivers w/ union (avoids shuffle)
• Kafka direct, increase kafka partitions
• For receivers, don’t forger receiver consumes a long running task
29© Cloudera, Inc. All rights reserved.
Security
• Submit principal and keytab for secured cluster via submit
• --principal user/hostname@domain --keytab keytabfile
• Disable HDFS Cache with HA Namenode (HDFS-9276, SPARK-11182)
• --conf spark.hadoop.fs.hdfs.impl.disable.cache=true
30© Cloudera, Inc. All rights reserved.
Backpressure
• Find optimal records for processing time
• May hide lag, monitor
• Smooth startup
• Spark 2
• spark.streaming.backpressure.initialRate
• Spark 1
• Receiver: spark.streaming.backpressure.initialRate
• Direct: spark.streaming.kafka.maxRatePerPartition
31© Cloudera, Inc. All rights reserved.
Logging
• Enable YARN rolling aggregator
• yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds
• Configure Spark rolling strategy
- or -
• Custom log4j appender
• --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties
• --conf spark.executor.extraJavaOptions=-
Dlog4j.configuration=file:log4j.properties
• --files /path/to/log4j.properties
32© Cloudera, Inc. All rights reserved.
Cloud
33© Cloudera, Inc. All rights reserved.
S3
• Treats S3 as a filesystem, S3A is preferred over S3N and S3
• Eventually consistent listing
• Consider writing to HDFS first then copy to S3
• DirectParquetOutputCommitter removed from Spark 2
• If writing directly to S3, use version 2 commit algorithm and turn off speculation
• spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
• spark.speculation=false
34© Cloudera, Inc. All rights reserved.
Parquet on S3
• When reading Parquet enable random I/O
• fs.s3a.experimental.input.fadvise=random (Filesystem Level when created)
• spark.hadoop.parquet.enable.summary-metadata=false
• spark.sql.parquet.mergeSchema=false
• spark.sql.parquet.filterPushdown=true
• spark.sql.hive.metastorePartitionPruning=true
35© Cloudera, Inc. All rights reserved.
S3 Performance
• fs.s3a.block.size
• Tune fs.s3a.multipart.threshold fs.s3a.multipart.size
• spark.hadoop.fs.s3a.readahead.range 67108864
• Expiremental fs.s3a.fast.upload (buffer in memory) fs.s3a.fast.buffer.size
• fs.s3a.threads.max active uploads fs.s3a.max.total.tasks queued
36© Cloudera, Inc. All rights reserved.
YARN
• yarn.scheduler.fair.locality.threshold.node = -1
• yarn.scheduler.fair.locality.threshold.rack = -1
• spark.locality.wait.rack=0
37© Cloudera, Inc. All rights reserved.
jason.hubbard@cloudera.com
Thank You

More Related Content

Spark Tips & Tricks

  • 1. 1© Cloudera, Inc. All rights reserved. Tips + Tricks Best Practices + Recommendations for Apache Spark Jason Hubbard | Systems Engineer @ Cloudera
  • 2. 2© Cloudera, Inc. All rights reserved. Quick Spark Overview Promise!
  • 3. 3© Cloudera, Inc. All rights reserved. Overview: Spark Components Spark is a fast, general purpose cluster computing platform. Spark takes advantage of parallelism, by distributing processing across a cluster of nodes, in order to provide fast processing of data. Each Spark application gets its own executor processes which stay up for the duration of the application. The executor runs tasks in multiple threads. The driver program coordinates tasks and handles resource requests to the cluster manager. The driver distributes application code to each executor. Normally, each task takes 1 core. Example: When Spark is run interactively via pyspark or spark-shell executors are assigned to the shell (driver) until the shell is exited. Each time the user invokes an action on the data the driver invokes that action on each executor as a task.
  • 4. 4© Cloudera, Inc. All rights reserved. Spark on Yarn YARN ( Yet Another Resource Negotiator) is a resource manager which can be used by Spark as the cluster manager (and is recommend for use with CDH) . Resource Manager Node Manager Container Executor Node Manager Container Executor Application Mstr. Spark Driver • The Spark Driver submits initial request to Resource Manager • Spark Application Master is launched • Spark Application Master coordinates with Resource Manager and Node Managers to launch containers and Executors • Spark Driver Coordinates execution of the application with the executors
  • 5. 5© Cloudera, Inc. All rights reserved. What About pyspark? •Spark operates on JVM •Objects serialized for performance •Python =/= Java •Spark uses Py4J to move data from JVM to python •Extra serialization cost (Pickle) •Note: DataFrames create a query plan out of pyspark and execute in JVM •But only if there are no UDFs •UDFs kick back to double serialization cost •Apache Arrow aims to solve some of these issues
  • 6. 6© Cloudera, Inc. All rights reserved. Resource Management How do I figure out # of executors, cores, memory?
  • 7. 7© Cloudera, Inc. All rights reserved. What are we doing here? •num-executors, executor-cores, executor-memory •obviously pretty important •How do we configure these? •Tip: use executor-cores to drive the rest. •i.e. X=total core, Y=executor-cores, then X/Y = num executors •i.e. Z=total mem, Z/(X/Y) = executor-memory •Good rule of thumb – try ~5 executor-cores •Notes! •Too few cores doesn’t take advantage of multiple tasks running in a executor (ex: sharing broadcast variables) •Too many tasks can create bad HDFS I/O (problems with concurrent threads) •You can’t give all your resources to Spark •At the very least, YARN AM needs a container, OS needs memory/core, HDFS needs memory/core, YARN needs memory/core, offheap memory, other services..
  • 8. 8© Cloudera, Inc. All rights reserved. Quick Aside on Memory Usage • yarn.nodemanager.resource.memory-mb – what yarn is working with • spark.executor.memory – memory per executor process (heap memory) • Spark.yarn.executor.memoryOverhead – offheap memory • Default is max (384 mb, 0.1 * spark.executor.memory) • Memory is pretty key – controls how much data you can process, group, join, cache, shuffle, etc,
  • 9. 9© Cloudera, Inc. All rights reserved. Unified Memory Management Spark • StorageExecution 1.6 • Evicts Storage, not execution • Specify minimum unevictable amount (not reservation) • Tasks 1.0 • Static vs dynamic (slots determined dynamically) • Fair and starvation free, static simpler, dynamic better for stragglers • Off by default in CDH (performance regressions) • Toggle with spark.memory.useLegacyMode
  • 10. 10© Cloudera, Inc. All rights reserved. Worked Example 16 Core 16 Core 16 Core 16 Core 64 Total Cores in Cluster 512 GB RAM C 1 Core for OS 4 GB RAM 111C 12 Cores 105 RAM for Executors Core/RAM Allocation (per Host) GO AND TEST! Also, keep executor mem < 64 GB [GC delays] 1 Executor 4 Cores 48 Cores 12 Executors with 4 Cores, 35 GB RAM Each x Allocate Resources Try differ executor/core ratios 1 Executor 5 Cores 48 Cores 9 Executors with 5 Cores, 46 GB RAM Each x (Leaves cores un-utilized) Determine the optimal resource allocation for the Spark job 128 GB 128 GB 128 GB 128 GB Worker Nodes C 1 Core for CM agent 1 GB RAM C 1 Core for NM 1 GB RAM C 1 Core for DN 1 GB RAM 12 x 4 = 48 total cores 105 x 4 = 420 GB RAM 1 Executor 6 Cores 48 Cores 8 Executors with 6 Cores, 52 GB RAM Each xC 12 GB RAM for overhead
  • 11. 11© Cloudera, Inc. All rights reserved. This could be easier: Dynamic Resource Allocation Yarn will handle the sizing and distribution of the requested executors. CDH 5.5+ This configuration will allow the dynamic allocation of between 1 and 20 executors. Spark will initially attempt to run the job with 5 executors, this helps to speed up jobs which you know will require a certain number of executors ahead of time. Configuration settings should be placed in the spark configuration file. However they can also be submitted with the job. spark-submit --class com.cloudera.example.YarnExample --master yarn-cluster --conf "spark.dynamicAllocation.enabled=true" --conf "spark.dynamicAllocation.minExecutors=1" --conf "spark.dynamicAllocation.maxExecutors=20" --conf "spark.dynamicAllocation.initialExecutors=5" lib/yarn-example.jar 10 Dynamic Allocation of Resources in Yarn only handles allocation of Executors. The number of cores per executor is handled via Spark.conf they are not dynamically sized. It is still important to understand the sizing limitations of your cluster in order to properly set the Min & Max executor settings as well as the executor- cores setting. Spark also lets you control the timeout of executors when they are not being used.
  • 12. 12© Cloudera, Inc. All rights reserved. Warning: Shuffle Block Size •Spark imposes a 2 GB limit on shuffle block size •Anything larger will cause application errors •How do we fix this? •Create more partitions •Spark core: rdd.repartition, rdd.coalesce. •Spark SQL: spark.sql.shuffle.partitions •Default is 200! •Note: If partitions > 2000, Spark uses HighlyCompressedMapStatus •Rule of Thumb: 128 MB/partition •Avoid Data Skew
  • 13. 13© Cloudera, Inc. All rights reserved. Data Skew •Avoid This! •Skew slows down jobs/queries •May even causes errors/break Spark (partition > 2GB, etc) Good Not Good
  • 14. 14© Cloudera, Inc. All rights reserved. How to Avoid Skew • If things are running slowly, always inspect the data/partition sizes • If you notice skew, try adding a salt to keys • With salt, do two stage operation, one on salted keys, then one on unsalted results • There are more transformations, but we spread the work around better so job should be faster • If there is a small number of of skewed keys, you can try isolated salting • Note: Less of a problem in SparkSQL • More efficient columnar cached representation • Able to push some operations down to the data store • Optimizer is able to look inside our operations (partition pruning, predicate pushdowns, etc)
  • 15. 15© Cloudera, Inc. All rights reserved. Finding Skew
  • 16. 16© Cloudera, Inc. All rights reserved. DAGnabit •ReduceByKey over GroupByKey •GroupByKey is unbounded and dependent on data •Tree(ReduceAggregate) over ReduceAggregate •Push more work to workers and return less data to driver •Less Spark •mapPartitions - reuse resources (JDBC) •Group or Reduce by then process in memory •Submit multiple Jobs concurrently •Java concurrency
  • 17. 17© Cloudera, Inc. All rights reserved. Serialization •In general, Spark stores data in memory as deserialized java objects and on disk/through network as serialized binary •Serialize stuff before you send it to executors, or don’t send stuff that don’t need to be serialized •This is bad: val myObj=<something> myRDD.filter(x => x == myObj.value) •This is good: val myObj=<something> val myVal=myObj.value myRDD.filter(x => x == myValue) •Kyro is better than Java serialization. Register custom classes •Cut fat off of data objects. Only use what you need.
  • 18. 18© Cloudera, Inc. All rights reserved. Labeling •This seems trivial but getting in good habits can save a lot of time •Name your jobs •sc.setJobGroup •Name your RDDs •rdd.setName •Which is easier to understand?
  • 19. 19© Cloudera, Inc. All rights reserved. Check Your Dependencies •Incorrectly built jars and version mismatch are common issues •Common errors: •Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster •java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.MRJobConfig •java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream •ImportError: No module named qt_returns_summary_wrapper •NoSuchMethodException •Double check and verify your artifacts •Use versions of components provided by your distro – versions will work together
  • 20. 20© Cloudera, Inc. All rights reserved. Spark Streaming It’s all about the microbatches
  • 21. 21© Cloudera, Inc. All rights reserved. Spark Streaming •Incoming data represented as DStreams (Discretized Streams) •Data commonly read from streaming data channels like Kafka or Flume •A spark-streaming application is a DAG of Transformations and Actions on DStreams (and RDDs)
  • 22. 22© Cloudera, Inc. All rights reserved. Discretized Stream •Incoming data stream is broken down into micro-batches •Micro-batch size is user defined, usually 0.3 to 1 second •Micro-batches are disjoint •Each micro-batch is an RDD •Effectively, a DStream is a sequence of RDDs, one per micro-batch •Spark Streaming known for high throughput
  • 23. 23© Cloudera, Inc. All rights reserved. Windowed DStreams • Defined by specifying a window size and a step size • Both are multiples of micro-batch size • Operations invoked on each window’s data
  • 24. 24© Cloudera, Inc. All rights reserved. Fault Tolerance • Handle Driver Failures • Process Control System • Submit with deploy-mode=cluster, runs in Application Master • Driver failure causes Application Master Failure • Set yarn.resourcemanager.am.max-attempts higher (default 2), try 4 • Reset Failure Counts • spark.yarn.am.attemptFailuresValidityInterval (default none), try 1h • spark.yarn.max.executor.failures (default max(2 * num executors, 3)), try {8 * num_executors} • spark.yarn.executor.failuresValidityInterval (default none), try 1h • spark.task.maxFailures (default 4), try 8
  • 25. 25© Cloudera, Inc. All rights reserved. Graceful Shutdown • Design for failure, but may need to finish batches • yarn application -kill [applicationId] (may stop in the middle) • Shutdown hook too late • spark.streaming.stopGracefullyOnShutdown doesn’t work on Yarn • Marker file or http endpoint
  • 26. 26© Cloudera, Inc. All rights reserved. Prevent Data Loss • Receiver • Enable Checkpoint • Enable WAL • Upgrades won’t work, delete checkpoint dir! • Direct • Checkpoint offsets (upgrades won’t work) • Save checkpoints manually in ZK, HDFS, Hbase, RDBMS, etc
  • 27. 27© Cloudera, Inc. All rights reserved. Performance • Prevent starvation, create dedicated Pool • --queue realtime_queue • Protect against stragglers, enable speculation • --conf spark.speculation=true • Single receiver most execute same task and node as receiver • Increase replication or lower spark.locality.wait (default 10 ms) • Batch time, Inverse function • mapWithState instead of updateStateByKey (size of batch instead of state)
  • 28. 28© Cloudera, Inc. All rights reserved. Parallelism/Partitions • Repartition (may cause shuffle) • Batch Interval & Block Interval determine # tasks (may cause shuffle) • Lower block interval to increase tasks (min 50 ms) • Batch Interval / Block Interval should = # executors • Multiple receivers w/ union (avoids shuffle) • Kafka direct, increase kafka partitions • For receivers, don’t forger receiver consumes a long running task
  • 29. 29© Cloudera, Inc. All rights reserved. Security • Submit principal and keytab for secured cluster via submit • --principal user/hostname@domain --keytab keytabfile • Disable HDFS Cache with HA Namenode (HDFS-9276, SPARK-11182) • --conf spark.hadoop.fs.hdfs.impl.disable.cache=true
  • 30. 30© Cloudera, Inc. All rights reserved. Backpressure • Find optimal records for processing time • May hide lag, monitor • Smooth startup • Spark 2 • spark.streaming.backpressure.initialRate • Spark 1 • Receiver: spark.streaming.backpressure.initialRate • Direct: spark.streaming.kafka.maxRatePerPartition
  • 31. 31© Cloudera, Inc. All rights reserved. Logging • Enable YARN rolling aggregator • yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds • Configure Spark rolling strategy - or - • Custom log4j appender • --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties • --conf spark.executor.extraJavaOptions=- Dlog4j.configuration=file:log4j.properties • --files /path/to/log4j.properties
  • 32. 32© Cloudera, Inc. All rights reserved. Cloud
  • 33. 33© Cloudera, Inc. All rights reserved. S3 • Treats S3 as a filesystem, S3A is preferred over S3N and S3 • Eventually consistent listing • Consider writing to HDFS first then copy to S3 • DirectParquetOutputCommitter removed from Spark 2 • If writing directly to S3, use version 2 commit algorithm and turn off speculation • spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 • spark.speculation=false
  • 34. 34© Cloudera, Inc. All rights reserved. Parquet on S3 • When reading Parquet enable random I/O • fs.s3a.experimental.input.fadvise=random (Filesystem Level when created) • spark.hadoop.parquet.enable.summary-metadata=false • spark.sql.parquet.mergeSchema=false • spark.sql.parquet.filterPushdown=true • spark.sql.hive.metastorePartitionPruning=true
  • 35. 35© Cloudera, Inc. All rights reserved. S3 Performance • fs.s3a.block.size • Tune fs.s3a.multipart.threshold fs.s3a.multipart.size • spark.hadoop.fs.s3a.readahead.range 67108864 • Expiremental fs.s3a.fast.upload (buffer in memory) fs.s3a.fast.buffer.size • fs.s3a.threads.max active uploads fs.s3a.max.total.tasks queued
  • 36. 36© Cloudera, Inc. All rights reserved. YARN • yarn.scheduler.fair.locality.threshold.node = -1 • yarn.scheduler.fair.locality.threshold.rack = -1 • spark.locality.wait.rack=0
  • 37. 37© Cloudera, Inc. All rights reserved. jason.hubbard@cloudera.com Thank You