Running on a Cluster
Basics of RDD
Spark Runtime Architecture
Basics of RDD
Machine or Node 1 Machine or Node 2 Machine or Node 3
Spark Runtime Architecture
Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Spark Runtime Architecture
Basics of RDD
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Spark Runtime Architecture
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
The Driver
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
The Driver
Basics of RDD
● Process where main() method runs
● When you launch a Spark shell, you’ve created a driver program
● Once the driver terminates, the application is finished.
● While running it performs following:
○ Converting a user program into tasks
■ Convert a user program into tasks - units of execution.
■ Converts DAG (logical graph) into a physical execution plan
○ Scheduling tasks on executors
The Driver
Basics of RDD
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
Driver: Scheduling tasks on executors
Basics of RDD
● Coordinate the scheduling of individual tasks on executors
● Schedule tasks based on data placement
● Tracks cached data and uses it to schedule future tasks
● Runs Spark web interface at port 4040.
Driver: Scheduling tasks on executors
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Basics of RDD
● It is a pluggable component in Spark.
● This allows Spark to run on YARN, Mesos & builtin Standalone
Cluster Manager
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Spark Context
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Spark Context
Executor Executor Executor
Basics of RDD
● Worker processes that run tasks of a job
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
Basics of RDD
● Worker processes that run tasks of a job
● Return results to the driver
● Launched once at the beginning of a Spark application
● Run for the entire lifetime of an application,
● Provide in-memory storage for cached RDDs via Block Manager
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Spark Context
Executor Executor Executor
Task Task Task Task Task
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
Basics of RDD
Understanding The Architecture
Worker Node
Machine or Node 1
Worker Node
Machine or Node 2
Worker Node
Machine or Node 3
Machine or Node 4
Resource Manager
Spark Context
Executor Executor Executor
Task Task Task Task Task
Maintains RDD &
executes workloads
Converts users program into
tasks & Launches Spark
Basics of RDD
Launching a Program
Basics of RDD
● The user submits an application using spark-submit.
Launching a Program
Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
Launching a Program
Spark-Submit Driver
Basics of RDD
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
Launching a Program
Spark-Submit Driver
Basics of RDD
Launching a Program
Spark-Submit Driver
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
Basics of RDD
Launching a Program
Spark-Submit Driver
● The user submits an application using spark-submit.
● spark-submit launches the driver program
● driver invokes the main() and creates spark context
● The driver program contacts the cluster manager for resources
● The cluster manager launches executors
● The driver process runs through the user application.
● the driver sends work to executors in the form of tasks.
● Tasks are run on executor processes to compute and save results.
● Terminate the executors and release resources if driver’s main() exit or sc.stop()
Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
Running On A Cluster
Getting Started - Two Modes
1. Local Mode
2. Cluster Mode
Spark-shell --master ….
Running On A Cluster
Local Mode or Spark in-process
1. Default Mode
2. Does not require any resource manager
a. Simply download and run.
3. Good for utilizing multiple cores for processing
4. Partitions are generally equal to number of CPUs.
5. Used generally for testing
Running On A Cluster
We can run spark-shell, spark-submit with
○ spark-shell
○ spark-shell --master local
○ spark-shell --master local[n]
○ spark-shell --master local[*]
Getting Started - Local Mode
Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
Running On A Cluster
Local Mode - Check!
scala> sc.isLocal
res0: Boolean = true
scala> sc.master
res0: String = local[*]
Running On A Cluster
Local Mode - HandsOn!
Running On A Cluster
Cluster Modes
Different kind of Resource Managers
a. Standalone
c. Mesos
d. EC2
Running On A Cluster
Cluster Mode - Standalone
Uses inbuilt manager resource manager
How to setup?
a. Install spark on all nodes.
b. Inform all nodes about each other
c. Launch spark on all nodes.
d. The spark nodes will discover each other
Running On A Cluster
Installing Standalone Cluster
1. Copy a compiled version of Spark to the same location on all your machines—for
example, /home/yourname/spark.
2. Set up password-less SSH access from your master machine to the others.
3. Edit the conf/slaves file on your master and fill in the workers’ hostnames.
4. run sbin/ on your master
5. Check http://masternode:8080
6. To stop the cluster, run bin/ on your master node.
Running On A Cluster
To run spark inside Hadoop's YARN.
Tasks are run inside the yarn's containers
How to use?
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-shell --master yarn
Cluster Mode - YARN
Running On A Cluster
Launching a program on yarn - Hands On
1. export YARN_CONF_DIR=/etc/hadoop/conf/
2. export HADOOP_CONF_DIR=/etc/hadoop/conf/
3. spark-submit --master yarn --class org.apache.spark.examples.SparkPi
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
Running On A Cluster
Launching a program on yarn - Hands On
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
git clone
cd bigdata
cd spark/
cd projects/
cd apache-log-parsing_sbt/
sbt clean
sbt package
spark-submit --master yarn
target/scala-2.10/apache-log-parsing_2.10-0.0.1.jar 10 10
Running On A Cluster
Hands On
Launching a program on yarn
Running On A Cluster
Cluster Mode - MESOS
1. Mesos Is a general-purpose cluster manager
2. it runs both analytics workloads and long-running services (DBs)
3. To use Spark on Mesos, pass a mesos:// URI to spark-submit:
spark-submit --master mesos://masternode:5050 yourapp
4. You can use ZooKeeper to elect master in mesos in case of multi-master
5. Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes.
6. Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI:
Running On A Cluster
Cluster Mode - Amazon EC2
● Spark comes with a built-in script to launch clusters on Amazon EC2.
● First create an Amazon Web Services (AWS) account
● Obtain an access key ID and secret access key.
● export these as environment variables:
○ export AWS_ACCESS_KEY_ID="..."
○ export AWS_SECRET_ACCESS_KEY="..."
● Create an EC2 SSH key pair and download its private key file (helps in SSH)
● Launch command of the spark-ec2 script:
○ cd /path/to/spark/ec2
○ ./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client
○ Cluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster
Running On A Cluster
Deployment Modes
● Based on where does driver run.
● Two ways:
○ Client - launch the driver program locally. Default
○ Cluster - on one of the worker machines inside the
Running On A Cluster
Architecture Yarn Client Mode
1. Driver Application is runs outside yarn
a. On machine where it is launched
2. If Driver Application shuts down the process is killed
3. Does not have resilience but is quicker to run.
Running On A Cluster
1. Driver Application runs inside yarn in application master
2. If launcher shuts down the process continues like a batch process
a. in background
3. Preferred way to run the long running processes
Architecture Yarn Cluster Mode
Running On A Cluster
Architecture Yarn cluster Mode - Example
export YARN_CONF_DIR=/etc/hadoop/conf/
export HADOOP_CONF_DIR=/etc/hadoop/conf/
spark-submit --master yarn --deploy-mode cluster --class
/usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
To check the status, use:
Running On A Cluster
Architecture Yarn cluster Mode - Demo
Hands On
Running On A Cluster
Which Cluster Manager to Use?
1. Start with a local mode if this is a new deployment.
2. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos
3. When sharing amongst many users is primary criteria, use Mesos
4. In all cases, it is best to run Spark on the same nodes as HDFS for fast access to
a. You can either install Mesos or Standalone cluster on Datanodes
b. Or Hadoop distributions already install YARN and HDFS together
Running On A Cluster
Packaging Your Code and Dependencies
1. Bundle all the libraries that your program depends upon
2. No need to bundle the spark libraries (org.apache.spark) and language libraries
3. Python users can:
a. Either install on all nodes using pip or easy_install
b. Or use --py-files argument (take files to every node's cwd) of spark-submit
4. Java & Scala
a. Submit libraries using --jars
b. But there are many libraries, use build tool such as sbt or maven
Running On A Cluster
Common flags for spark-submit
Flag Explanation
master Indicates the cluster manager to connect to. The options for
this flag are described earlier.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
deploy-mode Whether to launch the driver program locally (“client”) or
on one of the worker machines inside the cluster (“cluster”).
In client mode spark-submit will run your driver on the same
machine where spark-submit is itself being invoked. In
cluster mode, the driver will be shipped to execute on a
worker node in the cluster. The default is client mode.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
class The “main” class of your application if you’re running a Java
or Scala program.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
name A human-readable name for your application. This will be
displayed in Spark’s web UI.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
jars A list of JAR files to upload and place on the classpath of
your application. If your application depends on a small
number of third-party JARs, you can add them here.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
files A list of files to be placed in the working directory of
your application. This can be used for data files that you
want to distribute to each node.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
py-files A list of files to be added to the PYTHONPATH of
your application. This can contain .py, .egg, or .zip files.
Running On A Cluster
Common flags for spark-submit
Flag Explanation
executor-memory The amount of memory to use for executors, in bytes.
Suffixes can be used to specify larger quantities such as
“512m” (512 megabytes) or “15g” (15 gigabytes).
Running On A Cluster
Common flags for spark-submit
Flag Explanation
driver-memory The amount of memory to use for the driver process,
in bytes. Suffixes can be used to specify larger quantities
such as “512m” (512 megabytes) or “15g” (15
Thank you!
Running on a Cluster

Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Running on a Cluster
  • 2. Basics of RDD Spark Runtime Architecture
  • 3. Basics of RDD Machine or Node 1 Machine or Node 2 Machine or Node 3 Spark Runtime Architecture
  • 4. Basics of RDD Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 Spark Runtime Architecture
  • 5. Basics of RDD Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Spark Runtime Architecture
  • 6. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. The Driver
  • 7. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. ● While running it performs following: ○ Converting a user program into tasks ■ Convert a user program into tasks - units of execution. ■ Converts DAG (logical graph) into a physical execution plan The Driver
  • 8. Basics of RDD ● Process where main() method runs ● When you launch a Spark shell, you’ve created a driver program ● Once the driver terminates, the application is finished. ● While running it performs following: ○ Converting a user program into tasks ■ Convert a user program into tasks - units of execution. ■ Converts DAG (logical graph) into a physical execution plan ○ Scheduling tasks on executors The Driver
  • 9. Basics of RDD Driver: Scheduling tasks on executors
  • 10. Basics of RDD ● Coordinate the scheduling of individual tasks on executors Driver: Scheduling tasks on executors
  • 11. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement Driver: Scheduling tasks on executors
  • 12. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement ● Tracks cached data and uses it to schedule future tasks Driver: Scheduling tasks on executors
  • 13. Basics of RDD ● Coordinate the scheduling of individual tasks on executors ● Schedule tasks based on data placement ● Tracks cached data and uses it to schedule future tasks ● Runs Spark web interface at port 4040. Driver: Scheduling tasks on executors
  • 14. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2
  • 15. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone
  • 16. Basics of RDD ● It is a pluggable component in Spark. ● This allows Spark to run on YARN, Mesos & builtin Standalone Cluster Manager
  • 17. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context
  • 18. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor
  • 19. Basics of RDD ● Worker processes that run tasks of a job Executors
  • 20. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver Executors
  • 21. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application Executors
  • 22. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application ● Run for the entire lifetime of an application, Executors
  • 23. Basics of RDD ● Worker processes that run tasks of a job ● Return results to the driver ● Launched once at the beginning of a Spark application ● Run for the entire lifetime of an application, ● Provide in-memory storage for cached RDDs via Block Manager Executors
  • 24. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task
  • 25. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task Maintains RDD & executes workloads
  • 26. Basics of RDD Understanding The Architecture Worker Node Machine or Node 1 Worker Node Machine or Node 2 Worker Node Machine or Node 3 Driver Machine or Node 4 User Resource Manager YARN/MESOS/EC2/Standalone Spark Context Executor Executor Executor Task Task Task Task Task Maintains RDD & executes workloads Converts users program into tasks & Launches Spark Applications.
  • 28. Basics of RDD ● The user submits an application using spark-submit. Launching a Program Spark-Submit
  • 29. Basics of RDD ● The user submits an application using spark-submit. ● spark-submit launches the driver program Launching a Program Spark-Submit Driver
  • 30. Basics of RDD ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources Launching a Program Spark-Submit Driver Cluster Manager
  • 31. Basics of RDD Launching a Program Spark-Submit Driver Cluster Manager Starts Executors ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources ● The cluster manager launches executors ● The driver process runs through the user application. ● the driver sends work to executors in the form of tasks. ● Tasks are run on executor processes to compute and save results.
  • 32. Basics of RDD Launching a Program Spark-Submit Driver Cluster Manager Starts Executors ● The user submits an application using spark-submit. ● spark-submit launches the driver program ● driver invokes the main() and creates spark context ● The driver program contacts the cluster manager for resources ● The cluster manager launches executors ● The driver process runs through the user application. ● the driver sends work to executors in the form of tasks. ● Tasks are run on executor processes to compute and save results. ● Terminate the executors and release resources if driver’s main() exit or sc.stop() Exit sc.stop()
  • 33. Running On A Cluster Getting Started - Two Modes 1. Local Mode 2. Cluster Mode
  • 34. Running On A Cluster Getting Started - Two Modes 1. Local Mode 2. Cluster Mode Spark-shell --master ….
  • 35. Running On A Cluster Local Mode or Spark in-process 1. Default Mode 2. Does not require any resource manager a. Simply download and run. 3. Good for utilizing multiple cores for processing 4. Partitions are generally equal to number of CPUs. 5. Used generally for testing
  • 36. Running On A Cluster We can run spark-shell, spark-submit with ○ spark-shell ○ spark-shell --master local ○ spark-shell --master local[n] ○ spark-shell --master local[*] Getting Started - Local Mode
  • 37. Running On A Cluster Local Mode - Check! scala> sc.isLocal res0: Boolean = true
  • 38. Running On A Cluster Local Mode - Check! scala> sc.isLocal res0: Boolean = true scala> sc.master res0: String = local[*]
  • 39. Running On A Cluster Local Mode - HandsOn!
  • 40. Running On A Cluster Cluster Modes Different kind of Resource Managers a. Standalone b. YARN c. Mesos d. EC2
  • 41. Running On A Cluster Cluster Mode - Standalone Uses inbuilt manager resource manager How to setup? a. Install spark on all nodes. b. Inform all nodes about each other c. Launch spark on all nodes. d. The spark nodes will discover each other
  • 42. Running On A Cluster Installing Standalone Cluster 1. Copy a compiled version of Spark to the same location on all your machines—for example, /home/yourname/spark. 2. Set up password-less SSH access from your master machine to the others. 3. Edit the conf/slaves file on your master and fill in the workers’ hostnames. 4. run sbin/ on your master 5. Check http://masternode:8080 6. To stop the cluster, run bin/ on your master node.
  • 43. Running On A Cluster To run spark inside Hadoop's YARN. Tasks are run inside the yarn's containers How to use? export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ spark-shell --master yarn Cluster Mode - YARN
  • 44. Running On A Cluster Launching a program on yarn - Hands On 1. export YARN_CONF_DIR=/etc/hadoop/conf/ 2. export HADOOP_CONF_DIR=/etc/hadoop/conf/ 3. spark-submit --master yarn --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10
  • 45. Running On A Cluster Launching a program on yarn - Hands On export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ git clone cd bigdata cd spark/ cd projects/ cd apache-log-parsing_sbt/ sbt clean sbt package spark-submit --master yarn target/scala-2.10/apache-log-parsing_2.10-0.0.1.jar 10 10 /data/spark/project/access/access.log.45.gz
  • 46. Running On A Cluster Hands On Launching a program on yarn
  • 47. Running On A Cluster Cluster Mode - MESOS 1. Mesos Is a general-purpose cluster manager 2. it runs both analytics workloads and long-running services (DBs) 3. To use Spark on Mesos, pass a mesos:// URI to spark-submit: spark-submit --master mesos://masternode:5050 yourapp 4. You can use ZooKeeper to elect master in mesos in case of multi-master 5. Use a mesos://zk:// URI pointing to a list of ZooKeeper nodes. 6. Ex:, if you have 3 nodes (n1, n2, n3) having ZK on port 2181, use URI: mesos://zk://n1:2181/mesos,n2:2181/mesos,n2:2181/mesos
  • 48. Running On A Cluster Cluster Mode - Amazon EC2 ● Spark comes with a built-in script to launch clusters on Amazon EC2. ● First create an Amazon Web Services (AWS) account ● Obtain an access key ID and secret access key. ● export these as environment variables: ○ export AWS_ACCESS_KEY_ID="..." ○ export AWS_SECRET_ACCESS_KEY="..." ● Create an EC2 SSH key pair and download its private key file (helps in SSH) ● Launch command of the spark-ec2 script: ○ cd /path/to/spark/ec2 ○ ./spark-ec2 -k mykeypair -i mykeypair.pem launch mycluster
  • 49. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client ○ Cluster
  • 50. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client - launch the driver program locally. Default ○ Cluster
  • 51. Running On A Cluster Deployment Modes ● Based on where does driver run. ● Two ways: ○ Client - launch the driver program locally. Default ○ Cluster - on one of the worker machines inside the cluster
  • 52. Running On A Cluster Architecture Yarn Client Mode 1. Driver Application is runs outside yarn a. On machine where it is launched 2. If Driver Application shuts down the process is killed 3. Does not have resilience but is quicker to run.
  • 53. Running On A Cluster 1. Driver Application runs inside yarn in application master 2. If launcher shuts down the process continues like a batch process a. in background 3. Preferred way to run the long running processes Architecture Yarn Cluster Mode
  • 54. Running On A Cluster Architecture Yarn cluster Mode - Example export YARN_CONF_DIR=/etc/hadoop/conf/ export HADOOP_CONF_DIR=/etc/hadoop/conf/ spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi /usr/hdp/current/spark-client/lib/spark-examples-*.jar 10 To check the status, use: ○ ○
  • 55. Running On A Cluster Architecture Yarn cluster Mode - Demo Hands On
  • 56. Running On A Cluster Which Cluster Manager to Use? 1. Start with a local mode if this is a new deployment. 2. To use richer resource scheduling capabilities (e.g., queues), use YARN and Mesos 3. When sharing amongst many users is primary criteria, use Mesos 4. In all cases, it is best to run Spark on the same nodes as HDFS for fast access to storage. a. You can either install Mesos or Standalone cluster on Datanodes b. Or Hadoop distributions already install YARN and HDFS together
  • 57. Running On A Cluster Packaging Your Code and Dependencies 1. Bundle all the libraries that your program depends upon 2. No need to bundle the spark libraries (org.apache.spark) and language libraries (java…) 3. Python users can: a. Either install on all nodes using pip or easy_install b. Or use --py-files argument (take files to every node's cwd) of spark-submit 4. Java & Scala a. Submit libraries using --jars b. But there are many libraries, use build tool such as sbt or maven
  • 58. Running On A Cluster Common flags for spark-submit Flag Explanation master Indicates the cluster manager to connect to. The options for this flag are described earlier.
  • 59. Running On A Cluster Common flags for spark-submit Flag Explanation deploy-mode Whether to launch the driver program locally (“client”) or on one of the worker machines inside the cluster (“cluster”). In client mode spark-submit will run your driver on the same machine where spark-submit is itself being invoked. In cluster mode, the driver will be shipped to execute on a worker node in the cluster. The default is client mode.
  • 60. Running On A Cluster Common flags for spark-submit Flag Explanation class The “main” class of your application if you’re running a Java or Scala program.
  • 61. Running On A Cluster Common flags for spark-submit Flag Explanation name A human-readable name for your application. This will be displayed in Spark’s web UI.
  • 62. Running On A Cluster Common flags for spark-submit Flag Explanation jars A list of JAR files to upload and place on the classpath of your application. If your application depends on a small number of third-party JARs, you can add them here.
  • 63. Running On A Cluster Common flags for spark-submit Flag Explanation files A list of files to be placed in the working directory of your application. This can be used for data files that you want to distribute to each node.
  • 64. Running On A Cluster Common flags for spark-submit Flag Explanation py-files A list of files to be added to the PYTHONPATH of your application. This can contain .py, .egg, or .zip files.
  • 65. Running On A Cluster Common flags for spark-submit Flag Explanation executor-memory The amount of memory to use for executors, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).
  • 66. Running On A Cluster Common flags for spark-submit Flag Explanation driver-memory The amount of memory to use for the driver process, in bytes. Suffixes can be used to specify larger quantities such as “512m” (512 megabytes) or “15g” (15 gigabytes).