This document summarizes using Kubernetes to deploy a Spark big data computing environment. It discusses why Kubernetes is preferable to other solutions like Cloudera for managing Spark. The architecture of running Spark on Kubernetes is shown, with the Spark master and worker controllers. Performance is compared between Spark on Kubernetes and standalone Spark using the SparkPI and WordCount examples. Support for Spark 2.3.0 on Kubernetes is now official.
5. About Big Data Solution
● Famous management tool -- Cloudera
○ Too big
○ Too difficult
○ User does not want it (Most Important)
● Famous container management tool -- K8S
○ Small
○ Simple
○ User want it
12. Spark on K8S Architecture
● Only one master
● Using nodeAffinity to avoid Worker and Master
same node
● Using podAntiAffinity to ensure each node have
only one worker
18. How it works
$ bin/spark-submit
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
--deploy-mode cluster
--name spark-pi
--class org.apache.spark.examples.SparkPi
--conf spark.executor.instances=5
--conf spark.kubernetes.container.image=<spark-image>
local:///path/to/examples.jar
19. Currently experimental...
● Client mode is not currently supported.
● Future Work
○ PySpark
○ R
○ Dynamic Executor Scaling
○ Local File Dependency Management
○ Spark Application Management
○ Job Queues and Resource Management