SlideShare a Scribd company logo
No more struggles with Apache Spark (PySpark)
workloads in production
Chetan Khatri, Solution Architect - Data Science.
Accionlabs India.
PyconZA 2019, The Wanderers Club in Illovo.
Johannesburg, South Africa
11th Oct, 2019
Twitter: @khatri_chetan,
Email: chetan.khatri@live.com
chetan.khatri@accionlabs.com
LinkedIn: https://www.linkedin.com/in/chetkhatri
Github: chetkhatri
Who am I?
Solution Architect - Data Science @ Accion labs India Pvt. Ltd.
Contributor @ Apache Spark, Apache HBase, Elixir Lang.
Co-Authored University Curriculum @ University of Kachchh, India.
Ex - Data Engineering @: Nazara Games, Eccella Corporation.
Masters - Computer Science from University of Kachchh, India.
Daily Activity?
Functional Programming, Distributed Computing, Python, Scala, Haskell, Data
Science, Product Development
Helping organizations create innovative
product and solutions using the emerging
technologies
An Innovation Focused
Technology Services
Firm
Employees
Clients
Accelerators
Global
Offices
Development
Centers
2300+
75+
20+
12+
7
Accion Labs - Introduction
● A Global Technology Services firm focussed Emerging Technologies
○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients
● Profitable, venture-backed company
○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership
● Flexible Outcome-based Engagement Models
○ Projects, Extended teams, Shared IP, Co-development, Professional Services
● Framework Based Approach to Accelerate Digital Transformation
○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency
● Action-oriented Leadership Team
○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018
4
Accion’s Emerging Tech Capabilities
Adaptive UI, UX Engineering
NLP, Voice Interface &
Chat Bots
Artificial Intelligence and
Machine Learning
Data Lake &
Big Data Analytics
Blockchain, Payment
Technologies
Cloud Strategy and
Transformation
Mobile Development
MicroServices and
Serverless Computing
QA Engineering, RPA and
DevOps Automation
SFDC, ServiceNow, IBM
Solutions, Azure
5
Agenda
● Apache Spark
● Primary data structures (RDD, DataSet, Dataframe)
● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
● Parallel read from JDBC: Challenges and best practices.
● Bulk Load API vs JDBC write
● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
● Avoid unnecessary shuffle
● Optimize Spark stage generation plan
● Predicate pushdown with partitioning and bucketing
● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
What is Apache Spark?
● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data
processing.
● It provides high level API for Scala, Java, Python and R and optimized engine that supports general
execution graphs.
Structured Data / SQL - Spark SQL Graph Processing - GraphX
Machine Learning - MLlib Streaming - Spark Streaming,
Structured Streaming
What are RDDs ?
1. Distributed Data Abstraction
RDD RDD RDD RDD
Logical Model Across Distributed Storage on Cluster
HDFS, S3
2. Resilient & Immutable
RDD RDD RDD
T T
RDD -> T -> RDD -> T -> RDD
T = Transformation
3. Compile-time Type Safe / Strongly type inference
Integer RDD
String or Text RDD
Double or Binary RDD
4. Lazy evaluation
RDD RDD RDD
T T
RDD RDD RDD
T A
RDD - T - RDD - T - RDD - T - RDD - A - RDD
T = Transformation
A = Action
Apache Spark Operations
Operations
Transformation
Action
Essential Spark Operations
TRANSFORMATIONSACTIONS
General Math / Statistical Set Theory / Relational Data Structure / I/O
map
gilter
flatMap
mapPartitions
mapPartitionsWithIndex
groupBy
sortBy
sample
randomSplit
union
intersection
subtract
distinct
cartesian
zip
keyBy
zipWithIndex
zipWithUniqueID
zipPartitions
coalesce
repartition
repartitionAndSortWithinPartitions
pipe
reduce
collect
aggregate
fold
first
take
forEach
top
treeAggregate
treeReduce
forEachPartition
collectAsMap
count
takeSample
max
min
sum
histogram
mean
variance
stdev
sampleVariance
countApprox
countApproxDistinct
takeOrdered
saveAsTextFile
saveAsSequenceFile
saveAsObjectFile
saveAsHadoopDataset
saveAsHadoopFile
saveAsNewAPIHadoopDataset
saveAsNewAPIHadoopFile
When to use RDDs ?
You care about control of dataset and knows how data looks like, you care
about low level API.
Don’t care about lot’s of lambda functions than DSL.
Don’t care about Schema or Structure of Data.
Don’t care about optimization, performance & inefficiencies!
Very slow for non-JVM languages like Python, R.
Don’t care about Inadvertent inefficiencies.
Inadvertent inefficiencies in RDDs
Structured in Spark
DataFrames
Datasets
Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
DataFrame API Code
// convert RDD -> DF with column names
parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
// filter, groupBy, sum, and then agg()
parsedDF.filter(lambda x: x[1] === "finance")
.groupBy("sprint")
.agg(sum("numStories").as("count"))
.limit(100)
.show(100)
project sprint numStories
finance 3 20
finance 4 22
DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
Example: DataFrame Optimization
employees.join(events, employees("id") === events("eid"))
.filter(events("date") > "2015-01-01")
events file
employees
table
join
filter
Logical Plan
scan
(employees)
filter
Scan
(events)
join
Physical Plan
Optimized
scan
(events)
Optimized
scan
(employees)
join
Physical Plan
With Predicate Pushdown
and Column Pruning
DataFrames are Faster than RDDs
Source: Databricks
Pragmatic
Approach
Executors
Cores
Containers
Stage
Job
Task
Spark Internals terminology
Job - Each transformation and action mapping in Spark would create a separate jobs.
Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor.
Task - Lowest level of Concurrent and Parallel execution Unit.
Each stage is split into #number-of-partitions tasks,
i.e Number of Tasks = stage * number of partitions in the stage
Spark Internals: Jobs
Spark Internals: Stage
Spark Internals: Stage
Spark Internals: Tasks
Spark on Yarn Internals terminology
yarn.scheduler.minimum-allocation-vcores = 1
Yarn.scheduler.maximum-allocation-vcores = 6
Yarn.scheduler.minimum-allocation-mb = 4096
Yarn.scheduler.maximum-allocation-mb = 28832
Yarn.nodemanager.resource.memory-mb = 54000
Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 /
Yarn.scheduler.minimum-allocation-mb = 4096) = 13
Spark on Yarn Internals terminology
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Resource Manager (Yarn) Tuning
Spark Scheduler FIFO to FAIR
Parallel read from JDBC: Challenges
and best practices.
Spark JDBC Read
What happens when you run this code?
What would be the impact at Database engine side?
Spark JDBC Read: Impact on Database engine e.g MSSQL Server
Spark JDBC Read: Impact on Database engine e.g MSSQL Server
Spark Parallel JDBC Read
Spark Parallel JDBC Read
Spark Parallel JDBC Read
Impact on Database after Spark Parallel Read
Bulk Load API vs JDBC write
Bulk Load API vs JDBC write
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
JoinSelection execution planning strategy uses
spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size
of a dataset before broadcasting it to all worker nodes when performing a join.
# check broadcast join threshold
>>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024
10
# logical plan with tree numbered
sampleDF.queryExecution.logical.numberedTreeString
# Query plan
sampleDF.explain
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get
same key joins faster.
// Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster.
employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT"))
For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions
and each partition will write 2500 records Parallely.
Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
1. // disable autoBroadcastJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
2. // Order doesn't matter
table1.leftjoin(table2) or table2.leftjoin(table1)
3. force broadcast, if one DataFrame is not small!
4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition,
HashPartitioner
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Spark Submit Hyper-parameters and Dynamic Allocation
./bin/spark-submit 
--conf spark.yarn.maxAppAttempts=1 
--name PyConLT19 
--master yarn 
--deploy-mode cluster 
--driver-memory 18g 
--executor-memory 24g 
--num-executors 4 
--executor-cores 6 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.speculation=false 
--conf spark.broadcast.compress=true 
--conf spark.sql.broadcastTimeout=36000 
--conf spark.network.timeout=2500s 
--conf spark.dynamicAllocation.executorAllocationRatio=1 
--conf spark.executor.heartbeatInterval=30s 
--conf spark.dynamicAllocation.executorIdleTimeout=60s 
--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s 
--conf spark.network.timeout=1200s 
--conf spark.dynamicAllocation.schedulerBacklogTimeout=15s 
--conf spark.yarn.maxAppAttempts=1 
--conf spark.shuffle.service.enabled=true 
--conf spark.dynamicAllocation.enabled=True 
--conf spark.dynamicAllocation.minExecutors=2 
--conf spark.dynamicAllocation.initialExecutors=2 
--conf spark.dynamicAllocation.maxExecutors=6 
examples/src/main/python/pi.py
Case Study: High Level Architecture
OLTP
Shadow
Data
Source
Apache
Spark
Spark
SQL
Sqoop
HDFS
Parquet
Yarn Cluster manager
Customer
Specific
Reporting
DB
Bulk
Load
Parallelism Orchestration: Airflow
Spark Streaming
Code!
Ref.
https://github.com/chetkhatri/getting-started-airflow-for-spark/blob/master/spark
_streaming_kafka.py
Key role of Apache Airflow for Scheduling Data
Pipelines
Codebase: https://github.com/chetkhatri/getting-started-airflow-for-spark
Trigger the Airflow DAG from API
curl -d ' {"conf":"{"retail_id":"29" , "env_type":"dev", "size_is":"medium"}", "run_id": "retailer_1111"}' -H "Content-Type:
application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs
Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin
Spark Submit Operator inherited from BashOperator
https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s
ubmit_operator.py
Airflow - config.txt
Airflow spark_config.txt
Airflow - spark_hyperparameters.json
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_platform DAG
Airflow - nextgen_data_master_tables_subdag
Airflow - nextgen_data_master_tables_subdag
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
Airflow - common_util
References
[1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ.
[URL]
https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu
ster-7f19e998ebb
[2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines.
[URL]
https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari-
and-celery-for-data-pipelines-dc1e96f3d773
[3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster
[URL]
https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf
de7756bb6a
[4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks.
https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run
ning-tasks/
Questions ?
Thank you!
PyCon ZA Organizers and South Africa Python Community.

More Related Content

PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow

  • 1. No more struggles with Apache Spark (PySpark) workloads in production Chetan Khatri, Solution Architect - Data Science. Accionlabs India. PyconZA 2019, The Wanderers Club in Illovo. Johannesburg, South Africa 11th Oct, 2019 Twitter: @khatri_chetan, Email: chetan.khatri@live.com chetan.khatri@accionlabs.com LinkedIn: https://www.linkedin.com/in/chetkhatri Github: chetkhatri
  • 2. Who am I? Solution Architect - Data Science @ Accion labs India Pvt. Ltd. Contributor @ Apache Spark, Apache HBase, Elixir Lang. Co-Authored University Curriculum @ University of Kachchh, India. Ex - Data Engineering @: Nazara Games, Eccella Corporation. Masters - Computer Science from University of Kachchh, India. Daily Activity? Functional Programming, Distributed Computing, Python, Scala, Haskell, Data Science, Product Development
  • 3. Helping organizations create innovative product and solutions using the emerging technologies An Innovation Focused Technology Services Firm Employees Clients Accelerators Global Offices Development Centers 2300+ 75+ 20+ 12+ 7
  • 4. Accion Labs - Introduction ● A Global Technology Services firm focussed Emerging Technologies ○ 12 offices, 7 dev centers, 2300+ employees, 75+ active clients ● Profitable, venture-backed company ○ 3 rounds of funding, 8 acquisitions to bolster emerging tech capability and leadership ● Flexible Outcome-based Engagement Models ○ Projects, Extended teams, Shared IP, Co-development, Professional Services ● Framework Based Approach to Accelerate Digital Transformation ○ A collection of tools and frameworks, Breeze Digital Blueprint helps gain 25-30% efficiency ● Action-oriented Leadership Team ○ Fastest growing firm from Pittsburgh (2014, 2015, 2016), E&Y award 2015, PTC Finalist 2018 4
  • 5. Accion’s Emerging Tech Capabilities Adaptive UI, UX Engineering NLP, Voice Interface & Chat Bots Artificial Intelligence and Machine Learning Data Lake & Big Data Analytics Blockchain, Payment Technologies Cloud Strategy and Transformation Mobile Development MicroServices and Serverless Computing QA Engineering, RPA and DevOps Automation SFDC, ServiceNow, IBM Solutions, Azure 5
  • 6. Agenda ● Apache Spark ● Primary data structures (RDD, DataSet, Dataframe) ● Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark. ● Parallel read from JDBC: Challenges and best practices. ● Bulk Load API vs JDBC write ● An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin ● Avoid unnecessary shuffle ● Optimize Spark stage generation plan ● Predicate pushdown with partitioning and bucketing ● Airflow DAG scheduling for Apache Spark worflow. - Design, Architecture, Demo.
  • 7. What is Apache Spark? ● Apache Spark is a fast and general-purpose cluster computing system / Unified Engine for massive data processing. ● It provides high level API for Scala, Java, Python and R and optimized engine that supports general execution graphs. Structured Data / SQL - Spark SQL Graph Processing - GraphX Machine Learning - MLlib Streaming - Spark Streaming, Structured Streaming
  • 9. 1. Distributed Data Abstraction RDD RDD RDD RDD Logical Model Across Distributed Storage on Cluster HDFS, S3
  • 10. 2. Resilient & Immutable RDD RDD RDD T T RDD -> T -> RDD -> T -> RDD T = Transformation
  • 11. 3. Compile-time Type Safe / Strongly type inference Integer RDD String or Text RDD Double or Binary RDD
  • 12. 4. Lazy evaluation RDD RDD RDD T T RDD RDD RDD T A RDD - T - RDD - T - RDD - T - RDD - A - RDD T = Transformation A = Action
  • 14. Essential Spark Operations TRANSFORMATIONSACTIONS General Math / Statistical Set Theory / Relational Data Structure / I/O map gilter flatMap mapPartitions mapPartitionsWithIndex groupBy sortBy sample randomSplit union intersection subtract distinct cartesian zip keyBy zipWithIndex zipWithUniqueID zipPartitions coalesce repartition repartitionAndSortWithinPartitions pipe reduce collect aggregate fold first take forEach top treeAggregate treeReduce forEachPartition collectAsMap count takeSample max min sum histogram mean variance stdev sampleVariance countApprox countApproxDistinct takeOrdered saveAsTextFile saveAsSequenceFile saveAsObjectFile saveAsHadoopDataset saveAsHadoopFile saveAsNewAPIHadoopDataset saveAsNewAPIHadoopFile
  • 15. When to use RDDs ? You care about control of dataset and knows how data looks like, you care about low level API. Don’t care about lot’s of lambda functions than DSL. Don’t care about Schema or Structure of Data. Don’t care about optimization, performance & inefficiencies! Very slow for non-JVM languages like Python, R. Don’t care about Inadvertent inefficiencies.
  • 18. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 19. DataFrame API Code // convert RDD -> DF with column names parsedDF = parsedRDD.toDF("project", "sprint", "numStories") // filter, groupBy, sum, and then agg() parsedDF.filter(lambda x: x[1] === "finance") .groupBy("sprint") .agg(sum("numStories").as("count")) .limit(100) .show(100) project sprint numStories finance 3 20 finance 4 22
  • 20. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 21. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 22. Example: DataFrame Optimization employees.join(events, employees("id") === events("eid")) .filter(events("date") > "2015-01-01") events file employees table join filter Logical Plan scan (employees) filter Scan (events) join Physical Plan Optimized scan (events) Optimized scan (employees) join Physical Plan With Predicate Pushdown and Column Pruning
  • 23. DataFrames are Faster than RDDs Source: Databricks
  • 25. Spark Internals terminology Job - Each transformation and action mapping in Spark would create a separate jobs. Stage - A Set of task in each job which can run parallel using ThreadPoolExecutor. Task - Lowest level of Concurrent and Parallel execution Unit. Each stage is split into #number-of-partitions tasks, i.e Number of Tasks = stage * number of partitions in the stage
  • 30. Spark on Yarn Internals terminology yarn.scheduler.minimum-allocation-vcores = 1 Yarn.scheduler.maximum-allocation-vcores = 6 Yarn.scheduler.minimum-allocation-mb = 4096 Yarn.scheduler.maximum-allocation-mb = 28832 Yarn.nodemanager.resource.memory-mb = 54000 Number of max containers you can run = (Yarn.nodemanager.resource.memory-mb = 54000 / Yarn.scheduler.minimum-allocation-mb = 4096) = 13
  • 31. Spark on Yarn Internals terminology
  • 37. Parallel read from JDBC: Challenges and best practices.
  • 38. Spark JDBC Read What happens when you run this code? What would be the impact at Database engine side?
  • 39. Spark JDBC Read: Impact on Database engine e.g MSSQL Server
  • 40. Spark JDBC Read: Impact on Database engine e.g MSSQL Server
  • 44. Impact on Database after Spark Parallel Read
  • 45. Bulk Load API vs JDBC write
  • 46. Bulk Load API vs JDBC write
  • 47. Bulk Load API vs JDBC write
  • 48. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin JoinSelection execution planning strategy uses spark.sql.autoBroadcastJoinThreshold property (default: 10M) to control the size of a dataset before broadcasting it to all worker nodes when performing a join. # check broadcast join threshold >>> int(spark.conf.get("spark.sql.autoBroadcastJoinThreshold")) / 1024 / 1024 10 # logical plan with tree numbered sampleDF.queryExecution.logical.numberedTreeString # Query plan sampleDF.explain
  • 49. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin Repartition: Boost the Parallelism, by increasing the number of Partitions. Partition on Joins, to get same key joins faster. // Reduce number of partitions without shuffling, where repartition does equal data shuffling across the cluster. employeeDF.coalesce(10).bulkCopyToSqlDB(bulkWriteConfig("EMPLOYEE_CLIENT")) For example, In case of bulk JDBC write. Parameter "bulkCopyBatchSize" -> "2500", means Dataframe has 10 partitions and each partition will write 2500 records Parallely. Reduce: Impact on Network Communication, File I/O, Network I/O, Bandwidth I/O etc.
  • 50. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin 1. // disable autoBroadcastJoin spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) 2. // Order doesn't matter table1.leftjoin(table2) or table2.leftjoin(table1) 3. force broadcast, if one DataFrame is not small! 4. Minimize shuffling & Boost Parallism, Partitioning, Bucketing, coalesce, repartition, HashPartitioner
  • 51. An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
  • 57. Spark Submit Hyper-parameters and Dynamic Allocation ./bin/spark-submit --conf spark.yarn.maxAppAttempts=1 --name PyConLT19 --master yarn --deploy-mode cluster --driver-memory 18g --executor-memory 24g --num-executors 4 --executor-cores 6 --conf spark.yarn.maxAppAttempts=1 --conf spark.speculation=false --conf spark.broadcast.compress=true --conf spark.sql.broadcastTimeout=36000 --conf spark.network.timeout=2500s --conf spark.dynamicAllocation.executorAllocationRatio=1 --conf spark.executor.heartbeatInterval=30s --conf spark.dynamicAllocation.executorIdleTimeout=60s --conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=15s --conf spark.network.timeout=1200s --conf spark.dynamicAllocation.schedulerBacklogTimeout=15s --conf spark.yarn.maxAppAttempts=1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=True --conf spark.dynamicAllocation.minExecutors=2 --conf spark.dynamicAllocation.initialExecutors=2 --conf spark.dynamicAllocation.maxExecutors=6 examples/src/main/python/pi.py
  • 58. Case Study: High Level Architecture OLTP Shadow Data Source Apache Spark Spark SQL Sqoop HDFS Parquet Yarn Cluster manager Customer Specific Reporting DB Bulk Load Parallelism Orchestration: Airflow
  • 60. Key role of Apache Airflow for Scheduling Data Pipelines Codebase: https://github.com/chetkhatri/getting-started-airflow-for-spark
  • 61. Trigger the Airflow DAG from API curl -d ' {"conf":"{"retail_id":"29" , "env_type":"dev", "size_is":"medium"}", "run_id": "retailer_1111"}' -H "Content-Type: application/json" -X POST http://localhost:8000/api/experimental/dags/nextgen_data_platforms/dag_runs Ref. https://github.com/teamclairvoyant/airflow-rest-api-plugin Spark Submit Operator inherited from BashOperator https://github.com/apache/airflow/blob/master/airflow/contrib/operators/spark_s ubmit_operator.py
  • 81. References [1] How to Setup Airflow Multi-Node Cluster with Celery & RabbitMQ. [URL] https://medium.com/@khatri_chetan/challenges-and-struggle-while-setting-up-multi-node-airflow-clu ster-7f19e998ebb [2] Setup and Configure Multi Node Airflow Cluster with HDP Ambari and Celery for Data Pipelines. [URL] https://medium.com/@khatri_chetan/setup-and-configure-multi-node-airflow-cluster-with-hdp-ambari- and-celery-for-data-pipelines-dc1e96f3d773 [3] Challenges and Struggle while Setting up Multi-Node Airflow Cluster [URL] https://medium.com/@khatri_chetan/how-to-setup-airflow-multi-node-cluster-with-celery-rabbitmq-cf de7756bb6a [4] Leveraging Spark Speculation To Identify And Re-Schedule Slow Running Tasks. https://blog.yuvalitzchakov.com/leveraging-spark-speculation-to-identify-and-re-schedule-slow-run ning-tasks/
  • 83. Thank you! PyCon ZA Organizers and South Africa Python Community.