Apache Spark Performance tuning and Best Practise

1. Presented By: Amit Kumar Apache Spark Performance tuning and Best Practise

2. Presented By: Amit Kumar Apache Spark Performance tuning and Best Practise

3. Lack of etiquette and manners is a huge turn oﬀ. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.

4. Our Agenda 01 Spark Introduction 02 Code Level Optimization 03 Outside Code Technique 04 Demo 05 Summary

5. Introduction ● Apache Spark is Open Source, in-memory computation framework. ● It gives high performance for both batch as well as streaming job. ● It deals of big data processing. ● it is approx 100 times faster than mapreduce, because of in-memory computation As it deals with the big data processing application it also involves lot of uses of resources such as CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction. In the upcoming 40 minute we will learn about the approaches which will help to do so.

6. Ways to Optimise Code Level:- Here we will learn the best practices to follow in order to achieve high performance in minimal resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid UDF, Filter Data at earliest , Reduce Shuffle Beyond Code:- Here we will learn to tune the config parameter cluster resources level tuning such as:- File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval

7. Major Bottleneck ● CPU ● Network Bandwidth ● Memory Our Goal is to optimise each of them as much as possible in order to reduce the resources used and reduce the computation time to achieve optimum performance.

8. Caching Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving from a particular country and same is being used multiple times. ● Raw Data is in text file ● Reading Text File as DF1 ● Grouping by origin country DF2

9. Caching JOB1:- Now number of flights leaving US as DF3 JOB2:- number of flights leaving Singapore as DF4 JOB3:- number of flights leaving India as DF5 Execution plan for JOB1 :- DF1>DF2 >DF3 Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step. Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step. here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can use it in another job to reduce computation resources and save time.

10. Broadcasting Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with task every time. which helps in reducing the network bandwidth and time consumption. When to Use Broadcast Variable:- Suppose we have a lookup data and that data need to be used by each executor while performing task. We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition) we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task). But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be sent. Beneﬁt= sending 100 copy vs sending 10 copy val states = Map(("NY","New York"),("CA","California"),("FL","Florida")) val countries = Map(("USA","United States of America"),("IN","India")) val broadcastStates = spark.sparkContext.broadcast(states) val broadcastCountries = spark.sparkContext.broadcast(countries)

11. – - Continue In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution. Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.

12. Serialization From the above diagram it is clear that serialization is needed when we write data in some storage. De-Serialization is needed when we need to read from the some source. In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuﬄing etc. Hence it becomes very important to optimize the serialization process.

13. Serialization Kyro serialization over Java serialization:- kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires to register the classes not supported by it. val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class it will store the class name with each object of it (for every row) conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[Foo]))

14. DataSet/DataFrame over RDD RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition and shuﬄe, and we all know that Serialization and de-serialization are very expensive operations in spark. On the other hand, DataFrame stores the data as binary using oﬀ-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD

15. Avoid UDF When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible. but by any chance we have to use it then first we have to define a function like a normal scala function and we have to register it with spark udf class ● val plusOne = udf((x: Int) => x + 1) //defined function ● spark.udf.register("plusOne", plusOne) //register udf ● spark.sql("SELECT plusOne(5)").show() // calling udf // |UDF(5)| // result // +------+ // | 6|

16. Filter Data at Earliest example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address, pastexp, marital status, ……………………….. etc. Bu we have to ﬁnd number of employees belonging to a particular city. in this case we have to perform groupby operation on city column and other column becomes irrelevant. df.select(name,city).groupby(“city”).show() df.groupby(“City”).select(“City”, “count”).show() Scan Aggregate Filter Scan Aggregate Filter

17. Shuffling Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join() on RDD and DataFrame. It involves ● Disk I/O ● Involves data serialization and deserialization ● Network I/O

18. Reduce Shuffle Operation We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations remove any unused operations. Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property you can improve Spark performance. spark.conf.set("spark.sql.shuffle.partitions",100) Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase it to 200 or more.

19. File Format Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades from Database2 and perform calculation then writes in Databse3. as Database2 involves writing the data into and reading the data from it. In the above scenario we should prefer writing an intermediate ﬁle in Serialized and optimized formats like Avro, Parquet e.t.c, Any transformations on these formats performs better than text, CSV, and JSON. Spark Job1 Spark Job2 DataBase2 Database3 DataBase1

20. Executor Config ● JOB > Stage > Task ● one job can have multiple Stage, One stage can have multiple task. ● And number of core = number of parallel task ● Here we have to give proper number of core to each executor in order to optimise the resources. ● Allocating more number of core to each executor will leads to more parallel task on each executor which can lead to outofmemory(OOM) error. ● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor memory will not be fully optimised. ● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of parallelism and proper memory uses. ./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5

21. Memory Tuning There are three considerations in tuning memory usage: ● the amount of memory used by your objects (you may want your entire dataset to ﬁt in memory), ● the cost of accessing those objects, and ● the overhead of garbage collection ● String data types uses less storage space compared to Linked List and Map as these objects not only has a header, but also pointers (typically 8 bytes each) to the next object in the list. ● We can also optimise the memory uses by storing data in a serialized format. ● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their ﬁelds. ● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage collection cost. Broadcasting variable also help us in reducing GC.

22. Thank You ! Get in touch with us:

23. Lack of etiquette and manners is a huge turn oﬀ. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.

24. Our Agenda 01 Spark Introduction 02 Code Level Optimization 03 Outside Code Technique 04 Demo 05 Summary

25. Introduction ● Apache Spark is Open Source, in-memory computation framework. ● It gives high performance for both batch as well as streaming job. ● It deals of big data processing. ● it is approx 100 times faster than mapreduce, because of in-memory computation As it deals with the big data processing application it also involves lot of uses of resources such as CPU, RAM and Storage. Optimising one or more together will leads to saving a lot cost reduction. In the upcoming 40 minute we will learn about the approaches which will help to do so.

26. Ways to Optimise Code Level:- Here we will learn the best practices to follow in order to achieve high performance in minimal resources such as:- Caching, Broadcasting, Serialization, use DataSet/DF over RDD, Avoid UDF, Filter Data at earliest , Reduce Shuffle Beyond Code:- Here we will learn to tune the config parameter cluster resources level tuning such as:- File Format, Level of Parallelism, Executor config, Memory Tuning, Batch Interval

27. Major Bottleneck ● CPU ● Network Bandwidth ● Memory Our Goal is to optimise each of them as much as possible in order to reduce the resources used and reduce the computation time to achieve optimum performance.

28. Caching Suppose in our analytics project we have a text file and we have to read them and get number of flights leaving from a particular country and same is being used multiple times. ● Raw Data is in text file ● Reading Text File as DF1 ● Grouping by origin country DF2

29. Caching JOB1:- Now number of flights leaving US as DF3 JOB2:- number of flights leaving Singapore as DF4 JOB3:- number of flights leaving India as DF5 Execution plan for JOB1 :- DF1>DF2 >DF3 Execution plan for JOB2 :- DF1>DF2 >DF4 after cache DF2 > DF4 no need of DF1 > DF2 step. Execution plan for JOB3 :- DF1>DF2 >DF5 after cache DF2 > DF5 no need of DF1 > DF2 step. here instead of calculating the DF1 and DF2 again we can cache the last reusable DF in memory so that we can use it in another job to reduce computation resources and save time.

30. Broadcasting Broadcast variable allows us to keep a read only variable cached on each executor hence we don’t have to send it with task every time. which helps in reducing the network bandwidth and time consumption. When to Use Broadcast Variable:- Suppose we have a lookup data and that data need to be used by each executor while performing task. We have 100 partitions and 10 executor node cluster (every executor has to take care for 10 partition) we need to execute at least 100 task hence we have to send the lookup data 100 time to executor(once with every task). But if we use broadcast then we need to send the lookup data to each executor only once and only 10 copies will be sent. Beneﬁt= sending 100 copy vs sending 10 copy val states = Map(("NY","New York"),("CA","California"),("FL","Florida")) val countries = Map(("USA","United States of America"),("IN","India")) val broadcastStates = spark.sparkContext.broadcast(states) val broadcastCountries = spark.sparkContext.broadcast(countries)

31. – - Continue In the above diagram m is broadcast variable and it’s sitting in memory of each executor and getting used while task execution. Hence driver don’t need to ship the variable(m) with task and reduce the time of network IO and time.

32. Serialization From the above diagram it is clear that serialization is needed when we write data in some storage. De-Serialization is needed when we need to read from the some source. In Spark ecosystem we always have to deal with both of them while cache, broadcast, shuﬄing etc. Hence it becomes very important to optimize the serialization process.

33. Serialization Kyro serialization over Java serialization:- kyro is 10 times faster and more compact than java serialization but it doesn’t support all serializable types and requires to register the classes not supported by it. val spark = SparkSession.builder().appName("Broadcast").master("local").getOrCreate() spark.conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") Further Optimization is to register the class with kyro in advance if row size is too big as if you don’t register the class it will store the class name with each object of it (for every row) conf.set("spark.kryo.registrationRequired", "true") conf.registerKryoClasses(Array(classOf[Foo]))

34. DataSet/DataFrame over RDD RDD does sterilization and deserialization of data whenever it distributes the data across clusters such as during repartition and shuﬄe, and we all know that Serialization and de-serialization are very expensive operations in spark. On the other hand, DataFrame stores the data as binary using oﬀ-heap storage, no need for deserialization and serialization of data when it distributes to clusters. We see a big performance improvement in DataFrame over RDD

35. Avoid UDF When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. Hence whenever we can use inbuilt spark function we should use them and avoid UDF as much as possible. but by any chance we have to use it then first we have to define a function like a normal scala function and we have to register it with spark udf class ● val plusOne = udf((x: Int) => x + 1) //defined function ● spark.udf.register("plusOne", plusOne) //register udf ● spark.sql("SELECT plusOne(5)").show() // calling udf // |UDF(5)| // result // +------+ // | 6|

36. Filter Data at Earliest example:- suppose we have a data set of employees and have column like patient Number, age, gender, salary, department, city, address, pastexp, marital status, ……………………….. etc. Bu we have to ﬁnd number of employees belonging to a particular city. in this case we have to perform groupby operation on city column and other column becomes irrelevant. df.select(name,city).groupby(“city”).show() df.groupby(“City”).select(“City”, “count”).show() Scan Aggregate Filter Scan Aggregate Filter

37. Shuffling Shuffling is a mechanism Spark uses to redistribute the data across different executors and even across machines. Spark shuffling triggers when we perform certain transformation operations like gropByKey(), reducebyKey(), join() on RDD and DataFrame. It involves ● Disk I/O ● Involves data serialization and deserialization ● Network I/O

38. Reduce Shuffle Operation We cannot completely avoid shuffle operations but when possible try to reduce the number of shuffle operations remove any unused operations. Spark provides spark.sql.shuffle.partitions configurations to control the partitions of the shuffle, By tuning this property you can improve Spark performance. spark.conf.set("spark.sql.shuffle.partitions",100) Here 100 is the shuffle partition count we can tune this number by hit and trial based on datasize, If we have less data then we don’t need 100 shuffle partition, If we have much bigger data and can execute large number of parallel task then we can increase it to 200 or more.

39. File Format Suppose we have system like this DataSource > SparkJob1 > Database > SparkJob2 > Database As we are reading the data from source 1 from SparkJob1 and then we are writing data in Database2 then SparkJob2 reades from Database2 and perform calculation then writes in Databse3. as Database2 involves writing the data into and reading the data from it. In the above scenario we should prefer writing an intermediate ﬁle in Serialized and optimized formats like Avro, Parquet e.t.c, Any transformations on these formats performs better than text, CSV, and JSON. Spark Job1 Spark Job2 DataBase2 Database3 DataBase1

40. Executor Config ● JOB > Stage > Task ● one job can have multiple Stage, One stage can have multiple task. ● And number of core = number of parallel task ● Here we have to give proper number of core to each executor in order to optimise the resources. ● Allocating more number of core to each executor will leads to more parallel task on each executor which can lead to outofmemory(OOM) error. ● Allocating less core per executor will reduce the parallelism and will the the benefit of it. Also the executor memory will not be fully optimised. ● After Many iterations people recommend to allocate 5 cores per executor in order to get maximum benefit of parallelism and proper memory uses. ./bin/spark-submit --driver-memory 8G --executor-memory 16G --num-executors 3 --executor-cores 5

41. Memory Tuning There are three considerations in tuning memory usage: ● the amount of memory used by your objects (you may want your entire dataset to ﬁt in memory), ● the cost of accessing those objects, and ● the overhead of garbage collection ● String data types uses less storage space compared to Linked List and Map as these objects not only has a header, but also pointers (typically 8 bytes each) to the next object in the list. ● We can also optimise the memory uses by storing data in a serialized format. ● Java Objects are fast to access but consumes 2-5 times more space than the “raw” data inside their ﬁelds. ● using data structures with fewer objects and caching data in serialized format can help in reduce the Garbage collection cost. Broadcasting variable also help us in reducing GC.

42. Thank You ! Get in touch with us:

Apache Spark Performance tuning and Best Practise

More Related Content

Apache Spark Performance tuning and Best Practise