SlideShare a Scribd company logo
Advanced Spark Programming (2)
Advanced Spark Programming
Shared variables:
● When we pass functions such map()
● Every node gets a copy of the variable
● The change to these variables is not communicated back
● After starting of the map(), changes to the variable on driver
doesn't impact the worker.
Two Kinds:
1. Accumulators to aggregate information
2. Broadcast variables to efficiently distribute large values
Advanced Spark Programming
SHARED MEMORY - Accumulators
+= 10 += 20
are only “added” to
through associative operation
assoc.: 2+3+4=2+4+3=9
Advanced Spark Programming
● Accumulators are variables that are only “added” to through an
associative operation
● Can therefore be efficiently supported in parallel.
● They can be used to implement counters (as in MapReduce) or
sums.
Accumulators

Recommended for you

Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro

This presentation show the main Spark characteristics, like RDD, Transformations and Actions. I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/

spark
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17

Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.

sparkcluster computingdata analytics
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training

The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.

spark summit 2015apache spark
Advanced Spark Programming
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0

Recommended for you

SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)

1. Introduction to SparkR 2. Demo Starting to use SparkR DataFrames: dplyr style, SQL style RDD v.s. DataFrames SparkR on MLlib: GLM, K-means 3. User Case Median: approxQuantile() ID Match: dplyr style, SQL style, SparkR function SparkR + Shiny 4. The Future of SparkR

rhadoophadoopcon
Scala and spark
Scala and sparkScala and spark
Scala and spark

This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.

scala apache spark
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture

This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark

apache sparkdistributed systemtungsten
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
Advanced Spark Programming
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
printf("Blank lines: %d", numBlankLines.value)
//Blank lines: 24857
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is ???
Accumulators and Fault Tolerance

Recommended for you

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications

This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.

architecturesapplicationspark
Spark overview
Spark overviewSpark overview
Spark overview

This presentation gives a brief introduction of the Spark Framework and how it can be used in machine learning platform.

open sourcemachine learning
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview

This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.

sparkbigdatasql
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Accumulators and Fault Tolerance
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
Accumulators and Fault Tolerance
Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
YES, for accumulators in Transformation.
No, for accumulators in Action
Accumulators and Fault Tolerance
Advanced Spark Programming
○ For accumulators in actions, Each task’s accumulator update
applied once.
○ For reliable absolute value counter, put it inside an action
○ In transformations, this guarantee doesn't exist.
○ In transformations, use accumulators for debug only.
Accumulators and Fault Tolerance

Recommended for you

Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.

apache sparkdistributed computingmapreduce
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal

Presentation at Big Data Montreal #25 (June 3rd 2014) by Nan Zhu, Contributor to the Apache Spark project

big data montrealrealtimehadoop
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive

We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.

apache sparkrddrdd deep dive
Advanced Spark Programming
Custom Accumulators
● Out of the box, Spark supports accumulators of type Double, Long,
and Float.
● Spark also includes an API to define custom accumulator types
and custom aggregation operations
○ (e.g., finding the maximum of the accumulated values instead of
adding them).
● Custom accumulators need to extend AccumulatorV2.
Advanced Spark Programming
Custom Accumulators - version 1.x
Advanced Spark Programming
class MyComplex(var x: Int, var y: Int) extends Serializable{
def reset(): Unit = {
x = 0
y = 0
}
def add(p:MyComplex): MyComplex = {
x = x + p.x
y = y + p.y
return this
}
}
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
import org.apache.spark.AccumulatorParam
class ComplexAccumulatorV1 extends AccumulatorParam[MyComplex] {
def zero(initialVal: MyComplex): MyComplex = {
return initialVal
}
def addInPlace(v1: MyComplex, v2: MyComplex): MyComplex = {
v1.add(v2)
return v1;
}
}
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2

Recommended for you

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial

A tutorial presentation based on spark.apache.org documentation. I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.

apache_spark big_data
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter

In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release. The major themes for Spark 2.0 are: - Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs - Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries. - Tungsten Phase 2: Speed up Apache Spark by 10X

tungstensparkapache spark 2.0
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames

In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.

data miningdataframespyspark
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
var myrdd = sc.parallelize(Array(1,2,3))
def myfunc(x:Int):Int = {
vecAccum += new MyComplex(x, x)
return x * 3
}
var myrdd1 = myrdd.map(myfunc)
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1)
var myrdd = sc.parallelize(Array(1,2,3))
def myfunc(x:Int):Int = {
vecAccum += new MyComplex(x, x)
return x * 3
}
var myrdd1 = myrdd.map(myfunc)
myrdd1.collect()
vecAccum.value.x
vecAccum.value.y
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
Advanced Spark Programming
import org.apache.spark.util.AccumulatorV2
object ComplexAccumulatorV2 extends AccumulatorV2[MyComplex, MyComplex] {
private val myc:MyComplex = new MyComplex(0,0)
def reset(): Unit = {
myc.reset()
}
def add(v: MyComplex): Unit = {
myc.add(v)
}
def value():MyComplex = {
return myc
}
def isZero(): Boolean = {
return (myc.x == 0 && myc.y == 0)
}
def copy():AccumulatorV2[MyComplex, MyComplex] = {
return ComplexAccumulatorV2
}
def merge(other:AccumulatorV2[MyComplex, MyComplex]) = {
myc.add(other.value)
}
}
sc.register(ComplexAccumulatorV2, "mycomplexacc")
Custom Accumulators - version 2.x
https://gist.github.com/girisandeep/35b21cca890157afe0084a9e400e2e70

Recommended for you

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction

Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.

hadoopsparksqlspark
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

spark; internal; shuffle;
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark

Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.

testingmppcode-gen
Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
Advanced Spark Programming
commonWords = List("a", "an", "the", "of", "at",
"is", "am", "are", "this", "that", "", "at")
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
> Is it inefficient?
Broadcast Variables : Introduction
Advanced Spark Programming
Yes, because
1. Spark sends referenced variables to all workers.
1. The default task launching mechanism is optimised for small task sizes.
2. If using multiple times, spark will be sending it again to all nodes
So, we use broadcast variable instead.
Broadcast Variables : Introduction

Recommended for you

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK

As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

sparkapache sparkprometheus
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...

Watch video at: http://youtu.be/Wg2boMqLjCg Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications

apache sparkdatabricksdatabricks cloud
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...

This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.

cloudxlabsparkapache spark
Advanced Spark Programming
SHARED MEMORY
Broadcast
Variables
broadcast.value()
Broadcast()
Hadoop Distributed File System (HDFS
Resilient Distributed Dataset (RDD)
Spark
Application
Spark
Application
Spark
Application
Spark
Application
Broadcast Variables
Cache
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
Broadcast Variables

Recommended for you

HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...

Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

robert engelopenclhc4021
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka

Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...). We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs. Topics include : * Kafka and Spark Streaming for stateless and stateful use-cases * Spark Structured Streaming as a possible alternative * Combining Spark Streaming with batch ETLs * “Streaming” over Data Lake using Kafka

big datakafkaspark
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark

Powerful big data processing and storage combined, this presentation walks thru the basics of integrating Apache Spark and Apache Cassandra. Presented by Alex Thompson at the Sydney Cassandra Meetup.

Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
Broadcast Variables
Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
● Useful when
○ Tasks across multiple stages need the same data
○ Caching the data in deserialized form is important.
Broadcast Variables
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db

Recommended for you

A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark

Spark is an in-memory cluster computing framework that can access data from HDFS. It uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs support transformations that create new datasets and actions that return values. DataFrames are equivalent to relational tables that allow for optimizations. HiveContext allows Spark to query data stored in Hive. Queries can be written using HiveQL, which is converted to Spark jobs.

knoldusknowledge sharingspark
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

Hadoop became the most common systm to store big data. With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself. Together they form a big ecosystem. This presentation covers some of those systems. While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.

hivescaldingparquet
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...

Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage. In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.

apache sparksparkaisummit
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
Broadcast Variables : Example
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
uncommonWords.take(100)

Recommended for you

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.

apache sparksparkaisummit
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”

NEC has developed a new vector processor called SX-Aurora TSUBASA to accelerate machine learning and data analytics workloads. They developed a middleware framework called Frovedis that provides Spark-like functionality and is optimized for SX-Aurora TSUBASA. Frovedis achieved 10-100x speedups on machine learning algorithms and SQL-like queries compared to Spark on CPUs. NEC has also opened a lab called VEDAC for external users to access SX-Aurora TSUBASA systems running Frovedis.

Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem

Hadoop became the most common systm to store big data. With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself. Together they form a big ecosystem. This presentation covers some of those systems. While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.

hadoop scalding hive impala parquet cloudera
Advanced Spark Programming
Key Performance Considerations
1. Level of Parallelism
2. Serialization Format
3. Memory Management
4. Hardware Provisioning
Advanced Spark Programming
Level of Parallelism
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
Advanced Spark Programming
Level of Parallelism
Too Less ⇒ Might leave resources idle
Too Much ⇒ Small overheads due to each partition adds up
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
Advanced Spark Programming
Key Performance Considerations
1. Level of Parallelism - How many default partitions?

Recommended for you

Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup

This document discusses Typesafe's Reactive Platform and Apache Spark. It describes Typesafe's Fast Data strategy of using a microservices architecture with Spark, Kafka, HDFS and databases. It outlines contributions Typesafe has made to Spark, including backpressure support, dynamic resource allocation in Mesos, and integration tests. The document also discusses Typesafe's customer support and roadmap, including plans to introduce Kerberos security and evaluate Tachyon.

NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み

2022 年 4 月 26 日 (火) に開催した「GTC 2022 テクニカルフォローアップセミナー 〜新アーキテクチャ & HPC〜」にて、NVIDIA 丹が講演した「NVIDIA HPC ソフトウエア斜め読み」のスライドです。

Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...

Distributed training is a complex process that does more harm than good if it not setup correctly. https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth Big Data Spain 2017 November 16th - 17th Kinépolis Madrid

big databig data spainapache mxnet
Advanced Spark Programming
Key Performance Considerations - Partitions
/data/msprojects/in_table.csv has 62 blocks theoratically. Lets check.
$ hadoop fs -ls /data/msprojects/in_table.csv
-rw-r--r-- 3 sandeep sandeep 8303338297 2017-04-18 02:26 /data/msprojects/in_table.csv
$ python
>>> 8303338297.0/128.0/1024.0/1024.0
61.86469120532274
>>>
$ hdfs fsck /data/msprojects/in_table.csv
…..
Total blocks (validated): 62 (avg. block size 133924811 B)
Yes, it has 62 blocks actually.
Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.textFile("/data/msprojects/in_table.csv")
scala> myrdd.partitions.length
res1: Int = 62
Key Performance Considerations - Partitions
So, number of partitions is a function of number of data blocks in case
of sc.textFile.
Advanced Spark Programming
Key Performance Considerations - Partitions
// In the local mode
spark-shell
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res1: Int = 4
[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Since my machine has 4 cores, it has created 4 partitions.
Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2
When we are running in yarn mode, the number of partitions is function
of tasks that can be executed on a node, Here it is 2.
Key Performance Considerations - Partitions

Recommended for you

Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Since initial support was added in Apache Spark 2.3, running Spark on Kubernetes has been growing in popularity

spark + ai summit
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...

The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc. Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE Watch this presentation on YouTube: https://www.youtube.com/watch?v=RI8yXhBb8Hg

parallel computingmulti-core processorqualcomm
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop

Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop

sparkmachine learningapache spark
Advanced Spark Programming
Level of Parallelism
1. Specify number of partitions in sc.parallelize and sc.textFile
2. Shuffling operations accept degree of parallelism in parameter
3. repartition() or partitionBy
4. To efficiently shrink, prefer coalesce() over repartition()
How to control parallelism?
Advanced Spark Programming
Level of Parallelism
1. We are reading a large amount of data from S3.
2. filter() operation is likely to leave a tiny fraction
3. Result of filter() will have same size RDD as parent but with
many empty or small partitions.
4. Improve the application’s performance by coalescing
Example
Advanced Spark Programming
Serialization Format
● While transferring or saving objects need serialization
● Comes into play during large transfers
● By default Spark will use Java’s built-in serializer.
Advanced Spark Programming
Serialization Format
Benchmarks
https://github.com/eishay/jvm-serializers/wiki

Recommended for you

Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite

This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.

sqlapacheconapache calcite
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning

Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal. Topics covered: 1. Overview of Machine Learning 2. Basics of Deep Learning 3. What is computer vision and its use-cases? 4. Various algorithms used in Computer Vision (mostly CNN) 5. Live hands-on demo of either Auto Cameraman or Face recognition system 6. What next?

computer visionmachine learningbig data
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview

This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.

cloudxlabdeep learningmachine learning
Advanced Spark Programming
Serialization Format
Kryo
● Spark also supports the use of Kryo
● Faster and more compact
● But cannot serialize all types of objects “out of the box.”
● Almost all applications will benefit from shifting to Kryo
● To use,
○ sc.getConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
● For best performance, register classes with Kryo
○ sc.getConf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
○ Class needs to implement Java’s Serializable interface
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
User Code
Remaining
Default: 20% of memory

Recommended for you

Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks

This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.

recurrent neural networkrecurrent neural network tutorialrecurrent neural network tensorflow
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing

Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.

cloudxlabdeep learningmachine learning
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes

- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets. - It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted. - It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.

cloudxlabdeep learningmachine learning
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
Hardware Provisioning
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
Hardware Provisioning
Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
● Linear scaling
○ 2 x Hardware == 2 x speed
Hardware Provisioning
Thank you!
Advanced Programming

Recommended for you

Autoencoders
AutoencodersAutoencoders
Autoencoders

An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.

autoencoderdenoising autoencodervariational autoencoder
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets

The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.

vanishing gradientvanishing gradient problemxavier initialization
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning

( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS ) This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial: 1) What is Reinforcement? 2) Reinforcement Learning an Introduction 3) Reinforcement Learning Example 4) Learning to Optimize Rewards 5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques 6) OpenAI Gym 7) The Credit Assignment Problem 8) Inverse Reinforcement Learning 9) Playing Atari with Deep Reinforcement Learning 10) Policy Gradients 11) Markov Decision Processes

reinforcement learningopenai gymwhat is reinforcement

More Related Content

What's hot

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
Richard Kuo
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Anastasios Skarlatidis
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
David Lauzon
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
Farzad Nozarian
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
Ilya Ganelin
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

What's hot (20)

Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar to Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
Roger Rafanell Mas
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
AMD Developer Central
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
Alex Thompson
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
Knoldus Inc.
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Databricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Databricks
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
NVIDIA Japan
 
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Big Data Spain
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Qualcomm Developer Network
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
Chris Baynes
 

Similar to Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
 
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Apache Cassandra and Apche Spark
Apache Cassandra and Apche SparkApache Cassandra and Apche Spark
Apache Cassandra and Apche Spark
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 

More from CloudxLab

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
CloudxLab
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
CloudxLab
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
CloudxLab
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
CloudxLab
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
CloudxLab
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
CloudxLab
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
CloudxLab
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
CloudxLab
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
CloudxLab
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
CloudxLab
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 

More from CloudxLab (20)

Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Deep Learning Overview
Deep Learning OverviewDeep Learning Overview
Deep Learning Overview
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Naive Bayes
Naive BayesNaive Bayes
Naive Bayes
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabIntroduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
 
Introduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLabIntroduction to Deep Learning | CloudxLab
Introduction to Deep Learning | CloudxLab
 
Dimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLabDimensionality Reduction | Machine Learning | CloudxLab
Dimensionality Reduction | Machine Learning | CloudxLab
 
Ensemble Learning and Random Forests
Ensemble Learning and Random ForestsEnsemble Learning and Random Forests
Ensemble Learning and Random Forests
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
 

Recently uploaded

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Mydbops
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
Tatiana Al-Chueyr
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Bert Blevins
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
Adam Dunkels
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
Matthew Sinclair
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
Matthew Sinclair
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
Aurora Consulting
 

Recently uploaded (20)

Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsScaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - Mydbops
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
Best Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdfBest Practices for Effectively Running dbt in Airflow.pdf
Best Practices for Effectively Running dbt in Airflow.pdf
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionAdvanced Techniques for Cyber Security Analysis and Anomaly Detection
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
How to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptxHow to Build a Profitable IoT Product.pptx
How to Build a Profitable IoT Product.pptx
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
20240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 202420240704 QFM023 Engineering Leadership Reading List June 2024
20240704 QFM023 Engineering Leadership Reading List June 2024
 
20240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 202420240705 QFM024 Irresponsible AI Reading List June 2024
20240705 QFM024 Irresponsible AI Reading List June 2024
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
Quality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of TimeQuality Patents: Patents That Stand the Test of Time
Quality Patents: Patents That Stand the Test of Time
 

Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 2. Advanced Spark Programming Shared variables: ● When we pass functions such map() ● Every node gets a copy of the variable ● The change to these variables is not communicated back ● After starting of the map(), changes to the variable on driver doesn't impact the worker. Two Kinds: 1. Accumulators to aggregate information 2. Broadcast variables to efficiently distribute large values
  • 3. Advanced Spark Programming SHARED MEMORY - Accumulators += 10 += 20 are only “added” to through associative operation assoc.: 2+3+4=2+4+3=9
  • 4. Advanced Spark Programming ● Accumulators are variables that are only “added” to through an associative operation ● Can therefore be efficiently supported in parallel. ● They can be used to implement counters (as in MapReduce) or sums. Accumulators
  • 5. Advanced Spark Programming Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 6. Advanced Spark Programming sc.setLogLevel("ERROR") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 7. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 8. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 9. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 10. Advanced Spark Programming sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } var words = file.flatMap(toWords) words.saveAsTextFile("words3") Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
  • 11. Advanced Spark Programming Accumulator : Empty line count https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0 sc.setLogLevel("ERROR") var file = sc.textFile("/data/mr/wordcount/input/") var numBlankLines = sc.accumulator(0) def toWords(line:String): Array[String] = { if(line.length == 0) {numBlankLines += 1} return line.split(" "); } var words = file.flatMap(toWords) words.saveAsTextFile("words3") printf("Blank lines: %d", numBlankLines.value) //Blank lines: 24857
  • 12. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is ??? Accumulators and Fault Tolerance
  • 13. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Accumulators and Fault Tolerance
  • 14. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Does it mean accumulators will give wrong result? Accumulators and Fault Tolerance
  • 15. Advanced Spark Programming ● Spark Re-executes failed or slow tasks. ● Preemptively launches “speculative” copy of slow worker task The net result is: The same function may run multiple times on the same data. Does it mean accumulators will give wrong result? YES, for accumulators in Transformation. No, for accumulators in Action Accumulators and Fault Tolerance
  • 16. Advanced Spark Programming ○ For accumulators in actions, Each task’s accumulator update applied once. ○ For reliable absolute value counter, put it inside an action ○ In transformations, this guarantee doesn't exist. ○ In transformations, use accumulators for debug only. Accumulators and Fault Tolerance
  • 17. Advanced Spark Programming Custom Accumulators ● Out of the box, Spark supports accumulators of type Double, Long, and Float. ● Spark also includes an API to define custom accumulator types and custom aggregation operations ○ (e.g., finding the maximum of the accumulated values instead of adding them). ● Custom accumulators need to extend AccumulatorV2.
  • 18. Advanced Spark Programming Custom Accumulators - version 1.x
  • 19. Advanced Spark Programming class MyComplex(var x: Int, var y: Int) extends Serializable{ def reset(): Unit = { x = 0 y = 0 } def add(p:MyComplex): MyComplex = { x = x + p.x y = y + p.y return this } } Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 20. Advanced Spark Programming import org.apache.spark.AccumulatorParam class ComplexAccumulatorV1 extends AccumulatorParam[MyComplex] { def zero(initialVal: MyComplex): MyComplex = { return initialVal } def addInPlace(v1: MyComplex, v2: MyComplex): MyComplex = { v1.add(v2) return v1; } } Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 21. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 22. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) var myrdd = sc.parallelize(Array(1,2,3)) def myfunc(x:Int):Int = { vecAccum += new MyComplex(x, x) return x * 3 } var myrdd1 = myrdd.map(myfunc) Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 23. Advanced Spark Programming val vecAccum = sc.accumulator(new MyComplex(0,0))(new ComplexAccumulatorV1) var myrdd = sc.parallelize(Array(1,2,3)) def myfunc(x:Int):Int = { vecAccum += new MyComplex(x, x) return x * 3 } var myrdd1 = myrdd.map(myfunc) myrdd1.collect() vecAccum.value.x vecAccum.value.y Custom Accumulators - version 1.x https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
  • 24. Advanced Spark Programming import org.apache.spark.util.AccumulatorV2 object ComplexAccumulatorV2 extends AccumulatorV2[MyComplex, MyComplex] { private val myc:MyComplex = new MyComplex(0,0) def reset(): Unit = { myc.reset() } def add(v: MyComplex): Unit = { myc.add(v) } def value():MyComplex = { return myc } def isZero(): Boolean = { return (myc.x == 0 && myc.y == 0) } def copy():AccumulatorV2[MyComplex, MyComplex] = { return ComplexAccumulatorV2 } def merge(other:AccumulatorV2[MyComplex, MyComplex]) = { myc.add(other.value) } } sc.register(ComplexAccumulatorV2, "mycomplexacc") Custom Accumulators - version 2.x https://gist.github.com/girisandeep/35b21cca890157afe0084a9e400e2e70
  • 25. Advanced Spark Programming Broadcast Variables : Introduction commonWords = ["a", "an", "the", "of", "at", "is", "am", "are", "this", "that", '', 'at'] If we need to remove the common words from our wordcount, what do we need to do?
  • 26. Advanced Spark Programming Broadcast Variables : Introduction commonWords = ["a", "an", "the", "of", "at", "is", "am", "are", "this", "that", '', 'at'] If we need to remove the common words from our wordcount, what do we need to do? > We can create a local variable and use it
  • 27. Advanced Spark Programming commonWords = List("a", "an", "the", "of", "at", "is", "am", "are", "this", "that", "", "at") If we need to remove the common words from our wordcount, what do we need to do? > We can create a local variable and use it > Is it inefficient? Broadcast Variables : Introduction
  • 28. Advanced Spark Programming Yes, because 1. Spark sends referenced variables to all workers. 1. The default task launching mechanism is optimised for small task sizes. 2. If using multiple times, spark will be sending it again to all nodes So, we use broadcast variable instead. Broadcast Variables : Introduction
  • 29. Advanced Spark Programming SHARED MEMORY Broadcast Variables broadcast.value() Broadcast() Hadoop Distributed File System (HDFS Resilient Distributed Dataset (RDD) Spark Application Spark Application Spark Application Spark Application Broadcast Variables Cache
  • 30. Advanced Spark Programming ● Efficiently send a large, read-only value to workers Broadcast Variables
  • 31. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes Broadcast Variables
  • 32. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm Broadcast Variables
  • 33. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm ● It is like a distributed cache of Hadoop ● Spark distributes broadcast variables efficiently to reduce communication cost. Broadcast Variables
  • 34. Advanced Spark Programming ● Efficiently send a large, read-only value to workers ● For example: ○ Send a large, read-only lookup table to all the nodes ○ Large feature vector in a machine learning algorithm ● It is like a distributed cache of Hadoop ● Spark distributes broadcast variables efficiently to reduce communication cost. ● Useful when ○ Tasks across multiple stages need the same data ○ Caching the data in deserialized form is important. Broadcast Variables
  • 35. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 36. Advanced Spark Programming var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 37. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap)
  • 38. Advanced Spark Programming var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 39. Advanced Spark Programming Broadcast Variables : Example var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") def toWords(line:String):Array[String] = { var words = line.split(" ") var output = Array[String](); for(word <- words){ if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]",""))) output = output :+ word; } return output; } var uncommonWords = file.flatMap(toWords) Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
  • 40. Advanced Spark Programming Broadcast Variables : Example Removing Common Words using Broadcast. https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at", "in", "or", "and", "or", "not", "be", "for", "to", "it") val commonWordsMap = collection.mutable.Map[String, Int]() for(word <- commonWords){ commonWordsMap(word) = 1 } var commonWordsBC = sc.broadcast(commonWordsMap) var file = sc.textFile("/data/mr/wordcount/input/big.txt") def toWords(line:String):Array[String] = { var words = line.split(" ") var output = Array[String](); for(word <- words){ if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]",""))) output = output :+ word; } return output; } var uncommonWords = file.flatMap(toWords) uncommonWords.take(100)
  • 41. Advanced Spark Programming Key Performance Considerations 1. Level of Parallelism 2. Serialization Format 3. Memory Management 4. Hardware Provisioning
  • 42. Advanced Spark Programming Level of Parallelism By Default ● A single task per one partition, ● A single core in the cluster to execute. ● Default partitions are based on underlying storage or CPU ● HDFS RDDs - One partition per block
  • 43. Advanced Spark Programming Level of Parallelism Too Less ⇒ Might leave resources idle Too Much ⇒ Small overheads due to each partition adds up By Default ● A single task per one partition, ● A single core in the cluster to execute. ● Default partitions are based on underlying storage or CPU ● HDFS RDDs - One partition per block
  • 44. Advanced Spark Programming Key Performance Considerations 1. Level of Parallelism - How many default partitions?
  • 45. Advanced Spark Programming Key Performance Considerations - Partitions /data/msprojects/in_table.csv has 62 blocks theoratically. Lets check. $ hadoop fs -ls /data/msprojects/in_table.csv -rw-r--r-- 3 sandeep sandeep 8303338297 2017-04-18 02:26 /data/msprojects/in_table.csv $ python >>> 8303338297.0/128.0/1024.0/1024.0 61.86469120532274 >>> $ hdfs fsck /data/msprojects/in_table.csv ….. Total blocks (validated): 62 (avg. block size 133924811 B) Yes, it has 62 blocks actually.
  • 46. Advanced Spark Programming $ spark-shell --master yarn scala> var myrdd = sc.textFile("/data/msprojects/in_table.csv") scala> myrdd.partitions.length res1: Int = 62 Key Performance Considerations - Partitions So, number of partitions is a function of number of data blocks in case of sc.textFile.
  • 47. Advanced Spark Programming Key Performance Considerations - Partitions // In the local mode spark-shell scala> var myrdd = sc.parallelize(1 to 100000) scala> myrdd.partitions.length res1: Int = 4 [sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor processor : 0 processor : 1 processor : 2 processor : 3 Since my machine has 4 cores, it has created 4 partitions.
  • 48. Advanced Spark Programming $ spark-shell --master yarn scala> var myrdd = sc.parallelize(1 to 100000) scala> myrdd.partitions.length res6: Int = 2 When we are running in yarn mode, the number of partitions is function of tasks that can be executed on a node, Here it is 2. Key Performance Considerations - Partitions
  • 49. Advanced Spark Programming Level of Parallelism 1. Specify number of partitions in sc.parallelize and sc.textFile 2. Shuffling operations accept degree of parallelism in parameter 3. repartition() or partitionBy 4. To efficiently shrink, prefer coalesce() over repartition() How to control parallelism?
  • 50. Advanced Spark Programming Level of Parallelism 1. We are reading a large amount of data from S3. 2. filter() operation is likely to leave a tiny fraction 3. Result of filter() will have same size RDD as parent but with many empty or small partitions. 4. Improve the application’s performance by coalescing Example
  • 51. Advanced Spark Programming Serialization Format ● While transferring or saving objects need serialization ● Comes into play during large transfers ● By default Spark will use Java’s built-in serializer.
  • 52. Advanced Spark Programming Serialization Format Benchmarks https://github.com/eishay/jvm-serializers/wiki
  • 53. Advanced Spark Programming Serialization Format Kryo ● Spark also supports the use of Kryo ● Faster and more compact ● But cannot serialize all types of objects “out of the box.” ● Almost all applications will benefit from shifting to Kryo ● To use, ○ sc.getConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ● For best performance, register classes with Kryo ○ sc.getConf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2])) ○ Class needs to implement Java’s Serializable interface
  • 54. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management
  • 55. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management Shuffle and aggregation buffers ● For storing shuffle output data ● spark.shuffle.memoryFraction - Default: 20%
  • 56. Advanced Spark Programming RDD storage ● persist()'ed memory ● spark.storage.memoryFraction - Default: 60% ● If exceeded, older will be dropped ○ will be computed on demand ● For huge data, use persist() with MEMORY_AND_DISK Memory Management Shuffle and aggregation buffers ● For storing shuffle output data ● spark.shuffle.memoryFraction - Default: 20% User Code Remaining Default: 20% of memory
  • 57. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks Hardware Provisioning
  • 58. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks ● App Speed = (Impact of Memory + Cores) ○ Huge memory -> GC pauses ○ 64GB or less Hardware Provisioning
  • 59. Advanced Spark Programming ● Main Parameters ○ Executor’s Memory (spark.executor.memory) ○ Number of cores per Executor, ○ Total number of executors ○ No. of disks ● App Speed = (Impact of Memory + Cores) ○ Huge memory -> GC pauses ○ 64GB or less ● Linear scaling ○ 2 x Hardware == 2 x speed Hardware Provisioning