Big Data with Hadoop & Spark Training: http://bit.ly/2kyRTuW
This CloudxLab Advanced Spark Programming tutorial helps you to understand Advanced Spark Programming in detail. Below are the topics covered in this slide:
1) Shared Variables - Accumulators & Broadcast Variables
2) Accumulators and Fault Tolerance
3) Custom Accumulators - Version 1.x & Version 2.x
4) Examples of Broadcast Variables
5) Key Performance Considerations - Level of Parallelism
6) Serialization Format - Kryo
7) Memory Management
8) Hardware Provisioning
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Transformations and actions a visual guide training
The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Frustration-Reduced PySpark: Data engineering with DataFrames
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Stream, stream, stream: Different streaming methods with Spark and Kafka
Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka
Powerful big data processing and storage combined, this presentation walks thru the basics of integrating Apache Spark and Apache Cassandra. Presented by Alex Thompson at the Sydney Cassandra Meetup.
Spark is an in-memory cluster computing framework that can access data from HDFS. It uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs support transformations that create new datasets and actions that return values. DataFrames are equivalent to relational tables that allow for optimizations. HiveContext allows Spark to query data stored in Hive. Queries can be written using HiveQL, which is converted to Spark jobs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”
NEC has developed a new vector processor called SX-Aurora TSUBASA to accelerate machine learning and data analytics workloads. They developed a middleware framework called Frovedis that provides Spark-like functionality and is optimized for SX-Aurora TSUBASA. Frovedis achieved 10-100x speedups on machine learning algorithms and SQL-like queries compared to Spark on CPUs. NEC has also opened a lab called VEDAC for external users to access SX-Aurora TSUBASA systems running Frovedis.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
This document discusses Typesafe's Reactive Platform and Apache Spark. It describes Typesafe's Fast Data strategy of using a microservices architecture with Spark, Kafka, HDFS and databases. It outlines contributions Typesafe has made to Spark, including backpressure support, dynamic resource allocation in Mesos, and integration tests. The document also discusses Typesafe's customer support and roadmap, including plans to introduce Kerberos security and evaluate Tachyon.
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
Distributed training is a complex process that does more harm than good if it not setup correctly.
https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
(Big Data with Hadoop & Spark Training: http://bit.ly/2IUsWca
This CloudxLab Running in a Cluster tutorial helps you to understand running Spark in the cluster in detail. Below are the topics covered in this tutorial:
1) Spark Runtime Architecture
2) Driver Node
3) Scheduling Tasks on Executors
4) Understanding the Architecture
5) Cluster Managers
6) Executors
7) Launching a Program using spark-submit
8) Local Mode & Cluster-Mode
9) Installing Standalone Cluster
10) Cluster Mode - YARN
11) Launching a Program on YARN
12) Cluster Mode - Mesos and AWS EC2
13) Deployment Modes - Client and Cluster
14) Which Cluster Manager to Use?
15) Common flags for spark-submit
This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
This presentation show the main Spark characteristics, like RDD, Transformations and Actions.
I used this presentation for many Spark Intro workshops from Cluj-Napoca Big Data community : http://www.meetup.com/Big-Data-Data-Science-Meetup-Cluj-Napoca/
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Transformations and actions a visual guide trainingSpark Summit
The document summarizes key Spark API operations including transformations like map, filter, flatMap, groupBy, and actions like collect, count, and reduce. It provides visual diagrams and examples to illustrate how each operation works, the inputs and outputs, and whether the operation is narrow or wide.
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
1. Introduction to SparkR
2. Demo
Starting to use SparkR
DataFrames: dplyr style, SQL style
RDD v.s. DataFrames
SparkR on MLlib: GLM, K-means
3. User Case
Median: approxQuantile()
ID Match: dplyr style, SQL style, SparkR function
SparkR + Shiny
4. The Future of SparkR
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
This document provides an overview of Apache Spark, including how it compares to Hadoop, the Spark ecosystem, Resilient Distributed Datasets (RDDs), transformations and actions on RDDs, the directed acyclic graph (DAG) scheduler, Spark Streaming, and the DataFrames API. Key points covered include Spark's faster performance versus Hadoop through its use of memory instead of disk, the RDD abstraction for distributed collections, common RDD operations, and Spark's capabilities for real-time streaming data processing and SQL queries on structured data.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Spark 2.0 includes improvements that provide considerable speedups for CPU-intensive queries through techniques like code generation. Profiling tools like flame graphs can help analyze where CPU cycles are spent by visualizing stack traces. Flame graphs are useful for performance troubleshooting but have limitations. Testing Spark applications locally and through unit tests allows faster iteration compared to running on clusters and saves resources. It is also important to test with local approximations of distributed components like HDFS and Hive.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab
This document provides an introduction to Spark Structured Streaming. It discusses that Structured Streaming is a scalable, fault-tolerant stream processing engine built on the Spark SQL engine. It expresses streaming computations similar to batch processing and guarantees end-to-end exactly-once processing. The document also provides a code example of a word count application using Structured Streaming and discusses output modes for writing streaming query results.
HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated ...AMD Developer Central
Presentation HC-4021, Efficient scheduling of OpenMP and OpenCL™ workloads on Accelerated Processing Units, by Robert Engel at the AMD Developer Summit (APU13) Nov. 11-13, 2013.
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
Going into different streaming methods, we will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will also present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services’ costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* “Streaming” over Data Lake using Kafka
Powerful big data processing and storage combined, this presentation walks thru the basics of integrating Apache Spark and Apache Cassandra. Presented by Alex Thompson at the Sydney Cassandra Meetup.
A Step to programming with Apache SparkKnoldus Inc.
Spark is an in-memory cluster computing framework that can access data from HDFS. It uses Resilient Distributed Datasets (RDDs) as its fundamental data structure. RDDs support transformations that create new datasets and actions that return values. DataFrames are equivalent to relational tables that allow for optimizations. HiveContext allows Spark to query data stored in Hive. Queries can be written using HiveQL, which is converted to Spark jobs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Accelerating Spark MLlib and DataFrame with Vector Processor “SX-Aurora TSUBASA”Databricks
NEC has developed a new vector processor called SX-Aurora TSUBASA to accelerate machine learning and data analytics workloads. They developed a middleware framework called Frovedis that provides Spark-like functionality and is optimized for SX-Aurora TSUBASA. Frovedis achieved 10-100x speedups on machine learning algorithms and SQL-like queries compared to Spark on CPUs. NEC has also opened a lab called VEDAC for external users to access SX-Aurora TSUBASA systems running Frovedis.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
This document discusses Typesafe's Reactive Platform and Apache Spark. It describes Typesafe's Fast Data strategy of using a microservices architecture with Spark, Kafka, HDFS and databases. It outlines contributions Typesafe has made to Spark, including backpressure support, dynamic resource allocation in Mesos, and integration tests. The document also discusses Typesafe's customer support and roadmap, including plans to introduce Kerberos security and evaluate Tachyon.
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...Big Data Spain
Distributed training is a complex process that does more harm than good if it not setup correctly.
https://www.bigdataspain.org/2017/talk/apache-mxnet-distributed-training-explained-in-depth
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc.
Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE
Watch this presentation on YouTube:
https://www.youtube.com/watch?v=RI8yXhBb8Hg
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Fast federated SQL with Apache CalciteChris Baynes
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
Similar to Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)
Understanding computer vision with Deep LearningCloudxLab
Computer vision is a branch of computer science which deals with recognising objects, people and identifying patterns in visuals. It is basically analogous to the vision of an animal.
Topics covered:
1. Overview of Machine Learning
2. Basics of Deep Learning
3. What is computer vision and its use-cases?
4. Various algorithms used in Computer Vision (mostly CNN)
5. Live hands-on demo of either Auto Cameraman or Face recognition system
6. What next?
This document provides an agenda for an introduction to deep learning presentation. It begins with an introduction to basic AI, machine learning, and deep learning terms. It then briefly discusses use cases of deep learning. The document outlines how to approach a deep learning problem, including which tools and algorithms to use. It concludes with a question and answer section.
This document discusses recurrent neural networks (RNNs) and their applications. It begins by explaining that RNNs can process input sequences of arbitrary lengths, unlike other neural networks. It then provides examples of RNN applications, such as predicting time series data, autonomous driving, natural language processing, and music generation. The document goes on to describe the fundamental concepts of RNNs, including recurrent neurons, memory cells, and different types of RNN architectures for processing input/output sequences. It concludes by demonstrating how to implement basic RNNs using TensorFlow's static_rnn function.
Natural Language Processing (NLP) is a field of artificial intelligence that deals with interactions between computers and human languages. NLP aims to program computers to process and analyze large amounts of natural language data. Some common NLP tasks include speech recognition, text classification, machine translation, question answering, and more. Popular NLP tools include Stanford CoreNLP, NLTK, OpenNLP, and TextBlob. Vectorization is commonly used to represent text in a way that can be used for machine learning algorithms like calculating text similarity. Tf-idf is a common technique used to weigh words based on their frequency and importance.
- Naive Bayes is a classification technique based on Bayes' theorem that uses "naive" independence assumptions. It is easy to build and can perform well even with large datasets.
- It works by calculating the posterior probability for each class given predictor values using the Bayes theorem and independence assumptions between predictors. The class with the highest posterior probability is predicted.
- It is commonly used for text classification, spam filtering, and sentiment analysis due to its fast performance and high success rates compared to other algorithms.
An autoencoder is an artificial neural network that is trained to copy its input to its output. It consists of an encoder that compresses the input into a lower-dimensional latent-space encoding, and a decoder that reconstructs the output from this encoding. Autoencoders are useful for dimensionality reduction, feature learning, and generative modeling. When constrained by limiting the latent space or adding noise, autoencoders are forced to learn efficient representations of the input data. For example, a linear autoencoder trained with mean squared error performs principal component analysis.
The document discusses challenges in training deep neural networks and solutions to those challenges. Training deep neural networks with many layers and parameters can be slow and prone to overfitting. A key challenge is the vanishing gradient problem, where the gradients shrink exponentially small as they propagate through many layers, making earlier layers very slow to train. Solutions include using initialization techniques like He initialization and activation functions like ReLU and leaky ReLU that do not saturate, preventing gradients from vanishing. Later improvements include the ELU activation function.
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/5u2RiS )
This CloudxLab Reinforcement Learning tutorial helps you to understand Reinforcement Learning in detail. Below are the topics covered in this tutorial:
1) What is Reinforcement?
2) Reinforcement Learning an Introduction
3) Reinforcement Learning Example
4) Learning to Optimize Rewards
5) Policy Search - Brute Force Approach, Genetic Algorithms and Optimization Techniques
6) OpenAI Gym
7) The Credit Assignment Problem
8) Inverse Reinforcement Learning
9) Playing Atari with Deep Reinforcement Learning
10) Policy Gradients
11) Markov Decision Processes
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...CloudxLab
The document provides information about key-value RDD transformations and actions in Spark. It defines transformations like keys(), values(), groupByKey(), combineByKey(), sortByKey(), subtractByKey(), join(), leftOuterJoin(), rightOuterJoin(), and cogroup(). It also defines actions like countByKey() and lookup() that can be performed on pair RDDs. Examples are given showing how to use these transformations and actions to manipulate key-value RDDs.
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sf2z6i
This CloudxLab Introduction to Spark SQL & DataFrames tutorial helps you to understand Spark SQL & DataFrames in detail. Below are the topics covered in this slide:
1) Introduction to DataFrames
2) Creating DataFrames from JSON
3) DataFrame Operations
4) Running SQL Queries Programmatically
5) Datasets
6) Inferring the Schema Using Reflection
7) Programmatically Specifying the Schema
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2LCTufA
This CloudxLab Introduction to SparkR tutorial helps you to understand SparkR in detail. Below are the topics covered in this tutorial:
1) SparkR (R on Spark)
2) SparkR DataFrames
3) Launch SparkR
4) Creating DataFrames from Local DataFrames
5) DataFrame Operation
6) Creating DataFrames - From JSON
7) Running SQL Queries from SparkR
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
1) NoSQL databases are non-relational and schema-free, providing alternatives to SQL databases for big data and high availability applications.
2) Common NoSQL database models include key-value stores, column-oriented databases, document databases, and graph databases.
3) The CAP theorem states that a distributed data store can only provide two out of three guarantees around consistency, availability, and partition tolerance.
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2sh5b3E
This CloudxLab Hadoop Streaming tutorial helps you to understand Hadoop Streaming in detail. Below are the topics covered in this tutorial:
1) Hadoop Streaming and Why Do We Need it?
2) Writing Streaming Jobs
3) Testing Streaming jobs and Hands-on on CloudxLab
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLabCloudxLab
This document provides instructions for getting started with TensorFlow using a free CloudxLab. It outlines the following steps:
1. Open CloudxLab and enroll if not already enrolled. Otherwise go to "My Lab".
2. In "My Lab", open Jupyter and run commands to clone an ML repository containing TensorFlow examples.
3. Go to the deep learning folder in Jupyter and open the TensorFlow notebook to get started with examples.
Introduction to Deep Learning | CloudxLabCloudxLab
( Machine Learning & Deep Learning Specialization Training: https://goo.gl/goQxnL )
This CloudxLab Deep Learning tutorial helps you to understand Deep Learning in detail. Below are the topics covered in this tutorial:
1) What is Deep Learning
2) Deep Learning Applications
3) Artificial Neural Network
4) Deep Learning Neural Networks
5) Deep Learning Frameworks
6) AI vs Machine Learning
In this tutorial, we will learn the the following topics -
+ The Curse of Dimensionality
+ Main Approaches for Dimensionality Reduction
+ PCA - Principal Component Analysis
+ Kernel PCA
+ LLE
+ Other Dimensionality Reduction Techniques
In this tutorial, we will learn the the following topics -
+ Voting Classifiers
+ Bagging and Pasting
+ Random Patches and Random Subspaces
+ Random Forests
+ Boosting
+ Stacking
In this tutorial, we will learn the the following topics -
+ Training and Visualizing a Decision Tree
+ Making Predictions
+ Estimating Class Probabilities
+ The CART Training Algorithm
+ Computational Complexity
+ Gini Impurity or Entropy?
+ Regularization Hyperparameters
+ Regression
+ Instability
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2wLh5aF
This CloudxLab Introduction to Linux helps you to understand Linux in detail. Below are the topics covered in this tutorial:
1) Linux Overview
2) Linux Components - The Programs, The Kernel, The Shell
3) Overview of Linux File System
4) Connect to Linux Console
5) Linux - Quick Start Commands
6) Overview of Linux File System
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
2. Advanced Spark Programming
Shared variables:
● When we pass functions such map()
● Every node gets a copy of the variable
● The change to these variables is not communicated back
● After starting of the map(), changes to the variable on driver
doesn't impact the worker.
Two Kinds:
1. Accumulators to aggregate information
2. Broadcast variables to efficiently distribute large values
3. Advanced Spark Programming
SHARED MEMORY - Accumulators
+= 10 += 20
are only “added” to
through associative operation
assoc.: 2+3+4=2+4+3=9
4. Advanced Spark Programming
● Accumulators are variables that are only “added” to through an
associative operation
● Can therefore be efficiently supported in parallel.
● They can be used to implement counters (as in MapReduce) or
sums.
Accumulators
10. Advanced Spark Programming
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
11. Advanced Spark Programming
Accumulator : Empty line count
https://gist.github.com/girisandeep/161d1d5ea09517b1ab44df81b9b148c0
sc.setLogLevel("ERROR")
var file = sc.textFile("/data/mr/wordcount/input/")
var numBlankLines = sc.accumulator(0)
def toWords(line:String): Array[String] = {
if(line.length == 0) {numBlankLines += 1}
return line.split(" ");
}
var words = file.flatMap(toWords)
words.saveAsTextFile("words3")
printf("Blank lines: %d", numBlankLines.value)
//Blank lines: 24857
12. Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is ???
Accumulators and Fault Tolerance
13. Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Accumulators and Fault Tolerance
14. Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
Accumulators and Fault Tolerance
15. Advanced Spark Programming
● Spark Re-executes failed or slow tasks.
● Preemptively launches “speculative” copy of slow worker task
The net result is: The same function may run multiple times on the
same data.
Does it mean accumulators will give wrong result?
YES, for accumulators in Transformation.
No, for accumulators in Action
Accumulators and Fault Tolerance
16. Advanced Spark Programming
○ For accumulators in actions, Each task’s accumulator update
applied once.
○ For reliable absolute value counter, put it inside an action
○ In transformations, this guarantee doesn't exist.
○ In transformations, use accumulators for debug only.
Accumulators and Fault Tolerance
17. Advanced Spark Programming
Custom Accumulators
● Out of the box, Spark supports accumulators of type Double, Long,
and Float.
● Spark also includes an API to define custom accumulator types
and custom aggregation operations
○ (e.g., finding the maximum of the accumulated values instead of
adding them).
● Custom accumulators need to extend AccumulatorV2.
19. Advanced Spark Programming
class MyComplex(var x: Int, var y: Int) extends Serializable{
def reset(): Unit = {
x = 0
y = 0
}
def add(p:MyComplex): MyComplex = {
x = x + p.x
y = y + p.y
return this
}
}
Custom Accumulators - version 1.x
https://gist.github.com/girisandeep/450ff3d29f20f2e31cdd09ad0f1c0df2
25. Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
26. Advanced Spark Programming
Broadcast Variables : Introduction
commonWords = ["a", "an", "the", "of", "at", "is",
"am", "are", "this", "that", '', 'at']
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
27. Advanced Spark Programming
commonWords = List("a", "an", "the", "of", "at",
"is", "am", "are", "this", "that", "", "at")
If we need to remove the common words from our
wordcount, what do we need to do?
> We can create a local variable and use it
> Is it inefficient?
Broadcast Variables : Introduction
28. Advanced Spark Programming
Yes, because
1. Spark sends referenced variables to all workers.
1. The default task launching mechanism is optimised for small task sizes.
2. If using multiple times, spark will be sending it again to all nodes
So, we use broadcast variable instead.
Broadcast Variables : Introduction
31. Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
Broadcast Variables
32. Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
Broadcast Variables
33. Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
Broadcast Variables
34. Advanced Spark Programming
● Efficiently send a large, read-only value to workers
● For example:
○ Send a large, read-only lookup table to all the nodes
○ Large feature vector in a machine learning algorithm
● It is like a distributed cache of Hadoop
● Spark distributes broadcast variables efficiently to reduce communication
cost.
● Useful when
○ Tasks across multiple stages need the same data
○ Caching the data in deserialized form is important.
Broadcast Variables
35. Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
36. Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
37. Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
38. Advanced Spark Programming
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
39. Advanced Spark Programming
Broadcast Variables : Example
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
40. Advanced Spark Programming
Broadcast Variables : Example
Removing Common Words using Broadcast.
https://gist.github.com/girisandeep/f12ab4bf2536dc5f0a8ca673efbac1db
var commonWords = Array("a", "an", "the", "of", "at", "is", "am","are","this","that","at",
"in", "or", "and", "or", "not", "be", "for", "to", "it")
val commonWordsMap = collection.mutable.Map[String, Int]()
for(word <- commonWords){
commonWordsMap(word) = 1
}
var commonWordsBC = sc.broadcast(commonWordsMap)
var file = sc.textFile("/data/mr/wordcount/input/big.txt")
def toWords(line:String):Array[String] = {
var words = line.split(" ")
var output = Array[String]();
for(word <- words){
if(! (commonWordsBC.value contains word.toLowerCase.trim.replaceAll("[^a-z]","")))
output = output :+ word;
}
return output;
}
var uncommonWords = file.flatMap(toWords)
uncommonWords.take(100)
41. Advanced Spark Programming
Key Performance Considerations
1. Level of Parallelism
2. Serialization Format
3. Memory Management
4. Hardware Provisioning
42. Advanced Spark Programming
Level of Parallelism
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
43. Advanced Spark Programming
Level of Parallelism
Too Less ⇒ Might leave resources idle
Too Much ⇒ Small overheads due to each partition adds up
By Default
● A single task per one partition,
● A single core in the cluster to execute.
● Default partitions are based on underlying storage or CPU
● HDFS RDDs - One partition per block
46. Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.textFile("/data/msprojects/in_table.csv")
scala> myrdd.partitions.length
res1: Int = 62
Key Performance Considerations - Partitions
So, number of partitions is a function of number of data blocks in case
of sc.textFile.
47. Advanced Spark Programming
Key Performance Considerations - Partitions
// In the local mode
spark-shell
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res1: Int = 4
[sandeep@ip-172-31-60-179 ~]$ cat /proc/cpuinfo|grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Since my machine has 4 cores, it has created 4 partitions.
48. Advanced Spark Programming
$ spark-shell --master yarn
scala> var myrdd = sc.parallelize(1 to 100000)
scala> myrdd.partitions.length
res6: Int = 2
When we are running in yarn mode, the number of partitions is function
of tasks that can be executed on a node, Here it is 2.
Key Performance Considerations - Partitions
49. Advanced Spark Programming
Level of Parallelism
1. Specify number of partitions in sc.parallelize and sc.textFile
2. Shuffling operations accept degree of parallelism in parameter
3. repartition() or partitionBy
4. To efficiently shrink, prefer coalesce() over repartition()
How to control parallelism?
50. Advanced Spark Programming
Level of Parallelism
1. We are reading a large amount of data from S3.
2. filter() operation is likely to leave a tiny fraction
3. Result of filter() will have same size RDD as parent but with
many empty or small partitions.
4. Improve the application’s performance by coalescing
Example
51. Advanced Spark Programming
Serialization Format
● While transferring or saving objects need serialization
● Comes into play during large transfers
● By default Spark will use Java’s built-in serializer.
53. Advanced Spark Programming
Serialization Format
Kryo
● Spark also supports the use of Kryo
● Faster and more compact
● But cannot serialize all types of objects “out of the box.”
● Almost all applications will benefit from shifting to Kryo
● To use,
○ sc.getConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
● For best performance, register classes with Kryo
○ sc.getConf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
○ Class needs to implement Java’s Serializable interface
54. Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
55. Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
56. Advanced Spark Programming
RDD storage
● persist()'ed memory
● spark.storage.memoryFraction - Default: 60%
● If exceeded, older will be dropped
○ will be computed on demand
● For huge data, use persist() with MEMORY_AND_DISK
Memory Management
Shuffle and aggregation buffers
● For storing shuffle output data
● spark.shuffle.memoryFraction - Default: 20%
User Code
Remaining
Default: 20% of memory
57. Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
Hardware Provisioning
58. Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
Hardware Provisioning
59. Advanced Spark Programming
● Main Parameters
○ Executor’s Memory (spark.executor.memory)
○ Number of cores per Executor,
○ Total number of executors
○ No. of disks
● App Speed = (Impact of Memory + Cores)
○ Huge memory -> GC pauses
○ 64GB or less
● Linear scaling
○ 2 x Hardware == 2 x speed
Hardware Provisioning