MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
This document summarizes Viadeo's use of Apache Spark. It discusses how Spark is used to build models for job offer click prediction and member segmentation. Spark jobs process event log data from HDFS and HBase to cluster job titles, build relationship graphs, compute input variables for regression models, and evaluate segments. The models improve click-through rates and allow flexible, fast member targeting. Future work includes indexing segmentations and exposing them for analytics and online campaign building.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
This document provides an overview of MapReduce programming and best practices for Apache Hadoop. It describes the key components of Hadoop including HDFS, MapReduce, and the data flow. It also discusses optimizations that can be made to MapReduce jobs, such as using combiners, compression, and speculation. Finally, it outlines some anti-patterns to avoid and tips for debugging MapReduce applications.
Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.
A transpiler is a type of compiler that takes source code from one programming language and outputs source code in another programming language, while a compiler converts source code directly into machine code. Transpilers allow code to be translated between languages at similar levels of abstraction, such as C++ to C, while compilers translate to a lower level like C to assembly code. Transpilers are useful for porting codebases to new languages, translating between language versions, or implementing domain-specific languages. Popular transpilers include Babel, TypeScript, and Emscripten.
ASCI Terascale Simulation Requirements and Deployments
Presented at the Oak Ridge Interconnects Workshop in 1999. A fun historical perpsective on where the HPC industry in 1999 thought we would be going forward into the petascale industry.
Use of spark for proteomic scoring seattle presentation
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
Glenn K. Lockwood's document summarizes his professional background and experience with data-intensive computing systems. It then discusses the Gordon supercomputer deployed at SDSC in 2012, which was one of the world's first systems to use flash storage. The document analyzes Gordon's architecture using burst buffers and SSDs, experiences using the flash file system, and lessons learned. It also compares Gordon's proto-burst buffer approach to the dedicated burst buffer nodes on the Cori supercomputer.
Apache Pig: Introduction, Description, Installation, Pig Latin Commands, Use, Examples, Usefulness are demonstrated in this presentation.
Tushar B. Kute
Researcher,
http://tusharkute.com
Pig is a data flow language that sits on top of Hadoop and allows users to quickly process large volumes of data across many servers simultaneously. It supports relational features like joins, groups, and aggregates, making it well-suited for extract, transform, load (ETL) tasks. Common ETL use cases for Pig include time-sensitive data loads from various sources into databases, and processing multiple data sources to gain insights into customer behavior. While Pig can handle ETL tasks, it is also capable of sampling large datasets for analysis and providing analytical insights beyond basic ETL functions.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
MapReduce is a programming model for processing large datasets in parallel. It works by breaking the dataset into independent chunks which are processed by the map function, and then grouping the output of the maps into partitions to be processed by the reduce function. Hadoop uses MapReduce to provide fault tolerance by restarting failed tasks and monitoring the JobTracker and TaskTrackers. MapReduce programs can be written in languages other than Java using Hadoop Streaming.
This document describes how to set up a single-node Hadoop installation to perform MapReduce operations. It discusses supported platforms, required software including Java and SSH, and preparing the Hadoop cluster in either local, pseudo-distributed, or fully-distributed mode. The main components of the MapReduce execution pipeline are explained, including the driver, mapper, reducer, and input/output formats. Finally, a simple word count example MapReduce job is described to demonstrate how it works.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
This document summarizes Viadeo's use of Apache Spark. It discusses how Spark is used to build models for job offer click prediction and member segmentation. Spark jobs process event log data from HDFS and HBase to cluster job titles, build relationship graphs, compute input variables for regression models, and evaluate segments. The models improve click-through rates and allow flexible, fast member targeting. Future work includes indexing segmentations and exposing them for analytics and online campaign building.
This document introduces Apache Spark, an open-source cluster computing system that provides fast, general execution engines for large-scale data processing. It summarizes key Spark concepts including resilient distributed datasets (RDDs) that let users spread data across a cluster, transformations that operate on RDDs, and actions that return values to the driver program. Examples demonstrate how to load data from files, filter and transform it using RDDs, and run Spark programs on a local or cluster environment.
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...Chester Chen
Uber developed an new Spark ingestion system, Marmaray, for data ingestion from various sources. It’s designed to ingest billions of Kafka messages every 30 minutes. The amount of data handled by the pipeline is of the order hundreds of TBs. Omar details how to tackle such scale and insights into the optimizations techniques. Some key highlights are how to understand bottlenecks in Spark applications, to cache or not to cache your Spark DAG to avoid rereading your input data, how to effectively use accumulators to avoid unnecessary Spark actions, how to inspect your heap and nonheap memory usage across hundreds of executors, how you can change the layout of data to save long-term storage cost, how to effectively use serializers and compression to save network and disk traffic, and how to reduce amortize the cost of your application by multiplexing your jobs, different techniques for reducing memory footprint, runtime, and on-disk usage. CGI was able to significantly (~10%–40%) reduce memory footprint, runtime, and disk usage.
Speaker: Omkar Joshi (Uber)
Omkar Joshi is a senior software engineer on Uber’s Hadoop platform team, where he’s architecting Marmaray. Previously, he led object store and NFS solutions at Hedvig and was an initial contributor to Hadoop’s YARN scheduler.
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
This document provides an overview of MapReduce programming and best practices for Apache Hadoop. It describes the key components of Hadoop including HDFS, MapReduce, and the data flow. It also discusses optimizations that can be made to MapReduce jobs, such as using combiners, compression, and speculation. Finally, it outlines some anti-patterns to avoid and tips for debugging MapReduce applications.
Pig is a platform for analyzing large datasets that uses a high-level language to express data analysis programs. It compiles programs into MapReduce jobs that can run in parallel on a Hadoop cluster. Pig provides built-in functions for common tasks and allows users to define their own custom functions (UDFs). Programs can be run locally or on a Hadoop cluster by placing commands in a script or Grunt shell.
The document provides an overview of MapReduce and how it addresses the problem of processing large datasets in a distributed computing environment. It explains how MapReduce inspired by functional programming works by splitting data, mapping functions to pieces in parallel, and then reducing the results. Examples are given of word count and sorting word counts to find the most frequent word. Finally, it discusses how Hadoop popularized MapReduce by providing an open-source implementation and ecosystem.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes people make when writing Spark applications and provides recommendations to address them. It covers issues related to executor configuration, application failures due to shuffle block sizes exceeding limits, slow jobs caused by data skew, and managing the DAG to avoid excessive shuffles and stages. Recommendations include using smaller executors, increasing the number of partitions, addressing skew through techniques like salting, and preferring ReduceByKey over GroupByKey and TreeReduce over Reduce to improve performance and resource usage.
This document provides an overview of the Hadoop MapReduce Fundamentals course. It discusses what Hadoop is, why it is used, common business problems it can address, and companies that use Hadoop. It also outlines the core parts of Hadoop distributions and the Hadoop ecosystem. Additionally, it covers common MapReduce concepts like HDFS, the MapReduce programming model, and Hadoop distributions. The document includes several code examples and screenshots related to Hadoop and MapReduce.
This document discusses using Python for Hadoop and data mining. It introduces Dumbo, which allows writing Hadoop programs in Python. K-means clustering in MapReduce is also covered. Dumbo provides a Pythonic API for MapReduce and allows extending Hadoop functionality. Examples demonstrate implementing K-means in Dumbo and optimizing it by computing partial centroids locally in mappers. The document also lists Python books and tools for data mining and scientific computing.
The document provides an overview of various Apache Pig features including:
- The Grunt shell which allows interactive execution of Pig Latin scripts and access to HDFS.
- Advanced relational operators like SPLIT, ASSERT, CUBE, SAMPLE, and RANK for transforming data.
- Built-in functions and user defined functions (UDFs) for data processing. Macros can also be defined.
- Running Pig in local or MapReduce mode and accessing HDFS from within Pig scripts.
Overview of myHadoop 0.30, a framework for deploying Hadoop on existing high-performance computing infrastructure. Discussion of how to install it, spin up a Hadoop cluster, and use the new features.
myHadoop 0.30's project page is now on GitHub (https://github.com/glennklockwood/myhadoop) and the latest release tarball can be downloaded from my website (glennklockwood.com/files/myhadoop-0.30.tar.gz)
A brief overview of a case study where 438 human genomes underwent read mapping and variant calling in under two months. Architectural requirements for the multi-stage pipeline are covered.
A transpiler is a type of compiler that takes source code from one programming language and outputs source code in another programming language, while a compiler converts source code directly into machine code. Transpilers allow code to be translated between languages at similar levels of abstraction, such as C++ to C, while compilers translate to a lower level like C to assembly code. Transpilers are useful for porting codebases to new languages, translating between language versions, or implementing domain-specific languages. Popular transpilers include Babel, TypeScript, and Emscripten.
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
Presented at the Oak Ridge Interconnects Workshop in 1999. A fun historical perpsective on where the HPC industry in 1999 thought we would be going forward into the petascale industry.
Use of spark for proteomic scoring seattle presentationlordjoe
This document discusses using Apache Spark to parallelize proteomic scoring, which involves matching tandem mass spectra against a large database of peptides. The author developed a version of the Comet scoring algorithm and implemented it on a Spark cluster. This outperformed single machines by over 10x, allowing searches that took 8 hours to be done in under 30 minutes. Key considerations for running large jobs in parallel on Spark are discussed, such as input formatting, accumulator functions for debugging, and smart partitioning of data. The performance improvements allow searching larger databases and considering more modifications.
Лекция 3. Распределённая файловая система HDFSTechnopark
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
Техносфера Mail.ru Group, МГУ им. М.В. Ломоносова.
Курс "Методы распределенной обработки больших объемов данных в Hadoop"
Видео лекции курса https://www.youtube.com/playlist?list=PLrCZzMib1e9rPxMIgPri9YnOpvyDAL9HD
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
Glenn K. Lockwood's document summarizes his professional background and experience with data-intensive computing systems. It then discusses the Gordon supercomputer deployed at SDSC in 2012, which was one of the world's first systems to use flash storage. The document analyzes Gordon's architecture using burst buffers and SSDs, experiences using the flash file system, and lessons learned. It also compares Gordon's proto-burst buffer approach to the dedicated burst buffer nodes on the Cori supercomputer.
Generalized Linear Models in Spark MLlib and SparkRDatabricks
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R’s model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Chris Fregly
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/227622666/
Title: Spark on Kubernetes
Abstract: Engineers across several organizations are working on support for Kubernetes as a cluster scheduler backend within Spark. While designing this, we have encountered several challenges in translating Spark to use idiomatic Kubernetes constructs natively. This talk is about our high level design decisions and the current state of our work.
Speaker:
Anirudh Ramanathan is a software engineer on the Kubernetes team at Google. His focus is on running stateful and batch workloads. Previously, he worked on GGC (Google Global Cache) and prior to that, on the infrastructure team at NVIDIA."
Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract:
A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux!
For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more.
This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.
Spark is an open-source distributed computing framework used for processing large datasets. It allows for in-memory cluster computing, which enhances processing speed. Spark core components include Resilient Distributed Datasets (RDDs) and a directed acyclic graph (DAG) that represents the lineage of transformations and actions on RDDs. Spark Streaming is an extension that allows for processing of live data streams with low latency.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Spark improves on Hadoop MapReduce by keeping data in-memory between jobs. It reads data into resilient distributed datasets (RDDs) that can be transformed and cached in memory across nodes for faster iterative jobs. RDDs are immutable, partitioned collections distributed across a Spark cluster. Transformations define operations on RDDs, while actions trigger computation by passing data to the driver program.
This document provides an introduction to Apache Spark, including its architecture and programming model. Spark is a cluster computing framework that provides fast, in-memory processing of large datasets across multiple cores and nodes. It improves upon Hadoop MapReduce by allowing iterative algorithms and interactive querying of datasets through its use of resilient distributed datasets (RDDs) that can be cached in memory. RDDs act as immutable distributed collections that can be manipulated using transformations and actions to implement parallel operations.
Spark is a fast and general engine for large-scale data processing. It runs programs up to 100x faster than Hadoop in memory, and 10x faster on disk. Spark supports Scala, Java, Python and can run on standalone, YARN, or Mesos clusters. It provides high-level APIs for SQL, streaming, machine learning, and graph processing.
This is an introductory tutorial to Apache Spark at the Lagos Scala Meetup II. We discussed the basics of processing engine, Spark, how it relates to Hadoop MapReduce. Little handson at the end of the session.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Spark is an open-source cluster computing framework. It started as a project in 2009 at UC Berkeley and was open sourced in 2010. It has over 300 contributors from 50+ organizations. Spark uses Resilient Distributed Datasets (RDDs) that allow in-memory cluster computing across clusters. RDDs provide a programming model for distributed datasets that can be created from external storage or by transforming existing RDDs. RDDs support operations like map, filter, reduce to perform distributed computations lazily.
Apache Spark is a cluster computing framework that allows for fast, easy, and general processing of large datasets. It extends the MapReduce model to support iterative algorithms and interactive queries. Spark uses Resilient Distributed Datasets (RDDs), which allow data to be distributed across a cluster and cached in memory for faster processing. RDDs support transformations like map, filter, and reduce and actions like count and collect. This functional programming approach allows Spark to efficiently handle iterative algorithms and interactive data analysis.
This document provides steps to install and run Apache Spark. It discusses:
1. Installing Scala, Spark, and configuring environment variables for Hadoop.
2. Running Spark programs using RDDs and transformations in standalone, YARN, and Hadoop modes.
3. Using SparkSQL and SparkR to read CSV files from HDFS and perform operations on DataFrames.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
This document provides an overview of Apache Spark's architectural components through the life of simple Spark jobs. It begins with a simple Spark application analyzing airline on-time arrival data, then covers Resilient Distributed Datasets (RDDs), the cluster architecture, job execution through Spark components like tasks and scheduling, and techniques for writing better Spark applications like optimizing partitioning and reducing shuffle size.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Apache Spark is all the rage these days. People who work with Big Data, Spark is a household name for them. We have been using it for quite some time now. So we already know that Spark is lightning-fast cluster computing technology, it is faster than Hadoop MapReduce.
If you ask any of these Spark techies, how Spark is fast, they would give you a vague answer by saying Spark uses DAG to carry out the in-memory computations.
So, how far is this answer satisfiable?
Well to a Spark expert, this answer is just equivalent to a poison.
Let’s try to understand how exactly spark is handling our computations through DAG.
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
Spark Summit East 2015 Advanced Devops Student SlidesDatabricks
This document provides an agenda for an advanced Spark class covering topics such as RDD fundamentals, Spark runtime architecture, memory and persistence, shuffle operations, and Spark Streaming. The class will be held in March 2015 and include lectures, labs, and Q&A sessions. It notes that some slides may be skipped and asks attendees to keep Q&A low during the class, with a dedicated Q&A period at the end.
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Kief Morris rethinks the infrastructure code delivery lifecycle, advocating for a shift towards composable infrastructure systems. We should shift to designing around deployable components rather than code modules, use more useful levels of abstraction, and drive design and deployment from applications rather than bottom-up, monolithic architecture and delivery.
Best Practices for Effectively Running dbt in Airflow.pdfTatiana Al-Chueyr
As a popular open-source library for analytics engineering, dbt is often used in combination with Airflow. Orchestrating and executing dbt models as DAGs ensures an additional layer of control over tasks, observability, and provides a reliable, scalable environment to run dbt models.
This webinar will cover a step-by-step guide to Cosmos, an open source package from Astronomer that helps you easily run your dbt Core projects as Airflow DAGs and Task Groups, all with just a few lines of code. We’ll walk through:
- Standard ways of running dbt (and when to utilize other methods)
- How Cosmos can be used to run and visualize your dbt projects in Airflow
- Common challenges and how to address them, including performance, dependency conflicts, and more
- How running dbt projects in Airflow helps with cost optimization
Webinar given on 9 July 2024
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
TrustArc Webinar - 2024 Data Privacy Trends: A Mid-Year Check-InTrustArc
Six months into 2024, and it is clear the privacy ecosystem takes no days off!! Regulators continue to implement and enforce new regulations, businesses strive to meet requirements, and technology advances like AI have privacy professionals scratching their heads about managing risk.
What can we learn about the first six months of data privacy trends and events in 2024? How should this inform your privacy program management for the rest of the year?
Join TrustArc, Goodwin, and Snyk privacy experts as they discuss the changes we’ve seen in the first half of 2024 and gain insight into the concrete, actionable steps you can take to up-level your privacy program in the second half of the year.
This webinar will review:
- Key changes to privacy regulations in 2024
- Key themes in privacy and data governance in 2024
- How to maximize your privacy program in the second half of 2024
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
Best Programming Language for Civil EngineersAwais Yaseen
The integration of programming into civil engineering is transforming the industry. We can design complex infrastructure projects and analyse large datasets. Imagine revolutionizing the way we build our cities and infrastructure, all by the power of coding. Programming skills are no longer just a bonus—they’re a game changer in this era.
Technology is revolutionizing civil engineering by integrating advanced tools and techniques. Programming allows for the automation of repetitive tasks, enhancing the accuracy of designs, simulations, and analyses. With the advent of artificial intelligence and machine learning, engineers can now predict structural behaviors under various conditions, optimize material usage, and improve project planning.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
RPA In Healthcare Benefits, Use Case, Trend And Challenges 2024.pptxSynapseIndia
Your comprehensive guide to RPA in healthcare for 2024. Explore the benefits, use cases, and emerging trends of robotic process automation. Understand the challenges and prepare for the future of healthcare automation
2. Outline
I. Hadoop/MapReduce Recap and Limitations
II. Complex Workflows and RDDs
III. The Spark Framework
IV. Spark on Gordon
V. Practical Limitations of Spark
SAN DIEGO SUPERCOMPUTER CENTER
3. Map/Reduce Parallelism
Data Data
SAN DIEGO SUPERCOMPUTER CENTER
Data
Data
Data
taDsakt a0
task 5 task 4
task 3
task 1 task 2
6. Shuffle/Sort
SAN DIEGO SUPERCOMPUTER CENTER
MapReduce Disk
Spill
1. Map – convert raw input into
key/value pairs. Output to
local disk ("spill")
2. Shuffle/Sort – All reducers
retrieve all spilled records
from all mappers over
network
3. Reduce – For each unique
key, do something with all
the corresponding values.
Output to HDFS
Map Map Map
Reduce Reduce Reduce
7. 2. Full* data dump to disk
SAN DIEGO SUPERCOMPUTER CENTER
MapReduce: Two
Fundamental Limitations
1. MapReduce prescribes
workflow.
• You map, then you reduce.
• You cannot reduce, then map...
• ...or anything else. See first
point.
Map Map Map
Reduce Reduce Reduce
between workflow steps.
• Mappers deliver output on local
disk (mapred.local.dir)
• Reducers pull input over network
from other nodes' local disks
• Output goes right back to local
* Combiners do local reductions to prevent a full, unreduced
dump of data to local disk
disks via HDFS
Shuffle/Sort
8. Beyond MapReduce
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator
SAN DIEGO SUPERCOMPUTER CENTER
9. Beyond MapReduce: Complex
Workflows
• What if workflow could be arbitrary in length?
• map-map-reduce
• reduce-map-reduce
How can you do this without flushing intermediate
results to disk after every operation?
• What if higher-level map/reduce operations
could be applied?
• sampling or filtering of a large dataset
• mean and variance of a dataset
• sum/subtract all elements of a dataset
• SQL JOIN operator
How can you ensure fault tolerance for all of these
baked-in operations?
SAN DIEGO SUPERCOMPUTER CENTER
10. SAN DIEGO SUPERCOMPUTER CENTER
MapReduce Fault
Tolerance
Map Map Map
Reduce Reduce Reduce
Mapper Failure:
1. Re-run map task
and spill to disk
2. Block until finished
3. Reducers proceed
as normal
Reducer Failure:
1. Re-fetch spills from
all mappers' disks
2. Re-run reducer task
11. Performing Complex Workflows
How can you do complex workflows without
flushing intermediate results to disk after every
operation?
1. Cache intermediate results in-memory
2. Allow users to specify persistence in memory and
partitioning of dataset across nodes
How can you ensure fault tolerance?
1. Coarse-grained atomicity via partitions (transform
chunks of data, not record-by-record)
2. Use transaction logging--forget replication
SAN DIEGO SUPERCOMPUTER CENTER
12. Resilient Distributed Dataset (RDD)
• Comprised of distributed, atomic partitions of elements
• Apply transformations to generate new RDDs
• RDDs are immutable (read-only)
• RDDs can only be created from persistent storage (e.g.,
HDFS, POSIX, S3) or by transforming other RDDs
# Create an RDD from a file on HDFS
text = sc.textFile('hdfs://master.ibnet0/user/glock/mobydick.txt')
# Transform the RDD of lines into an RDD of words (one word per element)
words = text.flatMap( lambda line: line.split() )
# Transform the RDD of words into an RDD of key/value pairs
keyvals = words.map( lambda word: (word, 1) )
sc is a SparkContext object that describes our Spark cluster
lambda declares a "lambda function" in Python (an anonymous function in Perl and other languages)
SAN DIEGO SUPERCOMPUTER CENTER
14. RDD Transformation vs. Action
• Transformations are lazy: nothing actually happens when
this code is evaluated
• RDDs are computed only when an action is called on
them, e.g.,
• Calculate statistics over the elements of an RDD (count, mean)
• Save the RDD to a file (saveAsTextFile)
• Reduce elements of an RDD into a single object or value (reduce)
• Allows you to define partitioning/caching behavior after
defining the RDD but before calculating its contents
SAN DIEGO SUPERCOMPUTER CENTER
15. RDD Transformation vs. Action
• Must insert an action here to get pipeline to execute.
• Actions create files or objects:
# The saveAsTextFile action dumps the contents of an RDD to disk
>>> rdd.saveAsTextFile('hdfs://master.ibnet0/user/glock/output.txt')
# The count action returns the number of elements in an RDD
>>> num_elements = rdd.count();
num_elements;
type(num_elements)
SAN DIEGO SUPERCOMPUTER CENTER
215136
<type 'int'>
16. Resiliency: The 'R' in 'RDD'
• No replication of in-memory data
• Restrict transformations to coarse granularity
• Partition-level operations simplifies data lineage
SAN DIEGO SUPERCOMPUTER CENTER
17. Resiliency: The 'R' in 'RDD'
• Reconstruct missing data from its lineage
• Data in RDDs are deterministic since partitions
are immutable and atomic
SAN DIEGO SUPERCOMPUTER CENTER
18. Resiliency: The 'R' in 'RDD'
• Long lineages or complex interactions
(reductions, shuffles) can be checkpointed
• RDD immutability nonblocking (background)
SAN DIEGO SUPERCOMPUTER CENTER
19. Introduction to Spark
SPARK: AN IMPLEMENTATION
OF RDDS
SAN DIEGO SUPERCOMPUTER CENTER
20. Spark Framework
• Master/worker Model
• Spark Master is analogous to Hadoop Jobtracker (MRv1)
or Application Master (MRv2)
• Spark Worker is analogous to Hadoop Tasktracker
• Relies on "3rd party" storage for RDD generation
(hdfs://, s3n://, file://, http://)
• Spark clusters take three forms:
• Standalone mode - workers communicate directly with
master via spark://master:7077 URI
• Mesos - mesos://master:5050 URI
• YARN - no HA; complicated job launch
SAN DIEGO SUPERCOMPUTER CENTER
21. Spark on Gordon: Configuration
1. Standalone mode is the simplest configuration
and execution model (similar to MRv1)
2. Leverage existing HDFS support in myHadoop
for storage
3. Combine #1 and #2 to extend myHadoop to
support Spark:
$ export HADOOP_CONF_DIR=/home/glock/hadoop.conf
$ myhadoop-configure.sh
...
myHadoop: Enabling experimental Spark support
myHadoop: Using SPARK_CONF_DIR=/home/glock/hadoop.conf/spark
myHadoop:
To use Spark, you will want to type the following commands:"
source /home/glock/hadoop.conf/spark/spark-env.sh
myspark start
SAN DIEGO SUPERCOMPUTER CENTER
22. Spark on Gordon: Storage
• Spark can use HDFS
$ start-dfs.sh # after you run myhadoop-configure.sh, of course
...
$ pyspark
>>> mydata = sc.textFile('hdfs://localhost:54310/user/glock/mydata.txt')
>>> mydata.count()
982394
• Spark can use POSIX file systems too
$ pyspark
>>> mydata = sc.textFile('file:///oasis/scratch/glock/temp_project/mydata.txt')
>>> mydata.count()
982394
• S3 Native (s3n://) and HTTP (http://) also work
• file:// input will be served in chunks to Spark
workers via the Spark driver's built-in httpd
SAN DIEGO SUPERCOMPUTER CENTER
23. Spark on Gordon: Running
Spark treats several languages as first-class
citizens:
Feature Scala Java Python
Interactive YES NO YES
Shark (SQL) YES YES YES
Streaming YES YES NO
MLlib YES YES YES
GraphX YES YES NO
R is a second-class citizen; basic RDD API is
available outside of CRAN
(http://amplab-extras.github.io/SparkR-pkg/)
SAN DIEGO SUPERCOMPUTER CENTER
24. myHadoop/Spark on Gordon (1/2)
#!/bin/bash
#PBS -l nodes=2:ppn=16:native:flash
#PBS -l walltime=00:30:00
#PBS -q normal
### Environment setup for Hadoop
export MODULEPATH=/home/glock/apps/modulefiles:$MODULEPATH
module load hadoop/2.2.0
export HADOOP_CONF_DIR=$HOME/mycluster.conf
myhadoop-configure.sh
### Start HDFS. Starting YARN isn't necessary since Spark will be running in
### standalone mode on our cluster.
start-dfs.sh
### Load in the necessary Spark environment variables
source $HADOOP_CONF_DIR/spark/spark-env.sh
### Start the Spark masters and workers. Do NOT use the start-all.sh provided
### by Spark, as they do not correctly honor $SPARK_CONF_DIR
myspark start
SAN DIEGO SUPERCOMPUTER CENTER
25. myHadoop/Spark on Gordon (2/2)
### Run our example problem.
### Step 1. Load data into HDFS (Hadoop 2.x does not make the user's HDFS home
### dir by default which is different from Hadoop 1.x!)
hdfs dfs -mkdir -p /user/$USER
hdfs dfs -put /home/glock/hadoop/run/gutenberg.txt /user/$USER/gutenberg.txt
### Step 2. Run our Python Spark job. Note that Spark implicitly requires
### Python 2.6 (some features, like MLLib, require 2.7)
module load python scipy
/home/glock/hadoop/run/wordcount-spark.py
### Step 3. Copy output back out
hdfs dfs -get /user/$USER/output.dir $PBS_O_WORKDIR/
### Shut down Spark and HDFS
myspark stop
stop-dfs.sh
### Clean up
myhadoop-cleanup.sh
SAN DIEGO SUPERCOMPUTER CENTER
Wordcount submit script and Python code online:
https://github.com/glennklockwood/sparktutorial
27. Major Problems with Spark
1. Still smells like a CS project
2. Debugging is a dark art
3. Not battle-tested at scale
SAN DIEGO SUPERCOMPUTER CENTER
28. #1: Spark Smells Like CS
• Components are constantly breaking
• Graph.partitionBy broken in 1.0.0 (SPARK-1931)
• Some components never worked
• SPARK_CONF_DIR (start-all.sh) doesn't work (SPARK-2058)
• stop-master.sh doesn't work
• Spark with YARN will break with large data sets (SPARK-2398)
• spark-submit for standalone mode doesn't work (SPARK-2260)
SAN DIEGO SUPERCOMPUTER CENTER
29. #1: Spark Smells Like CS
• Really obvious usability issues:
>>> data = sc.textFile('file:///oasis/scratch/glock/temp_project/gutenberg.txt')
>>> data.saveAsTextFile('hdfs://gcn-8-42.ibnet0:54310/user/glock/output.dir')
14/04/30 16:23:07 ERROR Executor: Exception in task ID 19
scala.MatchError: 0 (of class java.lang.Integer)
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
SAN DIEGO SUPERCOMPUTER CENTER
...
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Read an RDD, then write it out = unhandled exception with
cryptic Scala errors from Python (SPARK-1690)
30. #2: Debugging is a Dark Art
>>> data.saveAsTextFile('hdfs://s12ib:54310/user/glock/gutenberg.out')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/N/u/glock/apps/spark-0.9.0/python/pyspark/rdd.py", line 682, in
saveAsTextFile
keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-
src.zip/py4j/java_gateway.py", line 537, in __call__
File "/N/u/glock/apps/spark-0.9.0/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o23.saveAsTextFile.
: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with
client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1070)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy7.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
Cause: Spark built against Hadoop 2 DFS trying to access data
on Hadoop 1 DFS
SAN DIEGO SUPERCOMPUTER CENTER
31. #2: Debugging is a Dark Art
>>> data.count()
14/04/30 16:15:11 ERROR Executor: Exception in task ID 12
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
self.serializer.dump_stream(self._batched(iterator), stream)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
for obj in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/serializers.
for item in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop1/python/pyspark/rdd.py", lin
if acc is None:
TypeError: an integer is required
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4
SAN DIEGO SUPERCOMPUTER CENTER
32. #2: Debugging is a Dark Art
>>> data.saveAsTextFile('hdfs://user/glock/output.dir/')
14/04/30 17:53:20 WARN scheduler.TaskSetManager: Loss was due to org.apache.spark.api.p
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/worker.py",
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/serializers.
for obj in iterator:
File "/home/glock/apps/spark-0.9.0-incubating-bin-hadoop2/python/pyspark/rdd.py", lin
if not isinstance(x, basestring):
SystemError: unknown opcode
at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:131)
at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:153)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
...
Cause: Master was using Python 2.6, but workers were only
able to find Python 2.4
SAN DIEGO SUPERCOMPUTER CENTER
33. #2: Spark Debugging Tips
• $SPARK_LOG_DIR/app-* contains master/worker
logs with failure information
• Try to find the salient error amidst the stack traces
• Google that error--odds are, it is a known issue
• Stick any required environment variables ($PATH,
$PYTHONPATH, $JAVA_HOME) in
$SPARK_CONF_DIR/spark-env.sh to rule out
these problems
• All else fails, look at Spark source code
SAN DIEGO SUPERCOMPUTER CENTER
34. #3: Spark Isn't Battle Tested
• Companies (Cloudera, SAP, etc) jumping on the
Spark bandwagon with disclaimers about scaling
• Spark does not handle multitenancy well at all.
Wait scheduling is considered best way to achieve
memory/disk data locality
• Largest Spark clusters ~ hundreds of nodes
SAN DIEGO SUPERCOMPUTER CENTER
35. Spark Take-Aways
SAN DIEGO SUPERCOMPUTER CENTER
• FACTS
• Data is represented as resilient distributed datasets
(RDDs) which remain in-memory and read-only
• RDDs are comprised of elements
• Elements are distributed across physical nodes in user-defined
groups called partitions
• RDDs are subject to transformations and actions
• Fault tolerance achieved by lineage, not replication
• Opinions
• Spark is still in its infancy but its progress is promising
• Good for evaluating--good for Gordon, Comet
37. Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
Start every webpage with a rank R = 1.0
1. For each webpage linking in N neighbor webpages,
have it "contribute" R/N to each of its N neighbors
2. Then, for each webpage, set its rank R to (0.15 +
0.85 * contributions)
SAN DIEGO SUPERCOMPUTER CENTER
3. Repeat
insert flow diagram here
38. Lazy Evaluation + In-Memory Caching =
Optimized JOIN Operations
lines = sc.textFile('hdfs://master.ibnet0:54310/user/glock/links.txt')
# Load key/value pairs of (url, link), eliminate duplicates, and partition them such
# that all common keys are kept together. Then retain this RDD in memory.
links = lines.map(lambda urls: urls.split()).distinct().groupByKey().cache()
# Create a new RDD of key/value pairs of (url, rank) and initialize all ranks to 1.0
ranks = links.map(lambda (url, neighbors): (url, 1.0))
# Calculate and update URL rank
for iteration in range(10):
# Calculate URL contributions to their neighbors
contribs = links.join(ranks).flatMap(
lambda (url, (urls, rank)): computeContribs(urls, rank))
# Recalculate URL ranks based on neighbor contributions
ranks = contribs.reduceByKey(add).mapValues(lambda rank: 0.15 + 0.85*rank)
# Print all URLs and their ranks
for (link, rank) in ranks.collect():
print '%s has rank %s' % (link, rank)
SAN DIEGO SUPERCOMPUTER CENTER
Editor's Notes
groupByKey: group the values for each key in the RDD into a single sequence
mapValues: apply map function to all values of key/value pairs without modifying keys (or their partitioning)
collect: return a list containing all elements of the RDD
def computeContribs(urls, rank):
"""Calculates URL contributions to the rank of other URLs."""
num_urls = len(urls)
for url in urls:
yield (url, rank / num_urls)