Scaling Spark workloads on YARN and Mesos can provide significant performance improvements but the benefits vary across different workloads. Adding resources alone may not fully utilize the new nodes due to delay in scheduling tasks locally on the new nodes. Tuning the locality wait time parameter in Spark to quickly change task placement preference can help make better use of new resources. Dynamic executor allocation in Spark can also be enhanced to dynamically adjust configuration settings like locality wait time during auto-scaling.
This document discusses analyzing large genomic datasets with ADAM and Toil. It summarizes the sequencing and analysis process, and how ADAM implemented on Spark can provide horizontal scalability and speedups of 30-50x over traditional tools. Toil is introduced as a pipeline system for massive genomic workflows that can run on thousands of nodes and is resilient to failures. Results show ADAM produces equivalent variants to GATK while being 3.5x faster and 4x cheaper.
This document describes Drizzle, a low latency execution engine for Apache Spark. It addresses the high overheads of Spark's centralized scheduling model by decoupling execution from scheduling through batch scheduling and pre-scheduling of shuffles. Microbenchmarks show Drizzle achieves milliseconds latency for iterative workloads compared to hundreds of milliseconds for Spark. End-to-end experiments show Drizzle improves latency for streaming and machine learning workloads like logistic regression. The authors are working on automatic batch tuning and an open source release of Drizzle.
Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2
The document describes a new architecture called "monotasks" for Apache Spark that aims to make reasoning about Spark job performance easier. The monotasks architecture decomposes Spark tasks so that each task uses only one resource (e.g. CPU, disk, network). This avoids issues where Spark tasks bottleneck on different resources over time or experience resource contention. With monotasks, dedicated schedulers control resource contention and monotask timing data can be used to model ideal performance. Results show monotasks match Spark's performance and provide clearer insight into bottlenecks.
Luca Canali presented on using flame graphs to investigate performance improvements in Spark 2.0 over Spark 1.6 for a CPU-intensive workload. Flame graphs of the Spark 1.6 and 2.0 executions showed Spark 2.0 spending less time in core Spark functions and more time in whole stage code generation functions, indicating improved optimizations. Additional tools like Linux perf confirmed Spark 2.0 utilized CPU and memory throughput better. The presentation demonstrated how flame graphs and other profiling tools can help pinpoint performance bottlenecks and understand the impact of changes like Spark 2.0's code generation optimizations.
Random Walks on graphs is a useful technique in machine learning, with applications in personalized PageRank, representational learning and others. This session will describe a novel algorithm for enumerating walks on large-scale graphs that benefits from the several unique abilities of Apache Spark. The algorithm generates a recursive branching DAG of stages that separates out the “closed” and “open” walks. Spark’s shuffle file management system is ingeniously used to accumulate the walks while the computation is progressing. In-memory caching over multi-core executors enables moving the walks several “steps” forward before shuffling to the next stage. See performance benchmarks, and hear about LinkedIn’s experience with Spark in production clusters. The session will conclude with an observation of how Spark’s unique and powerful construct opens new models of computation, not possible with state-of-the-art, for developing high-performant and scalable algorithms in data science and machine learning.
http://shelan.org/blog/2016/01/31/reproducible-experiment-to-compare-apache-spark-and-apache-flink-batch-processing/ http://blog.ashansa.org/2016/02/stream-processing-is-becoming-crucial.html Batch Processing. https://github.com/karamel-lab/batch-processing-comparison Stream Processing. https://github.com/karamel-lab/stream-processing-comparison
This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.
700 Updatable Queries Per Second: Spark as a Real-Time Web Service. Find out how to use Apache Spark with FiloDb for low-latency queries - something you never thought possible with Spark. Scale it down, not just scale it up!
This document presents a Spark framework for personalized DNA analysis at large scale for under $100 and less than 1 hour. The framework segments input DNA data and runs it through three stages on a Spark cluster: 1) mapping and static load balancing, 2) sorting and dynamic load balancing, and 3) Picard deduplication and GATK variant calling. It achieves high CPU utilization, scales linearly from 1 to 20 nodes, analyzes 400GB of data in under an hour on a 35-node cluster for under $100, and has a 99.1% concordance with serial GATK. Future work involves accelerating it using FPGAs.
This talk presents a continuous application example that relies on Spark FAIR scheduler as the conductor to orchestrate the entire “lambda architecture” in a single spark context. As a typical time series event stream analysis might involved, there are four key components: – an ETL step to store the raw data – a series of real time aggregation on the joint of streaming input and historical data to power a model – model execution – ad-hoc query for human inspection. The key benefits of this setup compared to a typical design that has a bunch of Spark application running individually are 1. Decouple streaming batches process from triggering model calculation, model calculations are triggered at a different pace from the stream processing. 2. Model is always processing the latest data, using pure rdd APIs. 3. Launch various operations in different threads on the driver node, ensuring them got submitted to the appropriate fair scheduler pool. Let FAIR scheduler to do the resource distribution. 4. Share code and time by sharing the actual data transformation (like the rdds in the intermediate steps). 5. Support adhoc queries on intermediate state without a dedicated serving layer or output protocol. 6. Only one app to monitor and tune.
This document provides an overview of Spark: Data Science as a Service by Sridhar Alla and Kiran Muglurmath of Comcast. It discusses Comcast's data science challenges due to massive data size and lack of scalable architecture. It introduces Roadrunner, Comcast's solution built on Spark, which provides a centralized processing system with SQL and machine learning capabilities to enable data ingestion, quality checks, feature engineering, modeling and workflow management. Roadrunner is accessed through REST APIs and helps multiple teams work with the same large datasets. Examples of transformations, joins, aggregations and anomaly detection algorithms demonstrated in Roadrunner are also included.
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
There’s a growing number of data scientists that use R as their primary language. While the SparkR API has made tremendous progress since release 1.6, with major advancements in Apache Spark 2.0 and 2.1, it can be difficult for traditional R programmers to embrace the Spark ecosystem. In this session, Zaidi will discuss the sparklyr package, which is a feature-rich and tidy interface for data science with Spark, and will show how it can be coupled with Microsoft R Server and extended with it’s lower-level API to become a full, first-class citizen of Spark. Learn how easy it is to go from single-threaded, memory-bound R functions to multi-threaded, multi-node, out-of-memory applications that can be deployed in a distributed cluster environment with minimal amount of code changes. You’ll also get best practices for reproducibility and performance by looking at a real-world case study of default risk classification and prediction entirely through R and Spark.
This document provides an overview of spark-timeseries, an open source time series library for Apache Spark. It discusses the library's design choices around representing multivariate time series data, partitioning time series data for distributed processing, and handling operations like lagging and differencing on irregular time series data. It also presents examples of using the library to test for stationarity, generate lagged features, and perform Holt-Winters forecasting on seasonal passenger data.
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
Imagine an organization with thousands of users who want to run data analytics workloads. These users shouldn’t have to worry about provisioning instances from a cloud provider, deploying a runtime processing engine, scaling resources based on utilization, or ensuring their data is secure. Nor should the organization’s system administrators. In this talk we will highlight some of the exciting problems we’re working on at Databricks in order to meet the demands of organizations that are analyzing data at scale. In particular, data engineers attending this session will walk away with learning how we: Manage a typical query lifetime through the Databricks software stack Dynamically allocate resources to satisfy the elastic demands of a single cluster Isolate the data and the generated state within a large organization with multiple clusters
This document discusses the architectural pattern of Command Query Responsibility Segregation (CQRS). It summarizes that CQRS separates read (query) and write (command) operations into different models to allow for more scalability and performance. Queries use a read-only data store optimized for reading, while commands express user intentions and are validated before being asynchronously processed to update data. The pattern allows for eventual consistency by keeping query data slightly stale, and improves scalability by allowing separate optimization of queries and commands.
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Presenter: Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent. In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
how to troubleshooting mysql.
Windowing in Apache Apex divides unbounded streaming data into finite time slices called windows to allow for computation. It uses time as a reference to break streams into windows, addressing issues like failure recovery and providing frames of reference. Operators can perform window-level processing by implementing callbacks for window start and end. Windows provide rolling statistics by accumulating results over multiple windows and emitting periodically. Windowing has lower latency than micro-batch systems as records are processed immediately rather than waiting for batch boundaries.
Have you ever looked at a random piece of code and wanted to rewrite it so badly? It’s natural to have legacy code in your application at some point. It’s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I’m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!