This document summarizes work done by an Intel software team in China to improve Apache Spark performance for real-world applications. It describes benchmarking tools like HiBench and profiling tools like HiMeter that were developed. It also discusses several case studies where the team worked with customers to optimize joins, manage memory usage, and reduce network bandwidth. The overall goal was to help solve common issues around ease of use, reliability, and scalability for Spark in production environments.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on. You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal. By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs: Sizing the cluster based on your dataset (shuffle partitions) Ingestion challenges – well begun is half done (globbing S3, small files) Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you) Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win) Scheduling (FAIR vs FIFO, is there a difference for your pipeline?) Caching and persistence (it’s the cost of doing business, so what are your options?) Fault tolerance (blacklisting, speculation, task reaping) Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans) Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
Blagoy Kaloferov presented on building a data warehouse at Edmunds.com using Spark SQL. He discussed how Spark SQL simplified ETL and enabled business analysts to build data marts more quickly. He showed how Spark SQL was used to optimize a dealer leads dataset in Platfora, reducing build time from hours to minutes. Finally, he proposed an approach using Spark SQL to automate OEM ad revenue billing by modeling complex rules through collaboration between analysts and developers.
The document discusses the goals of establishing a new research lab called RISELab to develop a secure real-time decision stack that can enable real-time decisions on live data with strong security guarantees. It outlines some of the challenges in building such a system and presents early work on Drizzle, a low-latency streaming engine, and Opaque, which leverages hardware enclaves to provide encryption and hide data access patterns. The goal is to build an open source platform and tools over the next 5 years to enable applications requiring sophisticated, accurate, and robust real-time decisions on private data.
This document provides an overview of SK Telecom's use of big data analytics and Spark. Some key points: - SKT collects around 250 TB of data per day which is stored and analyzed using a Hadoop cluster of over 1400 nodes. - Spark is used for both batch and real-time processing due to its performance benefits over other frameworks. Two main use cases are described: real-time network analytics and a network enterprise data warehouse (DW) built on Spark SQL. - The network DW consolidates data from over 130 legacy databases to enable thorough analysis of the entire network. Spark SQL, dynamic resource allocation in YARN, and integration with BI tools help meet requirements for timely processing and quick
Unbounded, unordered, global scale datasets are increasingly common in day-to-day business, and consumers of these datasets have detailed requirements for latency, cost, and completeness. Apache Beam defines a new data processing programming model that evolved from more than a decade of experience building Big Data infrastructure within Google, including MapReduce, FlumeJava, Millwheel, and Cloud Dataflow. Apache Beam handles both batch and streaming use cases, offering a powerful, unified model. It neatly separates properties of the data from run-time characteristics, allowing pipelines to be portable across multiple run-time environments, both open source, including Apache Apex, Apache Flink, Apache Gearpump, Apache Spark, and proprietary. Finally, Beam's model enables newer optimizations, like dynamic work rebalancing and autoscaling, resulting in an efficient execution. This talk will cover the basics of Apache Beam, touch on its evolution, and describe main concepts in its powerful programming model. We'll show how Beam unifies batch and streaming use cases, and show efficient execution in real-world scenarios. Finally, we'll demonstrate pipeline portability across Apache Apex, Apache Flink, Apache Spark and Google Cloud Dataflow in a live setting.
This document discusses Apache Ignite and how it can be used with Apache Spark for fast data applications. It provides an overview of Ignite's in-memory data fabric capabilities, how it compares to Spark, and how Ignite can be integrated with Spark to provide shared resilient storage and distributed computing. Examples are given of reading and writing data between Ignite and Spark and using Ignite's in-memory file system and SQL support from Spark.
This document discusses patterns for modern data integration using streaming data. It outlines an evolution from data warehouses to data lakes to streaming data. It then describes four key patterns: 1) Stream all things (data) in one place, 2) Keep schemas compatible and process data on, 3) Enable ridiculously parallel single message transformations, and 4) Perform streaming data enrichment to add additional context to events. Examples are provided of using Apache Kafka and Kafka Connect to implement these patterns for a large hotel chain integrating various data sources and performing real-time analytics on customer events.
This document discusses using Spark Streaming for IoT applications and the challenges involved. It notes that while Spark simplifies programming across different processing intervals from batch to stream, programming models alone are not sufficient as IoT data streams can have varying rates and delays. It proposes a unified data infrastructure with abstractions like data series that support joining real-time and historical data while handling delays transparently. It also suggests approaches for Spark Streaming to better support processing many independent low-volume IoT streams concurrently and improving resource utilization for such applications. Finally, it introduces the Device-Model-Infra framework for addressing these IoT analytics challenges through combined programming models and data abstractions.
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
This document provides an overview of Spark NLP, an open-source library for natural language processing (NLP). It introduces Spark NLP and discusses its state-of-the-art accuracy on NLP tasks like named entity recognition and text classification. It also covers Spark NLP's speed, scalability, and ease of use. Examples are given of training NLP models with Spark NLP for tasks like part-of-speech tagging, named entity recognition, and text classification.
Hopsworks is an open-source data platform that can be used to both develop and operate horizontally scalable machine learning pipelines. A key part of our pipelines is the world’s first open-source Feature Store, based on Apache Hive, that acts as a data warehouse for features, providing a natural API between data engineers – who write feature engineering code in Spark (in Scala or Python) – and Data Scientists, who select features from the feature store to generate training/test data for models. In this talk, we will discuss how Databricks Delta solves several of the key challenges in building both feature engineering pipelines that feed our Feature Store and in managing the feature data itself. Firstly, we will show how expectations and schema enforcement in Databricks Delta can be used to provide data validation, ensuring that feature data does not have missing or invalid values that could negatively affect model training. Secondly, time-travel in Databricks Delta can be used to provide version management and experiment reproducability for training/test datasets. That is, given a model, you can re-run the training experiment for that model using the same version of the data that was used to train the model. We will also discuss the next steps needed to take this work to the next level. Finally, we will perform a live demo, showing how Delta can be used in end-to-end ML pipelines using Spark on Hopsworks.
Over the past few years, Python has become the default language for data scientists. Packages such as pandas, numpy, statsmodel, and scikit-learn have gained great adoption and become the mainstream toolkits. At the same time, Apache Spark has become the de facto standard in processing big data. Spark ships with a Python interface, aka PySpark, however, because Spark’s runtime is implemented on top of JVM, using PySpark with native Python library sometimes results in poor performance and usability. In this talk, we introduce a new type of PySpark UDF designed to solve this problem – Vectorized UDF. Vectorized UDF is built on top of Apache Arrow and bring you the best of both worlds – the ability to define easy to use, high performance UDFs and scale up your analysis with Spark.
This document summarizes a talk given by Stephan Kessler at the Spark Summit Europe 2016 about integrating business functionality and specialized engines into Apache Spark using SAP HANA Vora. Key points discussed include using currency conversion and time series query capabilities directly in Spark by pushing computations to the relevant data sources via Spark extensions. SAP HANA Vora allows moving parts of the Spark logical query plan to various data sources like HANA, graph and document stores to perform analysis close to the data.
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.