This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
This document provides an introduction to concurrent programming with Akka Actors. It discusses concurrency and parallelism, how the end of Moore's Law necessitated a shift to concurrent programming, and introduces key concepts of actors including message passing concurrency, actor systems, actor operations like create and send, routing, supervision, configuration and testing. Remote actors are also discussed. Examples are provided in Scala.
1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR. 2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters. 3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are: - Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library. - This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job. - Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.
Continuation of http://www.slideshare.net/datamantra/building-distributed-systems-from-scratch-part-1
This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.
Managing a workflow using Azkaban scheduler. It can be used in batch as well as interactive workloads
Implicit parameters and implicit conversions are Scala language features that allow omitting explicit calls to methods or variables. Implicits enable concise and elegant code through features like dependency injection, context passing, and ad hoc polymorphism. Implicits resolve types at compile-time rather than runtime. While powerful, implicits can cause conflicts and slow compilation if overused. Frameworks like Scala collections, Spark, and Spray JSON extensively use implicits to provide type classes and conversions between Scala and Java types.
This document discusses building a real-time streaming application on Spark to analyze sensor data. It describes collecting data from servers through Flume into Kafka and processing it using Spark Streaming to generate analytics stored in Cassandra. The stages involve using files, then Kafka, and finally Cassandra for input/output. Testing streaming applications and redesigning for testability is also covered.
Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.
This document provides an introduction and overview of Spark's Dataset API. It discusses how Dataset combines the best aspects of RDDs and DataFrames into a single API, providing strongly typed transformations on structured data. The document also covers how Dataset moves Spark away from RDDs towards a more SQL-like programming model and optimized data handling. Key topics include the Spark Session entry point, differences between DataFrames and Datasets, and examples of Dataset operations like word count.
Multi Source Data Analysis Using Apache Spark and Tellius This document discusses analyzing data from multiple sources using Apache Spark and the Tellius platform. It covers loading data from different sources like databases and files into Spark DataFrames, defining a data model by joining the sources, and performing analysis like calculating revenues by department across sources. It also discusses challenges like double counting values when directly querying the joined data. The Tellius platform addresses this by implementing a custom query layer on top of Spark SQL to enable accurate multi-source analysis.
This document introduces Spark 2.0 and its key features, including the Dataset abstraction, Spark Session API, moving from RDDs to Datasets, Dataset and DataFrame APIs, handling time windows, and adding custom optimizations. The major focus of Spark 2.0 is standardizing on the Dataset abstraction and improving performance by an order of magnitude. Datasets provide a strongly typed API that combines the best of RDDs and DataFrames.
Meetup talk about building distributed systems like Spark from scratch on top of frameworks like Apache Mesos and YARN
The document introduces the Dataset API in Spark, which provides type safety and performance benefits over DataFrames. Datasets allow operating on domain objects using compiled functions rather than Rows. Encoders efficiently serialize objects to and from the JVM. This allows type checking of operations and retaining objects in distributed operations. The document outlines the history of Spark APIs, limitations of DataFrames, and how Datasets address these through compiled encoding and working with case classes rather than Rows.
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
Building a REST based generic big data tool for solving business problems across verticles using Apache Spark
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis. In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
-What is Structured Streaming? -Programming Model -Windowing -Delayed events -Watermarking -Watermarking and output mode By Yaswanth Yadlapalli
This document provides an overview of Structured Streaming in Apache Spark. It begins with a brief history of streaming in Spark and outlines some of the limitations of the previous DStream API. It then introduces the new Structured Streaming API, which allows for continuous queries to be expressed as standard Spark SQL queries against continuously arriving data. It describes the new processing model and how queries are executed incrementally. It also covers features like event-time processing, windows, joins, and fault-tolerance guarantees through checkpointing and write-ahead logging. Overall, the document presents Structured Streaming as providing a simpler way to perform streaming analytics by allowing streaming queries to be expressed using the same APIs as batch queries.
Structured Streaming in Apache Spark allows users to write streaming applications as batch-style queries on static or streaming data sources. It treats streams as continuous unbounded tables and allows batch queries written on DataFrames/Datasets to be automatically converted into incremental execution plans to process streaming data in micro-batches. This provides a simple yet powerful API for building robust stream processing applications with end-to-end fault tolerance guarantees and integration with various data sources and sinks.
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.