SlideShare a Scribd company logo
Introduction to Structured
Streaming
Next Generation Streaming API for Spark
https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co
m/madhukaraphatak/examples/sparktwo/streaming
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution in Stream Processing
● Drawbacks of DStream API
● Introduction to Structured Streaming
● Understanding Source and Sinks
● Stateful stream applications
● Handling State recovery
● Joins
● Window API
Evolution of Stream Processing

Recommended for you

Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors

This document provides an introduction to concurrent programming with Akka Actors. It discusses concurrency and parallelism, how the end of Moore's Law necessitated a shift to concurrent programming, and introduces key concepts of actors including message passing concurrency, actor systems, actor operations like create and send, routing, supervision, configuration and testing. Remote actors are also discussed. Examples are provided in Scala.

concurrent programmingshashankakka
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application

1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR. 2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters. 3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.

spark meetupbangalore apache spark meetupapache spark
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming

This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are: - Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library. - This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job. - Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.

bangalore apache spark meetupapache sparkspark streaming
Stream as Fast Batch Processing
● Stream processing viewed as low latency batch
processing
● Storm took stateless per message and spark took
minibatch approach
● Focused on mostly stateless / limited state workloads
● Reconciled using Lamda architecture
● Less features and less powerful API compared to the
batch system
● Ex : Storm, Spark DStream API
Drawbacks of Stream as Fast Batch
● Handling state for long time and efficiently is a
challenge in these systems
● Lambda architecture forces the duplication of efforts in
stream and batch
● As the API is limited, doing any kind of complex
operation takes lot of effort
● No clear abstractions for handling stream specific
interactions like late events, event time, state recovery
etc
Stream as the default abstraction
● Stream becomes the default abstraction on which both
stream processing and batch processing is built
● Batch processing is looked at as bounded stream
processing
● Supports most of the advanced stream processing
constructs out of the box
● Strong state API’s
● In par with functionalities of Batch API
● Ex : Flink, Beam
Challenges with Stream as default
● Stream as abstraction makes it hard to combine stream
with batch data
● Stream abstraction works well for piping based API’s
like map, flatMap but challenging for SQL
● Stream abstraction also sometimes make it difficult to
map it to structured world as in the platform level it’s
viewed as byte stream
● There are efforts like flink SQL but we have to wait how
it turns out

Recommended for you

Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming

This document summarizes the key challenges and solutions in building a real-time data pipeline that ingests data from a database, transforms it using Spark Streaming, and publishes the output to Salesforce. The pipeline aims to have a latency of 1 minute with zero data loss and ordering guarantees. Some challenges discussed include handling out of sequence and late arrival events, schema evolution, bootstrap loading, data loss/corruption, and diagnosing issues. Solutions proposed use Kafka, checkpointing, replay capabilities, and careful broker/connect setups to help meet the reliability requirements for the pipeline.

apache sparkbangalore apache spark meetupspark meetup
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2

Continuation of http://www.slideshare.net/datamantra/building-distributed-systems-from-scratch-part-1

bangaloreapache sparkspark meetup
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming

This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.

apache sparkbangalore apache spark meetupspark streaming
Drawbacks of DStream API
Tied to Minibatch execution
● DStream API looks stream as fast batch processing in
both API and runtime level
● Batch time integral part of the API which makes it
minibatch only API
● Batch time dicates how different abstractions of API like
window and state will behave
RDDs based API
● DStream API is based on RDD API which is deprecated
for user API’s in Spark 2.0
● As DStream API uses RDD, it doesn’t get benefit of the
all runtime improvements happened in spark sql
● Difficult to combine in batch API’s as they use Dataset
abstraction
● Running SQL queries over stream are awkward and not
straight forward
Limited support for Time abstraction
● Only supports the concept of Processing time
● No support for ingestion time and event time
● As batch time is defined at application level, there is no
framework level construct to handle late events
● Windowing other than time, is not possible

Recommended for you

Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban

Managing a workflow using Azkaban scheduler. It can be used in batch as well as interactive workloads

workflow managementazkabanbangalore apache spark meetup
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala

Implicit parameters and implicit conversions are Scala language features that allow omitting explicit calls to methods or variables. Implicits enable concise and elegant code through features like dependency injection, context passing, and ad hoc polymorphism. Implicits resolve types at compile-time rather than runtime. While powerful, implicits can cause conflicts and slow compilation if overused. Frameworks like Scala collections, Spark, and Spray JSON extensively use implicits to provide type classes and conversions between Scala and Java types.

implicitsscalaimplicit
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark

This document discusses building a real-time streaming application on Spark to analyze sensor data. It describes collecting data from servers through Flume into Kafka and processing it using Spark Streaming to generate analytics stored in Cassandra. The stages involve using files, then Kafka, and finally Cassandra for input/output. Testing streaming applications and redesigning for testability is also covered.

bangalore apache spark meetupstreaming applicationapache spark
Introduction to Structured Streaming
Stream as the infinite table
● In structured streaming, a stream is modeled as an
infinite table aka infinite Dataset
● As we are using structured abstraction, it’s called
structured streaming API
● All input sources, stream transformations and output
sinks modeled as Dataset
● As Dataset is underlying abstraction, stream
transformations are represented using SQL and Dataset
DSL
Advantage of Stream as infinite table
● Structured data analysis is first class not layered over
the unstructured runtime
● Easy to combine with batch data as both use same
Dataset abstraction
● Can use full power of SQL language to express stateful
stream operations
● Benefits from SQL optimisations learnt over decades
● Easy to learn and maintain
Source and Sinks API

Recommended for you

Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes

Spark can run on Kubernetes containers in two ways - as a static cluster or with native integration. As a static cluster, Spark pods are manually deployed without autoscaling. Native integration treats Kubernetes as a resource manager, allowing Spark to dynamically acquire and release containers like in YARN. It uses Kubernetes custom controllers to create driver pods that then launch worker pods. This provides autoscaling of resources based on job demands.

apache sparkkubernetesbangalore apache spark meetup
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API

This document provides an introduction and overview of Spark's Dataset API. It discusses how Dataset combines the best aspects of RDDs and DataFrames into a single API, providing strongly typed transformations on structured data. The document also covers how Dataset moves Spark away from RDDs towards a more SQL-like programming model and optimized data handling. Key topics include the Spark Session entry point, differences between DataFrames and Datasets, and examples of Dataset operations like word count.

spark architecturebig datardd
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius

Multi Source Data Analysis Using Apache Spark and Tellius This document discusses analyzing data from multiple sources using Apache Spark and the Tellius platform. It covers loading data from different sources like databases and files into Spark DataFrames, defining a data model by joining the sources, and performing analysis like calculating revenues by department across sources. It also discusses challenges like double counting values when directly querying the joined data. The Tellius platform addresses this by implementing a custom query layer on top of Spark SQL to enable accurate multi-source analysis.

apache sparkdatasciencetellius
Reading from Socket
● Socket is built in source for structured streaming
● As with DStream API, we can read socket by specifying
hostname and port
● Returns a DataFrame with single column called value
● Using console as the sink to write the output
● Once we have setup source and sink, we use query
interface to start the execution
● Ex : SocketReadExample
Questions from DStream users
● Where is batch time? Or how frequently this is going to
run?
● awaitTermination is on query not on session? Does
that mean we can have multiple queries running
parallely?
● We didn't specify local[2], how does that work?
● As this program using Dataframe, how does the schema
inference works?
Flink vs Spark stream processing
● Spark run as soon as possible may sound like per event
processing but it’s not
● In flink, all the operations like map / flatMap will be
running as processes and data will be streamed through
it
● But in spark asap, tasks are launched for given batch
and destroyed once it’s completed
● So spark still does minibatch but with much lower
latency.
Flink Operator Graph

Recommended for you

Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0

This document introduces Spark 2.0 and its key features, including the Dataset abstraction, Spark Session API, moving from RDDs to Datasets, Dataset and DataFrame APIs, handling time windows, and adding custom optimizations. The major focus of Spark 2.0 is standardizing on the Dataset abstraction and improving performance by an order of magnitude. Datasets provide a strongly typed API that combines the best of RDDs and DataFrames.

bangalore apache spark meetuplatest sparkspark 2.0
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1

Meetup talk about building distributed systems like Spark from scratch on top of frameworks like Apache Mesos and YARN

functional programmingdistributed systemsspark advanced
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset

The document introduces the Dataset API in Spark, which provides type safety and performance benefits over DataFrames. Datasets allow operating on domain objects using compiled functions rather than Rows. Encoders efficiently serialize objects to and from the JVM. This allows type checking of operations and retaining objects in distributed operations. The document outlines the history of Spark APIs, limitations of DataFrames, and how Datasets address these through compiled encoding and working with case classes rather than Rows.

bangaloredatasetdataframe
Spark Execution Graph
a b
1 2
3 4
Batch 1
Map
Stage
Aggregat
e Stage
Spawn tasks
Batch 2
Map
Stage
Aggregat
e Stage
Sink
Stage
Sink
Stage
Spawn tasks
Socket Stream
Independence from Execution Model
● Even though current structured streaming runtime is
minibatch, API doesn’t dictate the nature of runtime
● Structured Streaming API is built in such a way that
query execution model can be change in future
● Already plan for continuous processing mode to bring
structured streaming in par with flink per message
semantics
● https://issues.apache.org/jira/browse/SPARK-20928
Socket Minibatch
● In last example, we used asap trigger.
● We can mimic the DStream mini batch behaviour by
changing the trigger API
● Trigger is specified for the query, as it determines the
frequency for query execution
● In this example, we create a 5 second trigger, which will
create a batch for every 5 seconds
● Ex : SocketMiniBatchExample
Word count on Socket Stream
● Once we know how to read from a source, we can do
operations on the same
● In this example, we will do word count using Dataframe
and Dataset API’s
● We will be using Dataset API’s for data
cleanup/preparation and Dataframe API to define the
aggregations
● Ex : SocketWordCount

Recommended for you

Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API

In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api

dataframeapache sparkr
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark

Building a REST based generic big data tool for solving business problems across verticles using Apache Spark

buisnessmachine learningapache spark
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...

Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees. Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis. In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.

structured streamingapache sparkspark streaming
Understanding State
Stateful operations
● In last example, we observed that spark remembers the
state across batches
● In structured streaming, all aggregations are stateful
● Developer needs to choose output mode complete so
that aggregations are always up to date
● Spark internally uses the both disk and memory state
store for remembering state
● No more complicated state management in application
code
Understanding output mode
● Output mode defines what’s the dataframe seen by the
sink after each batch
● APPEND signifies sink only sees the records from last
batch
● UPDATE signifies sink sees all the changed records
across the batches
● COMPLETE signifies sink sess complete output for
every batch
● Depending on operations, we need to choose output
mode
Stateless aggregations
● Most of the stream applications benefit from default
statefulness
● But sometime we need aggregations done on batch
data rather on complete data
● Helpful for porting existing DStream code to structured
streaming code
● Spark exposes flatMapGroups API to define the
stateless aggregations

Recommended for you

Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1

-What is Structured Streaming? -Programming Model -Windowing -Delayed events -Watermarking -Watermarking and output mode By Yaswanth Yadlapalli

A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark

This document provides an overview of Structured Streaming in Apache Spark. It begins with a brief history of streaming in Spark and outlines some of the limitations of the previous DStream API. It then introduces the new Structured Streaming API, which allows for continuous queries to be expressed as standard Spark SQL queries against continuously arriving data. It describes the new processing model and how queries are executed incrementally. It also covers features like event-time processing, windows, joins, and fault-tolerance guarantees through checkpointing and write-ahead logging. Overall, the document presents Structured Streaming as providing a simpler way to perform streaming analytics by allowing streaming queries to be expressed using the same APIs as batch queries.

apache spark 2.xstructured streamingspark meetup
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?

Structured Streaming in Apache Spark allows users to write streaming applications as batch-style queries on static or streaming data sources. It treats streams as continuous unbounded tables and allows batch queries written on DataFrames/Datasets to be automatically converted into incremental execution plans to process streaming data in micro-batches. This provides a simple yet powerful API for building robust stream processing applications with end-to-end fault tolerance guarantees and integration with various data sources and sinks.

big dataapache sparkdatabricks
Stateless wordcount
● In this example, we will define word count on a batch
data
● Batch is defined for 5 seconds.
● Rather than using groupBy and count API’s we will use
groupByKey and flatMapGroups API
● flatMapGroups defines operations to be done on each
group
● We will be using output mode APPEND
● Ex : StatelessWordCount
Limitations of flatMapGroups
● flatMapGroups will be slower than groupBy and count
as it doesn’t support partial aggregations
● flatMapGroups can be used only with output mode
APPEND as output size of the function is unbounded
● flatMapGroups needs grouping done using Dataset API
not using Dataframe API
Checkpoint and state recovery
● Building stateful applications comes with additional
responsibility of checkpointing the state for safe
recovery
● Checkpointing is achieved by writing state of the
application to a HDFS compatible storage
● Checkpointing is specific for queries. So you can mix
and match stateless and stateful queries in same
application
● Ex : RecoverableAggregation
Working with Files

Recommended for you

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.

structured streamingapache spark 2.0
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...

This document summarizes a presentation about building a structured streaming connector for continuous applications using Azure Event Hubs as the streaming data source. It discusses key design considerations like representing offsets, implementing the getOffset and getBatch methods required by structured streaming sources, and challenges with testing asynchronous behavior. It also outlines issues contributed back to the Apache Spark community around streaming checkpoints and recovery.

apache sparkspark summit
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

sparkapache sparkspark streaming
File streams
● Structured Streaming has excellent support for the file
based streams
● Supports file types like csv, json,parquet out of the box
● Schema inference is not supported
● Picking up new files on arrival is same as DStream file
stream API
● Ex : FileStreamExample
Joins with static data
● As Dataset is common abstraction across batch and
stream API’s , we can easily enrich structured stream
with static data
● As both have schema built in, spark can use the catalyst
optimiser to optimise joins between files and streams
● In our example,we will be enriching sales stream with
customer data
● Ex : StreamJoin
References
● http://blog.madhukaraphatak.com/categories/introductio
n-structured-streaming/
● https://databricks.com/blog/2017/01/19/real-time-stream
ing-etl-structured-streaming-apache-spark-2-1.html
● https://flink.apache.org/news/2016/05/24/stream-sql.htm
l

More Related Content

What's hot

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
datamantra
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 

What's hot (20)

Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
Anatomy of Data Frame API :  A deep dive into Spark Data Frame APIAnatomy of Data Frame API :  A deep dive into Spark Data Frame API
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 

Viewers also liked

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
Sigmoid
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
Miklos Christine
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
Databricks
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
Knoldus Inc.
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 

Viewers also liked (10)

Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1Structured Streaming Using Spark 2.1
Structured Streaming Using Spark 2.1
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?What's new with Apache Spark's Structured Streaming?
What's new with Apache Spark's Structured Streaming?
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...Building Continuous Application with Structured Streaming and Real-Time Data ...
Building Continuous Application with Structured Streaming and Real-Time Data ...
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Deep dive into stateful stream processing in structured streaming  by Tathaga...Deep dive into stateful stream processing in structured streaming  by Tathaga...
Deep dive into stateful stream processing in structured streaming by Tathaga...
 

Similar to Introduction to Structured streaming

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
datamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
Anant Corporation
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
Federico Palladoro
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
javier ramirez
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
Knoldus Inc.
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
Shuyi Chen
 
Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting
Network Automation Forum
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
Knoldus Inc.
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 

Similar to Introduction to Structured streaming (20)

Introduction to Apache Flink
Introduction to Apache FlinkIntroduction to Apache Flink
Introduction to Apache Flink
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Cassandra Lunch #88: Cadence
Cassandra Lunch #88: CadenceCassandra Lunch #88: Cadence
Cassandra Lunch #88: Cadence
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Big data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on dockerBig data Argentina meetup 2020-09: Intro to presto on docker
Big data Argentina meetup 2020-09: Intro to presto on docker
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting Simplified Troubleshooting through API Scripting
Simplified Troubleshooting through API Scripting
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0Introduction to Apache Spark 2.0
Introduction to Apache Spark 2.0
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 

More from datamantra

Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 

More from datamantra (12)

Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 

Recently uploaded

EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
fatimaezzahraboumaiz2
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
Milind Agarwal
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
aashuverma204
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
dipti singh$A17
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
taqyea
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
depikasharma
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
SanelaNikodinoska1
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
Amazon Web Services Korea
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
ASISHSABAT3
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
taqyea
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
simmi singh$A17
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
sapna sharmap11
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Alisha Pathan $A17
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
vasudha malikmonii$A17
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
butwhat24
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
jiya khan$A17
 

Recently uploaded (20)

EGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithmEGU2020-10385_presentation LSTM algorithm
EGU2020-10385_presentation LSTM algorithm
 
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden NetworksFrom Clues to Connections: How Social Media Investigators Expose Hidden Networks
From Clues to Connections: How Social Media Investigators Expose Hidden Networks
 
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeMalviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Malviya Nagar @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model SafeMahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
Mahipalpur @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Yogita Mehra Top Model Safe
 
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeDelhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
Delhi @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 
University of Toronto degree offer diploma Transcript
University of Toronto  degree offer diploma TranscriptUniversity of Toronto  degree offer diploma Transcript
University of Toronto degree offer diploma Transcript
 
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model SafeRohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
Rohini @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Megha Singla Top Model Safe
 
AIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on AzureAIRLINE_SATISFACTION_Data Science Solution on Azure
AIRLINE_SATISFACTION_Data Science Solution on Azure
 
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
[D3T2S03] Data&AI Roadshow 2024 - Amazon DocumentDB 실습
 
NPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension schemeNPS_Presentation_V3.pptx it is regarding National pension scheme
NPS_Presentation_V3.pptx it is regarding National pension scheme
 
Simon Fraser University degree offer diploma Transcript
Simon Fraser University  degree offer diploma TranscriptSimon Fraser University  degree offer diploma Transcript
Simon Fraser University degree offer diploma Transcript
 
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model SafeSouth Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
South Ex @ℂall @Girls ꧁❤ 9711199012 ❤꧂Glamorous sonam Mehra Top Model Safe
 
Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)Sin Involves More Than You Might Think (We'll Explain)
Sin Involves More Than You Might Think (We'll Explain)
 
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeRK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
RK Puram @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model SafePitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
Pitampura @ℂall @Girls ꧁❤ 9873777170 ❤꧂Fabulous sonam Mehra Top Model Safe
 
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model SafeNehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
Nehru Place @ℂall @Girls ꧁❤ 9873940964 ❤꧂VIP Jina Singh Top Model Safe
 
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model SafeLajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
Lajpat Nagar @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Ruhi Singla Top Model Safe
 

Introduction to Structured streaming

  • 1. Introduction to Structured Streaming Next Generation Streaming API for Spark https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co m/madhukaraphatak/examples/sparktwo/streaming
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution in Stream Processing ● Drawbacks of DStream API ● Introduction to Structured Streaming ● Understanding Source and Sinks ● Stateful stream applications ● Handling State recovery ● Joins ● Window API
  • 4. Evolution of Stream Processing
  • 5. Stream as Fast Batch Processing ● Stream processing viewed as low latency batch processing ● Storm took stateless per message and spark took minibatch approach ● Focused on mostly stateless / limited state workloads ● Reconciled using Lamda architecture ● Less features and less powerful API compared to the batch system ● Ex : Storm, Spark DStream API
  • 6. Drawbacks of Stream as Fast Batch ● Handling state for long time and efficiently is a challenge in these systems ● Lambda architecture forces the duplication of efforts in stream and batch ● As the API is limited, doing any kind of complex operation takes lot of effort ● No clear abstractions for handling stream specific interactions like late events, event time, state recovery etc
  • 7. Stream as the default abstraction ● Stream becomes the default abstraction on which both stream processing and batch processing is built ● Batch processing is looked at as bounded stream processing ● Supports most of the advanced stream processing constructs out of the box ● Strong state API’s ● In par with functionalities of Batch API ● Ex : Flink, Beam
  • 8. Challenges with Stream as default ● Stream as abstraction makes it hard to combine stream with batch data ● Stream abstraction works well for piping based API’s like map, flatMap but challenging for SQL ● Stream abstraction also sometimes make it difficult to map it to structured world as in the platform level it’s viewed as byte stream ● There are efforts like flink SQL but we have to wait how it turns out
  • 10. Tied to Minibatch execution ● DStream API looks stream as fast batch processing in both API and runtime level ● Batch time integral part of the API which makes it minibatch only API ● Batch time dicates how different abstractions of API like window and state will behave
  • 11. RDDs based API ● DStream API is based on RDD API which is deprecated for user API’s in Spark 2.0 ● As DStream API uses RDD, it doesn’t get benefit of the all runtime improvements happened in spark sql ● Difficult to combine in batch API’s as they use Dataset abstraction ● Running SQL queries over stream are awkward and not straight forward
  • 12. Limited support for Time abstraction ● Only supports the concept of Processing time ● No support for ingestion time and event time ● As batch time is defined at application level, there is no framework level construct to handle late events ● Windowing other than time, is not possible
  • 14. Stream as the infinite table ● In structured streaming, a stream is modeled as an infinite table aka infinite Dataset ● As we are using structured abstraction, it’s called structured streaming API ● All input sources, stream transformations and output sinks modeled as Dataset ● As Dataset is underlying abstraction, stream transformations are represented using SQL and Dataset DSL
  • 15. Advantage of Stream as infinite table ● Structured data analysis is first class not layered over the unstructured runtime ● Easy to combine with batch data as both use same Dataset abstraction ● Can use full power of SQL language to express stateful stream operations ● Benefits from SQL optimisations learnt over decades ● Easy to learn and maintain
  • 17. Reading from Socket ● Socket is built in source for structured streaming ● As with DStream API, we can read socket by specifying hostname and port ● Returns a DataFrame with single column called value ● Using console as the sink to write the output ● Once we have setup source and sink, we use query interface to start the execution ● Ex : SocketReadExample
  • 18. Questions from DStream users ● Where is batch time? Or how frequently this is going to run? ● awaitTermination is on query not on session? Does that mean we can have multiple queries running parallely? ● We didn't specify local[2], how does that work? ● As this program using Dataframe, how does the schema inference works?
  • 19. Flink vs Spark stream processing ● Spark run as soon as possible may sound like per event processing but it’s not ● In flink, all the operations like map / flatMap will be running as processes and data will be streamed through it ● But in spark asap, tasks are launched for given batch and destroyed once it’s completed ● So spark still does minibatch but with much lower latency.
  • 21. Spark Execution Graph a b 1 2 3 4 Batch 1 Map Stage Aggregat e Stage Spawn tasks Batch 2 Map Stage Aggregat e Stage Sink Stage Sink Stage Spawn tasks Socket Stream
  • 22. Independence from Execution Model ● Even though current structured streaming runtime is minibatch, API doesn’t dictate the nature of runtime ● Structured Streaming API is built in such a way that query execution model can be change in future ● Already plan for continuous processing mode to bring structured streaming in par with flink per message semantics ● https://issues.apache.org/jira/browse/SPARK-20928
  • 23. Socket Minibatch ● In last example, we used asap trigger. ● We can mimic the DStream mini batch behaviour by changing the trigger API ● Trigger is specified for the query, as it determines the frequency for query execution ● In this example, we create a 5 second trigger, which will create a batch for every 5 seconds ● Ex : SocketMiniBatchExample
  • 24. Word count on Socket Stream ● Once we know how to read from a source, we can do operations on the same ● In this example, we will do word count using Dataframe and Dataset API’s ● We will be using Dataset API’s for data cleanup/preparation and Dataframe API to define the aggregations ● Ex : SocketWordCount
  • 26. Stateful operations ● In last example, we observed that spark remembers the state across batches ● In structured streaming, all aggregations are stateful ● Developer needs to choose output mode complete so that aggregations are always up to date ● Spark internally uses the both disk and memory state store for remembering state ● No more complicated state management in application code
  • 27. Understanding output mode ● Output mode defines what’s the dataframe seen by the sink after each batch ● APPEND signifies sink only sees the records from last batch ● UPDATE signifies sink sees all the changed records across the batches ● COMPLETE signifies sink sess complete output for every batch ● Depending on operations, we need to choose output mode
  • 28. Stateless aggregations ● Most of the stream applications benefit from default statefulness ● But sometime we need aggregations done on batch data rather on complete data ● Helpful for porting existing DStream code to structured streaming code ● Spark exposes flatMapGroups API to define the stateless aggregations
  • 29. Stateless wordcount ● In this example, we will define word count on a batch data ● Batch is defined for 5 seconds. ● Rather than using groupBy and count API’s we will use groupByKey and flatMapGroups API ● flatMapGroups defines operations to be done on each group ● We will be using output mode APPEND ● Ex : StatelessWordCount
  • 30. Limitations of flatMapGroups ● flatMapGroups will be slower than groupBy and count as it doesn’t support partial aggregations ● flatMapGroups can be used only with output mode APPEND as output size of the function is unbounded ● flatMapGroups needs grouping done using Dataset API not using Dataframe API
  • 31. Checkpoint and state recovery ● Building stateful applications comes with additional responsibility of checkpointing the state for safe recovery ● Checkpointing is achieved by writing state of the application to a HDFS compatible storage ● Checkpointing is specific for queries. So you can mix and match stateless and stateful queries in same application ● Ex : RecoverableAggregation
  • 33. File streams ● Structured Streaming has excellent support for the file based streams ● Supports file types like csv, json,parquet out of the box ● Schema inference is not supported ● Picking up new files on arrival is same as DStream file stream API ● Ex : FileStreamExample
  • 34. Joins with static data ● As Dataset is common abstraction across batch and stream API’s , we can easily enrich structured stream with static data ● As both have schema built in, spark can use the catalyst optimiser to optimise joins between files and streams ● In our example,we will be enriching sales stream with customer data ● Ex : StreamJoin