When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla.
We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
ScyllaDB CEO and co-founder Dor Laor shares his vision for Scylla and announces Scylla 2.0, a big step towards the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
AdGear runs an ad tech gateway at more than one million queries per second to Scylla and recently transitioned from Apache Cassandra. In this talk, we will highlight the tools and languages that we use (Erlang), how we do bulk imports, and how performance compares between the two database engines.
Duarte Nunes presented on distributed materialized views in ScyllaDB. He discussed the challenges of implementing materialized views in a distributed system without a single master, including propagating updates from base tables to views, handling consistency when tables can diverge, and managing concurrent updates safely. His proposed solution uses asynchronous replica-based propagation paired with repair mechanisms and locking or optimistic concurrency to address these issues. Materialized views provide powerful indexing capabilities but also introduce performance overhead that is difficult to avoid given Scylla's data model.
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.
Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.
Scylla Summit 2017: Snapfish's Journey Towards Scylla
Snapfish, a web-based photo and printing service, will walk through their evaluation process for a new database, discuss use cases, and how they plan to use Scylla in their production systems.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment.
Some of the questions that will be answered are:
- how to find out which queries are the slowest and why
- how we go about understanding the impact of the data model in a node's performance
- how to check which resources are the bottlenecks in the cluster
If You Care About Performance, Use User Defined Types
Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Benchmarks are fun to do but when going to production, all sorts of things can happen: anything from hardware outages to human error bringing your database down. Even in a healthy database, a lot of maintenance operations have to periodically run. Do you have the tools necessary to make sure you are good to go?
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
The document summarizes Benny Schnaider's presentation as the Chairman of NEXTGEN NOSQL. It discusses the evolution of NoSQL databases, with early generations having inefficiencies and issues that required workarounds. The presentation introduces Scylla, a next-generation NoSQL database that was built from the ground up by storage and operating systems experts to massively scale modern applications. Scylla leverages 20 years of database evolution and is implemented in C++ to provide better performance, stability and the ability to scale out across infrastructure.
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
In this talk, we will cover the lay of the land of graph databases. We will talk about what it takes to run a highly available hosted solution in the cloud while giving users a seamless vertical and horizontal scaling solution, and share our experiences migrating from an Apache Cassandra backed graphDB as-a-service solution.
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
If you’ve ever run a distributed database, you know that managing stateful systems is time-consuming and hard. I’ll talk about why that is, the path we took to make Twitter’s Manhattan database easy to run with thousands of nodes and multiple feature sets, and how you should think about operations.
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
Building queues on distributed data stores is hard, and long been considered an antipattern. However, with careful consideration and tactics, it is possible to do. CassieQ is an implementation of a distributed queue on Cassandra which supports easy installation, massive data ingest, authentication, a simple to use HTTP based API, and no dependencies other than your already existing Cassandra environment.
About the Speakers
Anton Kropp Senior Software Engineer, Curalate
Anton Kropp is a senior engineer with over 8 years experience building distributed and fault tolerant systems. He has worked at companies big and small (Godaddy, PracticeFusion), and enjoys building frameworks and tooling to make life easier with a penchant for dockerized containers and simple API's. When he's not messing around on his computer he's drinking local Seattle beers, zipping around the city on his electric bike, and hanging out with his wife and dog.
ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.
How to Monitor and Size Workloads on AWS i3 instances
There is a new class of machines in town! Amazon recently unveiled i3, a new class of machines targeted at I/O-intensive workloads. Scylla will officially support i3, and previews are already available.
Join our webinar to learn how to build a state-of-the-art database solution. Presenters Glauber Costa and Eyal Gutkind will cover how to:
- Determine which workloads can benefit from i3 instances
- Ensure Scylla fully leverages the great resources in the i3 family
- Effectively navigate the Scylla monitoring system and identify bottlenecks
You'll also see a live demonstration with a dashboard featuring an i3 cluster with different data models and workloads.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
This document discusses the new Data Pump utilities in Oracle Database 10g for high-performance data movement. Data Pump allows loading and unloading of data and metadata in a server-based, parallel manner using direct path APIs. It provides automatic parallelism, checkpoint/restart capabilities, fine-grained object selection, monitoring, and improved performance over traditional Export/Import - achieving speeds up to 40x faster for data loading. The new expdp/impdp clients offer enhanced functionality while Data Pump serves as the foundation for other Oracle technologies requiring fast data movement. Customers have reported significant performance gains during beta testing of Data Pump.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
There’s a lot of buzz around different DevOps tools being thrown around, and it can be difficult to break through the noise. We plan to share our success story of what to do/not to do while powering your software with the most acclaimed DevOps technologies. From provisioning clusters with Kubernetes to scaling the product for global user base; from Streaming live data using Kafka/Spark to consolidating it in Athena; from monitoring with Kibana to continuously integrating & deploying with GoCD, we promise to you a smooth ride. Come hear our journey of moving a monolith to elastic infrastructure
An Architect's guide to real time big data systems
Introduction to real time big data, stream computing using Infosphere Streams and Apache Storm. Presented in a Big Data Conference in Singapore, Jul 2014.
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScyllaDB
Are you a MySQL DBA or DevOps individual being asked to run Cassandra or Scylla? Feeling overwhelmed? In this talk, I will present Cassandra/Scylla operations in terms that directly relate to MySQL. I will show you comparisons between the Information Schema and the Cassandra/Scylla System keyspace(s). I will also talk about metrics available in MySQL versus Cassandra/Scylla and how to retrieve them. Finally, I will talk about how MySQL replication compares with Cassandra replication. Hopefully, when I am done you will be able to relate to Cassandra operations in a practical and useful way.
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScyllaDB
This presentation discusses the "cold node problem" that occurs when a node restarts in a Cassandra cluster. When a node restarts, it loses its cached data and becomes a bottleneck. The presentation proposes a "heat weighted load balancing" solution where the cluster tracks each node's cache hit ratio and redistributes requests based on this ratio after a restart. Testing shows this solution significantly improves throughput after a node restart by distributing requests more evenly across nodes based on their "heat" or cache contents.
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...ScyllaDB
This document outlines a presentation on using the GoCQL driver to execute queries against Cassandra and Scylla databases. It discusses connecting to a Cassandra cluster, executing queries, iterating over results, and using asynchronous queries. It also mentions some additional Cassandra libraries built on top of GoCQL, including gocqlx for data binding and queries, and gocassa for queries and migrations. The presentation aims to explain how GoCQL works behind the scenes and how to get started with basic querying functionality.
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScyllaDB
I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla.
We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScyllaDB
ScyllaDB CEO and co-founder Dor Laor shares his vision for Scylla and announces Scylla 2.0, a big step towards the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScyllaDB
AdGear runs an ad tech gateway at more than one million queries per second to Scylla and recently transitioned from Apache Cassandra. In this talk, we will highlight the tools and languages that we use (Erlang), how we do bulk imports, and how performance compares between the two database engines.
Duarte Nunes presented on distributed materialized views in ScyllaDB. He discussed the challenges of implementing materialized views in a distributed system without a single master, including propagating updates from base tables to views, handling consistency when tables can diverge, and managing concurrent updates safely. His proposed solution uses asynchronous replica-based propagation paired with repair mechanisms and locking or optimistic concurrency to address these issues. Materialized views provide powerful indexing capabilities but also introduce performance overhead that is difficult to avoid given Scylla's data model.
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...ScyllaDB
The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScyllaDB
Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.
Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScyllaDB
Snapfish, a web-based photo and printing service, will walk through their evaluation process for a new database, discuss use cases, and how they plan to use Scylla in their production systems.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...ScyllaDB
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScyllaDB
In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment.
Some of the questions that will be answered are:
- how to find out which queries are the slowest and why
- how we go about understanding the impact of the data model in a node's performance
- how to check which resources are the bottlenecks in the cluster
If You Care About Performance, Use User Defined TypesScyllaDB
Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScyllaDB
Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScyllaDB
In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScyllaDB
Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...ScyllaDB
Benchmarks are fun to do but when going to production, all sorts of things can happen: anything from hardware outages to human error bringing your database down. Even in a healthy database, a lot of maintenance operations have to periodically run. Do you have the tools necessary to make sure you are good to go?
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny SchnaiderScyllaDB
The document summarizes Benny Schnaider's presentation as the Chairman of NEXTGEN NOSQL. It discusses the evolution of NoSQL databases, with early generations having inefficiencies and issues that required workarounds. The presentation introduces Scylla, a next-generation NoSQL database that was built from the ground up by storage and operating systems experts to massively scale modern applications. Scylla leverages 20 years of database evolution and is implemented in C++ to provide better performance, stability and the ability to scale out across infrastructure.
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...ScyllaDB
In this talk, we will cover the lay of the land of graph databases. We will talk about what it takes to run a highly available hosted solution in the cloud while giving users a seamless vertical and horizontal scaling solution, and share our experiences migrating from an Apache Cassandra backed graphDB as-a-service solution.
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...ScyllaDB
Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScyllaDB
If you’ve ever run a distributed database, you know that managing stateful systems is time-consuming and hard. I’ll talk about why that is, the path we took to make Twitter’s Manhattan database easy to run with thousands of nodes and multiple feature sets, and how you should think about operations.
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScyllaDB
Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...DataStax
Building queues on distributed data stores is hard, and long been considered an antipattern. However, with careful consideration and tactics, it is possible to do. CassieQ is an implementation of a distributed queue on Cassandra which supports easy installation, massive data ingest, authentication, a simple to use HTTP based API, and no dependencies other than your already existing Cassandra environment.
About the Speakers
Anton Kropp Senior Software Engineer, Curalate
Anton Kropp is a senior engineer with over 8 years experience building distributed and fault tolerant systems. He has worked at companies big and small (Godaddy, PracticeFusion), and enjoys building frameworks and tooling to make life easier with a penchant for dockerized containers and simple API's. When he's not messing around on his computer he's drinking local Seattle beers, zipping around the city on his electric bike, and hanging out with his wife and dog.
ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.
How to Monitor and Size Workloads on AWS i3 instancesScyllaDB
There is a new class of machines in town! Amazon recently unveiled i3, a new class of machines targeted at I/O-intensive workloads. Scylla will officially support i3, and previews are already available.
Join our webinar to learn how to build a state-of-the-art database solution. Presenters Glauber Costa and Eyal Gutkind will cover how to:
- Determine which workloads can benefit from i3 instances
- Ensure Scylla fully leverages the great resources in the i3 family
- Effectively navigate the Scylla monitoring system and identify bottlenecks
You'll also see a live demonstration with a dashboard featuring an i3 cluster with different data models and workloads.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks
Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.
This document discusses the new Data Pump utilities in Oracle Database 10g for high-performance data movement. Data Pump allows loading and unloading of data and metadata in a server-based, parallel manner using direct path APIs. It provides automatic parallelism, checkpoint/restart capabilities, fine-grained object selection, monitoring, and improved performance over traditional Export/Import - achieving speeds up to 40x faster for data loading. The new expdp/impdp clients offer enhanced functionality while Data Pump serves as the foundation for other Oracle technologies requiring fast data movement. Customers have reported significant performance gains during beta testing of Data Pump.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
There’s a lot of buzz around different DevOps tools being thrown around, and it can be difficult to break through the noise. We plan to share our success story of what to do/not to do while powering your software with the most acclaimed DevOps technologies. From provisioning clusters with Kubernetes to scaling the product for global user base; from Streaming live data using Kafka/Spark to consolidating it in Athena; from monitoring with Kibana to continuously integrating & deploying with GoCD, we promise to you a smooth ride. Come hear our journey of moving a monolith to elastic infrastructure
An Architect's guide to real time big data systemsRaja SP
Introduction to real time big data, stream computing using Infosphere Streams and Apache Storm. Presented in a Big Data Conference in Singapore, Jul 2014.
Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks
Structured Streaming provides a simple way to perform streaming analytics by treating unbounded, continuous data streams similarly to static DataFrames and Datasets. It allows for event-time processing, windowing, joins, and other SQL operations on streaming data. Under the hood, it uses micro-batch processing to incrementally and continuously execute queries on streaming data using Spark's SQL engine and Catalyst optimizer. This allows for high-level APIs as well as end-to-end guarantees like exactly-once processing and fault tolerance through mechanisms like offset tracking and a fault-tolerant state store.
Witsml data processing with kafka and spark streamingMark Kerzner
This document summarizes a presentation about using Kafka and Spark Streaming to process real-time well data in WITSML format. It discusses WITSML data standards, using Kafka as a messaging system to ingest WITSML data from rigs and service companies, and Spark Streaming to consume Kafka topics and apply rules to detect anomalies and send alerts. Visualizing the data in real-time using Highcharts javascript is also covered. Lessons learned focus on improving data partitioning and managing producer/consumer services.
There’s a lot of buzz around different DevOps tools being thrown around, and it can be difficult to break through the noise. We plan to share our success story of what to do/not to do while powering your software with the most acclaimed DevOps technologies. From provisioning clusters with Kubernetes to scaling the product for global user base; from Streaming live data using Kafka/Spark to consolidating it in Athena; from monitoring with Kibana to continuously integrating & deploying with CircleCI, we promise to you a smooth ride. Come hear our journey of moving a monolith to elastic infrastructure
This document provides an overview of Oracle Stream Analytics capabilities for processing fast streaming data. It discusses deployment approaches on Oracle Cloud, hybrid cloud, and on-premises. It also covers event processing techniques like pattern detection, time windows, and continuous querying enabled by Oracle Stream Analytics. Specific use cases for retail and healthcare are also presented.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
The document provides an overview of the SAS system and its components. It describes the four main data-driven tasks of data access, data management, data analysis, and data presentation. It also outlines the structure of SAS programs and data sets, and how to run and submit SAS programs. Key concepts covered include DATA and PROC steps, the SAS log and output, browsing descriptor and data portions of SAS data sets, and SAS syntax rules.
A Deep Dive into Structured Streaming in Apache Spark Anyscale
This document provides an overview of Structured Streaming in Apache Spark. It begins with a brief history of streaming in Spark and outlines some of the limitations of the previous DStream API. It then introduces the new Structured Streaming API, which allows for continuous queries to be expressed as standard Spark SQL queries against continuously arriving data. It describes the new processing model and how queries are executed incrementally. It also covers features like event-time processing, windows, joins, and fault-tolerance guarantees through checkpointing and write-ahead logging. Overall, the document presents Structured Streaming as providing a simpler way to perform streaming analytics by allowing streaming queries to be expressed using the same APIs as batch queries.
Data Pipeline for The Big Data/Data Science OKCMark Smith
The document discusses and evaluates several data pipeline platforms: Spark Structured Streaming, Spring Cloud Data Flow, Apache NIFI, and AWS Glue. It provides an overview of each platform and evaluates them based on several criteria such as real-time processing, managing failures and duplicates, security, scaling to large data sets, and integration with machine learning and data catalogs. Overall, AWS Glue received strong ratings for its data catalog integration, extraction and transformation capabilities as an ETL tool, while Spark Structured Streaming, Apache NIFI, and Spring Cloud Data Flow demonstrated strengths in real-time processing, scalability, and maturity.
Databricks Spark Chief Architect Reynold Xin's keynote at Spark Summit East 2016, discussing streaming, continuous applications, and DataFrames in Spark.
Tecnicas e Instrumentos de Recoleccion de DatosAngel Giraldo
This document summarizes the future of real-time analytics in Spark. It discusses how Spark Streaming currently works and outlines new capabilities in Spark 2.0 called Structured Streaming that allow for continuous queries on streaming data using the Spark SQL engine and DataFrame API. This makes streaming analytics simpler by allowing the same queries to run on both batch and streaming data. Structured Streaming will provide features like output modes, event-time processing, and integration with machine learning. It is aimed to unify batch, interactive, and streaming analytics in Spark.
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark (20)
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...ScyllaDB
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations.
Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...ScyllaDB
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day.
BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing.
Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more.
This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability.
*BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
Noise Canceling RUM by Tim Vereecke, AkamaiScyllaDB
Noisy Real User Monitoring (RUM) data can ruin your P99!
We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites:
- Human: We exclude noise coming from bots and synthetic measurements.
- Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness.
- Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities.
Adopting Human Visible Navigations provides you with these key benefits:
- Fewer changes staying below the radar
- Fewer data fluctuations
- Fewer blindspots when finding bottlenecks
- Better correlation with business metrics
This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source)
After attending this session; your P99 and other percentiles will become less noisy and easier to tune!
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...ScyllaDB
In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Using Libtracecmd to Analyze Your Latency and Performance TroublesScyllaDB
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
Reducing P99 Latencies with Generational ZGCScyllaDB
With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000XScyllaDB
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules.
Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
Conquering Load Balancing: Experiences from ScyllaDB DriversScyllaDB
Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where we employed the Power of Two Choices algorithm, optimized the implementation of load balancing in Rust Driver, mitigated cloud costs through zone-aware load balancing and combated the issue of overloading a particular core of ScyllaDB. Be prepared to delve into the practical and theoretical aspects of load balancing, gaining valuable insights along the way.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
99.99% of Your Traces are Trash by Paige CruzScyllaDB
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s look at the design of Sifter and get familiar with why tail-based sampling is the way to enact a cost-effective tracing solution while actually increasing the system’s observability.
Square's Lessons Learned from Implementing a Key-Value Store with RaftScyllaDB
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more!
And believe it or not, you might be using some of these techniques without even realizing it!
This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.
A Deep Dive Into Concurrent React by Matheus AlbuquerqueScyllaDB
Writing fluid user interfaces becomes more and more challenging as the application complexity increases. In this talk, we’ll explore how proper scheduling improves your app’s experience by diving into some of the concurrent React features, understanding their rationales, and how they work under the hood.
The Latency Stack: Discovering Surprising Sources of LatencyScyllaDB
Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity. Debugging tail latencies requires unpacking the abstractions that we normally ignore: virtualization, hidden queues, and network behavior.
In this talk, I’ll describe how developers can diagnose more sources of delay and failure by building a more realistic and broad understanding of networked services. I’ll give some real-world cases when high end-to-end latency or elevated failure rates occurred due to factors we ordinarily might not even measure. Some examples include TCP SYN retransmission; virtualization on the client; and surprising behavior from AWS load balancers. Unfortunately, many measurement techniques don’t cover anything but the portion most directly under developer control. But developers can do better by comparing multiple measurements, applying Little’s law, investing in eBPF probes, and paying attention to the network layer.
Understanding API performance to find and fix issues faster ultimately means understanding the entire stack: the client, your code, and the underlying infrastructure.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
How Social Media Hackers Help You to See Your Wife's Message.pdfHackersList
In the modern digital era, social media platforms have become integral to our daily lives. These platforms, including Facebook, Instagram, WhatsApp, and Snapchat, offer countless ways to connect, share, and communicate.
7 Most Powerful Solar Storms in the History of Earth.pdfEnterprise Wired
Solar Storms (Geo Magnetic Storms) are the motion of accelerated charged particles in the solar environment with high velocities due to the coronal mass ejection (CME).
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Comparison Table of DiskWarrior Alternatives.pdfAndrey Yasko
To help you choose the best DiskWarrior alternative, we've compiled a comparison table summarizing the features, pros, cons, and pricing of six alternatives.
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark
1. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Aggregations
using
Structured
Streaming
in
Apache
Spark™
Software
Engineer,
Databricks
Burak
Yavuz
2. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Burak
Yavuz
2
●Software
Engineer
– Databricks
-‐ “We
make
your
streams
come
true”
●Apache
Spark
Committer
as
of
Feb
2017
●MS
in
Management
Science
&
Engineering
-‐
Stanford
University
●BS
in
Mechanical
Engineering
-‐ Bogazici University,
Istanbul
3. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
TEAM
About
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
4. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases
and
How
NoSQL
Stores
Fit
In
oDemos
5. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
The simplest way to perform streaming analytics
is not having to reason about streaming at all
6. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
7. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
New
Model
Input:
data
from
source
as
an
append-‐only table
Trigger:
how
frequently
to
check
input
for
new
data
Query:
operations
on
input
usual
map/filter/reduce
new
window,
session
ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
8. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New
Model
Result:
final
operated
table
updated
every
trigger
interval
Output:
what
part
of
result
to
write
to
data
sink
after
every
trigger
Complete
output:
Write
full
result
table
every
time
9. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
10. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
11. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Append
mode
(default) -‐ New
rows
added
to
the
Result
Table
since
the
last
trigger
will
be
outputted
to
the
sink.
Rows
will
be
output
only
once,
and
cannot
be
rescinded.
Example
use
cases:
ETL
12. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Complete
mode -‐ The
whole
Result
Table
will
be
outputted
to
the
sink
after
every
trigger.
This
is
supported
for
aggregation
queries.
Example
use
cases:
Monitoring
13. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Output
Modes
▪ Update
mode -‐ (Available
since
Spark
2.1.1)
Only
the
rows
in
the
Result
Table
that
were
updated
since
the
last
trigger
will
be
outputted
to
the
sink.
Example
use
cases:
Alerting,
Sessionization
14. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases
and
How
NoSQL
Stores
Fit
In
oDemos
15. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Many
use
cases
require
aggregate
statistics
by
event
time
E.g.
what's
the
#errors
in
each
system
in
1
hour
windows?
Many
challenges
Extracting
event
time
from
data,
handling
late,
out-‐of-‐order
data
DStream APIs
were
insufficient
for
event
time
operations
16. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Event
time
Aggregations
Windowing
is
just
another
type
of
grouping
in
Struct.
Streaming
number
of
records
every
hour
parsedData
.groupBy(window("timestamp","1 hour"))
.count()
parsedData
.groupBy(
"device",
window("timestamp","10 mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors
17. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Advanced
Aggregations
Powerful
built-‐in
aggregations
Multiple
simultaneous
aggregations
Custom
aggs using
reduceGroups,
UDAFs
parsedData
.groupBy(window("timestamp","1 hour"))
.agg(avg("signal"), stddev("signal"), max("signal"))
variance, stddev, kurtosis, stddev_samp, collect_list,
collect_set, corr, approx_count_distinct, ...
// Compute histogram of age by name.
val hist = ds.groupBy(_.type).mapGroups {
case (type, data: Iter[DeviceData]) =>
val buckets = new Array[Int](10)
data.map(_.signal).foreach { a => buckets(a/10)+=1 }
(type, buckets)
}
18. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Stateful Processing
for
Aggregations
In-‐memory,
streaming
state
maintained
for
aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data
19. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
20. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Watermark [Spark
2.1]
-‐ a
moving
threshold
that
trails
behind
the
max
seen
event
time
Trailing
gap
defines
how
late
data
is
expected
to
be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins
21. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Data
newer
than
watermark
may
be
late,
but
allowed
to
aggregate
Data
older
than
watermark
is
"too
late"
and
dropped
State
older
than
watermark
automatically
deleted
to
limit
the
amount
of
intermediate
state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
22. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
and
Late
Data
Control
the
tradeoff
between
state
size
and
lateness
requirements
Handle
more
late
à keep
more
state
Reduce
state
à handle
less
lateness
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
23. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Watermarking
to
Limit
State
[Spark
2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp", "10 minutes")
.groupBy(window("timestamp","5 minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!
24. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
25. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.
26. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How to group
data by time
Same in streaming & batch
27. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How late
data can be
28. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Working
With
Time
df.withWatermark("timestampColumn", "5 hours")
.groupBy(window("timestampColumn", "1 minute"))
.count()
.writeStream
.trigger("10 seconds")
How often
to emit updates
29. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Arbitrary
Stateful Operations
[Spark
2.2]
mapGroupsWithState
allows
any
user-‐defined
stateful ops
to
a
user-‐defined
state
Direct
support
for
per-‐key
timeouts
in
event-‐time
or
processing-‐time
supports
Scala
and
Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,
values: Iterator[V],
state: GroupState[S]): U = {
// update or remove state
// set timeouts
// return mapped value
}
30. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ Applies
the
given
function
to
each
group
of
data,
while
maintaining
a
user-‐defined
per-‐group state
▪ Invoked
once
per
group
in
batch
▪ Invoked
each
trigger
(with
the
existence
of
data)
per
group
in
streaming
▪ Requires
user
to
provide
an
output
mode
for
the
function
31. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
flatMapGroupsWithState
▪ mapGroupsWithState is
a
special
case
with
oOutput
mode:
Update
oOutput
size:
1
row
per
group
▪ Supports
both
Processing
Time
and
Event
Time
timeouts
32. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Outline
oStructured
Streaming
Concepts
oStateful Processing
in
Structured
Streaming
oUse
Cases and
How
NoSQL
Stores
Fit
In
oDemos
33. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
val monitoring = stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append, GST.ProcessingTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.queryName("alerts")
.foreach(new PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
34. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Alerting
▪ Save
your
state
to
Scylla
to
power
dashboards
▪ Have
the
stream
trigger
alerts
ASAP
35. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
val monitoring = stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout) {
(id: Int, events: Iterator[Event], state: GroupState[…]) =>
...
}
.writeStream
.scylla("trips")
Analyze sessions of user/system behavior
36. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Sessionization
▪ Update
sessions
in
your
stream
▪ Save
it
to
a
NoSQL
store
like
Scylla!
37. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Demo
38. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks
39. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks
40. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
We are hiring!
https://databricks.com/company/careers
41. PRESENTATION
TITLE
ON
ONE
LINE
AND
ON
TWO
LINES
First
and
last
name
Position,
company
THANK
YOU
burak@databricks.com
“Does anyone have any questions for my answers?”
- Henry Kissinger