SlideShare a Scribd company logo
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Arbitrary	
  Stateful Aggregations
using	
  Structured	
  Streaming
in	
  Apache	
  Spark™
Software	
  Engineer,	
  Databricks
Burak	
  Yavuz
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Burak	
  Yavuz
2
●Software	
  Engineer	
  – Databricks
-­‐ “We	
  make	
  your	
  streams	
  come	
  true”
●Apache	
  Spark	
  Committer	
  as	
  of	
  Feb	
  2017
●MS	
  in	
  Management	
  Science	
  &	
  Engineering	
  -­‐
Stanford	
  University
●BS	
  in	
  Mechanical	
  Engineering	
  -­‐ Bogazici University,	
  
Istanbul
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
TEAM
About
Started  Spark  project  (now  Apache  Spark)  at  UC  Berkeley  in  2009
PRODUCT
Unified  Analytics  Platform
MISSION
Making  Big  Data  Simple
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases	
  and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos

Recommended for you

Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs

I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla. We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.

scyllasummitnosqlscylla
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor

ScyllaDB CEO and co-founder Dor Laor shares his vision for Scylla and announces Scylla 2.0, a big step towards the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.

scylladbnosqlscyllasummit
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS

AdGear runs an ad tech gateway at more than one million queries per second to Scylla and recently transitioned from Apache Cassandra. In this talk, we will highlight the tools and languages that we use (Erlang), how we do bulk imports, and how performance compares between the two database engines.

scylladbscyllasummitnosql
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
The simplest way to perform streaming analytics
is not having to reason about streaming at all
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
New	
  Model
Input:	
  data	
  from	
  source	
  as	
  an	
  
append-­‐only table
Trigger:	
  how	
  frequently	
  to	
  check
input	
  for	
  new	
  data
Query:	
  operations	
  on	
  input
usual	
  map/filter/reduce	
  
new	
  window,	
  session	
  ops
Trigger: every 1 sec
1 2 3
Time
data up
to 1
Input data up
to 2
data up
to 3
Query
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[complete mode]
output all the rows in the result table
New	
  Model
Result:	
  final	
  operated	
  table	
  
updated	
  every	
  trigger	
  interval
Output:	
  what	
  part	
  of	
  result	
  to	
  
write	
  to	
  data	
  sink	
  after	
  every	
  	
  	
  	
  
trigger
Complete	
  output:	
   Write	
  full	
  result	
  table	
  
every	
  time

Recommended for you

Scylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized ViewsScylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized Views

Duarte Nunes presented on distributed materialized views in ScyllaDB. He discussed the challenges of implementing materialized views in a distributed system without a single master, including propagating updates from base tables to views, handling consistency when tables can diverge, and managing concurrent updates safely. His proposed solution uses asynchronous replica-based propagation paired with repair mechanisms and locking or optimistic concurrency to address these issues. Materialized views provide powerful indexing capabilities but also introduce performance overhead that is difficult to avoid given Scylla's data model.

scyllascyllasummitnosql
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...

The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.

nosqlscyllasummitscylla
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring Solution

Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.

nosqlscylladbscyllasummit
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Trigger: every 1 sec
1 2 3
result
for data
up to 1
Result
Query
Time
data up
to 1
Input data up
to 2
result
for data
up to 2
data up
to 3
result
for data
up to 3
Output
[append mode]
output only new rows since
last trigger
Result: final operated table updated
every trigger interval
Output: what part of result to write
to data sink after every trigger
Complete output: Write full result table
every time
Append output: Write only new rows that got
added to result table since previous batch
*Not all output modes are feasible with all queries
New Model
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Append	
  mode	
  (default) -­‐ New	
  rows	
  added	
  to	
  the	
  Result	
  Table	
  
since	
  the	
  last	
  trigger	
  will	
  be	
  outputted	
  to	
  the	
  sink.	
  Rows	
  will	
  be	
  
output	
  only	
  once,	
  and	
  cannot	
  be	
  rescinded.
Example	
  use	
  cases:	
  ETL
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Complete	
  mode -­‐ The	
  whole	
  Result	
  Table	
  will	
  be	
  outputted	
  to	
  the	
  
sink	
  after	
  every	
  trigger.	
  This	
  is	
  supported	
  for	
  aggregation	
  queries.
Example	
  use	
  cases:	
  Monitoring

Recommended for you

Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes

Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.

scyllasummitnosqlscylla
Scylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards Scylla

Snapfish, a web-based photo and printing service, will walk through their evaluation process for a new database, discuss use cases, and how they plan to use Scylla in their production systems.

scyllascyllasummitnosql
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...

In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.

nosqlscyllasummitscylla
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Output	
  Modes
▪ Update	
  mode -­‐ (Available	
  since	
  Spark	
  2.1.1)	
  Only	
  the	
  rows	
  in	
  the	
  
Result	
  Table	
  that	
  were	
  updated	
  since	
  the	
  last	
  trigger	
  will	
  be	
  
outputted	
  to	
  the	
  sink.
Example	
  use	
  cases:	
  Alerting,	
  Sessionization
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases	
  and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Event	
  time	
  Aggregations
Many	
  use	
  cases	
  require	
  aggregate	
  statistics	
  by	
  event	
  time
E.g.	
  what's	
  the	
  #errors	
  in	
  each	
  system	
  in	
  1	
  hour	
  windows?
Many	
  challenges
Extracting	
  event	
  time	
  from	
  data,	
  handling	
  late,	
  out-­‐of-­‐order	
  data
DStream APIs	
  were	
  insufficient	
  for	
  event	
  time	
  operations
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Event	
  time	
  Aggregations
Windowing	
  is	
  just	
  another	
  type	
  of	
  grouping	
  in	
  Struct.	
  Streaming
number	
  of	
  records	
  every	
  hour
parsedData
.groupBy(window("timestamp","1  hour"))
.count()
parsedData
.groupBy(
"device",  
window("timestamp","10  mins"))
.avg("signal")
avg signal strength of each
device every 10 mins
Use built-in functions to extract event-time
No need for separate extractors

Recommended for you

Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field

In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment. Some of the questions that will be answered are: - how to find out which queries are the slowest and why - how we go about understanding the impact of the data model in a node's performance - how to check which resources are the bottlenecks in the cluster

nosqlscyllasummitscylla
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types

Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.

nosqlscyllasummitscylla
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances

Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.

nosqlscyllasummitscylla
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Advanced	
  Aggregations
Powerful	
  built-­‐in	
  
aggregations
Multiple	
  simultaneous	
  
aggregations
Custom	
  aggs using	
  
reduceGroups,	
  UDAFs
parsedData
.groupBy(window("timestamp","1  hour"))
.agg(avg("signal"),  stddev("signal"),  max("signal"))
variance,  stddev,  kurtosis,  stddev_samp,  collect_list,  
collect_set,  corr,  approx_count_distinct,  ...  
//  Compute  histogram  of  age  by  name.
val hist =  ds.groupBy(_.type).mapGroups {
case (type,  data:  Iter[DeviceData])  =>
val buckets =  new Array[Int](10)            
data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }        
(type,  buckets)
}
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Stateful Processing	
  for	
  Aggregations
In-­‐memory,	
  streaming	
  
state	
  maintained	
  for	
  
aggregations 12:00 - 13:00 1 12:00 - 13:00 3
13:00 - 14:00 1
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 5
12:00 - 13:00 5
13:00 - 14:00 2
14:00 - 15:00 5
15:00 - 16:00 4
12:00 - 13:00 3
13:00 - 14:00 2
14:00 - 15:00 6
15:00 - 16:00 4
16:00 - 17:00 3
13:00 14:00 15:00 16:00 17:00
Keeping state allows late data to
update counts of old windows
But size of the state increases
indefinitely if old windows not dropped
red = state updated
with late data
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data	
  
Watermark [Spark	
  2.1]	
  -­‐ a	
  moving	
  
threshold	
  that	
  trails	
  behind	
  the	
  max	
  
seen	
  event	
  time
Trailing	
  gap	
  defines	
  how	
  late	
  data	
  is	
  
expected	
  to	
  be
event time
max event time
watermark data older
than
watermark
not expected
12:30 PM
12:20 PM
trailing gap
of 10 mins

Recommended for you

Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform

In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.

nosqlscyllasummitscylla
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL

Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.

scyllanosqlscyllasummit
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...

Benchmarks are fun to do but when going to production, all sorts of things can happen: anything from hardware outages to human error bringing your database down. Even in a healthy database, a lot of maintenance operations have to periodically run. Do you have the tools necessary to make sure you are good to go?

nosqlscyllasummitscylla
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data
Data	
  newer	
  than	
  watermark	
  may	
  
be	
  late,	
  but	
  allowed	
  to	
  aggregate
Data	
  older	
  than	
  watermark	
  is	
  "too	
  
late"	
  and	
  dropped
State	
  older	
  than	
  watermark	
  
automatically	
  deleted	
  to	
  limit	
  the	
  
amount	
  of	
  intermediate	
  state
max event time
event time
watermark
late data
allowed to
aggregate
data too
late,
dropped
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  and	
  Late	
  Data
Control	
  the	
  tradeoff	
  between	
  state	
  
size	
  and	
  lateness	
  requirements
Handle	
  more	
  late	
  à keep	
  more	
  state
Reduce	
  state	
  à handle	
  less	
  lateness
max event time
event time
watermark
allowed
lateness
of 10 mins
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
late data
allowed to
aggregate
data too
late,
dropped
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Watermarking	
  to	
  Limit	
  State	
  [Spark	
  2.1]
data too late,
ignored in counts,
state dropped
Processing Time12:00
12:05
12:10
12:15
12:10 12:15 12:20
12:07
12:13
12:08
EventTime
12:15
12:18
12:04
watermark updated to
12:14 - 10m = 12:04
for next trigger,
state < 12:04 deleted
data is late, but
considered in counts
parsedData
.withWatermark("timestamp",  "10  minutes")
.groupBy(window("timestamp","5  minutes"))
.count()
system tracks max
observed event time
12:08
wm = 12:04
10min
12:14
More details in blog post!
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company

Recommended for you

Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny SchnaiderScylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider

The document summarizes Benny Schnaider's presentation as the Chairman of NEXTGEN NOSQL. It discusses the evolution of NoSQL databases, with early generations having inefficiencies and issues that required workarounds. The presentation introduces Scylla, a next-generation NoSQL database that was built from the ground up by storage and operating systems experts to massively scale modern applications. Scylla leverages 20 years of database evolution and is implemented in C++ to provide better performance, stability and the ability to scale out across infrastructure.

scylladbscyllasummitnosql
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...

In this talk, we will cover the lay of the land of graph databases. We will talk about what it takes to run a highly available hosted solution in the cloud while giving users a seamless vertical and horizontal scaling solution, and share our experiences migrating from an Apache Cassandra backed graphDB as-a-service solution.

scylladbnosqlscyllasummit
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...

Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.

nosqlscyllasummitscylla
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
Separate processing details (output rate, late data tolerance)
from query semantics.
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How to group
data by time
Same in streaming & batch
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How late
data can be
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Working	
  With	
  Time
df.withWatermark("timestampColumn",  "5  hours")
.groupBy(window("timestampColumn",  "1  minute"))
.count()
.writeStream
.trigger("10  seconds")
How often
to emit updates

Recommended for you

Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter

If you’ve ever run a distributed database, you know that managing stateful systems is time-consuming and hard. I’ll talk about why that is, the path we took to make Twitter’s Manhattan database easy to run with thousands of nodes and multiple feature sets, and how you should think about operations.

nosqlscyllasummitscylla
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL

Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.

nosqlscyllasummitscylla
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...

Building queues on distributed data stores is hard, and long been considered an antipattern. However, with careful consideration and tactics, it is possible to do. CassieQ is an implementation of a distributed queue on Cassandra which supports easy installation, massive data ingest, authentication, a simple to use HTTP based API, and no dependencies other than your already existing Cassandra environment. About the Speakers Anton Kropp Senior Software Engineer, Curalate Anton Kropp is a senior engineer with over 8 years experience building distributed and fault tolerant systems. He has worked at companies big and small (Godaddy, PracticeFusion), and enjoys building frameworks and tooling to make life easier with a penchant for dockerized containers and simple API's. When he's not messing around on his computer he's drinking local Seattle beers, zipping around the city on his electric bike, and hanging out with his wife and dog.

cassandra summitis harddistributed queue
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Arbitrary	
  Stateful Operations	
  [Spark	
  2.2]
mapGroupsWithState
allows	
  any	
  user-­‐defined
stateful ops	
  to	
  a	
  
user-­‐defined	
  state
Direct	
  support	
  for	
  per-­‐key	
  
timeouts	
  in	
  event-­‐time	
  or	
  
processing-­‐time
supports	
  Scala	
  and	
  Java
ds.groupByKey(groupingFunc)
.mapGroupsWithState
(timeoutConf)
(mappingWithStateFunc)
def mappingWithStateFunc(
key: K,  
values: Iterator[V],  
state: GroupState[S]): U =  {  
//  update  or  remove  state
//  set  timeouts
//  return  mapped  value
}
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
flatMapGroupsWithState
▪ Applies	
  the	
  given	
  function	
  to	
  each	
  group	
  of	
  data,	
  while	
  maintaining	
  
a	
  user-­‐defined	
  per-­‐group state
▪ Invoked	
  once	
  per	
  group	
  in	
  batch
▪ Invoked	
  each	
  trigger	
  (with	
  the	
  existence	
  of	
  data)	
  per	
  group	
  in	
  
streaming
▪ Requires	
  user	
  to	
  provide	
  an	
  output	
  mode	
  for	
  the	
  function
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
flatMapGroupsWithState
▪ mapGroupsWithState is	
  a	
  special	
  case	
  with
oOutput	
  mode:	
  Update
oOutput	
  size:	
  1	
  row	
  per	
  group
▪ Supports	
  both	
  Processing	
  Time	
  and	
  Event	
  Time	
  timeouts
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Outline
oStructured	
  Streaming	
  Concepts
oStateful Processing	
  in	
  Structured	
  Streaming
oUse	
  Cases and	
  How	
  NoSQL	
  Stores	
  Fit	
  In
oDemos

Recommended for you

Scylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking aheadScylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking ahead

ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.

scyllanosqlscyllasummit
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances

There is a new class of machines in town! Amazon recently unveiled i3, a new class of machines targeted at I/O-intensive workloads. Scylla will officially support i3, and previews are already available. Join our webinar to learn how to build a state-of-the-art database solution. Presenters Glauber Costa and Eyal Gutkind will cover how to: - Determine which workloads can benefit from i3 instances - Ensure Scylla fully leverages the great resources in the i3 family - Effectively navigate the Scylla monitoring system and identify bottlenecks You'll also see a live demonstration with a dashboard featuring an i3 cluster with different data models and workloads.

awscloudawsmonitoring
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D. Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming" https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html // About the Presenter // Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica. Follow T.D. on - Twitter: https://twitter.com/tathadas LinkedIn: https://www.linkedin.com/in/tathadas

databricksspark streamingspark summit
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Alerting
val monitoring  =  stream
.as[Event]
.groupBy(_.id)
.flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.queryName("alerts")
.foreach(new  PagerdutySink(credentials))
Monitor a stream using custom stateful logic with timeouts.
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Alerting
▪ Save	
  your	
  state	
  to	
  Scylla	
  to	
  power	
  dashboards
▪ Have	
  the	
  stream	
  trigger	
  alerts	
  ASAP
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Sessionization
val monitoring  =  stream
.as[Event]
.groupBy(_.session_id)
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)  {
(id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  =>
...
}
.writeStream
.scylla("trips")
Analyze sessions of user/system behavior
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Sessionization
▪ Update	
  sessions	
  in	
  your	
  stream
▪ Save	
  it	
  to	
  a	
  NoSQL	
  store	
  like	
  Scylla!

Recommended for you

Spark streaming
Spark streamingSpark streaming
Spark streaming

This document provides an overview of Spark Streaming and Structured Streaming. It discusses what Spark Streaming is, its framework, and drawbacks. It then introduces Structured Streaming, which models streams as infinite datasets. It describes output modes, advantages like handling late data and event times. It covers window operations, watermarking for late data, and different types of stream-stream joins like inner and outer joins. Watermarks and time constraints are needed for joins to handle state and provide correct results.

spark streamingspark structured streamingwatermark
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Tathagata 'TD' Das presented at Bay Area Apache Spark Meetup. This talk covers the merits and motivations of Structured Streaming, and how you can start writing end-to-end continuous applications using Structured Streaming APIs.

structured streamingapache spark 2.0
40043 claborn
40043 claborn40043 claborn
40043 claborn

This document discusses the new Data Pump utilities in Oracle Database 10g for high-performance data movement. Data Pump allows loading and unloading of data and metadata in a server-based, parallel manner using direct path APIs. It provides automatic parallelism, checkpoint/restart capabilities, fine-grained object selection, monitoring, and improved performance over traditional Export/Import - achieving speeds up to 40x faster for data loading. The new expdp/impdp clients offer enhanced functionality while Data Pump serves as the foundation for other Oracle technologies requiring fast data movement. Customers have reported significant performance gains during beta testing of Data Pump.

PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Demo
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Try Spark 2.2 on Community Edition today!
https://databricks.com/try-databricks
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
Apache Spark’s Structured Streaming at Scale Series
https://databricks.com/blog/category/engineering
Twitter: @databricks
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
We are hiring!
https://databricks.com/company/careers

Recommended for you

Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk

This presentation aims to be useful by covering the following topics: - Modern Data Processing System Architectures and Models, - Batch and Stream Processing Pipelines' details, - Apache Spark Architecture and Internals, - Real life use cases used with Apache Spark.

apache sparkbatch processingdata pipelines
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...

There’s a lot of buzz around different DevOps tools being thrown around, and it can be difficult to break through the noise. We plan to share our success story of what to do/not to do while powering your software with the most acclaimed DevOps technologies. From provisioning clusters with Kubernetes to scaling the product for global user base; from Streaming live data using Kafka/Spark to consolidating it in Athena; from monitoring with Kibana to continuously integrating & deploying with GoCD, we promise to you a smooth ride. Come hear our journey of moving a monolith to elastic infrastructure

codemotioncodemotion milan 2018technology
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems

Introduction to real time big data, stream computing using Infosphere Streams and Apache Storm. Presented in a Big Data Conference in Singapore, Jul 2014.

real timeinfosphere streamsbig data
PRESENTATION	
  TITLE	
  ON	
  ONE	
  LINE	
  
AND	
  ON	
  TWO	
  LINES
First	
  and	
  last	
  name
Position,	
  company
THANK	
  YOU
burak@databricks.com
“Does anyone have any questions for my answers?”
- Henry Kissinger

More Related Content

What's hot

Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
ScyllaDB
 
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
ScyllaDB
 
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
ScyllaDB
 
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
ScyllaDB
 
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
ScyllaDB
 
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
ScyllaDB
 
Scylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized ViewsScylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized Views
ScyllaDB
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
ScyllaDB
 
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
ScyllaDB
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes
ScyllaDB
 
Scylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards Scylla
ScyllaDB
 
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
ScyllaDB
 
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
ScyllaDB
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types
ScyllaDB
 
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
ScyllaDB
 
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
ScyllaDB
 
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
ScyllaDB
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
ScyllaDB
 
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny SchnaiderScylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
ScyllaDB
 
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
ScyllaDB
 

What's hot (20)

Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of ViewScylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
Scylla Summit 2017: How to Run Cassandra/Scylla from a MySQL DBA's Point of View
 
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
 
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...
 
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDs
 
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
 
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScylla Summit 2017: Running a Soft Real-time Service at One Million QPS
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
 
Scylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized ViewsScylla Summit 2017: Distributed Materialized Views
Scylla Summit 2017: Distributed Materialized Views
 
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
 
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
 
Scylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on KubernetesScylla Summit 2017: Scylla on Kubernetes
Scylla Summit 2017: Scylla on Kubernetes
 
Scylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScylla Summit 2017: Snapfish's Journey Towards Scylla
Scylla Summit 2017: Snapfish's Journey Towards Scylla
 
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
 
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScylla Summit 2017: A Toolbox for Understanding Scylla in the Field
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
 
If You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined TypesIf You Care About Performance, Use User Defined Types
If You Care About Performance, Use User Defined Types
 
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
 
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
 
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
 
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
 
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny SchnaiderScylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
Scylla Summit 2017 Keynote: NextGen NoSQL with Chairman Benny Schnaider
 
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...
 

Viewers also liked

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
ScyllaDB
 
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
ScyllaDB
 
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
ScyllaDB
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
DataStax
 
Scylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking aheadScylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking ahead
ScyllaDB
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
ScyllaDB
 

Viewers also liked (6)

Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...
 
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at Twitter
 
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
 
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
 
Scylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking aheadScylla Summit 2017: Keynote, Looking back, looking ahead
Scylla Summit 2017: Keynote, Looking back, looking ahead
 
How to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instancesHow to Monitor and Size Workloads on AWS i3 instances
How to Monitor and Size Workloads on AWS i3 instances
 

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Databricks
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
Whiteklay
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
Databricks
 
40043 claborn
40043 claborn40043 claborn
40043 claborn
Baba Ib
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
Eren Avşaroğulları
 
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Codemotion
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
Raja SP
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
Mark Kerzner
 
Let's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaLet's decipher the DevOps macedonia
Let's decipher the DevOps macedonia
Wamika Singh
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
Franco Ucci
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Databricks
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
rowensCap
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
Anyscale
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
Mark Smith
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
Reynold Xin
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
Databricks
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
Angel Giraldo
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 

Similar to Scylla Summit 2017: Stateful Streaming Applications with Apache Spark (20)

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016 A Deep Dive into Structured Streaming:  Apache Spark Meetup at Bloomberg 2016
A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016
 
40043 claborn
40043 claborn40043 claborn
40043 claborn
 
Data Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup TalkData Processing with Apache Spark Meetup Talk
Data Processing with Apache Spark Meetup Talk
 
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
Wamika Singh, Suman Kumari - Let's decipher the DevOps macedonia - Codemotion...
 
An Architect's guide to real time big data systems
An Architect's guide to real time big data systemsAn Architect's guide to real time big data systems
An Architect's guide to real time big data systems
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Witsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streamingWitsml data processing with kafka and spark streaming
Witsml data processing with kafka and spark streaming
 
Let's decipher the DevOps macedonia
Let's decipher the DevOps macedoniaLet's decipher the DevOps macedonia
Let's decipher the DevOps macedonia
 
Stream Analytics
Stream Analytics Stream Analytics
Stream Analytics
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark A Deep Dive into Structured Streaming in Apache Spark
A Deep Dive into Structured Streaming in Apache Spark
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
The Future of Real-Time in Spark
The Future of Real-Time in SparkThe Future of Real-Time in Spark
The Future of Real-Time in Spark
 
Tecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de DatosTecnicas e Instrumentos de Recoleccion de Datos
Tecnicas e Instrumentos de Recoleccion de Datos
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 

More from ScyllaDB

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
ScyllaDB
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
ScyllaDB
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
ScyllaDB
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
ScyllaDB
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
ScyllaDB
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
ScyllaDB
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
ScyllaDB
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
ScyllaDB
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
ScyllaDB
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
ScyllaDB
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
ScyllaDB
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
ScyllaDB
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
ScyllaDB
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
ScyllaDB
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
ScyllaDB
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
ScyllaDB
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
ScyllaDB
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
ScyllaDB
 

More from ScyllaDB (20)

Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...
 
Mitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing SystemsMitigating the Impact of State Management in Cloud Stream Processing Systems
Mitigating the Impact of State Management in Cloud Stream Processing Systems
 
Measuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at TwitterMeasuring the Impact of Network Latency at Twitter
Measuring the Impact of Network Latency at Twitter
 
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...
 
Noise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, AkamaiNoise Canceling RUM by Tim Vereecke, Akamai
Noise Canceling RUM by Tim Vereecke, Akamai
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...
 
Performance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy EvertsPerformance Budgets for the Real World by Tammy Everts
Performance Budgets for the Real World by Tammy Everts
 
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance TroublesUsing Libtracecmd to Analyze Your Latency and Performance Troubles
Using Libtracecmd to Analyze Your Latency and Performance Troubles
 
Reducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGCReducing P99 Latencies with Generational ZGC
Reducing P99 Latencies with Generational ZGC
 
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000X
 
How Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global ScaleHow Netflix Builds High Performance Applications at Global Scale
How Netflix Builds High Performance Applications at Global Scale
 
Conquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB DriversConquering Load Balancing: Experiences from ScyllaDB Drivers
Conquering Load Balancing: Experiences from ScyllaDB Drivers
 
Interaction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance MetricInteraction Latency: Square's User-Centric Mobile Performance Metric
Interaction Latency: Square's User-Centric Mobile Performance Metric
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz99.99% of Your Traces are Trash by Paige Cruz
99.99% of Your Traces are Trash by Paige Cruz
 
Square's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with RaftSquare's Lessons Learned from Implementing a Key-Value Store with Raft
Square's Lessons Learned from Implementing a Key-Value Store with Raft
 
Making Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of RustMaking Python 100x Faster with Less Than 100 Lines of Rust
Making Python 100x Faster with Less Than 100 Lines of Rust
 
A Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus AlbuquerqueA Deep Dive Into Concurrent React by Matheus Albuquerque
A Deep Dive Into Concurrent React by Matheus Albuquerque
 
The Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of LatencyThe Latency Stack: Discovering Surprising Sources of Latency
The Latency Stack: Discovering Surprising Sources of Latency
 

Recently uploaded

WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
shanthidl1
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
HackersList
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
Stephanie Beckett
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
Bert Blevins
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
Enterprise Wired
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
BookNet Canada
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
RaminGhanbari2
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
Andrey Yasko
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Chris Swan
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
BookNet Canada
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
Neo4j
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
jackson110191
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
welrejdoall
 

Recently uploaded (20)

WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Cookies program to display the information though cookie creation
Cookies program to display the information though cookie creationCookies program to display the information though cookie creation
Cookies program to display the information though cookie creation
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
How Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdfHow Social Media Hackers Help You to See Your Wife's Message.pdf
How Social Media Hackers Help You to See Your Wife's Message.pdf
 
What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024What’s New in Teams Calling, Meetings and Devices May 2024
What’s New in Teams Calling, Meetings and Devices May 2024
 
Password Rotation in 2024 is still Relevant
Password Rotation in 2024 is still RelevantPassword Rotation in 2024 is still Relevant
Password Rotation in 2024 is still Relevant
 
7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf7 Most Powerful Solar Storms in the History of Earth.pdf
7 Most Powerful Solar Storms in the History of Earth.pdf
 
Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...Transcript: Details of description part II: Describing images in practice - T...
Transcript: Details of description part II: Describing images in practice - T...
 
Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyyActive Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
Active Inference is a veryyyyyyyyyyyyyyyyyyyyyyyy
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
Comparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdfComparison Table of DiskWarrior Alternatives.pdf
Comparison Table of DiskWarrior Alternatives.pdf
 
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...
 
Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024Details of description part II: Describing images in practice - Tech Forum 2024
Details of description part II: Describing images in practice - Tech Forum 2024
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfBT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdf
 
Pigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdfPigging Solutions Sustainability brochure.pdf
Pigging Solutions Sustainability brochure.pdf
 
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfINDIAN AIR FORCE FIGHTER PLANES LIST.pdf
INDIAN AIR FORCE FIGHTER PLANES LIST.pdf
 
Manual | Product | Research Presentation
Manual | Product | Research PresentationManual | Product | Research Presentation
Manual | Product | Research Presentation
 

Scylla Summit 2017: Stateful Streaming Applications with Apache Spark

  • 1. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Arbitrary  Stateful Aggregations using  Structured  Streaming in  Apache  Spark™ Software  Engineer,  Databricks Burak  Yavuz
  • 2. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Burak  Yavuz 2 ●Software  Engineer  – Databricks -­‐ “We  make  your  streams  come  true” ●Apache  Spark  Committer  as  of  Feb  2017 ●MS  in  Management  Science  &  Engineering  -­‐ Stanford  University ●BS  in  Mechanical  Engineering  -­‐ Bogazici University,   Istanbul
  • 3. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company TEAM About Started  Spark  project  (now  Apache  Spark)  at  UC  Berkeley  in  2009 PRODUCT Unified  Analytics  Platform MISSION Making  Big  Data  Simple
  • 4. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases  and  How  NoSQL  Stores  Fit  In oDemos
  • 5. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company The simplest way to perform streaming analytics is not having to reason about streaming at all
  • 6. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 7. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company New  Model Input:  data  from  source  as  an   append-­‐only table Trigger:  how  frequently  to  check input  for  new  data Query:  operations  on  input usual  map/filter/reduce   new  window,  session  ops Trigger: every 1 sec 1 2 3 Time data up to 1 Input data up to 2 data up to 3 Query
  • 8. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [complete mode] output all the rows in the result table New  Model Result:  final  operated  table   updated  every  trigger  interval Output:  what  part  of  result  to   write  to  data  sink  after  every         trigger Complete  output:   Write  full  result  table   every  time
  • 9. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Trigger: every 1 sec 1 2 3 result for data up to 1 Result Query Time data up to 1 Input data up to 2 result for data up to 2 data up to 3 result for data up to 3 Output [append mode] output only new rows since last trigger Result: final operated table updated every trigger interval Output: what part of result to write to data sink after every trigger Complete output: Write full result table every time Append output: Write only new rows that got added to result table since previous batch *Not all output modes are feasible with all queries New Model
  • 10. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 11. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Append  mode  (default) -­‐ New  rows  added  to  the  Result  Table   since  the  last  trigger  will  be  outputted  to  the  sink.  Rows  will  be   output  only  once,  and  cannot  be  rescinded. Example  use  cases:  ETL
  • 12. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Complete  mode -­‐ The  whole  Result  Table  will  be  outputted  to  the   sink  after  every  trigger.  This  is  supported  for  aggregation  queries. Example  use  cases:  Monitoring
  • 13. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Output  Modes ▪ Update  mode -­‐ (Available  since  Spark  2.1.1)  Only  the  rows  in  the   Result  Table  that  were  updated  since  the  last  trigger  will  be   outputted  to  the  sink. Example  use  cases:  Alerting,  Sessionization
  • 14. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases  and  How  NoSQL  Stores  Fit  In oDemos
  • 15. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Event  time  Aggregations Many  use  cases  require  aggregate  statistics  by  event  time E.g.  what's  the  #errors  in  each  system  in  1  hour  windows? Many  challenges Extracting  event  time  from  data,  handling  late,  out-­‐of-­‐order  data DStream APIs  were  insufficient  for  event  time  operations
  • 16. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Event  time  Aggregations Windowing  is  just  another  type  of  grouping  in  Struct.  Streaming number  of  records  every  hour parsedData .groupBy(window("timestamp","1  hour")) .count() parsedData .groupBy( "device",   window("timestamp","10  mins")) .avg("signal") avg signal strength of each device every 10 mins Use built-in functions to extract event-time No need for separate extractors
  • 17. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Advanced  Aggregations Powerful  built-­‐in   aggregations Multiple  simultaneous   aggregations Custom  aggs using   reduceGroups,  UDAFs parsedData .groupBy(window("timestamp","1  hour")) .agg(avg("signal"),  stddev("signal"),  max("signal")) variance,  stddev,  kurtosis,  stddev_samp,  collect_list,   collect_set,  corr,  approx_count_distinct,  ...   //  Compute  histogram  of  age  by  name. val hist =  ds.groupBy(_.type).mapGroups { case (type,  data:  Iter[DeviceData])  => val buckets =  new Array[Int](10)             data.map(_.signal).foreach {  a  => buckets(a/10)+=1 }         (type,  buckets) }
  • 18. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Stateful Processing  for  Aggregations In-­‐memory,  streaming   state  maintained  for   aggregations 12:00 - 13:00 1 12:00 - 13:00 3 13:00 - 14:00 1 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 5 12:00 - 13:00 5 13:00 - 14:00 2 14:00 - 15:00 5 15:00 - 16:00 4 12:00 - 13:00 3 13:00 - 14:00 2 14:00 - 15:00 6 15:00 - 16:00 4 16:00 - 17:00 3 13:00 14:00 15:00 16:00 17:00 Keeping state allows late data to update counts of old windows But size of the state increases indefinitely if old windows not dropped red = state updated with late data
  • 19. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 20. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data   Watermark [Spark  2.1]  -­‐ a  moving   threshold  that  trails  behind  the  max   seen  event  time Trailing  gap  defines  how  late  data  is   expected  to  be event time max event time watermark data older than watermark not expected 12:30 PM 12:20 PM trailing gap of 10 mins
  • 21. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data Data  newer  than  watermark  may   be  late,  but  allowed  to  aggregate Data  older  than  watermark  is  "too   late"  and  dropped State  older  than  watermark   automatically  deleted  to  limit  the   amount  of  intermediate  state max event time event time watermark late data allowed to aggregate data too late, dropped
  • 22. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  and  Late  Data Control  the  tradeoff  between  state   size  and  lateness  requirements Handle  more  late  à keep  more  state Reduce  state  à handle  less  lateness max event time event time watermark allowed lateness of 10 mins parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() late data allowed to aggregate data too late, dropped
  • 23. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Watermarking  to  Limit  State  [Spark  2.1] data too late, ignored in counts, state dropped Processing Time12:00 12:05 12:10 12:15 12:10 12:15 12:20 12:07 12:13 12:08 EventTime 12:15 12:18 12:04 watermark updated to 12:14 - 10m = 12:04 for next trigger, state < 12:04 deleted data is late, but considered in counts parsedData .withWatermark("timestamp",  "10  minutes") .groupBy(window("timestamp","5  minutes")) .count() system tracks max observed event time 12:08 wm = 12:04 10min 12:14 More details in blog post!
  • 24. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company
  • 25. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") Separate processing details (output rate, late data tolerance) from query semantics.
  • 26. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How to group data by time Same in streaming & batch
  • 27. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How late data can be
  • 28. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Working  With  Time df.withWatermark("timestampColumn",  "5  hours") .groupBy(window("timestampColumn",  "1  minute")) .count() .writeStream .trigger("10  seconds") How often to emit updates
  • 29. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Arbitrary  Stateful Operations  [Spark  2.2] mapGroupsWithState allows  any  user-­‐defined stateful ops  to  a   user-­‐defined  state Direct  support  for  per-­‐key   timeouts  in  event-­‐time  or   processing-­‐time supports  Scala  and  Java ds.groupByKey(groupingFunc) .mapGroupsWithState (timeoutConf) (mappingWithStateFunc) def mappingWithStateFunc( key: K,   values: Iterator[V],   state: GroupState[S]): U =  {   //  update  or  remove  state //  set  timeouts //  return  mapped  value }
  • 30. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company flatMapGroupsWithState ▪ Applies  the  given  function  to  each  group  of  data,  while  maintaining   a  user-­‐defined  per-­‐group state ▪ Invoked  once  per  group  in  batch ▪ Invoked  each  trigger  (with  the  existence  of  data)  per  group  in   streaming ▪ Requires  user  to  provide  an  output  mode  for  the  function
  • 31. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company flatMapGroupsWithState ▪ mapGroupsWithState is  a  special  case  with oOutput  mode:  Update oOutput  size:  1  row  per  group ▪ Supports  both  Processing  Time  and  Event  Time  timeouts
  • 32. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Outline oStructured  Streaming  Concepts oStateful Processing  in  Structured  Streaming oUse  Cases and  How  NoSQL  Stores  Fit  In oDemos
  • 33. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Alerting val monitoring  =  stream .as[Event] .groupBy(_.id) .flatMapGroupsWithState(Append,  GST.ProcessingTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .queryName("alerts") .foreach(new  PagerdutySink(credentials)) Monitor a stream using custom stateful logic with timeouts.
  • 34. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Alerting ▪ Save  your  state  to  Scylla  to  power  dashboards ▪ Have  the  stream  trigger  alerts  ASAP
  • 35. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Sessionization val monitoring  =  stream .as[Event] .groupBy(_.session_id) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout)  { (id:  Int,  events:  Iterator[Event],  state:  GroupState[…])  => ... } .writeStream .scylla("trips") Analyze sessions of user/system behavior
  • 36. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Sessionization ▪ Update  sessions  in  your  stream ▪ Save  it  to  a  NoSQL  store  like  Scylla!
  • 37. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Demo
  • 38. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Try Spark 2.2 on Community Edition today! https://databricks.com/try-databricks
  • 39. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company Apache Spark’s Structured Streaming at Scale Series https://databricks.com/blog/category/engineering Twitter: @databricks
  • 40. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company We are hiring! https://databricks.com/company/careers
  • 41. PRESENTATION  TITLE  ON  ONE  LINE   AND  ON  TWO  LINES First  and  last  name Position,  company THANK  YOU burak@databricks.com “Does anyone have any questions for my answers?” - Henry Kissinger