What happens to a request that reaches Scylla, and why should one care? Understanding how Scylla executes your queries can help you make better architectural decisions and also better understand the performance of your application.
Are my rows too big? Should I make that other column a part of my partition key instead? This talk will cover the interaction between nodes, shards and the role of Scylla's internal components like memtables, cache and sstables. I will explain how different types of queries are executed and how to plan your queries for maximum performance.
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor Laor
ScyllaDB CEO and co-founder Dor Laor shares his vision for Scylla and announces Scylla 2.0, a big step towards the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: A Deep Dive on Heat Weighted Load Balancing
This presentation discusses the "cold node problem" that occurs when a node restarts in a Cassandra cluster. When a node restarts, it loses its cached data and becomes a bottleneck. The presentation proposes a "heat weighted load balancing" solution where the cluster tracks each node's cache hit ratio and redistributes requests based on this ratio after a restart. Testing shows this solution significantly improves throughput after a node restart by distributing requests more evenly across nodes based on their "heat" or cache contents.
Duarte Nunes presented on distributed materialized views in ScyllaDB. He discussed the challenges of implementing materialized views in a distributed system without a single master, including propagating updates from base tables to views, handling consistency when tables can diverge, and managing concurrent updates safely. His proposed solution uses asynchronous replica-based propagation paired with repair mechanisms and locking or optimistic concurrency to address these issues. Materialized views provide powerful indexing capabilities but also introduce performance overhead that is difficult to avoid given Scylla's data model.
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPS
AdGear runs an ad tech gateway at more than one million queries per second to Scylla and recently transitioned from Apache Cassandra. In this talk, we will highlight the tools and languages that we use (Erlang), how we do bulk imports, and how performance compares between the two database engines.
Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.
I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla.
We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.
If You Care About Performance, Use User Defined Types
Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data Platform
In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...
The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.
Scylla Summit 2017: Scylla's Open Source Monitoring Solution
Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQL
Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
Scylla Summit 2017: A Toolbox for Understanding Scylla in the Field
In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment.
Some of the questions that will be answered are:
- how to find out which queries are the slowest and why
- how we go about understanding the impact of the data model in a node's performance
- how to check which resources are the bottlenecks in the cluster
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...
JanusGraph, a highly scalable graph database solution, supports historically Cassandra and HBase as database backends. We decided to put Scylla in the mix, certainly searching for the best performing backend. We ran test scenarios that cover high volume reads and writes. In this talk, we will show you the performance results of Scylla vs others and also share our lessons learned during the performance evaluation.
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot Instances
Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQL
Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
The document discusses new features and improvements in the MySQL 8.0 optimizer. Key highlights include:
- New SQL syntax like SELECT...FOR UPDATE SKIP LOCKED and NOWAIT to handle row locking contention.
- Support for common table expressions to improve readability and allow referencing derived tables multiple times.
- Enhancements to the cost model to produce more accurate estimates based on factors like data location.
- Better support for data types like UUID and IPv6, including optimized storage formats and new functions.
How to use Impala query plan and profile to fix performance issues
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Best Practices for Migrating Legacy Data Warehouses into Amazon Redshift
The document summarizes best practices for migrating legacy data warehouses to Amazon Redshift. It covers architectural concepts like columnar storage and compression, data distribution styles, sort keys to optimize query performance, and materializing dimension columns in fact tables. The presentation provides an overview of these topics and their impact on storage, I/O and querying. Real-world examples are also given to illustrate key points.
The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.
Accurately and Reliably Extracting Data from the Web:
STALKER is a machine learning algorithm that learns to extract data from web pages using a small number of labeled examples provided by the user. It generates extraction rules in a hierarchical manner, exploiting the structure of the web source. The algorithm is efficient because most web pages have a fixed template with few variations. It also uses an active learning approach called co-testing to select the most informative examples for the user to label. The system verifies extracted data by comparing it to learned statistical patterns, and can automatically repair wrappers when sites change.
The document contains log output from a plugin making calls to load, write, and render PDF files from various URLs and local files. It initializes the plugin, opens streams for the PDF files, writes the stream data in chunks, and finally destroys the streams and plugin instance. This process is repeated for multiple PDF files loaded by the plugin.
The document provides an overview of various Oracle tips and tricks, including CASE statements, joins, timestamps, renaming tables/columns, merge statements, subqueries, window functions, hierarchical queries, XML, grouping sets, rollups and cubes, indexes, temporary tables and more. Key features introduced in Oracle 9i such as the CASE statement, full outer joins, timestamps and the WITH clause are highlighted.
After completing this lesson, you should be able to do the following:
Describe a view
Create a view
Retrieve data through a view
Alter the definition of a view
Insert, update, and delete data through a view
Drop a view
This document provides examples of SQL queries using aggregation functions such as SUM, AVG, MIN, MAX, and COUNT. It demonstrates how to use aggregation functions to calculate values across entire tables or groups of rows. It also shows how to use the GROUP BY clause to aggregate values for each unique value in a column, and the HAVING clause to filter groups based on aggregation results. Proper order of operations for aggregation queries is also discussed.
This document provides an overview of the MySQL query optimizer. It discusses the main phases of the optimizer including logical transformations, cost-based optimizations, analyzing access methods, join ordering, and plan refinements. Logical transformations prepare the query for cost-based optimization by simplifying conditions. Cost-based optimizations select the optimal join order and access methods to minimize resources used. Access methods analyzed include table scans, index scans, and ref access. The join optimizer searches for the best join order. Plan refinements include sort avoidance and index condition pushdown.
Scylla Summit 2017: Scylla for Mass Simultaneous Sensor Data Processing of ME...ScyllaDB
We will share Scylla adoption practices in equipment sensor data management of MES, Data Modeling Tips, Data Architecture using Scylla, configurations, and tunings.
Scylla Summit 2017: Snapfish's Journey Towards ScyllaScyllaDB
Snapfish, a web-based photo and printing service, will walk through their evaluation process for a new database, discuss use cases, and how they plan to use Scylla in their production systems.
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...ScyllaDB
This document outlines a presentation on using the GoCQL driver to execute queries against Cassandra and Scylla databases. It discusses connecting to a Cassandra cluster, executing queries, iterating over results, and using asynchronous queries. It also mentions some additional Cassandra libraries built on top of GoCQL, including gocqlx for data binding and queries, and gocassa for queries and migrations. The presentation aims to explain how GoCQL works behind the scenes and how to get started with basic querying functionality.
Scylla Summit 2017 Keynote: NextGen NoSQL with CEO Dor LaorScyllaDB
ScyllaDB CEO and co-founder Dor Laor shares his vision for Scylla and announces Scylla 2.0, a big step towards the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScyllaDB
This presentation discusses the "cold node problem" that occurs when a node restarts in a Cassandra cluster. When a node restarts, it loses its cached data and becomes a bottleneck. The presentation proposes a "heat weighted load balancing" solution where the cluster tracks each node's cache hit ratio and redistributes requests based on this ratio after a restart. Testing shows this solution significantly improves throughput after a node restart by distributing requests more evenly across nodes based on their "heat" or cache contents.
Duarte Nunes presented on distributed materialized views in ScyllaDB. He discussed the challenges of implementing materialized views in a distributed system without a single master, including propagating updates from base tables to views, handling consistency when tables can diverge, and managing concurrent updates safely. His proposed solution uses asynchronous replica-based propagation paired with repair mechanisms and locking or optimistic concurrency to address these issues. Materialized views provide powerful indexing capabilities but also introduce performance overhead that is difficult to avoid given Scylla's data model.
Scylla Summit 2017: Running a Soft Real-time Service at One Million QPSScyllaDB
AdGear runs an ad tech gateway at more than one million queries per second to Scylla and recently transitioned from Apache Cassandra. In this talk, we will highlight the tools and languages that we use (Erlang), how we do bulk imports, and how performance compares between the two database engines.
Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScyllaDB
I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla.
We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.
If You Care About Performance, Use User Defined TypesScyllaDB
Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...ScyllaDB
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScyllaDB
In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...ScyllaDB
The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.
Scylla Summit 2017: Scylla's Open Source Monitoring SolutionScyllaDB
Scylla's monitoring capability has come a long way in the last year. We now have native support for Prometheus. Through scylla-grafana-monitoring, we have started providing default dashboards summarizing the most important aspects of Scylla for users. In this talk, I will cover what is currently available in our metrics, other non-standard metrics that are interesting but not available in our main dashboard, as well as our future plans for enhancement.
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScyllaDB
Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScyllaDB
In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment.
Some of the questions that will be answered are:
- how to find out which queries are the slowest and why
- how we go about understanding the impact of the data model in a node's performance
- how to check which resources are the bottlenecks in the cluster
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...ScyllaDB
JanusGraph, a highly scalable graph database solution, supports historically Cassandra and HBase as database backends. We decided to put Scylla in the mix, certainly searching for the best performing backend. We ran test scenarios that cover high volume reads and writes. In this talk, we will show you the performance results of Scylla vs others and also share our lessons learned during the performance evaluation.
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScyllaDB
Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScyllaDB
Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
The document discusses new features and improvements in the MySQL 8.0 optimizer. Key highlights include:
- New SQL syntax like SELECT...FOR UPDATE SKIP LOCKED and NOWAIT to handle row locking contention.
- Support for common table expressions to improve readability and allow referencing derived tables multiple times.
- Enhancements to the cost model to produce more accurate estimates based on factors like data location.
- Better support for data types like UUID and IPv6, including optimized storage formats and new functions.
How to use Impala query plan and profile to fix performance issuesCloudera, Inc.
Apache Impala is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
The document summarizes best practices for migrating legacy data warehouses to Amazon Redshift. It covers architectural concepts like columnar storage and compression, data distribution styles, sort keys to optimize query performance, and materializing dimension columns in fact tables. The presentation provides an overview of these topics and their impact on storage, I/O and querying. Real-world examples are also given to illustrate key points.
The document discusses query execution in database management systems. It begins with an example query on a City, Country database and represents it in relational algebra. It then discusses different query execution strategies like table scan, nested loop join, sort merge join, and hash join. The strategies are compared based on their memory and disk I/O requirements. The document emphasizes that query execution plans can be optimized for parallelism and pipelining to improve performance.
Accurately and Reliably Extracting Data from the Web: butest
STALKER is a machine learning algorithm that learns to extract data from web pages using a small number of labeled examples provided by the user. It generates extraction rules in a hierarchical manner, exploiting the structure of the web source. The algorithm is efficient because most web pages have a fixed template with few variations. It also uses an active learning approach called co-testing to select the most informative examples for the user to label. The system verifies extracted data by comparing it to learned statistical patterns, and can automatically repair wrappers when sites change.
The document contains log output from a plugin making calls to load, write, and render PDF files from various URLs and local files. It initializes the plugin, opens streams for the PDF files, writes the stream data in chunks, and finally destroys the streams and plugin instance. This process is repeated for multiple PDF files loaded by the plugin.
The document provides an overview of various Oracle tips and tricks, including CASE statements, joins, timestamps, renaming tables/columns, merge statements, subqueries, window functions, hierarchical queries, XML, grouping sets, rollups and cubes, indexes, temporary tables and more. Key features introduced in Oracle 9i such as the CASE statement, full outer joins, timestamps and the WITH clause are highlighted.
After completing this lesson, you should be able to do the following:
Describe a view
Create a view
Retrieve data through a view
Alter the definition of a view
Insert, update, and delete data through a view
Drop a view
This document provides examples of SQL queries using aggregation functions such as SUM, AVG, MIN, MAX, and COUNT. It demonstrates how to use aggregation functions to calculate values across entire tables or groups of rows. It also shows how to use the GROUP BY clause to aggregate values for each unique value in a column, and the HAVING clause to filter groups based on aggregation results. Proper order of operations for aggregation queries is also discussed.
This document provides an overview of the MySQL query optimizer. It discusses the main phases of the optimizer including logical transformations, cost-based optimizations, analyzing access methods, join ordering, and plan refinements. Logical transformations prepare the query for cost-based optimization by simplifying conditions. Cost-based optimizations select the optimal join order and access methods to minimize resources used. Access methods analyzed include table scans, index scans, and ref access. The join optimizer searches for the best join order. Plan refinements include sort avoidance and index condition pushdown.
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
This presentation gives an overview on OrientDB and Neo4j. It also compares some specific querys, their speed and the overall functionality of both databases.
The querys might not be optimized in both cases. At least they have the same outcome and are both written as querys. For sure in Neo4j you should do this in Java code. But that is way harder to write, so this presentation is more like a direkt comparision instead of really getting the best results.
Also it's done with real data and at the end round about 200 GB of data.
This is a paper I wrote at Hotsos where we used Method-R and Trace Data to optimize performance. SQL tuning can be simple if you ask the right questions.
Homework 5
due Wednesday, March 8, 5pm
Submit zipped .m files on Canvas and printed published file in 182 George St box #15 or #16
You are encouraged to work with other students on this assignment but you are expected to write
and work on your own answers. You don’t need to provide the name of students you worked with.
You can find information about usage and syntax of any built-in Matlab function by typing
help xfunctionnamey
Presentation of Common Table Expressions (CTE), recursive or not , a new feature in MySQL 8.0; slides written by Guilhem Bichot, developer of the feature, and presented by him at the Percona Live Conference in Dublin on 2017-09-26.
Witsml data processing with kafka and spark streamingMark Kerzner
This document summarizes a presentation about using Kafka and Spark Streaming to process real-time well data in WITSML format. It discusses WITSML data standards, using Kafka as a messaging system to ingest WITSML data from rigs and service companies, and Spark Streaming to consume Kafka topics and apply rules to detect anomalies and send alerts. Visualizing the data in real-time using Highcharts javascript is also covered. Lessons learned focus on improving data partitioning and managing producer/consumer services.
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Databricks
This document summarizes a presentation on extending Spark SQL Data Sources APIs with join push down. The presentation discusses how join push down can significantly improve query performance by reducing data transfer and exploiting data source capabilities like indexes. It provides examples of join push down in enterprise data pipelines and SQL acceleration use cases. The presentation also outlines the challenges of network speeds and exploiting data source capabilities, and how join push down addresses these challenges. Future work discussed includes building a cost model for global optimization across data sources.
Tech talk by Serena Signorelli (https://www.linkedin.com/in/serenasignorelli/) in the event ''Tensorflow and Sparklyr: Scaling Deep Learning and R to the Big Data ecosystem'', May 15, 2017 at ICTeam Grassobbio (BG). The event was part of the Data Science Milan Meetup (https://www.meetup.com/it-IT/Data-Science-Milan/).
Similar to Scylla Summit 2017: Planning Your Queries for Maximum Performance (20)
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...ScyllaDB
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations.
Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...ScyllaDB
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day.
BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing.
Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more.
This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability.
*BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
Noise Canceling RUM by Tim Vereecke, AkamaiScyllaDB
Noisy Real User Monitoring (RUM) data can ruin your P99!
We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites:
- Human: We exclude noise coming from bots and synthetic measurements.
- Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness.
- Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities.
Adopting Human Visible Navigations provides you with these key benefits:
- Fewer changes staying below the radar
- Fewer data fluctuations
- Fewer blindspots when finding bottlenecks
- Better correlation with business metrics
This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source)
After attending this session; your P99 and other percentiles will become less noisy and easier to tune!
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...ScyllaDB
In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Using Libtracecmd to Analyze Your Latency and Performance TroublesScyllaDB
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
Reducing P99 Latencies with Generational ZGCScyllaDB
With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000XScyllaDB
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules.
Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
Conquering Load Balancing: Experiences from ScyllaDB DriversScyllaDB
Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where we employed the Power of Two Choices algorithm, optimized the implementation of load balancing in Rust Driver, mitigated cloud costs through zone-aware load balancing and combated the issue of overloading a particular core of ScyllaDB. Be prepared to delve into the practical and theoretical aspects of load balancing, gaining valuable insights along the way.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
99.99% of Your Traces are Trash by Paige CruzScyllaDB
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s look at the design of Sifter and get familiar with why tail-based sampling is the way to enact a cost-effective tracing solution while actually increasing the system’s observability.
Square's Lessons Learned from Implementing a Key-Value Store with RaftScyllaDB
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more!
And believe it or not, you might be using some of these techniques without even realizing it!
This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.
A Deep Dive Into Concurrent React by Matheus AlbuquerqueScyllaDB
Writing fluid user interfaces becomes more and more challenging as the application complexity increases. In this talk, we’ll explore how proper scheduling improves your app’s experience by diving into some of the concurrent React features, understanding their rationales, and how they work under the hood.
The Latency Stack: Discovering Surprising Sources of LatencyScyllaDB
Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity. Debugging tail latencies requires unpacking the abstractions that we normally ignore: virtualization, hidden queues, and network behavior.
In this talk, I’ll describe how developers can diagnose more sources of delay and failure by building a more realistic and broad understanding of networked services. I’ll give some real-world cases when high end-to-end latency or elevated failure rates occurred due to factors we ordinarily might not even measure. Some examples include TCP SYN retransmission; virtualization on the client; and surprising behavior from AWS load balancers. Unfortunately, many measurement techniques don’t cover anything but the portion most directly under developer control. But developers can do better by comparing multiple measurements, applying Little’s law, investing in eBPF probes, and paying attention to the network layer.
Understanding API performance to find and fix issues faster ultimately means understanding the entire stack: the client, your code, and the underlying infrastructure.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Details of description part II: Describing images in practice - Tech Forum 2024BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and transcript: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Blockchain technology is transforming industries and reshaping the way we conduct business, manage data, and secure transactions. Whether you're new to blockchain or looking to deepen your knowledge, our guidebook, "Blockchain for Dummies", is your ultimate resource.
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
論文紹介:A Systematic Survey of Prompt Engineering on Vision-Language Foundation ...Toru Tamaki
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr "A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models" arXiv2023
https://arxiv.org/abs/2307.12980
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
BT & Neo4j: Knowledge Graphs for Critical Enterprise Systems.pptx.pdfNeo4j
Presented at Gartner Data & Analytics, London Maty 2024. BT Group has used the Neo4j Graph Database to enable impressive digital transformation programs over the last 6 years. By re-imagining their operational support systems to adopt self-serve and data lead principles they have substantially reduced the number of applications and complexity of their operations. The result has been a substantial reduction in risk and costs while improving time to value, innovation, and process automation. Join this session to hear their story, the lessons they learned along the way and how their future innovation plans include the exploration of uses of EKG + Generative AI.
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
YOUR RELIABLE WEB DESIGN & DEVELOPMENT TEAM — FOR LASTING SUCCESS
WPRiders is a web development company specialized in WordPress and WooCommerce websites and plugins for customers around the world. The company is headquartered in Bucharest, Romania, but our team members are located all over the world. Our customers are primarily from the US and Western Europe, but we have clients from Australia, Canada and other areas as well.
Some facts about WPRiders and why we are one of the best firms around:
More than 700 five-star reviews! You can check them here.
1500 WordPress projects delivered.
We respond 80% faster than other firms! Data provided by Freshdesk.
We’ve been in business since 2015.
We are located in 7 countries and have 22 team members.
With so many projects delivered, our team knows what works and what doesn’t when it comes to WordPress and WooCommerce.
Our team members are:
- highly experienced developers (employees & contractors with 5 -10+ years of experience),
- great designers with an eye for UX/UI with 10+ years of experience
- project managers with development background who speak both tech and non-tech
- QA specialists
- Conversion Rate Optimisation - CRO experts
They are all working together to provide you with the best possible service. We are passionate about WordPress, and we love creating custom solutions that help our clients achieve their goals.
At WPRiders, we are committed to building long-term relationships with our clients. We believe in accountability, in doing the right thing, as well as in transparency and open communication. You can read more about WPRiders on the About us page.
Advanced Techniques for Cyber Security Analysis and Anomaly DetectionBert Blevins
Cybersecurity is a major concern in today's connected digital world. Threats to organizations are constantly evolving and have the potential to compromise sensitive information, disrupt operations, and lead to significant financial losses. Traditional cybersecurity techniques often fall short against modern attackers. Therefore, advanced techniques for cyber security analysis and anomaly detection are essential for protecting digital assets. This blog explores these cutting-edge methods, providing a comprehensive overview of their application and importance.
Advanced Techniques for Cyber Security Analysis and Anomaly Detection
Scylla Summit 2017: Planning Your Queries for Maximum Performance
1. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Planning your queries
for maximum performance
VP R&D, ScyllaDB
Shlomi Livne
2. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Shlomi Livne
2
Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB
he led the research and development team at
Convergin, which was acquired by Oracle.
3. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
How Scylla executes
your queries
4. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Cluster View
4
client
Cluster of nodes
1
7
3
4
5
68
2
Coordinator
Replica
5. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Coordinator Tasks
5
1. Prepare the statement
2. Single partition queries
a. Selects replicas (using cache heat info) - and send query / digest requests
requesting a page of results
b. Compare the digests, if there is a mismatch:
i. Request data from selected replicas
ii. Repair the data on replicas
c. Return result
3. Partition scan queries
a. Split the request up based on the ring
b. Send requests for data using ranges - requesting a page of results
c. Merge results
d. Return result
6. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Replica Tasks
6
1. Receive a data/digest/range request
2. Split the request up according to shards
3. On each shard:
a. Execute the request merging data from memtables + cache/sstables
b. For data request:
i. prepare a result and return it (compute digest if RF > 1)
c. For digest request:
i. compute digest and return it
d. For partition scan request
i. return the partition range data (do not prepare a result)
7. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
7
Bloom Filter Summary Index Compression Data
Bloom Filter Summary Index Compression Data
Bloom Filter Summary Index Compression Data
ResultRow CacheMemtable
Read Req Result
Bloom Filter Summary Index Compression Data
8. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
8
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter Summary Index Compression Data
9. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
9
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter Summary Index Compression Data
10. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
10
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter Summary Index Compression Data
11. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Bloom Filter
emtable
P8:R1:C=3
Replica Shard Read Diagram
11
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Summary Index Compression Data
12. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
12
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter 12Summary Index Compression Data
13. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
13
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
13
Bloom Filter 13Summary Index Compression Data
14. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter
P8
Summary Index Compression Data
15. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
15
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
Memtable
P8:R1:C=3
Read: P8:R1
Bloom Filter
P8
Summary Index Compression Data
16. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
16
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter
P8
Summary Index Compression Data
17. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
emtable
P8:R1:C=3
Replica Shard Read Diagram
17
Bloom Filter
P8
Summary
P8
Index
P8
Compression Data
P8:R1:A=8
Bloom Filter
P8
Summary
Index
P8
Compression
Data
P8:R1:B=7
Bloom Filter Summary Index Compression Data
P8:R1:A=8,B=7Row Cache
P8:R1:A=8,B=7
Memtable
P8:R1:C=3
Read: P8:R1
P8:R1
A=8,B=7,C=3
Bloom Filter
P8
Summary Index Compression Data
18. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Row Cache
18
▪ Cache stores complete row data
▪ In addition to storing existing rows, cache stores information
about completeness of clustering ranges (continuity), so it doesn't
miss between cached rows.
▪ Cache is populated on:
o Queries
o Memtable flush:
• Data is merged - to keep it up to date with new sstables written.
• Data is inserted - in case there is no data for that partition on disk.
19. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Selecting Sstables
19
▪ Given a partition key (pk), the current set of sstables is reduced so that
sstable X will be included iff:
o min_partition_key(sstable X) < pk < max_partition_key (sstable X)
o bloom_filer (sstable X, pk) = True
▪ Scylla 2.0: SStables will be read in parallel
▪ Scylla 2.1:
o The reduced set of sstables is searched newest to oldest until a result can be
constructed and we can prove that older sstables are not relevant.
o SStables read parallelism will grow starting from a single sstable
20. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
7 Rules To
Optimize your Queries
21. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #1 - Use Prepared statements
▪ Coordinator needs to pre-process the query:
o A lot of repetitive work that can be done only once
o Adds overhead in execution of a query - directly translates to throughput and
latency
▪ Driver is not able to send the request to a coordinator node that
holds the data (an additional hop)
▪ tip: compare scylla_query_processor_statements_prepared to the
# of executed scylla_transport_requests_served
21
22. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Sample: single Scylla server, using c-s
22
Results Unprepared Prepared
op rate 13037 18704
partition rate 13037 18704
row rate 13037 18704
latency mean 1.5 1.1
latency median 1.3 1
latency 95th percentile 2.9 1.6
latency 99th percentile 6.2 2.5
latency 99.9th percentile 12.2 7.1
latency max 31.1 16.9
Total partitions 100000 100000
23. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #2 - Use Paging
▪ Paging Disabled: Coordinator will be forced to prepare a single
result that holds all the data and send it back:
o If coordinator is not able to return a response (allocate enough memory for
the single result) an error will be returned to the client
o tip: compare scylla_transport_unpaged_queries to scylla_cql_reads to
detected if many of your read queries are unpaged
23
24. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #3 - Use correct Page Size
▪ Drivers enable paging by default with a default page_size 5000
rows (java, python, gocql)
▪ CQL requires returning at least one result and allows returning less
results than the page size
▪ Scylla utilizes this:
o Scylla caps a page_size to ~1MB of memory - Scylla will return less rows than
requested when rows are large
o Do not use the number of returned results as indication if there are no more
results
24
25. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
25
21
Has more pages
26. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Scylla 2.0: does the default page_size make sense
26
page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes
10 timed out 2104.492031 331.087871 173.932543
50 5679.087615 737.148927 202.113023 168.165375
100 4034.920447 573.046783 186.384383 168.951807
500 2663.383039 415.760383 183.894015 173.015039
1000 2451.570687 395.313151 182.976511 168.427519
5000 2285.895679 400.031743 184.942591 169.345023
10000 2281.701375 399.769599 183.369727 169.738239
50000 2273.312767 396.099583 183.107583 170.000383
Test: duration in millisecond fetching a single wide partition with 10^8 bytes
split into rows using different page size
27. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Test: duration in millisecond fetching a single wide partition with 10^8 bytes
split into rows using different page size
C* 3.11.0: does the default page_size make sense
27
page size 10^6 rows of 100 bytes 10^5 rows of 1000 bytes 10^4 rows of 10^4 bytes 1000 rows of 10^5 bytes
10 timed out 4030.726143 903.872511 364.380159
50 12876.51328 1535.115263 419.430399 300.941311
100 8992.587775 1202.716671 405.274623 316.407807
500 6400.507903 907.542527 354.680831 348.651519
1000 6077.546495 874.512383 360.972287 370.409471
5000 5620.367359 791.674879 422.051839 358.612991
10000 5490.343935 793.772031 389.021695 360.447999
50000 5662.310399 913.833983 383.516671 355.467263
tip: consider changing the page size if your rows are large
28. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #4 - Beware of Multi Partition CQL IN queries
▪ Multi-Partition CQL IN queries: force the coordinator node to split
the queries up to single partition queries and aggregate results.
28
29. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #5 - Beware of Single Partition CQL IN queries
Question: Should I split the CQL IN Query ?
Sample:
▪ CQL: “Select * from ks.cf where pk = X and ck in (Y1, Y2, … Yn)
Translated to:
▪ CQL:
o “Select * from ks.cf where pk = X and ck = Y1“
o “Select * from ks.cf where pk = X and ck = Y2“
.
o “Select * from ks.cf where pk = X and ck = Yn“
29
30. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
30
31. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
31
32. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
32
33. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
33
34. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Question: Should I split the CQL IN Query ?
Answer: It depends on how wide your rows are
Comments:
▪ Prior to Scylla-2.0 in some wide partition cases single partition CQL
IN Queries - performed very badly.
▪ All reported results are using Scylla 2.0
34
35. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #6 - There’s a faster way todo full scans
▪ The blog post efficient-full-table-scans-with-scylla outlaid an
algorithm todo full scans; in highlevel:
o split the range up into small sub ranges
o run “enough” sub ranges in parallel
▪ In follow up blog How to scan 475 million partitions 12x faster
using efficient full table scan a sample implementation applying
this was provided
▪ Is there even a “faster” way ?
35
36. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
▪ Yes there is:
o Using the token ownership of nodes in the ring one can select ranges of
tokens. Once a “range” has been processed - the next “range” can be
selected based on the ownership in the ring.
o An even more optimized solution would use the “sharding” information and
aim ranges based on shards on a machine - so that all cores are executing
requests in parallel.
36
37. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Rule #7: Use the tools ….
▪ Probelastic tracing
▪ Slow query tracing
▪ Wireshark
▪ CQL Trace
▪ Enable Client Side tracing.
37
38. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
shlomi@scylladb.com
@ShlomiLivne
Any questions?