In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
The document discusses the circumstances under which an auditor may need to resign from an audit engagement before completion. It outlines that resignation is a last resort that requires careful consideration. There are professional standards and legal requirements that must be followed. The auditor must communicate openly with the client and fully document the reasons for withdrawal. Withdrawing from an audit is a difficult decision that depends on factors like the audit's stage of completion and whether laws allow resignation without completing the audit.
I. AWS IAM provides identity and access management for AWS services and resources. It allows customization of access controls through policies and provides features like MFA and identity federation. IAM roles are preferable to users where possible for additional security.
II. EC2 allows launching virtual computing instances in AWS. AMIs contain templates for instances including the OS. Instance types determine hardware configurations. Security groups act as virtual firewalls controlling traffic to instances. EBS provides persistent storage volumes for instances.
III. Core AWS services discussed include IAM, EC2, S3, RDS, CloudWatch which provide fundamental cloud capabilities for security, computing, storage, databases and monitoring.
This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.
Advanced MySql Data-at-Rest Encryption in Percona ServerSeveralnines
Iwo Panowicz - Percona & Bart Oles - Severalnines AB
The purpose of the talk is to present data-at-rest encryption implementation in Percona Server for MySQL.
Differences between Oracle's MySQL and MariaDB implementation.
- How it is implemented?
- What is encrypted:
- Tablespaces?
- General tablespace?
- Double write buffer/parallel double write buffer?
- Temporary tablespaces? (KEY BLOCKS)
- Binlogs?
- Slow/general/error logs?
- MyISAM? MyRocks? X?
- Performance overhead.
- Backups?
- Transportable tablespaces. Transfer key.
- Plugins
- Keyrings in general
- Key rotation?
- General-Purpose Keyring Key-Management Functions
- Keyring_file
- Is useful? How to make it profitable?
- Keyring Vault
- How does it work?
- How to make a transition from keyring_file
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...Amazon Web Services
NoSQL databases are a great fit for many modern applications such as mobile, web, and gaming that require flexible, scalable, high-performance, and highly functional databases to provide great user experiences but they can be hard to manage and require high proficiency and attention.In this session we will present Amazon DynamoDB, a fully managed, multi-region, multi-master database that provides consistent single-digit millisecond latency in any scale.
AWSome Day Online 2020_Module 1: Introduction to the AWS CloudAmazon Web Services
This document outlines an AWS online conference for IT professionals. The conference will cover introducing AWS cloud concepts, getting started with AWS services, building applications in AWS, security best practices, and AWS pricing, support and architecture. The document provides an overview of AWS cloud concepts including infrastructure, services, security, pricing and support. It also summarizes AWS global infrastructure, regions, availability zones and interfaces for managing AWS resources.
The document discusses the circumstances under which an auditor may need to resign from an audit engagement before completion. It outlines that resignation is a last resort that requires careful consideration. There are professional standards and legal requirements that must be followed. The auditor must communicate openly with the client and fully document the reasons for withdrawal. Withdrawing from an audit is a difficult decision that depends on factors like the audit's stage of completion and whether laws allow resignation without completing the audit.
I. AWS IAM provides identity and access management for AWS services and resources. It allows customization of access controls through policies and provides features like MFA and identity federation. IAM roles are preferable to users where possible for additional security.
II. EC2 allows launching virtual computing instances in AWS. AMIs contain templates for instances including the OS. Instance types determine hardware configurations. Security groups act as virtual firewalls controlling traffic to instances. EBS provides persistent storage volumes for instances.
III. Core AWS services discussed include IAM, EC2, S3, RDS, CloudWatch which provide fundamental cloud capabilities for security, computing, storage, databases and monitoring.
This document discusses GraphQL and DGraph with GO. It begins by introducing GraphQL and some popular GraphQL implementations in GO like graphql-go. It then discusses DGraph, describing it as a distributed, high performance graph database written in GO. It provides examples of using the DGraph GO client to perform CRUD operations, querying for single and multiple objects, committing transactions, and more.
Advanced MySql Data-at-Rest Encryption in Percona ServerSeveralnines
Iwo Panowicz - Percona & Bart Oles - Severalnines AB
The purpose of the talk is to present data-at-rest encryption implementation in Percona Server for MySQL.
Differences between Oracle's MySQL and MariaDB implementation.
- How it is implemented?
- What is encrypted:
- Tablespaces?
- General tablespace?
- Double write buffer/parallel double write buffer?
- Temporary tablespaces? (KEY BLOCKS)
- Binlogs?
- Slow/general/error logs?
- MyISAM? MyRocks? X?
- Performance overhead.
- Backups?
- Transportable tablespaces. Transfer key.
- Plugins
- Keyrings in general
- Key rotation?
- General-Purpose Keyring Key-Management Functions
- Keyring_file
- Is useful? How to make it profitable?
- Keyring Vault
- How does it work?
- How to make a transition from keyring_file
No Hassle NoSQL - Amazon DynamoDB & Amazon DocumentDB | AWS Summit Tel Aviv ...Amazon Web Services
NoSQL databases are a great fit for many modern applications such as mobile, web, and gaming that require flexible, scalable, high-performance, and highly functional databases to provide great user experiences but they can be hard to manage and require high proficiency and attention.In this session we will present Amazon DynamoDB, a fully managed, multi-region, multi-master database that provides consistent single-digit millisecond latency in any scale.
AWSome Day Online 2020_Module 1: Introduction to the AWS CloudAmazon Web Services
This document outlines an AWS online conference for IT professionals. The conference will cover introducing AWS cloud concepts, getting started with AWS services, building applications in AWS, security best practices, and AWS pricing, support and architecture. The document provides an overview of AWS cloud concepts including infrastructure, services, security, pricing and support. It also summarizes AWS global infrastructure, regions, availability zones and interfaces for managing AWS resources.
The document discusses five key principles for architecting applications on AWS: elasticity, designing for failure, loose coupling, security, and performance. It provides examples and services for each principle such as using Amazon EC2 for elasticity, designing with fault tolerance using services like RDS and Route 53, loosely coupling components with services like SQS and SWF, leveraging security services like IAM, and scaling vertically with cluster compute or horizontally using services like ElastiCache for performance.
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Amazon Web Services
The document discusses an automated solution for deploying an AWS Landing Zone. It describes the AWS Landing Zone as providing an easy way to set up a new multi-account AWS environment based on AWS best practices. The solution automates the initial setup of accounts and baseline security and governance controls. It also includes an AWS Account Vending Machine that allows additional accounts to be automatically provisioned with security baselines. The workshop will include demos of deploying a Landing Zone, creating new accounts via the AWS Vending Machine, and extending the Landing Zone with add-on features.
As part of the Introduction to AWS Workshop Series, see how to scale your website from your first user, right up to a complex architecture to support 10 million users.
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Amazon Web Services
Most modern businesses depend on a portfolio of technology solutions to operate and be successful every day. How do you know whether your team is following best practices or what the risks are in your architectures? This session shows how the AWS Well-Architected Framework provides prescriptive advice on best practices and how the AWS Well-Architected Tool enables you to measure and improve your technology portfolio. We explain how other customers are using AWS Well-Architected in their businesses, and we share what we learned from reviewing tens of thousands of architectures across operational excellence, security, reliability, performance efficiency, and cost optimization.
As enterprises move to the cloud, robust connectivity is often an early consideration. AWS Direct Connect provides a more consistent network experience for accessing your AWS resources, typically with greater bandwidth and reduced network costs. This session dives deep into the features of AWS Direct Connect and VPNs. We discuss deployment architectures and demonstrate the process from start to finish. We’ll show you how to configure public and private virtual interfaces, configure routers, use VPN backup, and provide secure communication between sites by using the AWS VPN CloudHub.
This document describes a cloud assessment service that helps organizations evaluate their readiness to move to the cloud. The assessment identifies current infrastructure and applications, evaluates business and IT value, and develops a cloud adoption strategy and roadmap. It determines which applications are suitable for the cloud, how to optimize existing resources, and recommends industry best practices for portfolio transformation and cloud migration.
AWS Technical Due Diligence Workshop Session OneTom Laszewski
First session in the one day Technical Due Diligence workshop. Understand the AWS approach to TDD along with the common use cases]/ hypothesis. Cover the AWS TDD case studies, and outputs from TDDs.
Failure is not an Option - Designing Highly Resilient AWS SystemsAmazon Web Services
Customers moving mission-critical applications to the cloud are seeking guidance to replicate and improve the resiliency of their Tier-1 systems, while simultaneously meeting compliance and regulatory requirements. Natural disasters, internet disruptions, hardware or software failure can lead to events requiring customers to invoke disaster recovery (DR) plans. Join us in this session to learn how to “design for failure” and remain resilient in the event of disaster by designing applications using highly resilient components and service features.
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Connect QuickSight to your data (Redshift, Athena, S3, RDS, Private VPCs, On-Premise databases)
- Create interactive dashboards
- Publish reports and dashboards at scale (Row Level Security, AD integration, Groups, User Management)
by Robbie Wright, HEad of Amazon S3 & Amazon Glacier Product Marketing, AWS
Learn from AWS on how we've designed S3 and Glacier to be durable, available, and massively scalable. Hear how customers are using these services to enhance the accessibility and usability of their data. We will also dive into the benefits of object storage, its applications, and some best practices to follow.
面對日新月異的大數據工具,有時候很難跟上這節奏。有���於此,Amazon Web Services提供了廣泛而完善的雲端運算服務組合,幫助您構建、維護和部署大數據應用程式。
這場線上研討會,將為各位深入淺出介紹AWS 雲端平台提供的各種大數據選項,包括現正流行的大數據框架,如Hadoop、Spark、NoSQL數據庫等,同時透過使用案例來瞭解最佳實踐方式。最後,您將了解如何應用這些工具服務,將大數據導入您的現實應用程式中。
This document discusses AWS Auto Scaling, which automatically launches and terminates EC2 instances based on demand. It describes the key components of Auto Scaling including launch configurations, Auto Scaling groups, scaling policies, and CloudWatch alarms. It provides step-by-step instructions for setting up a simple Auto Scaling group to support a web application, including creating an AMI, load balancer, launch configuration, Auto Scaling group, scaling policies, and CloudWatch alarms to dynamically scale the number of EC2 instances.
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Amazon Web Services
In this session you will hear from AWS ProServe experts as they present the business value and technical best practices in database migrations to the AWS Cloud. Learn key methodologies from provisioning to decommissioning, while taking into consideration your applications’ performance and availability, system scalability, data security, and database maintenance requirements, as well as project costs and timelines. You will also gain an understanding of steps you can take to help accelerate the preparation of data for analytics, and faster delivery of key data insights.
The Total Cost of Ownership of Cloud Storage (TCO) - AWS Cloud Storage for th...Amazon Web Services
The document discusses factors to consider when calculating the total cost of ownership (TCO) of cloud storage versus on-premises storage. It notes that cloud storage costs less when properly accounting for: 1) usable versus raw storage capacity and utilization rates, 2) different redundancy and durability levels between storage classes, 3) all fixed costs including hardware, staffing, facilities, etc., 4) updated pricing models like price cuts, tiered pricing and recurring savings from optimization, and 5) intangible benefits of cloud like security, agility and support. The cloud's economies of scale allow for continuous price reductions while customers only pay for what they use.
Introduction to Graph Databases with detailed installation steps, cypher query language examples, demos and visualization tools like RedisInsight. It also contains benchmarks for RedisGraph against Tigergraph, neo4j, neptune, Janusgraph and Arangodb. I mentions differences between native and non-native graph databases. It contains usecases for the graph databases and provides a score for selecting graph DB over traditional SQL and NoSQL DBs.
Databases on AWS: Scaling Applications & Modern Data ArchitecturesAmazon Web Services
This document provides an overview of AWS and discusses how AWS can help companies handle large scale events. It begins with an introduction to AWS regions, availability zones, and networking concepts. It then discusses AWS's wide range of services across compute, storage, databases, analytics, machine learning and more. The document also highlights AWS's pace of innovation, security capabilities, compliance certifications, and enterprise customers. It positions AWS as a leader in Gartner's Magic Quadrant for cloud infrastructure and operational database management systems. Finally, it defines what a large scale event is and notes that AWS can help companies address problems of unknown infrastructure requirements and short event durations for situations that require temporary increases in capacity.
AWS provides a range of security services and features that AWS customers can use to secure their content and applications and meet their own specific business requirements for security. This presentation focuses on how you can make use of AWS security features to meet your own organization's security and compliance objectives.
View a recording of the webinar based on this presentation on YouTube here: http://youtu.be/rXPyGDWKHIo
Automating document analysis and text extraction with Amazon Textract - AIM20...Amazon Web Services
Many companies today extract data from documents and forms through manual data entry, which is slow and expensive, or through simple optical character recognition (OCR) software, which is difficult to customize. Amazon Textract overcomes these challenges by using machine learning to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. In this session, you learn how to extract data from documents using Amazon Textract. We also demonstrate how you can create smart search indexes and better maintain compliance with document archival rules after the information is captured.
Scylla Summit 2017: From Elasticsearch to Scylla at ZenlyScyllaDB
Zenly (recently acquired by Snap) makes a social map app. Their team has been running Scylla in production for the past eight months. Get an overview of the reasons they chose Scylla, its deployment on Google Cloud, the performances they achieved, plus learn as they share some of the few hiccups they hit along the way.
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScyllaDB
What happens to a request that reaches Scylla, and why should one care? Understanding how Scylla executes your queries can help you make better architectural decisions and also better understand the performance of your application.
Are my rows too big? Should I make that other column a part of my partition key instead? This talk will cover the interaction between nodes, shards and the role of Scylla's internal components like memtables, cache and sstables. I will explain how different types of queries are executed and how to plan your queries for maximum performance.
The document discusses five key principles for architecting applications on AWS: elasticity, designing for failure, loose coupling, security, and performance. It provides examples and services for each principle such as using Amazon EC2 for elasticity, designing with fault tolerance using services like RDS and Route 53, loosely coupling components with services like SQS and SWF, leveraging security services like IAM, and scaling vertically with cluster compute or horizontally using services like ElastiCache for performance.
Automated Solution for Deploying AWS Landing Zone (GPSWS407) - AWS re:Invent ...Amazon Web Services
The document discusses an automated solution for deploying an AWS Landing Zone. It describes the AWS Landing Zone as providing an easy way to set up a new multi-account AWS environment based on AWS best practices. The solution automates the initial setup of accounts and baseline security and governance controls. It also includes an AWS Account Vending Machine that allows additional accounts to be automatically provisioned with security baselines. The workshop will include demos of deploying a Landing Zone, creating new accounts via the AWS Vending Machine, and extending the Landing Zone with add-on features.
As part of the Introduction to AWS Workshop Series, see how to scale your website from your first user, right up to a complex architecture to support 10 million users.
Introduction to the Well-Architected Framework and Tool - SVC208 - Anaheim AW...Amazon Web Services
Most modern businesses depend on a portfolio of technology solutions to operate and be successful every day. How do you know whether your team is following best practices or what the risks are in your architectures? This session shows how the AWS Well-Architected Framework provides prescriptive advice on best practices and how the AWS Well-Architected Tool enables you to measure and improve your technology portfolio. We explain how other customers are using AWS Well-Architected in their businesses, and we share what we learned from reviewing tens of thousands of architectures across operational excellence, security, reliability, performance efficiency, and cost optimization.
As enterprises move to the cloud, robust connectivity is often an early consideration. AWS Direct Connect provides a more consistent network experience for accessing your AWS resources, typically with greater bandwidth and reduced network costs. This session dives deep into the features of AWS Direct Connect and VPNs. We discuss deployment architectures and demonstrate the process from start to finish. We’ll show you how to configure public and private virtual interfaces, configure routers, use VPN backup, and provide secure communication between sites by using the AWS VPN CloudHub.
This document describes a cloud assessment service that helps organizations evaluate their readiness to move to the cloud. The assessment identifies current infrastructure and applications, evaluates business and IT value, and develops a cloud adoption strategy and roadmap. It determines which applications are suitable for the cloud, how to optimize existing resources, and recommends industry best practices for portfolio transformation and cloud migration.
AWS Technical Due Diligence Workshop Session OneTom Laszewski
First session in the one day Technical Due Diligence workshop. Understand the AWS approach to TDD along with the common use cases]/ hypothesis. Cover the AWS TDD case studies, and outputs from TDDs.
Failure is not an Option - Designing Highly Resilient AWS SystemsAmazon Web Services
Customers moving mission-critical applications to the cloud are seeking guidance to replicate and improve the resiliency of their Tier-1 systems, while simultaneously meeting compliance and regulatory requirements. Natural disasters, internet disruptions, hardware or software failure can lead to events requiring customers to invoke disaster recovery (DR) plans. Join us in this session to learn how to “design for failure” and remain resilient in the event of disaster by designing applications using highly resilient components and service features.
Cloud Based Business Intelligence with Amazon QuickSight - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Connect QuickSight to your data (Redshift, Athena, S3, RDS, Private VPCs, On-Premise databases)
- Create interactive dashboards
- Publish reports and dashboards at scale (Row Level Security, AD integration, Groups, User Management)
by Robbie Wright, HEad of Amazon S3 & Amazon Glacier Product Marketing, AWS
Learn from AWS on how we've designed S3 and Glacier to be durable, available, and massively scalable. Hear how customers are using these services to enhance the accessibility and usability of their data. We will also dive into the benefits of object storage, its applications, and some best practices to follow.
面對日新月異的大數據工具,有時候很難跟上這節奏。有鑑於此,Amazon Web Services提供了廣泛而完善的雲端運算服務組合,幫助您構建、維護和部署大數據應用程式。
這場線上研討會,將為各位深入淺出介紹AWS 雲端平台提供的各種大數據選項,包括現正流行的大數據框架,如Hadoop、Spark、NoSQL數據庫等,同時透過使用案例來瞭解最佳實踐方式。最後,您將了解如何應用這些工具服務,將大數據導入您的現實應用程式中。
This document discusses AWS Auto Scaling, which automatically launches and terminates EC2 instances based on demand. It describes the key components of Auto Scaling including launch configurations, Auto Scaling groups, scaling policies, and CloudWatch alarms. It provides step-by-step instructions for setting up a simple Auto Scaling group to support a web application, including creating an AMI, load balancer, launch configuration, Auto Scaling group, scaling policies, and CloudWatch alarms to dynamically scale the number of EC2 instances.
Best Practices for Database Migration to the Cloud: Improve Application Perfo...Amazon Web Services
In this session you will hear from AWS ProServe experts as they present the business value and technical best practices in database migrations to the AWS Cloud. Learn key methodologies from provisioning to decommissioning, while taking into consideration your applications’ performance and availability, system scalability, data security, and database maintenance requirements, as well as project costs and timelines. You will also gain an understanding of steps you can take to help accelerate the preparation of data for analytics, and faster delivery of key data insights.
The Total Cost of Ownership of Cloud Storage (TCO) - AWS Cloud Storage for th...Amazon Web Services
The document discusses factors to consider when calculating the total cost of ownership (TCO) of cloud storage versus on-premises storage. It notes that cloud storage costs less when properly accounting for: 1) usable versus raw storage capacity and utilization rates, 2) different redundancy and durability levels between storage classes, 3) all fixed costs including hardware, staffing, facilities, etc., 4) updated pricing models like price cuts, tiered pricing and recurring savings from optimization, and 5) intangible benefits of cloud like security, agility and support. The cloud's economies of scale allow for continuous price reductions while customers only pay for what they use.
Introduction to Graph Databases with detailed installation steps, cypher query language examples, demos and visualization tools like RedisInsight. It also contains benchmarks for RedisGraph against Tigergraph, neo4j, neptune, Janusgraph and Arangodb. I mentions differences between native and non-native graph databases. It contains usecases for the graph databases and provides a score for selecting graph DB over traditional SQL and NoSQL DBs.
Databases on AWS: Scaling Applications & Modern Data ArchitecturesAmazon Web Services
This document provides an overview of AWS and discusses how AWS can help companies handle large scale events. It begins with an introduction to AWS regions, availability zones, and networking concepts. It then discusses AWS's wide range of services across compute, storage, databases, analytics, machine learning and more. The document also highlights AWS's pace of innovation, security capabilities, compliance certifications, and enterprise customers. It positions AWS as a leader in Gartner's Magic Quadrant for cloud infrastructure and operational database management systems. Finally, it defines what a large scale event is and notes that AWS can help companies address problems of unknown infrastructure requirements and short event durations for situations that require temporary increases in capacity.
AWS provides a range of security services and features that AWS customers can use to secure their content and applications and meet their own specific business requirements for security. This presentation focuses on how you can make use of AWS security features to meet your own organization's security and compliance objectives.
View a recording of the webinar based on this presentation on YouTube here: http://youtu.be/rXPyGDWKHIo
Automating document analysis and text extraction with Amazon Textract - AIM20...Amazon Web Services
Many companies today extract data from documents and forms through manual data entry, which is slow and expensive, or through simple optical character recognition (OCR) software, which is difficult to customize. Amazon Textract overcomes these challenges by using machine learning to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. In this session, you learn how to extract data from documents using Amazon Textract. We also demonstrate how you can create smart search indexes and better maintain compliance with document archival rules after the information is captured.
Scylla Summit 2017: From Elasticsearch to Scylla at ZenlyScyllaDB
Zenly (recently acquired by Snap) makes a social map app. Their team has been running Scylla in production for the past eight months. Get an overview of the reasons they chose Scylla, its deployment on Google Cloud, the performances they achieved, plus learn as they share some of the few hiccups they hit along the way.
Scylla Summit 2017: Planning Your Queries for Maximum PerformanceScyllaDB
What happens to a request that reaches Scylla, and why should one care? Understanding how Scylla executes your queries can help you make better architectural decisions and also better understand the performance of your application.
Are my rows too big? Should I make that other column a part of my partition key instead? This talk will cover the interaction between nodes, shards and the role of Scylla's internal components like memtables, cache and sstables. I will explain how different types of queries are executed and how to plan your queries for maximum performance.
Scylla Summit 2017: A Toolbox for Understanding Scylla in the FieldScyllaDB
In this talk, we will share useful tools and techniques that we are using in the field to understand Scylla clusters. Users will learn how to use those same tools to better understand their deployment.
Some of the questions that will be answered are:
- how to find out which queries are the slowest and why
- how we go about understanding the impact of the data model in a node's performance
- how to check which resources are the bottlenecks in the cluster
Scylla Summit 2017: Migrating to Scylla From Cassandra and Others With No Dow...ScyllaDB
The session will cover the best practices to migrate existing data from Apache Cassandra to Scylla and how to do it while being online all of the time.
If You Care About Performance, Use User Defined TypesScyllaDB
Shlomi Livne, VP of R&D at ScyllaDB, presented on the performance benefits of using user-defined types (UDTs) in ScyllaDB. He explained that with traditional columns, each column has overhead and flexibility comes at a price. However, with frozen UDTs, the columns are treated as a single unit, sharing metadata and improving performance. Livne showed results of a test where UDTs with many fields outperformed traditional columns with the same number of fields. However, he noted that Scylla's row cache and Java driver performance need improvement for UDTs.
Scylla Summit 2017: Stateful Streaming Applications with Apache Spark ScyllaDB
When working with streaming data, stateful operations are a common use case. If you would like to perform data de-duplication, calculate aggregations over event-time windows, track user activity over sessions, you are performing a stateful operation.
Apache Spark provides users with a high level, simple to use DataFrame/Dataset API to work with both batch and streaming data. The funny thing about batch workloads is that people tend to run these batch workloads over and over again. Structured Streaming allows users to run these same workloads, with the exact same business logic in a streaming fashion, helping users answer questions at lower latencies.
In this talk, we will focus on stateful operations with Structured Streaming and we will demonstrate through live demos, how NoSQL stores can be plugged in as a fault tolerant state store to store intermediate state, as well as used as a streaming sink, where the output data can be stored indefinitely for downstream applications.
Scylla Summit 2017: Cry in the Dojo, Laugh in the Battlefield: How We Constan...ScyllaDB
Testing a complex system like Scylla is a challenge on its own. There are many environments, workloads, and problems. Simple problems become increasingly worse at scale. In this talk, we will explore the testing method that we employ in our QA lab and our plans to make it even better in years to come.
Scylla Summit 2017: Managing 10,000 Node Storage Clusters at TwitterScyllaDB
If you’ve ever run a distributed database, you know that managing stateful systems is time-consuming and hard. I’ll talk about why that is, the path we took to make Twitter’s Manhattan database easy to run with thousands of nodes and multiple feature sets, and how you should think about operations.
Scylla Summit 2017: Streaming ETL in Kafka for Everyone with KSQLScyllaDB
Apache Kafka is a high-throughput distributed streaming platform that is being adopted by hundreds of companies to manage their real-time data. KSQL is an open source streaming SQL engine that implements continuous, interactive queries against Apache Kafka™. KSQL makes it easy to read, write and process streaming data in real-time, at scale, using SQL-like semantics. In my talk, I will discuss streaming ETL from Kafka into stores like Apache Cassandra using KSQL.
Scylla Summit 2017: How to Optimize and Reduce Inter-DC Network Traffic and S...ScyllaDB
The document appears to be a presentation on optimizing inter-data center communication. It discusses key topics like what inter-data center communication involves, the costs associated with it, best practices for setting snitches, keyspaces, client drivers and consistency levels for queries to optimize performance between data centers. It recommends using network topology replication strategies over simple strategies for multi-region deployments, setting load balancing and consistency levels appropriately in clients, and enabling internode compression to reduce costs of communication between data centers. The presentation encourages reviewing client locations, data access patterns, who is reading/writing data, and having conversations between operations and development teams to determine the best use cases.
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...ScyllaDB
Benchmarks are fun to do but when going to production, all sorts of things can happen: anything from hardware outages to human error bringing your database down. Even in a healthy database, a lot of maintenance operations have to periodically run. Do you have the tools necessary to make sure you are good to go?
Scylla Summit 2017: The Upcoming HPC EvolutionScyllaDB
In this talk, I will explain how HPC is beginning to evolve and how we use supercomputers to monitor supercomputers. First we will look at how HPC is different from cloud computing in terms of infrastructure and application architecture. Then I will discuss how those things are changing and why. Finally, I will dive into a use case of monitoring supercomputers as an application area for Scylla.
Scylla Summit 2017: Scylla on Samsung NVMe Z-SSDsScyllaDB
I will be giving a talk about performance characterization and tuning of Scylla on Samsung NVMe SSDs. We will characterize the performance of Scylla on Samsung high-performance NVMe SSDs and show how Z-SSD ─ the Samsung ultra-low-latency NVMe drive ─ can significantly shrink the performance gap between in-memory and in-storage with Scylla.
We will further evaluate the throughput-vs-latency profile of Scylla with NVMe devices and present end-to-end latencies (from the client's viewpoint) as well as the latencies of the software/hardware stack. We will show that a Z-SSD-backed Scylla cluster can provide competitive performance to an in-memory deployment while sharply reducing costs.
Kubernetes is a declarative system for automatically deploying, managing, and scaling applications and their dependencies. In this short talk, I'll demonstrate a small Scylla cluster running in Google Compute Engine via Kubernetes and our publicly-published Docker images.
Scylla Summit 2017: How to Use Gocql to Execute Queries and What the Driver D...ScyllaDB
This document outlines a presentation on using the GoCQL driver to execute queries against Cassandra and Scylla databases. It discusses connecting to a Cassandra cluster, executing queries, iterating over results, and using asynchronous queries. It also mentions some additional Cassandra libraries built on top of GoCQL, including gocqlx for data binding and queries, and gocassa for queries and migrations. The presentation aims to explain how GoCQL works behind the scenes and how to get started with basic querying functionality.
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...DataStax
Building queues on distributed data stores is hard, and long been considered an antipattern. However, with careful consideration and tactics, it is possible to do. CassieQ is an implementation of a distributed queue on Cassandra which supports easy installation, massive data ingest, authentication, a simple to use HTTP based API, and no dependencies other than your already existing Cassandra environment.
About the Speakers
Anton Kropp Senior Software Engineer, Curalate
Anton Kropp is a senior engineer with over 8 years experience building distributed and fault tolerant systems. He has worked at companies big and small (Godaddy, PracticeFusion), and enjoys building frameworks and tooling to make life easier with a penchant for dockerized containers and simple API's. When he's not messing around on his computer he's drinking local Seattle beers, zipping around the city on his electric bike, and hanging out with his wife and dog.
mParticle's Journey to Scylla from CassandraScyllaDB
mParticle processes 50 billion monthly messages and needed a data store that provides full availability and performance. They previously used Cassandra but faced issues with high latency, complicated tuning, and backlogs of up to 20 hours. They tested Scylla and found it provided significantly lower latency and compaction backlogs with minimal tuning needed. Scylla also offered knowledgeable support. mParticle migrated their data from Cassandra to Scylla, which immediately kept up with their data loads with little to no backlog.
Scylla Summit 2017: How Baidu Runs Scylla on a Petabyte-Level Big Data PlatformScyllaDB
In this presentation, I'll speak of the benefits of running Scylla on our Big Data environment which stores over 500TB of data as well as using Scylla as the indexing engine to replace MongoDB and Cassandra for our log data analysis platform.
Scylla Summit 2017: Stretching Scylla Silly: The Datastore of a Graph Databas...ScyllaDB
In this talk, we will cover the lay of the land of graph databases. We will talk about what it takes to run a highly available hosted solution in the cloud while giving users a seamless vertical and horizontal scaling solution, and share our experiences migrating from an Apache Cassandra backed graphDB as-a-service solution.
ScyllaDB CTO Avi Kivity gave a keynote on how Scylla has evolved. He discussed new features in Scylla 2.0—including Materialized Views and Heat-Weighted Load Balancing, changes in monitoring—and shared our product roadmap. He also talked about our recent acquisition of Seastar.io and how it will enable us to deliver a database-as-a-service offering.
Presentation on Scylla's and Cassandra's compaction, why it is needed and how it works, and the different compaction strategies: their strengths and weaknesses, and the different types of "amplification" and how to use them to reason about the different compaction strategies. And finally, what Scylla does better than Cassandra in this area. These slides were presented at a meetup in Tel-Aviv, a joint meetup of the following two groups:
https://www.meetup.com/Israel-Cassandra-Users/events/259322355/
https://www.meetup.com/Big-things-are-happening-here/events/259495379/
Balancing Compaction Principles and PracticesScyllaDB
Compaction is a crucial component for preventing storage consumption from exploding. In this session, we’ll talk about why compaction is required and its principles of operation, the main compaction strategies available for use, when they should be used, and how they can be configured. Finally, we’ll present new compaction features recently introduced in ScyllaDB Enterprise and ScyllaDB Cloud.
Scylla Summit 2017: Intel Optane SSDs as the New Accelerator in Your Data CenterScyllaDB
Frank will share the motivation behind the 3D XPoint memory, the current shipping Optane SSD product and key values of why it is better than NAND-based SSDs, and a few use cases that exist in the Open Source space for Database usages of Optane SSDs.
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...ScyllaDB
Compaction. A necessary reality in databases with immutable table designs. To date, Scylla and Cassandra compaction strategies for SSTables have had tradeoffs. For example, size-tiered compaction strategy requires leaving 50% of your total drive space unused in order to compact large tables.
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS).
Join us for a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
Scylla Summit 2017: Performance Evaluation of Scylla as a Database Backend fo...ScyllaDB
JanusGraph, a highly scalable graph database solution, supports historically Cassandra and HBase as database backends. We decided to put Scylla in the mix, certainly searching for the best performing backend. We ran test scenarios that cover high volume reads and writes. In this talk, we will show you the performance results of Scylla vs others and also share our lessons learned during the performance evaluation.
1. Log structured merge trees store data in multiple levels with different storage speeds and costs, requiring data to periodically merge across levels.
2. This structure allows fast writes by storing new data in faster levels before merging to slower levels, and efficient reads by querying multiple levels and merging results.
3. The merging process involves loading, sorting, and rewriting levels to consolidate and propagate deletions and updates between levels.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
Scylla Summit 2017: Welcome and Keynote - Nextgen NoSQLScyllaDB
Our CEO and co-founder Dor Laor and our chairman Benny Schnaider sharing their vision for Scylla. This was also our opportunity to announce Scylla 2.0. Our latest release is a big step toward the first autonomous NoSQL database—one that dynamically tunes itself to varying conditions while always maintaining a high level of performance.
Scylla Summit 2017: Saving Thousands by Running Scylla on EC2 Spot InstancesScyllaDB
Scylla and Spotinst together provide a strong combination of extreme performance and cost reduction. In this talk, we will present how a Scylla cluster can be used on AWS’s EC2 Spot without losing consistency with the help of Spotinst prediction technology and advanced stateful features. We will show a live demo on how to run Scylla on the Spotinst platform.
This document provides a summary of key concepts in Hadoop including MapReduce, HDFS, and the Hadoop ecosystem. It begins with an introduction to MapReduce processing using the map and reduce functions. It then discusses HDFS storage and failures that can occur. The document concludes with a brief overview of additional tools in the Hadoop ecosystem such as Pig, Hive, HBase, Sqoop and Flume.
Scylla Summit 2018: Keynote - 4 Years of ScyllaScyllaDB
This document summarizes Dor Laor's experience over 4+ years with ScyllaDB, including key milestones and achievements as well as ongoing goals and challenges. It notes Scylla's initial release in 2016 and improvements over time to features such as materialized views and global secondary indexes. It also discusses optimizing performance on cloud infrastructure and addressing challenges related to workload types and capacity planning. Going forward, it outlines priorities like lightweight transactions, change data capture, and improving Cassandra compatibility. The overall message is one of pride in accomplishments while still feeling challenged to achieve further dreams and improvements.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression, and lazy decompression. It provides examples of run length and dictionary encoding. The document also discusses columnar file formats like RCFile, ORC, and Parquet, providing more details on ORC. It concludes with a case study where optimizations to a petabyte-scale data warehouse including sorting, changed compression, and other configuration changes improved query performance significantly through reduced data size.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression techniques like run-length encoding, and lazy decompression. Specific columnar file formats like RCFile, ORC, and Parquet are mentioned. The document concludes with a case study describing optimizations made to a 1PB Hive table that resulted in a 3x query performance improvement through techniques like explicit sorting, improved compression, increased bucketing, and stripe size tuning.
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
As data is ingested into a database, it must be constantly rewritten for easy querying. Scylla writes incoming data to immutable files that must later be compacted into fewer files in order to maintain good read performance. The question becomes how fast should you compact? The traditional approach is to expose throughput tunables so the user can control the compaction speed. That means finding a good value involves a lot of trial and error. And what if the workload changes?
We take a different approach at ScyllaDB. We use the mathematical foundation of control theory to make automatic decisions about compactions, putting an end to compaction tuning altogether.
Watch this webinar to learn:
- How we created mathematical models of compaction backlog
- How to use that model to feed a control theory framework that can automatically tune compactions.
- Other exciting developments that are coming in this area
DB2 Version 8 introduces several new features for developers including the ability to join up to 225 tables in a single query, support for SQL statements up to 2MB in size, longer object names up to 128 bytes, and multi-row fetch and insert capabilities that allow retrieving and inserting multiple rows of data in a single operation for improved performance.
Scylla Summit 2017: A Deep Dive on Heat Weighted Load BalancingScyllaDB
This presentation discusses the "cold node problem" that occurs when a node restarts in a Cassandra cluster. When a node restarts, it loses its cached data and becomes a bottleneck. The presentation proposes a "heat weighted load balancing" solution where the cluster tracks each node's cache hit ratio and redistributes requests based on this ratio after a restart. Testing shows this solution significantly improves throughput after a node restart by distributing requests more evenly across nodes based on their "heat" or cache contents.
How Incremental Compaction Reduces Your Storage FootprintScyllaDB
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS). Get a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
- Scala originated from Martin Odersky's work on functional programming languages like OCaml in the 1980s and 1990s. It was designed to combine object-oriented and functional programming in a practical way.
- Key aspects of Scala include being statically typed while also being flexible, its unified object and module system, and treating libraries as primary over language features.
- Scala has grown into an large ecosystem centered around the JVM and also targets JavaScript via Scala.js. Tooling continues to improve with faster compilers and new IDE support.
- Future work includes establishing TASTY as a portable intermediate representation, connecting Scala to formal foundations through the DOT calculus, and exploring new type
Similar to Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the Wrong Compaction Strategy (20)
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...ScyllaDB
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations.
Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...ScyllaDB
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day.
BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing.
Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more.
This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability.
*BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
Noise Canceling RUM by Tim Vereecke, AkamaiScyllaDB
Noisy Real User Monitoring (RUM) data can ruin your P99!
We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites:
- Human: We exclude noise coming from bots and synthetic measurements.
- Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness.
- Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities.
Adopting Human Visible Navigations provides you with these key benefits:
- Fewer changes staying below the radar
- Fewer data fluctuations
- Fewer blindspots when finding bottlenecks
- Better correlation with business metrics
This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source)
After attending this session; your P99 and other percentiles will become less noisy and easier to tune!
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...ScyllaDB
In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Using Libtracecmd to Analyze Your Latency and Performance TroublesScyllaDB
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
Reducing P99 Latencies with Generational ZGCScyllaDB
With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000XScyllaDB
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules.
Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
Conquering Load Balancing: Experiences from ScyllaDB DriversScyllaDB
Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where we employed the Power of Two Choices algorithm, optimized the implementation of load balancing in Rust Driver, mitigated cloud costs through zone-aware load balancing and combated the issue of overloading a particular core of ScyllaDB. Be prepared to delve into the practical and theoretical aspects of load balancing, gaining valuable insights along the way.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
99.99% of Your Traces are Trash by Paige CruzScyllaDB
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s look at the design of Sifter and get familiar with why tail-based sampling is the way to enact a cost-effective tracing solution while actually increasing the system’s observability.
Square's Lessons Learned from Implementing a Key-Value Store with RaftScyllaDB
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more!
And believe it or not, you might be using some of these techniques without even realizing it!
This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.
A Deep Dive Into Concurrent React by Matheus AlbuquerqueScyllaDB
Writing fluid user interfaces becomes more and more challenging as the application complexity increases. In this talk, we’ll explore how proper scheduling improves your app’s experience by diving into some of the concurrent React features, understanding their rationales, and how they work under the hood.
The Latency Stack: Discovering Surprising Sources of LatencyScyllaDB
Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity. Debugging tail latencies requires unpacking the abstractions that we normally ignore: virtualization, hidden queues, and network behavior.
In this talk, I’ll describe how developers can diagnose more sources of delay and failure by building a more realistic and broad understanding of networked services. I’ll give some real-world cases when high end-to-end latency or elevated failure rates occurred due to factors we ordinarily might not even measure. Some examples include TCP SYN retransmission; virtualization on the client; and surprising behavior from AWS load balancers. Unfortunately, many measurement techniques don’t cover anything but the portion most directly under developer control. But developers can do better by comparing multiple measurements, applying Little’s law, investing in eBPF probes, and paying attention to the network layer.
Understanding API performance to find and fix issues faster ultimately means understanding the entire stack: the client, your code, and the underlying infrastructure.
Dev Dives: Mining your data with AI-powered Continuous DiscoveryUiPathCommunity
Want to learn how AI and Continuous Discovery can uncover impactful automation opportunities? Watch this webinar to find out more about UiPath Discovery products!
Watch this session and:
👉 See the power of UiPath Discovery products, including Process Mining, Task Mining, Communications Mining, and Automation Hub
👉 Watch the demo of how to leverage system data, desktop data, or unstructured communications data to gain deeper understanding of existing processes
👉 Learn how you can benefit from each of the discovery products as an Automation Developer
🗣 Speakers:
Jyoti Raghav, Principal Technical Enablement Engineer @UiPath
Anja le Clercq, Principal Technical Enablement Engineer @UiPath
⏩ Register for our upcoming Dev Dives July session: Boosting Tester Productivity with Coded Automation and Autopilot™
👉 Link: https://bit.ly/Dev_Dives_July
This session was streamed live on June 27, 2024.
Check out all our upcoming Dev Dives 2024 sessions at:
🚩 https://bit.ly/Dev_Dives_2024
Chapter 3 of ISTQB Foundation 2018 syllabus with sample questions. Answers about what is static testing, what is review, types of review, informal review, walkthrough, technical review, inspection.
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsLinda Zhang
This brochure gives introduction of MYIR Electronics company and MYIR's products and services.
MYIR Electronics Limited (MYIR for short), established in 2011, is a global provider of embedded System-On-Modules (SOMs) and
comprehensive solutions based on various architectures such as ARM, FPGA, RISC-V, and AI. We cater to customers' needs for large-scale production, offering customized design, industry-specific application solutions, and one-stop OEM services.
MYIR, recognized as a national high-tech enterprise, is also listed among the "Specialized
and Special new" Enterprises in Shenzhen, China. Our core belief is that "Our success stems from our customers' success" and embraces the philosophy
of "Make Your Idea Real, then My Idea Realizing!"
This slide deck is a deep dive the Salesforce latest release - Summer 24, by the famous Stephen Stanley. He has examined the release notes very carefully, and summarised them for the Wellington Salesforce user group, virtual meeting June 27 2024.
The document discusses fundamentals of software testing including definitions of testing, why testing is necessary, seven testing principles, and the test process. It describes the test process as consisting of test planning, monitoring and control, analysis, design, implementation, execution, and completion. It also outlines the typical work products created during each phase of the test process.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
Database Management Myths for DevelopersJohn Sterrett
Myths, Mistakes, and Lessons learned about Managing SQL Server databases. We also focus on automating and validating your critical database management tasks.
Blockchain and Cyber Defense Strategies in new genre timesanupriti
Explore robust defense strategies at the intersection of blockchain technology and cybersecurity. This presentation delves into proactive measures and innovative approaches to safeguarding blockchain networks against evolving cyber threats. Discover how secure blockchain implementations can enhance resilience, protect data integrity, and ensure trust in digital transactions. Gain insights into cutting-edge security protocols and best practices essential for mitigating risks in the blockchain ecosystem.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Metadata Lakes for Next-Gen AI/ML - DatastratoZilliz
As data catalogs evolve to meet the growing and new demands of high-velocity, unstructured data, we see them taking a new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also discuss Apache (incubating) Gravitino and its open source-first approach to data cataloging across multi-cloud and geo-distributed architectures.
The presentation will delve into the ASIMOV project, a novel initiative that leverages Retrieval-Augmented Generation (RAG) to provide precise, domain-specific assistance to telecommunications engineers and technicians. The session will focus on the unique capabilities of Milvus, the chosen vector database for the project, and its advantages over other vector databases.
Attending this session will give you a deeper understanding of the potential of RAG and Milvus DB in telecommunications engineering. You will learn how to address common challenges in the field and enhance the efficiency of their operations. The session will equip you with the knowledge to make informed decisions about the choice of vector databases, and how best to use them for your use-cases
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Data Protection in a Connected World: Sovereignty and Cyber Securityanupriti
Delve into the critical intersection of data sovereignty and cyber security in this presentation. Explore unconventional cyber threat vectors and strategies to safeguard data integrity and sovereignty in an increasingly interconnected world. Gain insights into emerging threats and proactive defense measures essential for modern digital ecosystems.
Data Protection in a Connected World: Sovereignty and Cyber Security
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the Wrong Compaction Strategy
1. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
SCYLLA’S COMPACTION STRATEGIES
OR
HOW TO RUIN YOUR WORKLOAD'S PERFORMANCE
BY CHOOSING THE WRONG COMPACTION STRATEGY
Nadav Har’El, Raphael Carvalho
2. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Nadav Har’El
2
Nadav Har’El has had a diverse 20-year career in
computer programming and computer science.
In the past he worked on scientific computing,
networking software, and information retrieval.
In recent years his focus has been on virtualization
and operating systems. He also worked on nested
virtualization and exit-less I/O in KVM. Today, he
maintains the OSv kernel and also works on Seastar
and Scylla.
3. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Raphael Carvalho
3
Raphael S. Carvalho is a computer programmer who
loves file systems and has developed a huge
interest in distributed systems since he started
working on Scylla. Previously, he worked on ZFS
support for OSv and also drivers for the Syslinux
project. At ScyllaDB, Raphael has been mostly
working on compaction and compaction strategies.
4. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Agenda
▪ What is compaction?
▪ Scylla’s compaction strategies:
o Size Tier
o Leveled
o Hybrid
o Date Tier
o Time Window
▪ Which should I use for my workload and why?
▪ Examples!
4
5. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
What is compaction?
Scylla’s write path:
5
Writes
commit log
compaction
6. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(What is compaction?)
▪ Scylla’s write path:
o Updates are inserted into a memory table (“memtable”)
o Memtables are periodically flushed to a new sorted file (“sstable”)
▪ After a while, we have many separate sstables
o Different sstables may contain old and new values of the same cell
o Or different rows in the same partition
o Wastes disk space
o Slows down reads
▪ Compaction: read several sstables and output one (or more)
containing the merged and most recent information
6
7. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
What is compaction? (cont.)
▪ This technique of keeping sorted files and merging them is
well-known and often called Log-Structured Merge (LSM) Tree
▪ Published in 1996, earliest popular application that I know of is the
Lucene search engine, 1999
o High performance write.
o Immediately readable.
o Reasonable performance for read.
7
8. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(Compaction efficiency requirements)
▪ Sstable merge is efficient
o Merging sorted sstables efficient, and contiguous I/O for read and write
▪ Background compaction does not increase request tail-latency
o Scylla breaks compaction work into small pieces
▪ Background compaction does not fluctuate request throughput
o “Workload conditioning”: compaction done not faster than needed
8
9. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Compaction Strategy
▪ Which sstables to compact, and when?
▪ This is called the compaction strategy
▪ The goal of the strategy is low amplification:
o Avoid read requests needing many sstables.
• read amplification
o Avoid overwritten/deleted/expired data staying on disk.
o Avoid excessive temporary disk space needs (scary!)
• space amplification
o Avoid compacting the same data again and again.
• write amplification
9
Which compaction
strategy shall I
choose?
10. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #1: Size-Tiered Compaction
▪ Cassandra’s oldest, and still default, compaction strategy
▪ Dates back to Google’s BigTable paper (2006)
o Idea used even earlier (e.g., Lucene, 1999)
10
11. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Size-Tiered compaction strategy
11
12. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(Size-Tiered compaction strategy)
▪ Each time when enough data is in the memory table, flush it to a
small sstable
▪ When several small sstables exist, compact them into one bigger
sstable
▪ When several bigger sstables exist, compact them into one very big
sstable
▪ …
▪ Each time one “size tier” has enough sstables, compact them into
one sstable in the (usually) next size tier
12
13. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Size-Tiered compaction - amplification
▪ write amplification: O(logN)
o Where “N” is (data size) / (flushed sstable size).
o Most data is in highest tier - needed to pass through O(logN) tiers
o This is asymptotically optimal
13
14. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Size-Tiered compaction - amplification
What is read amplification? O(logN) sstables, but:
▪ If workload writes a partition once and never modifies it:
o Eventually each partition’s data will be compacted into one sstable
o In-memory bloom filter will usually allow reading only one sstable
o Optimal
▪ But if workload continues to update a partition:
o All sstables will contain updates to the same partition
o O(logN) reads per read request
o Reasonable, but not great
14
15. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Size-Tiered compaction - amplification
▪ Space amplification
15
16. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Size-Tiered compaction - amplification
▪ Space amplification:
o Obsolete data in a huge sstable will remain for a very long time
o Compaction needs a lot of temporary space:
• Worst-case, needs to merge all existing sstables into one and may need
half the disk to be empty for the merged result. (2x)
• Less of a problem in Scylla than Cassandra because of sharding
o When workload is overwrite-intensive, it is even worse:
• We wait until 4 large sstables
• All 4 overwrote the same data, so merged amount is same as in 1 sstable
• 5-fold space amplification!
• Or worse - if compaction is behind, there will be the same data in several
tiers and have unequal sizes
16
17. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #2: Leveled Compaction
▪ Introduced in Cassandra 1.0, in 2011.
▪ Based on Google’s LevelDB (itself based on Google’s BigTable)
▪ No longer has size-tiered’s huge sstables
▪ Instead have runs:
o A run is a collection of small (160 MB by default) SSTables
o Have non-overlapping key ranges
o A huge SSTable must be rewritten as a whole, but in a run we can modify only
parts of it (individual sstables) while keeping the disjoint key requirement
▪ In leveled compaction strategy:
17
18. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Leveled compaction strategy
18
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
19. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(Leveled compaction strategy)
▪ SSTables are divided into “levels”:
o New SSTables (dumped from memtables) are created in Level 0
o Each other level is a run of SSTables of exponentially increasing size:
• Level 1 is a run of 10 SSTables (of 160 MB each)
• Level 2 is a run of 100 SSTables (of 160 MB each)
• etc.
▪ When we have enough (e.g., 4) sstables in Level 0, we compact
them with all 10 sstables in Level 1
o We don't create one large sstable - rather, a run: we write one sstable and
when we reach the size limit (160 MB), we start a new sstable
19
20. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(Leveled compaction strategy)
▪ After the compaction of level 0 into level 1, level 1 may have more
than 10 of sstables. We pick one and compact it into level 2:
o Take one sstable from level 1
o Look at its key range and find all sstables in level 2 which overlap with it
o Typically, there are about 12 of these
• The level 1 sstable spans roughly 1/10th of the keys, while each level 2
sstable spans 1/100th of the keys; so a level-1 sstable’s range roughly
overlaps 10 level-2 sstables plus two more on the edges
o As before, we compact the one sstable from level 1 and the 12 sstables from
level 2 and replace all of those with new sstables in level 2
20
21. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
(Leveled compaction strategy)
▪ After this compaction of level 1 into level 2, now we can have
excess sstables in level 2 so we merge them into level 3. Again, one
sstable from level 2 will need to be compacted with around 10
sstables from level 3.
21
22. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Leveled compaction - amplification
▪ Space amplification:
o Because of sstable counts, 90% of the data is in the deepest level (if full!)
o These sstables do not overlap, so it can’t have duplicate data!
o So at most, 10% of the space is wasted
o Also, each compaction needs a constant (~12*160MB) temporary space
o Nearly optimal
22
23. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Leveled compaction - amplification
▪ Read amplification:
o We have O(N) tables!
o But in each level sstables have disjoint ranges (cached in memory)
o Worst-case, O(logN) sstables relevant to a partition - plus L0 size.
o Under some assumptions (update complete rows, of similar sizes)
space amplification implies: 90% of the reads will need just one sstable!
o Nearly optimal
23
24. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Leveled compaction - amplification
▪ Write amplification:
24
25. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Leveled compaction - amplification
▪ Write amplification:
o Again, most of the data is in the deepest level k
• E.g., k=3 is enough for 160 GB of data (per shard!)
• All data was written once in L0, then compacted into L1, … then to Lk
• So each row written k+1 times
o For each input (level i>1) sstable we compact, we compact it with ~12
overlapping sstables in level i+1. Writing ~13 output sstables. (lower for L0)
o Worst-case, write amplification is around 13k
o Also O(logN) but higher constant factor than size-tiered...
o If enough writing and LCS can’t keep up, its read and space advantages are
lost
o If also have cache-miss reads, they will get less disk bandwidth
25
26. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1 - write-only workload
▪ Write-only workload
o Cassandra-stress writing 30 million partitions (about 9 GB of data)
o Constant write rate 10,000 writes/second
o One shard
26
27. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1 - write-only workload
▪ Size-tiered compaction:
at some points needs twice the disk space
o In Scylla with many shards, “usually” maximum space use is not concurrent
▪ Level-tiered compaction:
more than double the amount of disk I/O
o Test used smaller-than default sstables (10 MB) to illustrate the problem
o Same problem with default sstable size (160 MB) - with larger workloads
27
28. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1 (space amplification)
constant multiple of
flushed memtable &
sstable size
28
x2 space
amplification
29. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1 (write amplification)
▪ Amount of actual data collected: 8.8 GB
▪ Size-tiered compaction: 50 GB writes (4 tiers + commit log)
▪ Leveled compaction: 111 GB writes
29
30. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1 - note
▪ Leveled compactions write amplification is not only a problem with
100% write...
▪ Can have just 10% writes and an amplified write workload so high
that
o Uncached reads slowed down because we need the disk to write
o Compaction can’t keep up, uncompacted sstables pile up, even slower reads
▪ Leveled compaction is unsuitable for many workloads with a
non-negligible amount of writes even if they seem “read mostly”
30
31. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Can we create a new compaction strategy with
▪ Low write amplification of size-tiered compaction
▪ Without its high temporary disk space usage during compaction?
31
32. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #3: Hybrid Compaction
▪ New in upcoming version of Scylla Enterprise
▪ Hybrid of Size-Tiered and Leveled strategies:
32
33. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #3: Hybrid Compaction
▪ Size-tiered compaction needs temporary space because we only
remove a huge sstable after we fully compact it.
▪ Let’s split each huge sstable into a run (a la LCS) of “fragments”:
o Treat the entire run (not individual sstables) as a file for STCS
o Remove individual sstables as compacted. Low temporary space.
33
34. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #3: Hybrid Compaction
▪ Solve 4x worst-case in overwrite workloads with other techniques:
o Compact fewer sstables if disk is getting full
• Not a risk because small temporary disk needs
o Compact fewer sstables if they have large overlaps
34
35. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hybrid compaction - amplification
▪ Space amplification:
o Small constant temporary space needs, even smaller than LCS
(M*S per parallel compaction, e.g., M=4, S=160 MB)
o Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold
o Optimal.
▪ Write amplification:
o O(logN), small constant — same as Size-Tiered compaction
▪ Read amplification:
o Like Size-Tiered, at worst O(logN) if updating the same partitions
35
36. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 1, with Hybrid compaction strategy
36
37. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 2 - overwrite workload
▪ Write 15 times the same 4 million partitions
o cassandra-stress write n=4000000 -pop seq=1..4000000 -schema
"replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)"
o In this test cassandra-stress not rate limited
o Again, small (10MB) LCS tables
▪ Necessary amount of sstable data: 1.2 GB
▪ STCS space amplification: x7.7 !
▪ LCS space amplification lower, constant multiple of sstable size
▪ Hybrid will be around x2
37
38. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 2
38
x7.7
space
amplification
39. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 3 - read+updates workload
▪ When workloads are read-mostly, read amplification is important
▪ When workloads also have updates to existing partitions
o With STCS, each partition ends up in multiple sstables
o Read amplification
▪ An example to simulate this:
o Do a write-only update workload
• cassandra_stress write n=4,000,000 -pop seq=1..1,000,000
o Now run a read-only workload
• cassandra_stress read n=1,000,000 -pop seq=1..1,000,000
• measure avg. number of read bytes per request
39
40. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 3 - read+updates workload
▪ Size-tiered: 46,915 bytes read per request
o Optimal after major compaction - 11,979
▪ Leveled: 11,982
o Equal to optimal because in this case all sstables fit in L1...
▪ Increasing the number of partitions 8-fold:
o Size-tiered: 29,794 luckier this time
o Leveled: 16,713 unlucky (0.5 of data, not 0.9, in L2)
▪ BUT: Remember that if we have non-negligable amount of writes,
LCS write amplification may slow down reads
40
41. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Example 3, and major compaction
▪ We saw that size-tiered major compaction reduces read
amplification
▪ It also reduces space amplification (expired/overwritten data)
▪ Major compaction only makes sense if very few writes
o But in that case, LCS’s write amplification is not a problem!
o So LCS is recommended instead of major compaction
• Easier to use
• No huge operations like major compaction (need to find when to run)
• No 50%-free-disk worst-case requirement
• Good read amplification and space amplification
41
42. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Why major compaction? Is it suboptimal?
(from STCS perspective)
▪ STCS is quite inefficient / slow at getting rid of obsolete data
(droppable tombstone, shadowed data).
o For droppable tombstone, there’s tombstone compaction. Suboptimal though.
o For shadowed (overwritten) data, there’s nothing to do. Just wait for data and
obsolete data to be compacted together after reaching same tier.
42
43. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Tombstone compaction
▪ Triggered when standard compaction has nothing to do
▪ Tombstone compaction selects sstable with a percentage of
droppable tombstone higher than N% and hopes space will be
released.
▪ That’s suboptimal though…
▪ Tombstone cannot be purged unless it’s compacted with data it
deletes/shadows.
▪ CASSANDRA-7019 suggests improving the feature by compacting a
sstable with older overlapping sstables. That will be inefficient
with STCS though. What can we do instead?
43
44. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Making improved tombstone compaction
efficient with hybrid
▪ Hybrid can choose a fragment from high tiers and compact it with
all overlapping fragments from sstable runs of same tier or above.
▪ All sstable run(s) involved will have their (often only one) fragment
replaced by another with: (LIVE DATA) – (SHADOWED DATA) –
(DROPPABLE TOMBSTONES)
▪ Temporary space requirement of N * fragment size, N = number of
fragments involved
▪ Make it optional for regular scenarios but use it if running out of
disk space.
44
45. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hybrid tombstone compaction - Example
45
FRAGMENTS
SSTABLE RUNS
CHOOSE A SSTABLE RUN
FRAGMENT WITH N% OF DROPPABLE
TOMBSTONES
46. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hybrid tombstone compaction - Example
FRAGMENTS
SSTABLE RUNS
INCLUDE *OLDER* FRAGMENT(S)
THAT OVERLAP WITH THE ONE
PREVIOUSLY CHOSEN
47. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Hybrid tombstone compaction - Example
FRAGMENTS
SSTABLE RUNS
REPLACE FRAGMENTS BY ONES
WITHOUT SHADOWED DATA AND
DROPPABLE TOMBSTONES
48. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Making hybrid take action when lots of
duplicate data waste disk space
▪ Compact fewer tables of same tier if they contain lots of duplicate
data. Affects only overwrite intensive workloads.
▪ Cardinality information may help us estimating duplication
between tables. Work only at partition level though…
▪ Nadav came up with idea of doing a compaction sample to help
with estimation at clustering level. Works due to murmur
tokenizer.
▪ At worst case (running out of space), Hybrid can afford to compact
biggest tiers together to get rid of all obsolete data with low
temporary space requirement.
49. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Conclusion on this hybrid strategy topic
▪ Goal is to have hybrid do the cleanup job itself rather than relying
on sysadmin to run manual (major compaction) at an interval.
▪ Hybrid can take smart decisions due to its nature; non-aggressive,
incremental steps towards improving space amplification without
hurting system performance like major does.
▪ Trying to bring best of both worlds.
50. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Strategy #4: Time-Window Compaction
▪ Introduced in Cassandra 3.0.8, designed for time-series data
▪ Replaces Date-Tiered compaction strategy of Cassandra 2.1
(which is also supported by Scylla, but not recommended)
50
51. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Time-Window compaction strategy (cont.)
In a time-series use case:
▪ Clustering key and write time are correlated
▪ Data is added in time order. Only few out-of-order writes, typically
rearranged by just a few seconds
▪ Data is only deleted through expiration (TTL) or by deleting an
entire partition, usually the same TTL on all the data
▪ The rate at which data is written is nearly constant
▪ A query is a clustering-key range query on a given partition
Most common query: "values from the last hour/day/week"
51
52. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Time-Window compaction strategy (cont.)
▪ Scylla remembers in memory the minimum and maximum
clustering key in each newly-flushed sstable
o Efficiently find only the sstables with data relevant to a query
▪ Other compaction strategies
o Destroy this feature by merging “old” and “new” sstables
o Move all rows of a partition to the same sstable…
• But time series queries don’t need all rows of a partition, just rows in a
given time range
• Makes it impossible to expire old sstable’s when everything in them has
expired
• Read and write amplification (needless compactions)
52
53. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Time-Window compaction strategy (cont.)
So TWCS:
▪ Divides time into “time windows”
o E.g., if typical query asks for 1 day of data, choose a time window of 1 day
▪ Divide sstables into time buckets, according to time window
▪ Compact using Size-Tiered strategy inside each time bucket
o If the 2-day old window has just one big sstable and a repair creates an
additional tiny “old” sstable, the two will not get compacted
o A tradeoff: slows read but avoids the write amplification problem of DTCS
▪ When time bucket exits the current window, do a major
compaction
o Except for small repair-produced sstables, we get 1 sstable per time window
53
54. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
Summary
54
Workload Size-Tiered Leveled Hybrid Time-Window
Write-only 2x peak space 2x writes Best -
Overwrite Huge peak
space
write
amplification
high peak
space, but not
like size-tiered
-
Read-mostly,
few updates
read
amplification
Best read
amplification
-
Read-mostly,
but a lot of
updates
read and space
amplification
write
amplification
may overwhelm
read
amplification
-
Time series write, read, and
space ampl.
write and space
amplification
write and read
amplification
Best
55. PRESENTATION TITLE ON ONE LINE
AND ON TWO LINES
First and last name
Position, company
THANK YOU
nyh@scylladb.com
Please stay in touch
Any questions?