Compaction is a crucial component for preventing storage consumption from exploding. In this session, we’ll talk about why compaction is required and its principles of operation, the main compaction strategies available for use, when they should be used, and how they can be configured. Finally, we’ll present new compaction features recently introduced in ScyllaDB Enterprise and ScyllaDB Cloud.
Presentation on Scylla's and Cassandra's compaction, why it is needed and how it works, and the different compaction strategies: their strengths and weaknesses, and the different types of "amplification" and how to use them to reason about the different compaction strategies. And finally, what Scylla does better than Cassandra in this area. These slides were presented at a meetup in Tel-Aviv, a joint meetup of the following two groups:
https://www.meetup.com/Israel-Cassandra-Users/events/259322355/
https://www.meetup.com/Big-things-are-happening-here/events/259495379/
Scaling ScyllaDB Storage Engine with State-of-Art CompactionScyllaDB
The document discusses techniques for optimizing the performance of log-structured merge trees (LSM trees), which are commonly used as the storage engine in databases. It describes the basic in-place and out-of-place update approaches, and focuses on LSM trees. Key topics covered include compaction strategies like leveled, tiered, and hybrid approaches; techniques like file partitioning to reduce disk usage and improve concurrency; and efficient handling of deletes using tombstone recycling. The goal is to squeeze the most performance out of LSM tree storage engines by applying state-of-the-art techniques.
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...ScyllaDB
Compaction. A necessary reality in databases with immutable table designs. To date, Scylla and Cassandra compaction strategies for SSTables have had tradeoffs. For example, size-tiered compaction strategy requires leaving 50% of your total drive space unused in order to compact large tables.
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS).
Join us for a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
How Incremental Compaction Reduces Your Storage FootprintScyllaDB
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS). Get a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world.
Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment.
This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...ScyllaDB
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
This document discusses scaling MySQL databases in Amazon Web Services. It provides an overview of using Amazon RDS versus managing MySQL databases on EC2 instances. While RDS offers ease of use, it has higher costs and less flexibility. The document recommends using EC2 for high performance or flexible setups, and automating database provisioning, backups, and failover. It also discusses sharding databases across multiple instances, using replication and multiple availability zones for resiliency, and tools for monitoring and operations visibility.
This document discusses options for running MySQL in AWS. It describes using Amazon RDS, where AWS manages the infrastructure and MySQL version, but has limitations like lack of root access. It also describes using EC2, where one provisions and manages their own instances, storage, and MySQL binaries, allowing more flexibility but also more management overhead. Key tradeoffs discussed are ease of use vs customization options and control in RDS vs EC2.
Presentation on Scylla's and Cassandra's compaction, why it is needed and how it works, and the different compaction strategies: their strengths and weaknesses, and the different types of "amplification" and how to use them to reason about the different compaction strategies. And finally, what Scylla does better than Cassandra in this area. These slides were presented at a meetup in Tel-Aviv, a joint meetup of the following two groups:
https://www.meetup.com/Israel-Cassandra-Users/events/259322355/
https://www.meetup.com/Big-things-are-happening-here/events/259495379/
Scaling ScyllaDB Storage Engine with State-of-Art CompactionScyllaDB
The document discusses techniques for optimizing the performance of log-structured merge trees (LSM trees), which are commonly used as the storage engine in databases. It describes the basic in-place and out-of-place update approaches, and focuses on LSM trees. Key topics covered include compaction strategies like leveled, tiered, and hybrid approaches; techniques like file partitioning to reduce disk usage and improve concurrency; and efficient handling of deletes using tombstone recycling. The goal is to squeeze the most performance out of LSM tree storage engines by applying state-of-the-art techniques.
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...ScyllaDB
Compaction. A necessary reality in databases with immutable table designs. To date, Scylla and Cassandra compaction strategies for SSTables have had tradeoffs. For example, size-tiered compaction strategy requires leaving 50% of your total drive space unused in order to compact large tables.
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS).
Join us for a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
How Incremental Compaction Reduces Your Storage FootprintScyllaDB
What if there was a new, better, more efficient way to handle compactions in Scylla? One that allows you to use your storage much more efficiently? Enter Scylla’s unique Incremental Compaction Strategy (ICS). Get a comparison of common compaction strategies and a technical deep dive into ICS. You’ll learn why ICS will become the new standard for compaction, including an overview of how much disk space you can save with ICS.
The Wikimedia Foundation is a non-profit and charitable organization driven by a vision of a world where every human can freely share in the sum of all knowledge. Each month Wikimedia sites serve over 18 billion page views to 500 million unique visitors around the world.
Among the many resources offered by Wikimedia is a public-facing API that provides low-latency, programmatic access to full-history content and meta-data, in a variety of formats. Commonly, results from this system are the product of computationally intensive transformations, and must be pre-generated and persisted to meet latency expectations. Unsurprisingly, there are numerous challenges to providing low-latency storage of such a massive data-set, in a demanding, globally distributed environment.
This talk covers Wikimedia Content API, and it's use of Apache Cassandra, a massively-scalable distributed database, as storage for a diverse and growing set of use-cases. Trials, tribulations, and triumphs, of both a development and operational nature are discussed.
Scylla Summit 2017: How to Ruin Your Workload's Performance by Choosing the W...ScyllaDB
In my talk, I will present the different compaction strategies that Scylla provides, and demonstrate when it is appropriate and when it is inappropriate to use each one. I will then present a new compaction strategy that we designed as a lesson from the existing compaction strategies by picking the best features of the existing strategies while avoiding their problems.
This document discusses scaling MySQL databases in Amazon Web Services. It provides an overview of using Amazon RDS versus managing MySQL databases on EC2 instances. While RDS offers ease of use, it has higher costs and less flexibility. The document recommends using EC2 for high performance or flexible setups, and automating database provisioning, backups, and failover. It also discusses sharding databases across multiple instances, using replication and multiple availability zones for resiliency, and tools for monitoring and operations visibility.
This document discusses options for running MySQL in AWS. It describes using Amazon RDS, where AWS manages the infrastructure and MySQL version, but has limitations like lack of root access. It also describes using EC2, where one provisions and manages their own instances, storage, and MySQL binaries, allowing more flexibility but also more management overhead. Key tradeoffs discussed are ease of use vs customization options and control in RDS vs EC2.
Slides from a brief talk I gave at the local JUG, javaBin. It's about our experiences using Cassandra in a production environment, with some philosophizing here and there.
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
As data is ingested into a database, it must be constantly rewritten for easy querying. Scylla writes incoming data to immutable files that must later be compacted into fewer files in order to maintain good read performance. The question becomes how fast should you compact? The traditional approach is to expose throughput tunables so the user can control the compaction speed. That means finding a good value involves a lot of trial and error. And what if the workload changes?
We take a different approach at ScyllaDB. We use the mathematical foundation of control theory to make automatic decisions about compactions, putting an end to compaction tuning altogether.
Watch this webinar to learn:
- How we created mathematical models of compaction backlog
- How to use that model to feed a control theory framework that can automatically tune compactions.
- Other exciting developments that are coming in this area
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
This document discusses monitoring Cassandra, including an overview of Cassandra, its internal concepts like read/write paths and compactions, and important metrics to monitor. Key metrics to monitor Cassandra's performance include read/write latency, live SSTable count, thread pool pending/completed tasks, and memtable flush count. Operations like compactions and hinted handoff replication should also be monitored. Resource usage metrics like JVM garbage collection time and memory usage are important to monitor as well. Monitoring these metrics helps detect anomalies, optimize performance, and ensure Cassandra's successful operation over the long run.
This document provides performance optimization tips for Hadoop jobs, including recommendations around compression, speculative execution, number of maps/reducers, block size, sort size, JVM tuning, and more. It suggests how to configure properties like mapred.compress.map.output, mapred.map/reduce.tasks.speculative.execution, and dfs.block.size based on factors like cluster size, job characteristics, and data size. It also identifies antipatterns to avoid like processing thousands of small files or using many maps with very short runtimes.
This document provides an overview of how Bloomberg uses Ceph and OpenStack in its cloud infrastructure. Some key points:
- Bloomberg uses Ceph for object storage with RGW and block storage with RBD. It uses OpenStack for compute functions.
- Initially Bloomberg had a fully converged architecture with Ceph and OpenStack on the same nodes, but this caused performance issues.
- Bloomberg now uses a semi-converged "POD" architecture with dedicated Ceph and OpenStack nodes in separate clusters for better scalability and performance.
- Ephemeral storage provides faster performance than Ceph but lacks data integrity protections. Ceph offers replication and reliability at the cost of some latency.
- Automation with Chef
The document provides tips for tuning DB2 database performance for non-DBAs. It discusses checking hardware configuration, optimizing memory allocation for sort heap, share sort and buffer pool based on database size, leveraging multiple CPUs, designing tables and indexes efficiently, and gathering statistics on commonly used columns. Specific steps are provided to set configuration parameters like DB2_PARALLEL_IO, SHEATHRES_SHR, SORTHEAP, and altering buffer pools and indexes. Overall, following these tips can help optimize a DB2 database for better query performance.
AWS Activate webinar - Scalable databases for fast growing startupsAmazon Web Services
Fast growing startups building high scale applications demand a lot from their infrastructure and in particular from their databases. Often, databases become the bottleneck of the startups’ technology stack, with the risk of inhibiting fast growth as they are not easy to set up, operate and scale in the cloud. This webinar focuses on how to build scalable databases in the Cloud and covers how to effectively combine the use of relational, NoSQL, and even data warehouse databases, which have become a reality for startups with the launch of Amazon Redshift.
Key takeaways:
Understand the trade-off between SQL and NoSQL and when to go for a hybrid model.
Best practices in setting up your database in the AWS cloud whether using managed services or managing it yourself.
Learn how to minimize the costs of your database with the right architecture and pricing models.
Who should attend:
DBA’s
Startup CTO’s
Developers
Engineers
Architects
Growth Hackers
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
Scylla 5.0 introduces several new features to improve node operations and compaction:
1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming.
2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion.
3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
This document provides an overview and best practices for storage technologies. It discusses factors that affect storage performance like interconnect bandwidth versus IOPS and command sizing. It covers tiering strategies and when auto-tiering may not be effective. It also discusses SSDs versus spinning disks, large VMDK and VMFS support, thin provisioning at the VM and LUN level, and architecting storage for failure including individual component failure, temporary and permanent site loss. It provides examples of how to implement a low-cost disaster recovery site using inexpensive hardware.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression, and lazy decompression. It provides examples of run length and dictionary encoding. The document also discusses columnar file formats like RCFile, ORC, and Parquet, providing more details on ORC. It concludes with a case study where optimizations to a petabyte-scale data warehouse including sorting, changed compression, and other configuration changes improved query performance significantly through reduced data size.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression techniques like run-length encoding, and lazy decompression. Specific columnar file formats like RCFile, ORC, and Parquet are mentioned. The document concludes with a case study describing optimizations made to a 1PB Hive table that resulted in a 3x query performance improvement through techniques like explicit sorting, improved compression, increased bucketing, and stripe size tuning.
The document discusses various techniques for performance tuning and cluster administration in HBase, including garbage collection tuning, use of memstore-local allocation buffers (MSLAB), enabling compression, optimizing splits and compactions through pre-splitting regions, and addressing hotspotting through manual splits. It provides guidance on configuring garbage collection, compression codecs, and approaches for managing splits and compactions to reduce disk I/O loads.
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
Felipe Cardeneti Mendes, Solutions Architect at ScyllaDB
Navigating workload-specific performance challenges and tradeoffs.
Felipe Mendes covers how to navigate the top performance challenges and tradeoffs that you’re likely to face with your project’s specific workload characteristics and technical/business requirements.
Laine Campbell, CEO of Blackbird, will explain the options for running MySQL at high volumes at Amazon Web Services, exploring options around database as a service, hosted instances/storages and all appropriate availability, performance and provisioning considerations using real-world examples from Call of Duty, Obama for America and many more. Laine will show how to build highly available, manageable and performant MySQL environments that scale in AWS—how to maintain then, grow them and deal with failure. Some of the specific topics covered are:
* Overview of RDS and EC2 – pros, cons and usage patterns/antipatterns.
* Implementation choices in both offerings: instance sizing, ephemeral SSDs, EBS, provisioned IOPS and advanced techniques (RAID, mixed storage environments, etc…)
* Leveraging regions and availability zones for availability, business continuity and disaster recovery.
* Scaling patterns including read/write splitting, read distribution, functional dataset partitioning and horizontal dataset partitioning (aka sharding)
* Common failure modes – AZ and Region failures, EBS corruption, EBS performance inconsistencies and more.
* Managing and mitigating cost with various instance and storage options
Jvm & Garbage collection tuning for low latencies applicationQuentin Ambard
G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...ScyllaDB
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations.
Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
More Related Content
Similar to Balancing Compaction Principles and Practices
Slides from a brief talk I gave at the local JUG, javaBin. It's about our experiences using Cassandra in a production environment, with some philosophizing here and there.
Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB
As data is ingested into a database, it must be constantly rewritten for easy querying. Scylla writes incoming data to immutable files that must later be compacted into fewer files in order to maintain good read performance. The question becomes how fast should you compact? The traditional approach is to expose throughput tunables so the user can control the compaction speed. That means finding a good value involves a lot of trial and error. And what if the workload changes?
We take a different approach at ScyllaDB. We use the mathematical foundation of control theory to make automatic decisions about compactions, putting an end to compaction tuning altogether.
Watch this webinar to learn:
- How we created mathematical models of compaction backlog
- How to use that model to feed a control theory framework that can automatically tune compactions.
- Other exciting developments that are coming in this area
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2skCodH
This CloudxLab Understanding MapReduce tutorial helps you to understand MapReduce in detail. Below are the topics covered in this tutorial:
1) Thinking in Map / Reduce
2) Understanding Unix Pipeline
3) Examples to understand MapReduce
4) Merging
5) Mappers & Reducers
6) Mapper Example
7) Input Split
8) mapper() & reducer() Code
9) Example - Count number of words in a file using MapReduce
10) Example - Compute Max Temperature using MapReduce
11) Hands-on - Count number of words in a file using MapReduce on CloudxLab
This document discusses monitoring Cassandra, including an overview of Cassandra, its internal concepts like read/write paths and compactions, and important metrics to monitor. Key metrics to monitor Cassandra's performance include read/write latency, live SSTable count, thread pool pending/completed tasks, and memtable flush count. Operations like compactions and hinted handoff replication should also be monitored. Resource usage metrics like JVM garbage collection time and memory usage are important to monitor as well. Monitoring these metrics helps detect anomalies, optimize performance, and ensure Cassandra's successful operation over the long run.
This document provides performance optimization tips for Hadoop jobs, including recommendations around compression, speculative execution, number of maps/reducers, block size, sort size, JVM tuning, and more. It suggests how to configure properties like mapred.compress.map.output, mapred.map/reduce.tasks.speculative.execution, and dfs.block.size based on factors like cluster size, job characteristics, and data size. It also identifies antipatterns to avoid like processing thousands of small files or using many maps with very short runtimes.
This document provides an overview of how Bloomberg uses Ceph and OpenStack in its cloud infrastructure. Some key points:
- Bloomberg uses Ceph for object storage with RGW and block storage with RBD. It uses OpenStack for compute functions.
- Initially Bloomberg had a fully converged architecture with Ceph and OpenStack on the same nodes, but this caused performance issues.
- Bloomberg now uses a semi-converged "POD" architecture with dedicated Ceph and OpenStack nodes in separate clusters for better scalability and performance.
- Ephemeral storage provides faster performance than Ceph but lacks data integrity protections. Ceph offers replication and reliability at the cost of some latency.
- Automation with Chef
The document provides tips for tuning DB2 database performance for non-DBAs. It discusses checking hardware configuration, optimizing memory allocation for sort heap, share sort and buffer pool based on database size, leveraging multiple CPUs, designing tables and indexes efficiently, and gathering statistics on commonly used columns. Specific steps are provided to set configuration parameters like DB2_PARALLEL_IO, SHEATHRES_SHR, SORTHEAP, and altering buffer pools and indexes. Overall, following these tips can help optimize a DB2 database for better query performance.
AWS Activate webinar - Scalable databases for fast growing startupsAmazon Web Services
Fast growing startups building high scale applications demand a lot from their infrastructure and in particular from their databases. Often, databases become the bottleneck of the startups’ technology stack, with the risk of inhibiting fast growth as they are not easy to set up, operate and scale in the cloud. This webinar focuses on how to build scalable databases in the Cloud and covers how to effectively combine the use of relational, NoSQL, and even data warehouse databases, which have become a reality for startups with the launch of Amazon Redshift.
Key takeaways:
Understand the trade-off between SQL and NoSQL and when to go for a hybrid model.
Best practices in setting up your database in the AWS cloud whether using managed services or managing it yourself.
Learn how to minimize the costs of your database with the right architecture and pricing models.
Who should attend:
DBA’s
Startup CTO’s
Developers
Engineers
Architects
Growth Hackers
Modeling Data and Queries for Wide Column NoSQLScyllaDB
Discover how to model data for wide column databases such as ScyllaDB and Apache Cassandra. Contrast the differerence from traditional RDBMS data modeling, going from a normalized “schema first” design to a denormalized “query first” design. Plus how to use advanced features like secondary indexes and materialized views to use the same base table to get the answers you need.
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
Scylla 5.0 introduces several new features to improve node operations and compaction:
1. Repair-based node operations (RBNO) provide more efficient, consistent, and simplified bootstrap, replace, rebuild, and other node operations by using row-level repair as the underlying mechanism instead of streaming.
2. Off-strategy compaction keeps sstables generated during node operations in a separate data set and compacts them together after the operation finishes for less compaction work and faster completion.
3. Space amplification goal (SAG) for compaction optimizes space efficiency for overwrite workloads by dynamically adapting compaction to meet latency and space goals, improving storage density.
VMworld 2013: Just Because You Could, Doesn't Mean You Should: Lessons Learne...VMworld
This document provides an overview and best practices for storage technologies. It discusses factors that affect storage performance like interconnect bandwidth versus IOPS and command sizing. It covers tiering strategies and when auto-tiering may not be effective. It also discusses SSDs versus spinning disks, large VMDK and VMFS support, thin provisioning at the VM and LUN level, and architecting storage for failure including individual component failure, temporary and permanent site loss. It provides examples of how to implement a low-cost disaster recovery site using inexpensive hardware.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression, and lazy decompression. It provides examples of run length and dictionary encoding. The document also discusses columnar file formats like RCFile, ORC, and Parquet, providing more details on ORC. It concludes with a case study where optimizations to a petabyte-scale data warehouse including sorting, changed compression, and other configuration changes improved query performance significantly through reduced data size.
This document discusses optimizing columnar data stores. It begins with an overview of row-oriented versus column-oriented data stores, noting that column stores are well-suited for read-heavy analytical loads as they only need to read relevant data. The document then covers the history of columnar stores and notable features like data encoding, compression techniques like run-length encoding, and lazy decompression. Specific columnar file formats like RCFile, ORC, and Parquet are mentioned. The document concludes with a case study describing optimizations made to a 1PB Hive table that resulted in a 3x query performance improvement through techniques like explicit sorting, improved compression, increased bucketing, and stripe size tuning.
The document discusses various techniques for performance tuning and cluster administration in HBase, including garbage collection tuning, use of memstore-local allocation buffers (MSLAB), enabling compression, optimizing splits and compactions through pre-splitting regions, and addressing hotspotting through manual splits. It provides guidance on configuring garbage collection, compression codecs, and approaches for managing splits and compactions to reduce disk I/O loads.
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...ScyllaDB
Felipe Cardeneti Mendes, Solutions Architect at ScyllaDB
Navigating workload-specific performance challenges and tradeoffs.
Felipe Mendes covers how to navigate the top performance challenges and tradeoffs that you’re likely to face with your project’s specific workload characteristics and technical/business requirements.
Laine Campbell, CEO of Blackbird, will explain the options for running MySQL at high volumes at Amazon Web Services, exploring options around database as a service, hosted instances/storages and all appropriate availability, performance and provisioning considerations using real-world examples from Call of Duty, Obama for America and many more. Laine will show how to build highly available, manageable and performant MySQL environments that scale in AWS—how to maintain then, grow them and deal with failure. Some of the specific topics covered are:
* Overview of RDS and EC2 – pros, cons and usage patterns/antipatterns.
* Implementation choices in both offerings: instance sizing, ephemeral SSDs, EBS, provisioned IOPS and advanced techniques (RAID, mixed storage environments, etc…)
* Leveraging regions and availability zones for availability, business continuity and disaster recovery.
* Scaling patterns including read/write splitting, read distribution, functional dataset partitioning and horizontal dataset partitioning (aka sharding)
* Common failure modes – AZ and Region failures, EBS corruption, EBS performance inconsistencies and more.
* Managing and mitigating cost with various instance and storage options
Jvm & Garbage collection tuning for low latencies applicationQuentin Ambard
G1, CMS, Shenandoah, or Zing? Heap size at 8GB or 31GB? compressed pointers? Region size? What is the maximum break time? Throughput or Latency... What gain? MaxGCPauseMillis, G1HeapRegionSize, MaxTenuringThreshold, UnlockExperimentalVMOptions, ParallelGCThreads, InitiatingHeapOccupancyPercent, G1RSetUpdatingPauseTimePercent, which parameters have the most impact?
Similar to Balancing Compaction Principles and Practices (20)
Unconventional Methods to Identify Bottlenecks in Low-Latency and High-Throug...ScyllaDB
In this presentation, we explore how standard profiling and monitoring methods may fall short in identifying bottlenecks in low-latency data ingestion workflows. Instead, we showcase the power of simple yet clever methods that can uncover hidden performance limitations.
Attendees will discover unconventional techniques, including clever logging, targeted instrumentation, and specialized metrics, to pinpoint bottlenecks accurately. Real-world use cases will be presented to demonstrate the effectiveness of these methods. By the end of the session, attendees will be equipped with alternative approaches to identify bottlenecks and optimize their low-latency data ingestion workflows for high throughput.
Mitigating the Impact of State Management in Cloud Stream Processing SystemsScyllaDB
Stream processing is a crucial component of modern data infrastructure, but constructing an efficient and scalable stream processing system can be challenging. Decoupling compute and storage architecture has emerged as an effective solution to these challenges, but it can introduce high latency issues, especially when dealing with complex continuous queries that necessitate managing extra-large internal states.
In this talk, we focus on addressing the high latency issues associated with S3 storage in stream processing systems that employ a decoupled compute and storage architecture. We delve into the root causes of latency in this context and explore various techniques to minimize the impact of S3 latency on stream processing performance. Our proposed approach is to implement a tiered storage mechanism that leverages a blend of high-performance and low-cost storage tiers to reduce data movement between the compute and storage layers while maintaining efficient processing.
Throughout the talk, we will present experimental results that demonstrate the effectiveness of our approach in mitigating the impact of S3 latency on stream processing. By the end of the talk, attendees will have gained insights into how to optimize their stream processing systems for reduced latency and improved cost-efficiency.
Measuring the Impact of Network Latency at TwitterScyllaDB
Widya Salim and Victor Ma will outline the causal impact analysis, framework, and key learnings used to quantify the impact of reducing Twitter's network latency.
Architecting a High-Performance (Open Source) Distributed Message Queuing Sys...ScyllaDB
BlazingMQ is a new open source* distributed message queuing system developed at and published by Bloomberg. It provides highly-performant queues to applications for asynchronous, efficient, and reliable communication. This system has been used at scale at Bloomberg for eight years, where it moves terabytes of data and billions of messages across tens of thousands of queues in production every day.
BlazingMQ provides highly-available, fault-tolerant queues courtesy of replication based on the Raft consensus algorithm. In addition, it provides a rich set of enterprise message routing strategies, enabling users to implement a variety of scenarios for message processing.
Written in C++ from the ground up, BlazingMQ has been architected with low latency as one of its core requirements. This has resulted in some unique design and implementation choices at all levels of the system, such as its lock-free threading model, custom memory allocators, compact wire protocol, multi-hop network topology, and more.
This talk will provide an overview of BlazingMQ. We will then delve into the system’s core design principles, architecture, and implementation details in order to explore the crucial role they play in its performance and reliability.
*BlazingMQ will be released as open source between now and P99 (exact timing is still TBD)
Noise Canceling RUM by Tim Vereecke, AkamaiScyllaDB
Noisy Real User Monitoring (RUM) data can ruin your P99!
We introduce a fresh concept called ""Human Visible Navigations"" (HVN) to tackle this risk; we focus on the experiences you actually care about when talking about the speed of our sites:
- Human: We exclude noise coming from bots and synthetic measurements.
- Visible: We remove any partial or fully hidden experiences. These tend to be very slow but users don’t see this slowness.
- Navigations: We ignore lightning fast back-forward navigations which usually have few optimisation opportunities.
Adopting Human Visible Navigations provides you with these key benefits:
- Fewer changes staying below the radar
- Fewer data fluctuations
- Fewer blindspots when finding bottlenecks
- Better correlation with business metrics
This is supported by plenty of real world examples coming from the world's largest scale modeling site (6M Monthly visits) in combination with aggregated data from the brand new rumarchive.com (open source)
After attending this session; your P99 and other percentiles will become less noisy and easier to tune!
Always-on Profiling of All Linux Threads, On-CPU and Off-CPU, with eBPF & Con...ScyllaDB
In this session, Tanel introduces a new open source eBPF tool for efficiently sampling both on-CPU events and off-CPU events for every thread (task) in the OS. Linux standard performance tools (like perf) allow you to easily profile on-CPU threads doing work, but if we want to include the off-CPU timing and reasons for the full picture, things get complicated. Combining eBPF task state arrays with periodic sampling for profiling allows us to get both a system-level overview of where threads spend their time, even when blocked and sleeping, and allow us to drill down into individual thread level, to understand why.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Using Libtracecmd to Analyze Your Latency and Performance TroublesScyllaDB
Trying to figure out why your application is responding late can be difficult, especially if it is because of interference from the operating system. This talk will briefly go over how to write a C program that can analyze what in the Linux system is interfering with your application. It will use trace-cmd to enable kernel trace events as well as tracing lock functions, and it will then go over a quick tutorial on how to use libtracecmd to read the created trace.dat file to uncover what is the cause of interference to you application.
Reducing P99 Latencies with Generational ZGCScyllaDB
With the low-latency garbage collector ZGC, GC pause times are no longer a big problem in Java. With sub-millisecond pause times there are instead other things in the GC and JVM that can cause application threads to experience unexpected latencies. This talk will dig into a specific use where the GC pauses are no longer the cause of unexpected latencies and look at how adding generations to ZGC help lower the p99 application latencies.
5 Hours to 7.7 Seconds: How Database Tricks Sped up Rust Linting Over 2000XScyllaDB
Linters are a type of database! They are a collection of lint rules — queries that look for rule violations to report — plus a way to execute those queries over a source code dataset.
This is a case study about using database ideas to build a linter that looks for breaking changes in Rust library APIs. Maintainability and performance are key: new Rust releases tend to have mutually-incompatible ways of representing API information, and we cannot afford to reimplement and optimize dozens of rules for each Rust version separately. Fortunately, databases don't require rewriting queries when the underlying storage format or query plan changes! This allows us to ship massive optimizations and support multiple Rust versions without making any changes to the queries that describe lint rules.
Ship now, optimize later"" can be a sustainable development practice after all — join us to see how!
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
Conquering Load Balancing: Experiences from ScyllaDB DriversScyllaDB
Load balancing seems simple on the surface, with algorithms like round-robin, but the real world loves throwing curveballs. Join me in this session as we delve into the intricacies of load balancing within ScyllaDB Drivers. Discover firsthand experiences from our journey in driver development, where we employed the Power of Two Choices algorithm, optimized the implementation of load balancing in Rust Driver, mitigated cloud costs through zone-aware load balancing and combated the issue of overloading a particular core of ScyllaDB. Be prepared to delve into the practical and theoretical aspects of load balancing, gaining valuable insights along the way.
Interaction Latency: Square's User-Centric Mobile Performance MetricScyllaDB
Mobile performance metrics often take inspiration from the backend world and measure resource usage (CPU usage, memory usage, etc) and workload durations (how long a piece of code takes to run).
However, mobile apps are used by humans and the app performance directly impacts their experience, so we should primarily track user-centric mobile performance metrics. Following the lead of tech giants, the mobile industry at large is now adopting the tracking of app launch time and smoothness (jank during motion).
At Square, our customers spend most of their time in the app long after it's launched, and they don't scroll much, so app launch time and smoothness aren't critical metrics. What should we track instead?
This talk will introduce you to Interaction Latency, a user-centric mobile performance metric inspired from the Web Vital metric Interaction to Next Paint"" (web.dev/inp). We'll go over why apps need to track this, how to properly implement its tracking (it's tricky!), how to aggregate this metric and what thresholds you should target.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
99.99% of Your Traces are Trash by Paige CruzScyllaDB
Distributed tracing is still finding its footing in many organizations today, one challenge to overcome is the data volume - keeping 100% of your traces is expensive and unnecessary. Enter sampling - head vs tail how do you decide? Let’s look at the design of Sifter and get familiar with why tail-based sampling is the way to enact a cost-effective tracing solution while actually increasing the system’s observability.
Square's Lessons Learned from Implementing a Key-Value Store with RaftScyllaDB
To put it simply, Raft is used to make a use case (e.g., key-value store, indexing system) more fault tolerant to increase availability using replication (despite server and network failures). Raft has been gaining ground due to its simplicity without sacrificing consistency and performance.
Although we'll cover Raft's building blocks, this is not about the Raft algorithm; it is more about the micro-lessons one can learn from building fault-tolerant, strongly consistent distributed systems using Raft. Things like majority agreement rule (quorum), write-ahead log, split votes & randomness to reduce contention, heartbeats, split-brain syndrome, snapshots & logs replay, client requests dedupe & idempotency, consistency guarantees (linearizability), leases & stale reads, batching & streaming, parallelizing persisting & broadcasting, version control, and more!
And believe it or not, you might be using some of these techniques without even realizing it!
This is inspired by Raft paper (raft.github.io), publications & courses on Raft, and an attempt to implement a key-value store using Raft as a side project.
A Deep Dive Into Concurrent React by Matheus AlbuquerqueScyllaDB
Writing fluid user interfaces becomes more and more challenging as the application complexity increases. In this talk, we’ll explore how proper scheduling improves your app’s experience by diving into some of the concurrent React features, understanding their rationales, and how they work under the hood.
The Latency Stack: Discovering Surprising Sources of LatencyScyllaDB
Usually, when an API call is slow, developers blame ourselves and our code. We held a lock too long, or used a blocking operation, or built an inefficient query. But often, the simple picture of latency as “the time a server takes to process a message” hides a great deal of end-to-end complexity. Debugging tail latencies requires unpacking the abstractions that we normally ignore: virtualization, hidden queues, and network behavior.
In this talk, I’ll describe how developers can diagnose more sources of delay and failure by building a more realistic and broad understanding of networked services. I’ll give some real-world cases when high end-to-end latency or elevated failure rates occurred due to factors we ordinarily might not even measure. Some examples include TCP SYN retransmission; virtualization on the client; and surprising behavior from AWS load balancers. Unfortunately, many measurement techniques don’t cover anything but the portion most directly under developer control. But developers can do better by comparing multiple measurements, applying Little’s law, investing in eBPF probes, and paying attention to the network layer.
Understanding API performance to find and fix issues faster ultimately means understanding the entire stack: the client, your code, and the underlying infrastructure.
The "Zen" of Python Exemplars - OTel Community DayPaige Cruz
The Zen of Python states "There should be one-- and preferably only one --obvious way to do it." OpenTelemetry is the obvious choice for traces but bad news for Pythonistas when it comes to metrics because both Prometheus and OpenTelemetry offer compelling choices. Let's look at all of the ways you can tie metrics and traces together with exemplars whether you're working with OTel metrics, Prom metrics, Prom-turned-OTel metrics, or OTel-turned-Prom metrics!
This slide deck is a deep dive the Salesforce latest release - Summer 24, by the famous Stephen Stanley. He has examined the release notes very carefully, and summarised them for the Wellington Salesforce user group, virtual meeting June 27 2024.
Enterprise Knowledge’s Joe Hilger, COO, and Sara Nash, Principal Consultant, presented “Building a Semantic Layer of your Data Platform” at Data Summit Workshop on May 7th, 2024 in Boston, Massachusetts.
This presentation delved into the importance of the semantic layer and detailed four real-world applications. Hilger and Nash explored how a robust semantic layer architecture optimizes user journeys across diverse organizational needs, including data consistency and usability, search and discovery, reporting and insights, and data modernization. Practical use cases explore a variety of industries such as biotechnology, financial services, and global retail.
The presentation will delve into the ASIMOV project, a novel initiative that leverages Retrieval-Augmented Generation (RAG) to provide precise, domain-specific assistance to telecommunications engineers and technicians. The session will focus on the unique capabilities of Milvus, the chosen vector database for the project, and its advantages over other vector databases.
Attending this session will give you a deeper understanding of the potential of RAG and Milvus DB in telecommunications engineering. You will learn how to address common challenges in the field and enhance the efficiency of their operations. The session will equip you with the knowledge to make informed decisions about the choice of vector databases, and how best to use them for your use-cases
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/efficiency-unleashed-the-next-gen-nxp-i-mx-95-applications-processor-for-embedded-vision-a-presentation-from-nxp-semiconductors/
James Prior, Senior Product Manager at NXP Semiconductors, presents the “Efficiency Unleashed: The Next-gen NXP i.MX 95 Applications Processor for Embedded Vision” tutorial at the May 2024 Embedded Vision Summit.
Machine vision is the most obvious way to help humans live better, enabling hundreds of applications spanning security, monitoring, inspection and more. Modern edge processors need private on-device and scalable hybrid machine learning capabilities to offer enough longevity to stay relevant in industrial and commercial IoT markets. In this talk, Prior presents the upcoming i.MX 95 family of applications processors.
The i.MX 95 features a new, self-developed neural processing unit from NXP—the eIQ Neutron NPU. Designed to scale from today’s conventional neural networks to tomorrow’s transformer-based models, the eIQ Neutron NPU scalable architecture delivers edge AI capabilities at high efficiency with award-winning tools, combined with chip-level security and privacy features. The i.MX 95 applications processor family features powerful processing and vision capabilities combined with safety, security and expandable high-speed interfaces.
Dev Dives: Mining your data with AI-powered Continuous DiscoveryUiPathCommunity
Want to learn how AI and Continuous Discovery can uncover impactful automation opportunities? Watch this webinar to find out more about UiPath Discovery products!
Watch this session and:
👉 See the power of UiPath Discovery products, including Process Mining, Task Mining, Communications Mining, and Automation Hub
👉 Watch the demo of how to leverage system data, desktop data, or unstructured communications data to gain deeper understanding of existing processes
👉 Learn how you can benefit from each of the discovery products as an Automation Developer
🗣 Speakers:
Jyoti Raghav, Principal Technical Enablement Engineer @UiPath
Anja le Clercq, Principal Technical Enablement Engineer @UiPath
⏩ Register for our upcoming Dev Dives July session: Boosting Tester Productivity with Coded Automation and Autopilot™
👉 Link: https://bit.ly/Dev_Dives_July
This session was streamed live on June 27, 2024.
Check out all our upcoming Dev Dives 2024 sessions at:
🚩 https://bit.ly/Dev_Dives_2024
CNSCon 2024 Lightning Talk: Don’t Make Me Impersonate My IdentityCynthia Thomas
Identities are a crucial part of running workloads on Kubernetes. How do you ensure Pods can securely access Cloud resources? In this lightning talk, you will learn how large Cloud providers work together to share Identity Provider responsibilities in order to federate identities in multi-cloud environments.
Metadata Lakes for Next-Gen AI/ML - DatastratoZilliz
As data catalogs evolve to meet the growing and new demands of high-velocity, unstructured data, we see them taking a new shape as an emergent and flexible way to activate metadata for multiple uses. This talk discusses modern uses of metadata at the infrastructure level for AI-enablement in RAG pipelines in response to the new demands of the ecosystem. We will also discuss Apache (incubating) Gravitino and its open source-first approach to data cataloging across multi-cloud and geo-distributed architectures.
Database Management Myths for DevelopersJohn Sterrett
Myths, Mistakes, and Lessons learned about Managing SQL Server databases. We also focus on automating and validating your critical database management tasks.
Cassandra to ScyllaDB: Technical Comparison and the Path to SuccessScyllaDB
What can you expect when migrating from Cassandra to ScyllaDB? This session provides a jumpstart based on what we’ve learned from working with your peers across hundreds of use cases. Discover how ScyllaDB’s architecture, capabilities, and performance compares to Cassandra’s. Then, hear about your Cassandra to ScyllaDB migration options and practical strategies for success, including our top do’s and don’ts.
Test Management as Chapter 5 of ISTQB Foundation. Topics covered are Test Organization, Test Planning and Estimation, Test Monitoring and Control, Test Execution Schedule, Test Strategy, Risk Management, Defect Management
Multimodal Retrieval Augmented Generation (RAG) with MilvusZilliz
We've seen an influx of powerful multimodal capabilities in many LLMs. In this talk, we'll vectorize a dataset of images and texts into the same embedding space, store them in Milvus, retrieve all relevant data using multilingual texts and/or images and input multimodal data as context into GPT-4o.
2. Raphael Carvalho
■ Syslinux (bootloader)
■ OSv (unikernel)
■ Seastar (ScyllaDB’s heart)
■ ScyllaDB (the best db in the world)
3. ■ LSM tree
■ What is Compaction? Why is it Needed?
■ Read, Write and Space Amplification
■ Different Compaction Strategies
■ When to Use Each One
■ SAG in ICS
■ Time Series Compaction Strategy (TWCS)
Presentation Agenda
4. What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
5. What is LSM-tree Compaction?
LSM storage engine’s write path:
commit log
Writes
6. What is LSM-tree Compaction?
LSM storage engine’s write path:
commit log
Writes
7. What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
compaction
8. What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
compaction
9. What is LSM-tree Compaction?
LSM storage engine’s write path:
Writes
commit log
10. What is Compaction? (cont.)
■ This technique of keeping sorted files and merging them is well-known
and often called Log-Structured Merge (LSM) Tree
■ Published in 1996, earliest popular application that I know of is the
Lucene search engine, 1999
■ High performance write.
■ Immediately readable.
■ Reasonable performance for read.
11. Compaction Strategy
(a.k.a. File picking policy)
■ Which files to Compact, and When?
■ This is called the Compaction Strategy
■ The Goal of the Strategy is Low Amplification:
■ Avoid read requests needing many sstables
■ Read Amplification
■ Avoid overwritten/deleted/expired data staying on disk
■ Avoid excessive temporary disk space needs (scary!)
■ Space Amplification
■ Avoid compacting the same data again and again
■ Write Amplification
Which compaction
strategy shall I
choose?
12. Read, Write and Space Amplification
Make a choice!
■ This choice is well known in distributed databases like with CAP, etc.
■ The RUM Conjecture states:
■ We cannot design an access method for a storage system that is
optimal in all the following three aspects - Reads, Updates, and
Memory.
■ Impossible to decrease Read, Write & Space Amplification, all at once
■ A strategy can e.g. optimize for Write, while sacrificing Read & Space
■ Whereas another can optimize for Space and Read, while sacrificing
Writes
14. Compaction Strategies History
Cassandra and ScyllaDB
■ Starts with Size Tiered Compaction Strategy
■ Efficient Write performance
■ Inconsistent Read performance
■ Substantial waste of disk space = bad space amplification (due
to slow GC)
■ To fix Read / Space issues in Tiered Compaction, Leveled
Compaction is introduced
■ Fixes Read & Space issues
■ BUT it introduces a new problem - Write Amplification
15. Strategy #1: Size-Tiered Compaction
■ Cassandra’s oldest and still default Compaction Strategy
■ Dates back to Google’s BigTable paper (2006)
■ Idea used even earlier (e.g., Lucene, 1999)
21. Size-Tiered Compaction - Amplification
■ Write Amplification: O(logN)
■ Where “N” is (data size) / (flushed sstable size).
■ Most data is in highest tier - needed to pass through O(logN) tiers
■ This is asymptotically optimal
22. Size-Tiered Compaction - Amplification
What is Read Amplification? O(logN) sstables, but:
■ If workload writes a partition once and never modifies it:
■ Eventually each partition’s data will be compacted into one sstable
■ In-memory bloom filter will usually allow reading only one sstable
■ Optimal
■ But if workload continues to update a partition:
■ All sstables will contain updates to the same partition
■ O(logN) reads per read request
■ Reasonable, but not great
24. Strategy #2: Leveled Compaction
■ Introduced in Cassandra 1.0, in 2011
■ Based on Google’s LevelDB (itself based on Google’s BigTable)
■ No longer has size-tiered's huge sstables
■ Instead have runs:
■ A run is a collection of small (160 MB by default) SSTables
■ Have non-overlapping key ranges
■ A huge SSTable must be rewritten as a whole, but in a run we can modify only parts
of it (individual sstables) while keeping the disjoint key requirement
26. Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
COMPACTING LEVEL 0
INTO ALL SSTABLES FROM
LEVEL 1, DUE TO KEY
RANGE OVERLAPPING
27. Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
OUTPUT IS PLACED INTO
LEVEL 1, WHICH MAY
HAVE EXCEEDED ITS
CAPACITY… MAY NEED TO
COMPACT LEVEL 1 INTO 2
28. Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
PICKS 1 EXCEEDING FROM
LEVEL 1 AND COMPACT
WITH OVERLAPPING IN
LEVEL 2 (ABOUT ~10 DUE
TO DEFAULT FAN-OUT OF
10)
29. Leveled Compaction Strategy
Level 0
Level 1
(run of 10
sstables) Level 2
(run of 100
sstables)
...
INPUT IS REMOVED FROM
LEVEL 1 AND OUTPUT
PLACED INTO LEVEL 2,
WITHOUT BREAKING KEY
DISJOINTNESS IN LEVEL 2
30. Leveled Compaction - Amplification
■ Space Amplification:
■ Because of sstable counts, 90% of the data is in the deepest level (if full!)
■ These sstables do not overlap, so it can’t have duplicate data!
■ So at most, 10% of the space is wasted
■ Also, each compaction needs a constant (~12*160MB) temporary space
■ Nearly optimal
31. Leveled Compaction - Amplification
■ Read Amplification:
■ We have O(N) tables!
■ But in each level sstables have disjoint ranges (cached in memory)
■ Worst-case, O(logN) sstables relevant to a partition - plus L0 size.
■ Under some assumptions (update complete rows, of similar sizes)
space amplification implies: 90% of the reads will need just one sstable!
■ Nearly optimal
33. Example 1 - Write-Heavy Workload
■ Size-tiered compaction:
At some points needs twice the disk space
■ In ScyllaDB with many shards, “usually” maximum space
use is not concurrent
■ Level-tiered compaction:
More than double the amount of disk I/O
■ Test used smaller-than default sstables (10 MB) to
illustrate the problem
■ Same problem with default sstable size (160 MB) - with
larger workloads
34. Example 1 (Space Amplification)
constant multiple of
flushed memtable &
sstable size
x2 space
amplification
35. Example 2 - Overwrite Workload
■ Write 15 times the same 4 million partitions
■ cassandra-stress write n=4000000 -pop seq=1..4000000 -schema
"replication(strategy=org.apache.cassandra.locator.SimpleStrategy,factor=1)"
■ In this test cassandra-stress is not rate limited
■ Again, small (10MB) LCS tables
■ Necessary amount of sstable data: 1.2 GB
■ STCS space amplification: x7.7 !
■ LCS space amplification lower, constant multiple of sstable size
■ Incremental will be around x2 (if it decides to compact fewer files)
37. Strategy #3: Incremental Compaction
■ Size-tiered Compaction needs temporary space because we only remove
a huge SSTable after we fully compact it.
■ Let’s split each huge sstable into a run (a la LCS) of “fragments”:
■ Treat the entire run (not individual SSTables) as a file for STCS
■ Remove individual sstables as compacted. Low temporary space.
38. Incremental Compaction - Amplification
■ Space Amplification:
■ Small constant temporary space needs, even smaller than LCS
(M*S per parallel compaction, e.g., M=4, S=160 MB)
■ Overwrite-mostly still a worst-case, but 2-fold instead of 5-fold
■ Optimal.
■ Write Amplification:
■ O(logN), small constant — same as Size-Tiered compaction
■ Read Amplification:
■ Like Size-Tiered, at worst O(logN) if updating the same partitions
39. Example 1 - Size Tiered vs Incremental
Incremental
compaction
40. Is it Enough?
■ Space overhead problem was efficiently fixed in Incremental (ICS), however…
■ Incremental (ICS) and size-tiered (STCS) strategies share the same space
amplification (~2-4x) when facing overwrite intensive workloads, where:
■ They cover a similar region in the three-dimensional efficiency space (RUM
trade-offs):
READ
WRITE SPACE
STC
S
ICS
41. Space Amplification Goal (SAG)
■ Leveled and Size-Tiered (or ICS) cover different regions
■ Interesting regions cannot be reached with either strategies.
■ But interesting regions can be reached by combining data layout of both strategies
■ i.e. a Hybrid (Tiered+Leveled) approach:
READ
WRITE SPACE
STC
S
ICS
LCS
42. ■ Space Amplification Goal (SAG) is a property to control the size ratio of
the largest and the second largest-tier
■ It’s a value between 1 and 2 (defined in table’s schema). Value of 1.5
implies Cross-Tier Compaction when second-largest is half the size of
largest.
■ Effectively, helps controlling Space Amplification. Not an upper bound,
but results show that compaction will be working towards reducing the
actual SA to below the configured value.
■ The lower the SAG value the lower the SA but the higher the WA. A good
initial value is 1.5, and then decrease conservatively.
Further on ICS + SAG
43. ALTER TABLE foo.bar
WITH compaction = {
'class': 'IncrementalCompactionStrategy',
'space_amplification_goal': '1.5'
};
Schema for ICS’ SAG
45. ■ Accumulation of tombstone records is a known problem in LSM trees
■ Makes queries slower
■ Read amplification
■ More CPU work (preserve latest)
■ Employs a SAG-like, but with focus on expired data, rather than space.
■ Enabled by default, can be controlled with usual params
■ tombstone_compaction_interval (defaults to 864000 (10 days))
■ tombstone_threshold (defaults to 0.2)
ICS has more efficient GC
46. Strategy #4: Time Window (TWCS)
■ Designed for time series workloads
■ Groups data of similar age together
■ Helps with:
■ Garbage collecting expired data
■ as data with similar age will be expired roughly at the same time
■ Read performance
■ Queries using a time range will find the data in a few number of files
■ Common Anti patterns:
■ Not having every cell TTLd (recommendation is to use default_time_to_live)
■ Deletions, overwrites (not well supported, major is usually needed after)
■ Keep number of windows to a small constant. Recommendation: 20.
47. Compaction Strategies Summary
Workload Size-Tiered Leveled Incremental Time-Window
Write-only 2x peak space 2x writes Best -
Overwrite Huge peak
space
write
amplification
SAG helps -
Read-mostly,
few updates
read
amplification
Best read
amplification
-
Read-mostly,
but a lot of
updates
read and space
amplification
write
amplification
may overwhelm
read
amplification,
again SAG
helps
-
Time series write, read, and
space ampl.
write and space
amplification
write and read
amplification
Best
48. Stay in Touch
Raphael S. Carvalho
raphaelsc@scylladb.com
@raphael_scarv
@raphaelsc
https://www.linkedin.com/in/raphaelscarvalho
/