MyRocks is an open source LSM based MySQL database, created by Facebook. This slides introduce MyRocks overview and how we deployed at Facebook, as of 2017.
In this talk, we'll walk through RocksDB technology and look into areas where MyRocks is a good fit by comparison to other engines such as InnoDB. We will go over internals, benchmarks, and tuning of MyRocks engine. We also aim to explore the benefits of using MyRocks within the MySQL ecosystem. Attendees will be able to conclude with the latest development of tools and integration within MySQL.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote...Databricks
The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. With a data warehouse at this scale, it is a constant challenge to keep improving performance. This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes to guarantee jobs always use consistent table snapshots.
In this session, you'll learn:
• Some background about big data at Netflix
• Why Iceberg is needed and the drawbacks of the current tables used by Spark and Hive
• How Iceberg maintains table metadata to make queries fast and reliable
• The benefits of Iceberg's design and how it is changing the way Netflix manages its data warehouse
• How you can get started using Iceberg
Speaker
Ryan Blue, Software Engineer, Netflix
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
Flink Forward San Francisco 2022.
With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi.
by
Ethan Guo & Kyle Weller
Storm is a distributed and fault-tolerant realtime computation system. It was created at BackType/Twitter to analyze tweets, links, and users on Twitter in realtime. Storm provides scalability, reliability, and ease of programming. It uses components like Zookeeper, ØMQ, and Thrift. A Storm topology defines the flow of data between spouts that read data and bolts that process data. Storm guarantees processing of all data through its reliability APIs and guarantees no data loss even during failures.
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
This presentation describes the reasons why Facebook decided to build yet another key-value store, the vision and architecture of RocksDB and how it differs from other open source key-value stores. Dhruba describes some of the salient features in RocksDB that are needed for supporting embedded-storage deployments. He explains typical workloads that could be the primary use-cases for RocksDB. He also lays out the roadmap to make RocksDB the key-value store of choice for highly-multi-core processors and RAM-speed storage devices.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This document provides a summary of improvements made to Hive's performance through the use of Apache Tez and other optimizations. Some key points include:
- Hive was improved to use Apache Tez as its execution engine instead of MapReduce, reducing latency for interactive queries and improving throughput for batch queries.
- Statistics collection was optimized to gather column-level statistics from ORC file footers, speeding up statistics gathering.
- The cost-based optimizer Optiq was added to Hive, allowing it to choose better execution plans.
- Vectorized query processing, broadcast joins, dynamic partitioning, and other optimizations improved individual query performance by over 100x in some cases.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Capacity Planning Your Kafka Cluster | Jason Bell, DigitalisHostedbyConfluent
"There's little talk about capacity planning Kafka clusters, it's very much learn as you go, every cluster is different. In this talk Kafka DevOps Engineer Jason Bell takes you through the things that will help you, from broker capacity, thinking about topics and how the other Confluent components can affect throughput and performance. With a number of production deployments under his watchful gaze for over six years Jason has plenty of experience, stories and useful information that will help you.
By the end of the talk you'll have a good understanding of designing the cluster for various scenarios, where the points of latency are to watch and monitor. And also how to prevent teams breaking the cluster behind your back.
This talk is designed for everyone, anyone who is just starting to those who are operating Kafka on a daily basis."
RedisConf17- Using Redis at scale @ TwitterRedis Labs
The document discusses Nighthawk, Twitter's distributed caching system which uses Redis. It provides caching services at a massive scale of over 10 million queries per second and 10 terabytes of data across 3000 Redis nodes. The key aspects of Nighthawk's architecture that allow it to scale are its use of a client-oblivious proxy layer and cluster manager that can independently scale and rebalance partitions across Redis nodes. It also employs replication between data centers to provide high availability even in the event of node failures. Some challenges discussed are handling "hot keys" that get an unusually high volume of requests and more efficiently warming up replicas when nodes fail.
Outrageous Performance: RageDB's Experience with the Seastar FrameworkScyllaDB
Learn how RageDB leveraged the Seastar framework to build an outrageously fast graph database. Understand the right way to embrace the triple digit multi-core future by scaling up and not out. Sacrifice everything for speed and get out of the way of your users. No drivers, no custom protocols, no query languages, no GraphQL, just code in and JSON out. Exploit the built in Seastar HTTP server to tie it all together.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Migrating large databases at Facebook from InnoDB to MyRocks and HBase to MyRocks resulted in significant space savings of 2-4x and improved write performance by up to 10x. Various techniques were used for the migrations such as creating new MyRocks instances without downtime, loading data efficiently, testing on shadow instances, and promoting MyRocks instances as masters. Ongoing work involves optimizations like direct I/O, dictionary compression, parallel compaction, and dynamic configuration changes to further improve performance and efficiency.
MyRocks deployment at Facebook and Roadmaps
This document discusses Facebook's deployment of MyRocks, a MySQL storage engine that uses RocksDB. It summarizes Facebook's initial goals for MyRocks, the technical challenges of migrating to MyRocks, their production configuration, and monitoring. It also outlines Facebook's plans to help further develop MyRocks in MariaDB and Percona Server with a focus on read performance, mixed engines, better replication, and supporting larger instance sizes.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
How to randomly access data in close-to-RAM speeds but a lower cost with SSD’...JAXLondon2014
This document discusses how SSDs are improving data processing performance compared to HDDs and memory. It outlines the performance differences between various storage levels like registers, caches, RAM, SSDs, and HDDs. It then discusses some of the challenges with SSDs related to their NAND chip architecture and controllers. It provides examples of how databases like Cassandra and MySQL can be optimized for SSD performance characteristics like sequential writes. The document argues that software needs to better utilize direct SSD access and trim commands to maximize performance.
SSDs, IMDGs and All the Rest - Jax LondonUri Cohen
This document discusses how SSDs are improving data processing performance compared to HDDs and memory. It provides numbers showing SSDs have faster access times than HDDs but slower than memory. It also explains some of the challenges of SSDs like limited write cycles and that updates require erasing entire blocks. It discusses how databases like Cassandra and technologies like flash caching are optimized for SSDs, but there is still room for improvement like reducing read path complexity and write amplification. The document advocates for software optimizations to directly access SSDs and reduce overhead to further improve performance.
Making the case for write-optimized database algorithms / Mark Callaghan (Fac...Ontico
Write-optimized database algorithms have been available in NoSQL products for many years. With MyRocks, the RocksDB storage engine for MySQL, we are using a write-optimized algorithm for a SQL DBMS. This talk will explain why we created MyRocks and how to compare write-optimized algorithms with the ubiquitous B-Tree in terms of read, write and space efficiency. A large MySQL deployment at Facebook is in the process of migrating from InnoDB to MyRocks. With RocksDB storage engines for MySQL and MongoDB and the Vinyl storage engine for Tarantool we think it is likely that you can consider a write-optimized database engine in the next few years.
Deep Dive on Amazon Aurora - Covering New Feature AnnouncementsAmazon Web Services
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is a disruptive technology in the database space, bringing a new architectural model and distributed system techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share customer experiences from the field.
Learning Objectives:
• Learn about the capabilities and features of Amazon Aurora and its new features
• Learn about the benefits of Amazon Aurora and how it delivers 5x the performance and 1/10th the cost
• Learn about the different use cases
• Learn how to get started using Amazon Aurora
This document discusses how Cassandra's storage engine was optimized for spinning disks but remains well-suited for solid state drives. It describes how Cassandra uses LSM trees with sequential, append-only writes to disks, avoiding the random read/write patterns that cause issues for SSDs like write amplification and reduced lifetime from excessive garbage collection. While SSDs have benefits like fast random access, Cassandra's design circumvents problems they were meant to solve, keeping write amplification close to 1 and leveraging SSDs' fast sequential throughput.
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
Tips, tricks, and gotchas learned at Localytics for optimizing MongoDB installs. Includes information about document design, indexes, fragmentation, migration, AWS EC2/EBS, and more.
This document discusses scaling MySQL databases in Amazon Web Services. It provides an overview of using Amazon RDS versus managing MySQL databases on EC2 instances. While RDS offers ease of use, it has higher costs and less flexibility. The document recommends using EC2 for high performance or flexible setups, and automating database provisioning, backups, and failover. It also discusses sharding databases across multiple instances, using replication and multiple availability zones for resiliency, and tools for monitoring and operations visibility.
ZFS is a filesystem developed for Solaris that provides features like cheap snapshots, replication, and checksumming. It can be used for databases. While it has benefits, random writes become sequential which can hurt performance. The OpenZFS project continues developing ZFS and improved the I/O scheduler to provide smoother write latency compared to the original ZFS write throttle. Tuning parameters in OpenZFS give better control over throughput and latency. Measuring performance is important for optimizing ZFS for database use.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
RocksDB is a new storage engine for MySQL that provides better storage efficiency than InnoDB. It achieves lower space amplification and write amplification than InnoDB through its use of compression and log-structured merge trees. While MyRocks (RocksDB integrated with MySQL) currently has some limitations like a lack of support for online DDL and spatial indexes, work is ongoing to address these limitations and integrate additional RocksDB features to fully support MySQL workloads. Testing at Facebook showed MyRocks uses less disk space and performs comparably to InnoDB for their queries.
Amazon Aurora is a fully managed relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. It is purpose-built for the cloud using a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously possible using conventional monolithic database architectures. Amazon Aurora packs a lot of innovations in the engine and storage layers. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, new improvements to Aurora's performance, availability and cost-effectiveness and discuss best practices and optimal configurations.
Proving out flash storage array performance using swingbench and slobKapil Goyal
This document discusses testing the performance of a flash storage array using the tools Swingbench and SLOB. It provides details on running tests with SLOB to measure IOPS and latency for random reads and writes. It also describes using Swingbench to test throughput by running the Sales History benchmark against a 500GB schema, varying configuration settings like parallelism and indexes. The results of these tests are analyzed to demonstrate the performance of the flash storage array.
Accelerating hbase with nvme and bucket cacheDavid Grier
This set of slides describes some initial experiments which we have designed for discovering improvements for performance in Hadoop technologies using NVMe technology
This document provides a summary of different data storage systems and structures. It discusses B-trees, LSM-trees, hash indices, R-trees, and the Block Range Index. It describes their uses, properties, and tradeoffs for operations like reads, writes, and range queries. Overall, the document analyzes various indexing techniques and how they are applied in different databases.
SmugMug uses a variety of technologies and strategies to optimize performance for their photo sharing website. They rely heavily on MySQL databases stored on high-performance SSD storage arrays. They also leverage content delivery networks, caching, and database replication. Their use of ZFS storage has improved performance and reliability compared to their previous filesystems.
Consistency between Engine and Binlog under Reduced DurabilityYoshinori Matsunobu
- When MySQL instances fail and recover, the binary logs and storage engines can become inconsistent due to different levels of durability settings. This can cause issues when trying to rejoin instances to replication.
- The document discusses challenges in ensuring consistency between binary logs and storage engines like InnoDB under reduced durability settings. It also addresses issues that can occur when restarting masters or replicas due to potential inconsistencies.
- Solutions discussed include using the max GTID from the storage engine to determine where to start replication, truncating binary logs on restart if they are ahead of the engines, and using idempotent recovery techniques to handle potential duplicate or missing rows. Ensuring consistency across multiple storage engines is also challenging.
The document discusses using MySQL for large scale social games. It describes DeNA's use of over 1000 MySQL servers and 150 master-slave pairs to support 25 million users and 2-3 billion page views per day for their social games. It outlines the challenges of dynamically scaling games that can unexpectedly increase or decrease in traffic. It proposes automating master migration and failover to reduce maintenance downtime. A new open source tool called MySQL MHA is introduced that allows switching a master in under 3 seconds by blocking writes, promoting a slave, and ensuring data consistency across slaves.
Automated, Non-Stop MySQL Operations and Failover discusses automating master failover in MySQL to minimize downtime. The goal is to have no single point of failure by automatically promoting a slave as the new master when the master goes down. This is challenging due to asynchronous replication and the possibility that not all slaves have received the same binary log events from the crashed master. Differential relay log events must be identified and applied to bring all slaves to an eventually consistent state.
This document summarizes optimizations for MySQL performance on Linux hardware. It covers SSD and memory performance impacts, file I/O, networking, and useful tools. The history of MySQL performance improvements is discussed from hardware upgrades like SSDs and more CPU cores to software optimizations like improved algorithms and concurrency. Optimizing per-server performance to reduce total servers needed is emphasized.
This document discusses indexing strategies in MySQL to improve performance and concurrency. It covers how indexes can help avoid lock contention on tables by enabling concurrent queries to access and modify different rows. However, indexes can also cause deadlocks in some situations. The document outlines several cases exploring how indexes impact locking, covering indexes, sorting and query plans.
Linux performance tuning & stabilization tips (mysqlconf2010)Yoshinori Matsunobu
This document provides tips for optimizing Linux performance and stability when running MySQL. It discusses managing memory and swap space, including keeping hot application data cached in RAM. Direct I/O is recommended over buffered I/O to fully utilize memory. The document warns against allocating too much memory or disabling swap completely, as this could trigger the out-of-memory killer to crash processes. Backup operations are noted as a potential cause of swapping, and adjusting swappiness is suggested.
React Native vs Flutter - SSTech SystemSSTech System
Your project needs and long-term objectives will ultimately choose which of React Native and Flutter to use. For applications using JavaScript and current web technologies in particular, React Native is a mature and trustworthy choice. For projects that value performance and customizability across many platforms, Flutter, on the other hand, provides outstanding performance and a unified UI development experience.
An MVP (Minimum Viable Product) mobile application is a streamlined version of a mobile app that includes only the core features necessary to address the primary needs of its users. The purpose of an MVP is to validate the app concept with minimal resources, gather user feedback, and identify any areas for improvement before investing in a full-scale development. This approach allows businesses to quickly launch their app, test its market viability, and make data-driven decisions for future enhancements, ensuring a higher likelihood of success and user satisfaction.
Ansys Mechanical enables you to solve complex structural engineering problems and make better, faster design decisions. With the finite element analysis (FEA) solvers available in the suite, you can customize and automate solutions for your structural mechanics problems and parameterize them to analyze multiple design scenarios. Ansys Mechanical is a dynamic tool that has a complete range of analysis tools.
Software development... for all? (keynote at ICSOFT'2024)miso_uam
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
React and Next.js are complementary tools in web development. React, a JavaScript library, specializes in building user interfaces with its component-based architecture and efficient state management. Next.js extends React by providing server-side rendering, routing, and other utilities, making it ideal for building SEO-friendly, high-performance web applications.
Sami provided a beginner-friendly introduction to Amazon Web Services (AWS), covering essential terms, products, and services for cloud deployment. Participants explored AWS' latest Gen AI offerings, making it accessible for those starting their cloud journey or integrating AI into coding practices.
Explore the rapid development journey of TryBoxLang, completed in just 48 hours. This session delves into the innovative process behind creating TryBoxLang, a platform designed to showcase the capabilities of BoxLang by Ortus Solutions. Discover the challenges, strategies, and outcomes of this accelerated development effort, highlighting how TryBoxLang provides a practical introduction to BoxLang's features and benefits.
A Comparative Analysis of Functional and Non-Functional Testing.pdfkalichargn70th171
A robust software testing strategy encompassing functional and non-functional testing is fundamental for development teams. These twin pillars are essential for ensuring the success of your applications. But why are they so critical?
Functional testing rigorously examines the application's processes against predefined requirements, ensuring they align seamlessly. Conversely, non-functional testing evaluates performance and reliability under load, enhancing the end-user experience.
Lots of bloggers are using Google AdSense now. It’s getting really popular. With AdSense, bloggers can make money by showing ads on their websites. Read this important article written by the experienced designers of the best website designing company in Delhi –
A captivating AI chatbot PowerPoint presentation is made with a striking backdrop in order to attract a wider audience. Select this template featuring several AI chatbot visuals to boost audience engagement and spontaneity. With the aid of this multi-colored template, you may make a compelling presentation and get extra bonuses. To easily elucidate your ideas, choose a typeface with vibrant colors. You can include your data regarding utilizing the chatbot methodology to the remaining half of the template.
Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple StepsEstuary Flow
Unlock the full potential of your data by effortlessly migrating from PostgreSQL to Snowflake, the leading cloud data warehouse. This comprehensive guide presents an easy-to-follow 8-step process using Estuary Flow, an open-source data operations platform designed to simplify data pipelines.
Discover how to seamlessly transfer your PostgreSQL data to Snowflake, leveraging Estuary Flow's intuitive interface and powerful real-time replication capabilities. Harness the power of both platforms to create a robust data ecosystem that drives business intelligence, analytics, and data-driven decision-making.
Key Takeaways:
1. Effortless Migration: Learn how to migrate your PostgreSQL data to Snowflake in 8 simple steps, even with limited technical expertise.
2. Real-Time Insights: Achieve near-instantaneous data syncing for up-to-the-minute analytics and reporting.
3. Cost-Effective Solution: Lower your total cost of ownership (TCO) with Estuary Flow's efficient and scalable architecture.
4. Seamless Integration: Combine the strengths of PostgreSQL's transactional power with Snowflake's cloud-native scalability and data warehousing features.
Don't miss out on this opportunity to unlock the full potential of your data. Read & Download this comprehensive guide now and embark on a seamless data journey from PostgreSQL to Snowflake with Estuary Flow!
Try it Free: https://dashboard.estuary.dev/register
Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple Steps
MyRocks Deep Dive
1. MyRocks Deep Dive
Yoshinori Matsunobu
Production Database Engineer, Facebook
Apr 18, 2016 – Percona Live Tutorial
2. Agenda
▪ MyRocks overview
▪ Getting Started
▪ Architecture
▪ Data structure and table definition
▪ Query optimizer and statistics
▪ Row Locking and Concurrency
▪ Replication, Backup and Recovery
▪ Performance Tuning
▪ Monitoring
3. Target audiences
▪ Interested in efficiency -- Reducing the number of MySQL servers
▪ Already familiar with MySQL
▪ InnoDB and MySQL Replication
▪ It is ok you don’t know anything about RocksDB or LSM – they’re covered in
this tutorial!
4. H/W trends and limitations
▪ SSD/Flash is getting affordable, but MLC Flash is still a bit expensive
▪ HDD: Large enough capacity but very limited IOPS
▪ Reducing read/write IOPS is very important -- Reducing write is harder
▪ SSD/Flash: Great read iops but limited space and write endurance
▪ Reducing space is higher priority
5. Random Write on B+Tree
▪ B+Tree index leaf page size is small (16KB in InnoDB)
▪ Modifications in random order => Random Writes, and Random Reads if not cached
▪ N rows modification => In the worst case N different random page reads and writes
per index
INSERT INTO message (user_id) VALUES (31);
INSERT INTO message (user_id) VALUES (10000);
…..
Branch
Leaf31 10000
B+Tree index on user_id
8. Compression issues in InnoDB
Uncompressed
16KB page
Row
Row
Row
Compressed
to 5KB
Row
Row
Row
Using 8KB space
on storage
Row
Row
Row
0~4KB => 4KB
4~8KB => 8KB
8~16KB => 16KB
New (5.7~) punch-hole compression has similar issue
9. RocksDB
▪ http://rocksdb.org/
▪ Forked from LevelDB
▪ Key-Value LSM persistent store
▪ Embedded
▪ Data stored locally
▪ Optimized for fast storage
▪ LevelDB was created by Google
▪ Facebook forked and developed RocksDB
▪ Used at many backend services at Facebook, and many external large
services
▪ MyRocks == MySQL with RocksDB Storage Engine
22. Reducing Space/Write Amplification
Append Only Prefix Key Encoding Zero-Filling metadata
WAL/
Memstore
Row
Row
Row
Row
Row
Row
sst
id1 id2 id3
100 200 1
100 200 2
100 200 3
100 200 4
id1 id2 id3
100 200 1
2
3
4
key value seq id flag
k1 v1 1234561 W
k2 v2 1234562 W
k3 v3 1234563 W
k4 v4 1234564 W
Seq id is 7 bytes in RocksDB. After compression,
“0” uses very little space
key value seq id flag
k1 v1 0 W
k2 v2 0 W
k3 v3 0 W
k4 v4 0 W
23. LSM Compaction Algorithm -- Level
▪ For each level, data is sorted by key
▪ Read Amplification: 1 ~ number of levels (depending on cache -- L0~L3 are usually cached)
▪ Write Amplification: 1 + 1 + fanout * (number of levels – 2) / 2
▪ Space Amplification: 1.11
▪ 11% is much smaller than B+Tree’s fragmentation
24. Read Penalty on LSM
MemTable
L0
L1
L2
L3
RocksDBInnoDB
SELECT id1, id2, time FROM t WHERE id1=100 AND id2=100 ORDER BY time DESC LIMIT 1000;
Index on (id1, id2, time)
Branch
Leaf
Range Scan with covering index is
done by just reading leaves sequentially,
and ordering is guaranteed (very efficient)
Merge is needed to do range scan with
ORDER BY
(L0-L2 are usually cached, but in total it needs
more CPU cycles than InnoDB)
26. Delete penalty
Put(1)
Put(2)
Put(3)
Put(4)
Put(5)
INSERT INTO t
VALUES (1),(2),(3),(4),(5);
▪ “Delete” adds a tombstone
▪ When reading, ignoring *all* Puts for the same key
▪ Tombstones can’t disappear until bottom level compaction happens
▪ Some reads need to scan lots of tombstones => inefficient
▪ In this example, reading 5 entries is needed just for getting one row
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
DELETE FROM t WHERE
id <= 4;
Delete(1)
Delete(2)
Delete(3)
Delete(4)
Put(5)
SELECT COUNT(*) FROM t;
27. “SingleDelete” optimization in RocksDB
Put(1)
Put(2)
Put(3)
Put(4)
Put(5)
INSERT INTO t
VALUES (1),(2),(3),(4),(5);
▪ If Put for the same key is guaranteed to happen only once, SingleDelete can remove
Put and itself
▪ Reading just one entry is ok – more efficient than reading 5 entries
▪ MyRocks uses SingleDelete whenever possible
SD(1)
SD(2)
SD(3)
SD(4)
Put(1)
Put(2)
Put(3)
Put(4)
Put(5)
DELETE FROM t WHERE
id <= 4;
Put(5)
SELECT COUNT(*) FROM t;
28. LSM on Disk
▪ Main Advantage
▪ Lower write penalty
▪ Main Disadvantage
▪ Higher read penalty
▪ Good fit for write heavy applications
29. LSM on Flash
▪ Main Advantages
▪ Smaller space with compression
▪ Lower write amplification
▪ Main Disadvantage
▪ Higher read penalty
30. MyRocks (RocksDB storage engine for MySQL)
▪ Taking both LSM advantages and MySQL features
▪ LSM advantage: Smaller space and lower write amplification
▪ MySQL features: SQL, Replication, Connectors and many tools
▪ Fully Open Source
▪ Working with MariaDB Company
▪ Currently RC stage
▪ https://github.com/facebook/mysql-5.6/
31. Major feature sets in MyRocks
▪ Similar feature sets as InnoDB
▪ Transaction
▪ Atomicity
▪ MVCC / Non locking consistent reads
▪ Read Committed, Repeatable Read (PostgreSQL-style)
▪ Crash safe slave and master
▪ Online Backup
▪ Logical backup by mysqldump
▪ Binary backup by myrocks_hotbackup
32. Performance and Efficiency in MyRocks
▪ Much smaller space and write amplification compared to InnoDB
▪ Reverse order index (Reverse Column Family)
▪ SingleDelete
▪ Prefix bloom filter
▪ “SELECT … WHERE id=1 and time >= X” => using bloom filter for id
▪ Mem-comparable keys when using case sensitive collations
▪ Optimizer statistics without diving into pages
33. Performance (LinkBench)
▪ Space Usage
▪ QPS
▪ Flash reads per query
▪ Flash writes per query
▪ Data Loading
▪ Latency
▪ HDD
▪ http://smalldatum.blogspot.com/2016/01/myrocks-vs-innodb-with-linkbench-over-7.html
38. Data Loading (migration)
▪ Dump and Reload by mysqldump | mysql
▪ InnoDB to InnoDB: Logical Copy completed in 6:07:03
▪ 1276GB to 1044GB (uncompressed)
▪ InnoDB to MyRocks: Logical Copy completed in 4:20:17
▪ 1276GB to 233GB (zlib L1 compressed)
39. Latency
▪ Slow query log to capture query times
0
50
100
150
200
250
300
Queries(K)
Cumulative Response Times
InnoDB
2x RocksDB
41. Agenda
▪ MyRocks overview
▪ Getting Started
▪ Architecture
▪ Data structure and table definition
▪ Query optimizer and statistics
▪ Row Locking and Concurrency
▪ Replication, Backup and Recovery
▪ Performance Tuning
▪ Monitoring
42. Getting Started
▪ Downloading Facebook MySQL 5.6 source code from GitHub
▪ Building MySQL binary by cmake
▪ Configuring my.cnf
▪ Installing MySQL and initializing data directory
▪ Starting mysqld
▪ Creating and manipulating some tables
▪ Shutting down mysqld
▪ https://github.com/facebook/mysql-5.6/wiki/Getting-Started-with-MyRocks
43. Downloading Facebook MySQL 5.6
▪ Everything is published as Open Source Software at
https://github.com/facebook/mysql-5.6
▪ Added many features/enhancements
▪ MyRocks, crash safe master/gtid, admission control, RBR, and many more
▪ Forked from official (Oracle) MySQL 5.6, rebased regularly
▪ Currently rebased from 5.6.27 (three revisions older than the latest – not very old)
▪ Actively developed
▪ Not based on MySQL 5.7
▪ Most features in MySQL 5.7 were not interesting to us, so we decided to skip. Will revisit in 5.8
▪ We backported some really useful features for us – like Loss-Less Semisync
44. Building MySQL
▪ See https://github.com/facebook/mysql-5.6/wiki/Build-Steps for details
▪ RocksDB (https://github.com/facebook/rocksdb/) is managed as submodule
▪ Optionally Snappy, LZ4, ZSTD and Bzip2 compression libraries can be added
▪ “export WITH_SNAPPY=/usr” if libsnappy.a is located at /usr/lib/.
▪ See storage/rocksdb/CMakeLists.txt for details
$ git clone https://github.com/facebook/mysql-5.6.git
$ cd mysql-5.6
$ git submodule init
$ git submodule update
$ cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=bundled -
DMYSQL_MAINTAINER_MODE=0 -DENABLED_LOCAL_INFILE=1
$ make –j24
$ make install
45. my.cnf (minimal configuration)
▪ You shouldn’t mix multiple transactional storage engines within the same instance
▪ Not transactional across engines, Not tested at all
▪ Add “allow-multiple-engines” in my.cnf, if you really want to mix InnoDB and MyRocks
[mysqld]
rocksdb
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM
collation-server=latin1_bin (or utf8_bin, binary)
log-bin
binlog-format=ROW
46. Creating tables
▪ “ENGINE=RocksDB” is a syntax to create RocksDB tables
▪ Setting “default-storage-engine=RocksDB” in my.cnf is fine too
▪ It is generally recommended to have a PRIMARY KEY, though MyRocks allows tables without
PRIMARY KEYs
▪ Tables are automatically compressed, without any configuration in DDL. Compression
algorithm can be configured via my.cnf
CREATE TABLE t (
id INT PRIMARY KEY,
value1 INT,
value2 VARCHAR (100),
INDEX (value1)
) ENGINE=RocksDB COLLATE latin1_bin;
47. MyRocks in Depth
▪ MyRocks data structure
▪ Query Optimizer and Optimizer Statistics
▪ Row Locking and concurrency
▪ Backup
▪ Crash Recovery
48. MyRocks Data Structure and Schema Design
▪ Supports PRIMARY KEY and SECONDARY KEY
▪ Primary Key is clustered
▪ Similar to InnoDB
▪ Primary key lookup can be done by single step
▪ “Index comment” specifies Column Family
▪ MySQL has a syntax to add a comment for each index
▪ Fulltext, Foreign, Spatial indexes are not supported
▪ Tablespace is not supported
▪ Online DDL has not been supported yet
49. Internal Index ID
▪ MyRocks assigns internal 4 byte index
id for each index
▪ You don’t have to be aware of index ids,
unless debugging internal data
structures
▪ Internal index id is used for all MyRocks
internal operations, such as
reading/writing/updating/deleting
rows, dropping indexes
RocksDB Key
Internal Index ID Primary Key
RocksDB Value
The rest columns
50. Internal Key/Value format
▪ Primary Key
▪ Key:
▪ Internal 4 byte index id (auto-assigned)
▪ Packed primary key columns
▪ Value:
▪ Packed other columns
▪ Record Checksum (optional)
▪ Secondary Key
▪ Key:
▪ Internal 4 byte index id
▪ Packed secondary key columns
▪ Packed primary key columns (excluding
duplicate columns)
▪ Value:
▪ Record Checksum (optional)
RocksDB Key
Internal Index ID Primary Key
RocksDB Value
The rest columns Checksum
RocksDB Key
Internal Index ID Secondary Key
Rocks Value
Primary Key Checksum
Primary Key:
Secondary Key:
Secondary Key structure is called Extended Key – containing both Secondary Key and Primary Key
Rocks Metadata
SeqID, Flag
Rocks Metadata
SeqID, Flag
51. Example
▪ CREATE TABLE t1 (id INT PRIMARY KEY, c1 INT, c2 INT, INDEX i1 (c1));
▪ INSERT INTO t1 VALUES (1, 10, 100), (2, 20, 200), (3, 30, 300),(4,40, 400),(5, 50, 500)
▪ Primary key index id was 256, Secondary key (i1) index id was 257
RocksDB Key Value
Internal Index id PK Value
256 1 10,100
256 2 20,200
256 3 30,300
256 4 40,400
256 5 50,500
RocksDB Key Value
Internal Index id SK PK Value
257 10 1 Null
257 20 2 Null
257 30 3 Null
257 40 4 Null
257 50 5 Null
Space overhead by Internal Index ID is very small, thanks to Prefix Key Encoding feature of RocksDB
52. Index lookup/scan efficiency
▪ It’s very similar to InnoDB since both InnoDB and MyRocks use
clustered index
▪ Single step read if looking up by primary key (i.e. WHERE pk = 1)
▪ Covering Index: If queries using secondary key touch only secondary key +
primary key fields, they don’t read Primary Keys (no extra lookup)
▪ Both InnoDB and MyRocks support covering index
53. Avoid non-covering secondary index scan
- When doing index scan, index leaf blocks can be fetched by
*sequential* reads,
but *random* reads for table records are required
- The overhead is very high for both InnoDB and MyRocks
- Try to rewrite queries to use covering index range scan
key1 PK
1 10000
2 5
…
3 15321
table records
10000: col2=‘abc’, col3=100
5: col2=‘aaa’, col3=10
SELECT * FROM tbl WHERE key1 < 2000000
60 431
- 60 Leaf 1
- 120 Leaf 2
Branch 1
Leaf block 1
15321: col2=‘a’, col3=7
- 180 Leaf 3
…
…
54. Tables without primary key
▪ Hidden Primary Key
▪ Key:
▪ Internal 4 byte index id
▪ Internal 8 byte auto generated id (internal primary key)
▪ Hidden keys are not visible from applications
▪ Value:
▪ All columns (packed format)
▪ Record Checksum (optional)
▪ Secondary Key
▪ Same as tables with primary key
RocksDB Key
Internal Index ID Hidden PK
RocksDB Value
All columns Checksum
RocksDB Key
Internal Index ID Secondary Key
Rocks Value
Hidden PK Checksum
Hidden Primary Key:
Secondary Key:
Rocks Metadata
SeqID, Flag
Rocks Metadata
SeqID, Flag
55. Index and Column Family
▪ Column Family and MyRocks Index mapping is 1:N
▪ Each MyRocks index belongs to one Column Family
▪ Multiple indexes can belong to the same Column Family
▪ If not specified in DDL, the index belongs to “default” Column Family
▪ Most RocksDB configuration parameters are per Column Family
▪ MemTable, Bloom Filter, etc
▪ Different types of indexes should be allocated to different Column Families
▪ Do not create too many Column Families
▪ ~20 would be good enough
▪ INDEX COMMENT specifies associated column family
56. Column Family
Query atomicity across different key spaces.
▪ Column families:
▪ Separate MemTables and SST files
▪ Share transactional logs
WAL
(shared)
CF1
Active MemTable
CF2
Active MemTable
CF1
Active MemTable
CF1
Read Only
MemTable(s)
CF1
Active MemTable
CF2
Read Only
MemTable(s)
File File File
CF1 Persistent Files
File File File
CF2 Persistent Files
CF1
Compaction
CF2
Compaction
57. Index maintenance efficiency
▪ In MyRocks, non-unique Index does not need reads on index writes
▪ Better write QPS especially if index does not fit in memory
INSERT INTO message (user_id) VALUES (31)
DiskBuffer Pool
user_id RowID
1 10000
2 5
…
3 15321
Leaf Block 1
2. pread()
(if not cached)
1. Check indexes are cached or not
3. Modify indexes
60 431
InnoDB:
MyRocks:
INSERT INTO message (user_id) VALUES (31)
MemTable/WAL 1. Put (user_id=31, pk=…)
58. Reverse column families
▪ RocksDB is great at scanning forward
▪ But ORDER BY DESC queries are slow
▪ Reverse column families make
descending scan a lot faster
Prefix Key Encoding
id1 id2 id3
100 200 1
100 200 2
100 200 3
100 200 4
id1 id2 id3
100 200 1
2
3
4
id1 id2 id3
100 200 4
3
2
1
59. Table Definition Example
▪ Index Comment specifies Column Family name
▪ If the column family does not exist, RocksDB automatically creates it
▪ “rev:” is a syntax to create reverse order column family
▪ Column Family statistics can be viewed via “SHOW ENGINE ROCKSDB STATUSG”
CREATE TABLE `linktable` (
`id1` bigint unsigned,
`id1_type` int unsigned,
`id2` bigint unsigned,
`id2_type` int unsigned,
`link_type` bigint unsigned,
`visibility` tinyint NOT NULL,
`data` varchar NOT NULL,
`time` bigint unsigned NOT NULL,
`version` int unsigned NOT NULL,
PRIMARY KEY (link_type, `id1`,`id2`) COMMENT 'cf_link_pk',
KEY `id1_type` (`id1`,`link_type`,`visibility`,`time`,`version`,`data`) COMMENT 'rev:cf_link_id1_type'
) ENGINE=RocksDB DEFAULT COLLATE=latin1_bin;
60. Write path
▪ Example: “UPDATE t SET c1=100 WHERE c1=1;”
▪ There were 4 indexes – Primary key, i1 (c1), i2 (c2), i3(c2, c1)
▪ Only 1 row was affected
▪ These write operations were issued into RocksDB at transaction commit
▪ Put (primary key, c1=100) -- overwrite
▪ SingleDelete (i1, c1=1)
▪ Put (i1, c1=100)
▪ SingleDelete(i3, c2=X, c1=1)
▪ Put (i3, c2=X, c1=100)
▪ Put (new binlog state) -- overwrite
▪ Put (new relay log state) -- overwrite
▪ i2 was not affected because i2 didn’t include c1
61. Mem-comparable Keys and Case Sensitiveness
▪ MyRocks is optimized for CHAR/VARCHAR indexes using case
sensitive collations, such as latin1_bin, utf8_bin and binary.
▪ Set collation-server=latin1_bin (or utf8_bin, binary) in my.cnf
▪ Default collation in MySQL is latin1_swedish_ci, which is not case
sensitive
▪ MyRocks by default does not allow to have indexes with case
insensitive collations
▪ You can optionally allow by setting
rocksdb_strict_collation_exceptions=‘.*’ in my.cnf
62. Case Sensitive collations
▪ Schema Definition
▪ CREATE TABLE … ENGINE=ROCKSDB COLLATE latin1_bin;
▪ Less strict unique constraint
▪ ‘AAA’, ‘aaa’, ‘aAa’ are treated different characters. Unique key error is not returned
▪ WHERE key=‘a’ no longer matches key ‘A’
▪ Sort ordering is changed
▪ Case sensitive: ‘AAA’ -> ‘ABA’ -> ‘AaB’
▪ Case insensitive: ‘AAA’ -> ‘AaB’ -> ‘ABA’
▪ (A=0x41, B=0x42, a=0x61)
▪ Replication may stop if replicating from case insensitive master to case sensitive slaves
▪ Storing ‘A’ -> DELETE FROM t WHERE key=‘a’ (deleted on master, ignored on slave) -> INSERT INTO t (key) VALUES
(‘A’) (inserted on master, duplicate key error on slave)
63. Data Dictionary
▪ MyRocks stores some metadata information into RocksDB’s
dedicated column family named __system__
▪ Dictionary Examples
▪ Table name => internal index id mapping
▪ Internal index id => index metadata, Column Family id
▪ Column Family id => Column Family flags (i.e. reverse order or not)
▪ Binlog state
▪ Binlog name, position, and GTID
▪ Written at transaction commit
▪ Index statistics
▪ Can be read via information_schema
64. Query optimizer, table and index statistics
▪ MyRocks automatically stores index statistics, so that MySQL can choose proper
query execution plans
▪ MyRocks stores following estimated statistics
▪ For each index:
▪ Index name
▪ Index size
▪ Number of rows
▪ Actual disk space
▪ Number of deletes (tombstones)
▪ For each index prefixes:
▪ Distinct number of keys (for cardinality)
65. Where statistics are stored
▪ SST files
▪ Table Property
▪ RocksDB’s extension format, allocated for each sst file
▪ Written when SST files are created (flush and compaction)
▪ Calculating statistics during flush/compaction, for all associated indexes
for each sst
▪ Data Dictionary
▪ Summary of these statistics, for each index
66. When index statistics are updated
▪ At MemTable Flush (automatic)
▪ At Compaction (automatic)
▪ ANALYZE TABLE (manual)
67. Index statistics in small tables
▪ Until MemTable is flushed, index statistics are not written
▪ Typical MemTable size is 128MB, allocated for each Column Family
▪ If your MySQL instance is not updated, index statistics are not
written
▪ SHOW INDEX reports zero index cardinality
▪ ANALYZE TABLE triggers MemTable flush, which updates cardinality
on small tables too
68. How MyRocks estimates range scan efficiency
▪ “SELECT * FROM t WHERE idx < 10000000”
▪ How MyRocks decides to do full index scan or range scan?
▪ RocksDB has an API GetApproximateSizes() for specified ranges. MyRocks
uses this function
▪ Range Scan: GetApproximateSizes(begin, end) / average_row_length (from data
dictionary) => Estimated number of rows scanned
▪ Full Scan: Estimated total number of rows (from data dictionary)
▪ MySQL optimizer uses these information
69. MyRocks specific optimizer features/limitations
▪ Need to execute ANALYZE TABLE on small tables
▪ No “index dive” overhead
▪ Index Condition Pushdown is supported
▪ Multi Range Read (MRR) is not supported yet
70. MyRocks in Depth
▪ MyRocks data structure
▪ Query Optimizer and Optimizer
Statistics
▪ Row Locking and concurrency
▪ Backup
▪ Crash Recovery
71. MyRocks row locking
▪ Lock granularity in MyRocks is row
▪ Locks are released at the end of the transaction (commit or rollback)
▪ All locking states are kept in memory
▪ Can’t modify billions of rows within one transaction
▪ Shared level lock is not supported yet
▪ SELECT .. LOCK IN SHARE MODE holds exclusive lock (as .. FOR UPDATE)
▪ Gap lock support is very limited
▪ Rows not found are not locked, except when using all primary keys in WHERE condition
▪ Row based binary logging (binlog_format=ROW) must be used
▪ Read Uncommitted and Serializable isolation levels are not supported
▪ Automatic deadlock detection is not supported
▪ SAVEPOINT is not supported
72. Primary key locking reads
▪ SELECT * FROM t WHERE id = 1 FOR UPDATE;
▪ GetForUpdate(id=1) – reserving a row lock for id=1 only
▪ Nobody else can do locking read or insert for the same id
▪ id1=1 is locked even if the row does not exist
▪ This is the only Gap Lock pattern that MyRocks supports
▪ MyRocks holds gap lock when using all primary key columns in WHERE clause
▪ Used for unique key checking
73. Releasing locks for rows scanned but not matched
▪ “SELECT * FROM t WHERE non_indexed_column = 1 FOR UPDATE”
▪ InnoDB with Statement based Binlog: Does full table scan, lock all rows
▪ InnoDB with Row based Binlog + Read Committed Isolation: Does full table
scan, lock rows, release rows immediately if non_indexed_column != 1
▪ After the SELECT completes, only rows with non_indexed_column = 1 are
locked
▪ MyRocks: Same as InnoDB with RBR + RC
74. Secondary index locking reads
▪ SELECT sk FROM t WHERE sk = 1 FOR UPDATE;
▪ Locking ordering is Secondary key -> Primary key
▪ Locking the secondary key where sk=1
▪ Reading primary key from the secondary key (extended key)
▪ Locking the primary key so that nobody can modify the row
▪ Reading primary key values is skipped thanks to covering index
▪ Other secondary keys are not locked
▪ All locking reads lock primary key
▪ Unique Constraint + Secondary Key is also supported
75. Transaction Isolation Differences compared to
InnoDB/PostgreSQL
▪ MyRocks transaction isolation implementation is close to PostgreSQL’s style – snapshot isolation
▪ Behavior is very close, though there are some minor differences
▪ There are some behavior differences between InnoDB and PostgreSQL/MyRocks
▪ Locking reads (UPDATE, SELECT FOR UPDATE, etc) read current data in InnoDB, regardless of isolation
levels
▪ Locking reads in MyRocks and PostgreSQL read from snapshot
▪ Read Committed: Snapshot is created at each statement
▪ Repeatable Read (PostgreSQL): Snapshot is created at the beginning of the transaction
▪ Repeatable Read (MyRocks): Snapshot is created at the first statement of the transaction
▪ https://github.com/ept/hermitage is a great resource to understand with examples
76. Multi-Statement Transactions
▪ Client 1
BEGIN;
INSERT INTO linktable (id1=1, id2=1, link_type=1)
Client 2
BEGIN;
INSERT INTO linktable (id1=1, id2=2, link_type=1)
Client 1
INSERT INTO counttable (id1=1, link_type=1) ON
DUPLICATE KEY UPDATE...
COMMIT;
Client 2
INSERT INTO counttable (id1=1, link_type=1) ON
DUPLICATE KEY UPDATE
???
▪ InnoDB RC/RR: Client 2 overwrites client 1
PostgreSQL RC: Client 2 overwrites client 1
MyRocks RC: Client 2 overwrites client 1
▪ PostgreSQL RR: ERROR: could not serialize
access due to concurrent update
▪ MyRocks RR: ERROR 1213 (40001): Deadlock
found when trying to get lock; try restarting
transaction
Trying to update a row that was changed by somebody after starting my transaction
=> Allowed with MyRocks RC, but not allowed with MyRocks. InnoDB allows both
77. Gap Lock
▪ MyRocks has very limited Gap Lock support – only when using all primary keys in WHERE condition
▪ InnoDB vs MyRocks major locking differences
▪ Update with Select (INSERT/CREATE .. SELECT, UPDATE .. SELECT etc) locks rows scanned by select in
InnoDB. MyRocks does not by default
▪ Locking reads in InnoDB lock rows that don’t exist. MyRocks has limited support for it.
▪ InnoDB can lock in range – i.e. blocking to insert any id greater than 10. MyRocks can’t do that
▪ MyRocks needs Row Based Binary Logging
▪ It returns errors on update when using SBR, except SQL threads
▪ Set “rocksdb_unsafe_for_binlog=true” if you’re sure your queries are 100% safe with SBR
78. id value
1 0
10 9
100 1
1000 0
Why lack of Gap Lock support needs RBR
1) DELETE FROM t WHERE value = 9;
2) UPDATE t SET value = 9 WHERE id = 1;
1) Checked id = 1 first, not matched, unlocking if gap lock is not supported
2) Updated value = 9 where id = 1, committed, written to binlog
1) Finished delete statement. id=10 was deleted. Written to binlog
Final Result: id=1 exists, value=9. id=10 was deleted
Binary log:
1. UPDATE t SET value = 9 WHERE user_id=1;
2. DELETE FROM t WHERE status = 9;
=> Both id=1 and 10 are deleted on slaves
Data Consistency is broken. Row based Binary logging & Replication (RBR) can prevent this issue
79. Gap Lock
▪ When migrating from InnoDB to MyRocks:
▪ Fewer row lock contentions than InnoDB
▪ For some queries relying on gap lock, you may get “Deadlock” errors in MyRocks
▪ For some queries relying on gap lock, MyRocks does not work at all
▪ We added session variables for handling gap lock gracefully
▪ “gap_lock_raise_error” -- raising errors if queries relying on gap lock
▪ “gap_lock_write_log” – writing into log file for queries using gap lock
80. Queries relying on Gap Lock – empty check
link table has primary key on (id1, type, id2)
user table has primary key on (id)
If matching (id1, type) does not exist in link, set link_exist=0 on user
After inserting into link, set link_exist=1 on user
BEGIN;
INSERT INTO link (id1, type, id2) VALUES (2, 2, 3);
=> Not blocked in MyRocks, because of no Gap Lock
UPDATE user SET link_exist=1 WHERE id=2;
COMMIT;
BEGIN;
SELECT * FROM link WHERE id1=2 AND type=2 LOCK IN SHARE MODE;
=> Empty Set
UPDATE user SET link_exist=0 WHERE id=2;
=> Getting “Deadlock” error in MyRocks, with Repeatable Read
(if UPDATE was done, the row became inconsistent)
MyRocks:
BEGIN;
INSERT INTO link (id1, type, id2) VALUES (2, 2, 3);
=> Blocked in InnoDB, because of Gap Lock
UPDATE user SET link_exist=1 WHERE id=2;
COMMIT;
UPDATE user SET link_exist=0 WHERE id=2;
COMMIT;
=> INSERT proceeded
InnoDB:
- MyRocks does not implicitly overwrite changes after creating a snapshot, with Repeatable Read isolation.
This rarely causes unexpected data inconsistency
- With MyRocks, applications will get more errors than InnoDB
81. Queries relying on Gap Lock – empty check
link table has primary key on (id1, type, id2)
user table has primary key on (id)
If matching (id1, type) does not exist in link, set link_exist=0 on user
After inserting into link, set link_exist=1 on user
BEGIN;
INSERT INTO link (id1, type, id2) VALUES (2, 2, 3);
=> Not blocked in MyRocks, because of no Gap Lock
BEGIN;
SELECT * FROM link WHERE id1=2 AND type=2 LOCK IN SHARE MODE;
=> Empty Set
UPDATE user SET link_exist=0 WHERE id=2;
COMMIT;
MyRocks:
UPDATE user SET link_exist=1 WHERE id=2;
=> Getting “Deadlock” error in MyRocks, with Repeatable Read
82. Queries relying on Gap Lock – prefix uniqueness
link table has primary key on (id1, type, id2)
Want uniqueness on prefix (id1=1 and type=10)
InnoDB:
BEGIN;
SELECT * FROM link WHERE id1=1 AND type=10 FOR UPDATE;
=> Returning Empty Set, holding gap lock on (id1=1, type=10)
=> Nobody else can insert a row starting with (id1=1, type=10) until transaction ends, thanks to Gap Lock
INSERT INTO link (id1, type, id2) VALUES (1, 10, 150);
COMMIT;
Other transactions:
BEGIN;
SELECT * FROM link WHERE id1=1 AND type=10 FOR UPDATE;
=> Row exists
UPDATE link SET … WHERE id1=1 AND type=10 AND id2=150;
COMMIT;
This does NOT work with MyRocks – will end up inserting many rows starting with (id1=1, type=10)
83. link table has primary key on (id1, type, id2)
Want uniqueness on prefix (id1=1 and type=10)
MyRocks:
BEGIN;
SELECT * FROM link WHERE id1=1 AND type=10 AND id2=0 FOR UPDATE;
=> Returning Empty Set, holding gap lock on (id1=1, type=10, id2=0)
=> Nobody else can proceed SELECT FOR UPDATE with (id1=1, type=10, id2=0). id2 =0 is fixed dummy id.
INSERT INTO link (id1, type, id2) VALUES (1, 10, 150);
COMMIT;
Other transactions:
BEGIN;
SELECT * FROM link WHERE id1=1 AND type=10 AND id2=0 FOR UPDATE;
=> Row exists
UPDATE link SET … WHERE id1=1 AND type=10 AND id2=150;
COMMIT;
MyRocks holds gap lock if using ALL primary key columns. Unique prefix can be done by this.
Queries relying on Gap Lock – prefix uniqueness
84. Queries relying on Gap Lock – blocking queue
▪ Gap Lock can be used to block inserting any value greater (or
smaller) than X
link table has primary key on (id1, type, id2)
BEGIN;
SELECT * FROM link WHERE id1=100 AND type=10 ORDER BY id2 DESC LIMIT 1 FOR UPDATE;
=> Returning id2 = 20
BEGIN;
INSERT INTO link (id1, type, id2) VALUES (100, 10, 21);
=> Blocked
(can’t insert id=22 or greater either)
MyRocks does not support this
85. Queries relying on Gap Lock – pt-table-checksum
▪ REPLACE INTO percona.checksums SELECT * FROM app_tables WHERE key > X
and key < Y;
▪ Running the same statement on both master and slaves via replication, then
comparing results (checksums)
▪ This relies on Gap Lock so it doesn’t work with MyRocks
▪ We’re checking table checksum with different algorithm at Facebook
86. Locking rows that do not exist (1)
▪ create table t0 (id1 int, id2 int, value int, primary
key(id1, id2));
insert into t0 values (1,1,0),(3,3,0),(4,4,0),(6,6,0);
T1:
begin;
select * from t0 where id1=1 for update;
T2:
begin;
select * from t0 where id1=1 and id2=4 for update;
insert into t0 values (1,5,0);
▪ MyRocks RR/RC: succeeds
InnoDB RBR+RC: succeeds
InnoDB RR: T2 blocked by T1
PostgreSQL RR/RC: succeeds
Only InnoDB supports this (Gap
Locking). MyRocks and PostgreSQL
behavior are expected.
87. Locking rows that do not exist (2)
▪ create table t0 (id1 int, id2 int, value int, primary
key(id1, id2));
insert into t0 values (1,1,0),(3,3,0),(4,4,0),(6,6,0);
T1:
begin;
select * from t0 where id1=1 and id2=5 for update;
T2:
begin;
insert into t0 values (1,5,0);
T3:
begin;
select * from t0 where id1=1 and id2=5 for update;
▪ PostgreSQL RC/RR: succeeds
InnoDB RR: T2/T3 blocked by T1
InnoDB RBR+RC: succeeds
MyRocks RC/RR: T2/T3 blocked by T1
88. After creating a snapshot, other clients updating rows
▪ create table t (id int primary key, value int);
(t had 10M rows, all values were zero)
▪
T1:
select * from t where value > 0 for update;
(taking long time)
T2:
update t set value=value+1 where id=5000000;
commit;
(starting T2 just after starting T1. finishing
before T1 touches the row)
▪ InnoDB:
RC/RR: T1 returned id=5000000 (returning
rows updated after creating a snapshot at
the beginning of the long running select for
update)
▪ PostgreSQL:
RC: T1 returned id=5000000
RR: ERROR: could not serialize access due to
concurrent update
▪ MyRocks:
RC/RR: ERROR 1213 (40001): Deadlock found
when trying to get lock; try restarting
transaction
89. Replication, Backup and Recovery
▪ Replication
▪ Row Based Binary Logging and Replication (RBR)
▪ Read Free Replication
▪ Facebook’s Replication Enhancements
▪ Crash Safety
▪ Slave recovery
▪ Master failover
▪ Backup
▪ Online logical backup
▪ Consistent Snapshot and long running transactions
▪ Online binary backup
90. RBR overview
▪ MyRocks needs binlog_format=ROW
▪ Statement based binary logging (SBR) may cause data mismatch between master and
slaves without full Gap Lock support, so MyRocks raises errors when using SBR
▪ It is possible to use binlog_format=STATEMENT on slaves
▪ Can be useful when replicating from InnoDB with statement based binary logging
▪ MyRocks returns errors on SBR, except from replication threads
▪ Set rocksdb_unsafe_for_binlog=1 if you are sure it is safe
91. RBR advantages
▪ Generally better replication speed because no query parsing is
needed
▪ No statement is unsafe for row based binary logging
▪ Costly queries that change little are efficiently applied
▪ Data can be consumed by non-MySQL consumers
92. RBR disadvantages
▪ Many statements are currently unsafe for RBR
▪ small queries producing large changes
▪ queries on tables without a usable key
▪ lack of visibility into replication progress
▪ events contain incomplete type information
▪ external consumers don't have necessary data available
▪ schema mismatches between master and slave
▪ binary logs may be substantially larger
▪ internal type changes may cause incompatibility
▪ triggers don't run on slaves when applying changes
▪ modifying no row won't be written to binlog
93. Notable differences between SBR and RBR
▪ data mismatches between master and slave are
unacceptable
▪ default -- both duplicate key error and no data found become error
▪ SBR -- duplicate key error stops slave, no data found does not
▪ slave_exec_mode=IDEMPOTENT -- both duplicate key error and no
data found are skipped
▪ slave_exec_mode=SEMI_STRICT (Facebook extension) -- only no data
found is ignored (compatible with SBR)
▪ dropping columns on slave will cause replication to stop
▪ Errors like "Column 2 of table 'test.a' cannot be converted from type
'enum' to type 'int(11)'" (FB extension: log_column_names=ON)
95. Read Free Replication
▪ Read Free Replication is a feature to skip random reads on slaves
▪ Skipping unique constraint checking on INSERT
▪ Skipping row checking on UPDATE/DELETE
▪ RBR and Append Only database (LSM, Fractal) make it possible
▪ TokuDB implemented this feature. We implemented in MyRocks based on its idea
and codebase
▪ Unique key check can be skipped by setting “rocksdb_skip_unique_check=1”
96. Avoiding look-up rows
▪ Update and Delete path on Slave, with RBR, without Read Free Repl
▪ Relay log has “Before Row Image” (BI)
▪ Getting primary key from BI
▪ Point lookup by the primary key (random read)
▪ If not exist, skip or raise error, depending on slave_exec_mode. If exists, update by “After
Row Image (AI)” or delete
▪ Read Free Repl
▪ Delete/SingleDelete (BI keys)
▪ Put (AI)
▪ Eliminates random read overhead on update/delete
▪ Can be configured by rocksdb_rpl_lookup_rows parameter
97. Facebook’s Replication Extensions
▪ GTID + Multi-Threaded Slave (MTS) + Reduced Durability
▪ Replication states are written to mysql.slave_gtid_info table (transactional
table, installed by mysql_install_db in fb-mysql)
▪ Made crash safe slave work with GTID
▪ slave-transaction-retries working with MTS
▪ Changed Binlog Index format so that GTID auto positioning faster
▪ RBR (described before)
▪ Semisync (backporting Loss-Less from 5.7, semisync mysqlbinlog)
▪ And many more…
98. Crash Safety
▪ How to recover MyRocks instances when they crash, without
restoring entire data from other instances
▪ Restoring TBs instances is painful
▪ We want to recover by 1. restarting mysqld, then 2. start slave
▪ We call it as “Crash Safe” – can be recovered without full restore
▪ MyRocks is designed to be crash safe
99. Crash Scenario and how to recover MyRocks
▪ Slave Failure
▪ MyRocks supports “Crash Safe Slave”. You can just restart mysqld instance then MySQL
recovers everything
▪ my.cnf
▪ relay_log_recovery=1
▪ Master Failure
▪ This is much more complex than Slave Failure, but it’s possible to make crash safe
100. How crash recovery works
▪ All modifications are written to WAL (transaction log files) at commit
▪ Flushed to kernel buffer, but fsync() is not called at commit
▪ On mysqld process down
▪ All committed transactions are written to WAL file / kernel buffer. No data loss
▪ On OS/machine down
▪ Some of the committed transactions may be lost
▪ Need to catch up missing transactions from master
▪ Binlog and Replication State are also written to WAL at commit, in
atomic
101. Crash Safe MyRocks on slaves
Master_binlog_file= mysql-bin.000100
Master_binlog_pos = 1000
Put (key=1111, value=100)
Master_binlog_file= mysql-bin.000100
Master_binlog_pos = 2000
Put (key=1112, value=200)
Master_binlog_file= mysql-bin.000100
Master_binlog_pos = 3000
Put (key=1113, value=300)
WAL entry 1
WAL entry 2
WAL entry 3
Slave Instance
▪ WAL is append only, and it has internal checksum.
On crash recovery, if detecting any broken WAL
entry, RocksDB discards the broken enrty and all
WAL entries after that. So state after crash
recovery is consistent
▪ Even if WAL entry 3 is lost, after crash recovery,
replication state becomes
“master_binlog_pos=2000” so it can fetch binlog
events for WAL entry 3 from master
102. Crash Safe Master with GTID and semisync
▪ What is crash safe master?
▪ When master is down and promoting a slave, after the crashed master’s
recovery, we want to add the crashed master as a new slave without rebuilding
the whole crashed master instance
▪ Recovery (recovering from OS reboot etc) shouldn’t take days. Applying a few
minutes/hours of binlogs is much faster than copying and rebuilding the entire
instance
M1
GTID: m1-gtid:1-10000
S1
GTID: m1-gtid:1-10000, s1-gtid:1-….
After crashed master’s recovery, continue replication by
CHANGE MASTER TO MASTER_HOST=‘S1’, MASTER_AUTO_POSITION=1;
103. Difficulties of the crash safe master
M1
GTID in binlog: m1-gtid:1-10005
GTID in MyRocks: with loss-less semisync, guaranteed before new master’s GTID
S1
GTID: m1-gtid:1-10000, s1-gtid:1-….
▪ If orig master’s GTID is ahead new master’s, the orig master can’t join as a slave
▪ Because new master doesn’t have the data
▪ GTID in binlog may ahead semisync slaves (even with loss-less semisync), and crash recovery is done by GTID in
binlog
▪ During crash recovery, prepared state transactions are rolled forward
▪ We needed to trim orig master’s binary logs, so that prepared state transactions are rolled back
104. Detailed crashed master recovery steps
▪ Before restarting crashed master’s mysqld process, remove binary logs
▪ InnoDB has prepared state transactions that were not sent to any slave yet. If matching binlog events exist,
they will be rolled forward
▪ Find last committed GTID on the crashed master, and set GTID_PURGED
▪ We parse mysql.err log by Python script, then finding the last committed GTID
▪ Apply missing binlog events from new master
▪ Either by mysqlbinlog or replication with --replicate-same-server-id
▪ We made all these steps work with MyRocks too
M1
GTID in binlog: m1-gtid:1-10005
GTID in InnoDB: m1-gtid:1-9998 (with semisync, guaranteed before new master’s GTID)
S1
GTID: m1-gtid:1-10000, s1-gtid:1-….
m1-gtid: 1-10000
105. MyRocks limitations around crash safety
▪ MyRocks currently does not support XA between binary logs and RocksDB, so you may lose
data or cause data inconsistency on master machine failure
▪ MyRocks by default does not call fsync() at commit. By setting rocksdb_use_fsync=1, it calls fsync(),
but it doesn’t help to prevent inconsistency because of lack of XA support
▪ All metadata operations (i.e. adding indexes) are always synchronized
▪ With GTID and Loss-Less Semi-Synchronous Replication, you can failover without data loss,
without data inconsistency
▪ You need to promote a slave. Do not wait for the dead master to recover
▪ Regular (non-Semisync) failover may cause data loss but can prevent data inconsistency
▪ We’re working on supporting XA and full durability with good enough performance in
MyRocks. We’re going to implement:
▪ 2PC in RocksDB
▪ Group commit support in MyRocks
106. Backup
▪ Logical Backup by mysqldump
▪ Consistent Snapshot and long running transactions
▪ Binary Backup by myrocks_hotbackup
107. Logical Backup by mysqldump
▪ Facebook mysql-5.6 extended mysqldump to support MyRocks
▪ mysqldump --single-transaction
▪ Can take either InnoDB or MyRocks consistent snapshot, not
both
▪ Checks default-storage-engine
▪ default-storage-engine=RocksDB => taking consistent MyRocks dump
▪ default-storage-engine=InnoDB => taking consistent InnoDB dump
108. How consistent backup works
▪ SET TRANSACTION ISOLATION LEVEL REPEATABLE READ
▪ START TRANSACTION WITH CONSISTENT ROCKSDB SNAPSHOT
▪ New syntax (Facebook patch)
▪ Get an internal binlog lock so that nobody can write to binlog
▪ Much lighter than FLUSH TABLES WITH READ LOCK
▪ Create RocksDB snapshot
▪ Get current binlog position and GTID
▪ Unlock internal binlog lock
▪ Dump all tables (SELECT …)
▪ Repeatable Read guarantees all dump data is based on the acquired snapshot
109. Snapshot and long running transactions
▪ Logical backup holds snapshots for a very long time. This is bad from performance point of view
▪ The overhead is high in MyRocks too, but not as high as InnoDB
PK=1, value=0
UPDATE t SET value=value+1 WHERE PK=1; PK=1, value=1
UPDATE t SET value=value+1 WHERE PK=1; PK=1, value=2
UPDATE t SET value=value+1 WHERE PK=1; PK=1, value=3
START TRANSACTION WITH
CONSISTENT SNAPSHOT;
SELECT * FROM t; => returning value=0
Backup Client Applications
….. (modified 1,000,000 times)
PK=1, value=1,000,000
110. How snapshot works in InnoDB
UNDO tablespace
(PK=1, value=1,000,000)
tbl.ibd
(PK=1, value=99999)
(PK=1, value=99998)
Tx id=6, roll ptr=2 (PK=1,
value=3)
Tx id=3, roll ptr=0 (PK=1,
value=0)
Tx id=5, roll ptr=1 (PK=1,
value=2)
….
- InnoDB needs to look back UNDO pages 1,000,000 times (the number of modifications after starting tx)
to find the exact row (target row txid was just before executing START TRANSACTION WITH CONSISTENT SNAPSHOT)
- Each lookup is random read, which is much less efficient than sequential read
(PK=1, value=999,999)
(PK=1, value=999,998)
(PK=1, value=990,003)
(PK=1, value=990,001)
(PK=1, value=990,002)
111. How snapshot works in MyRocks
START TRANSACTION WITH CONSISTENT ROCKSDB SNAPSHOT
Sequence Id = 3
Seq=2 (PK=1, value=0)
Seq=4 (PK=1, value=1)
Seq=5 (PK=1, value=2)
MemTable
Seq=1000 (PK=1,
value=1001)
Seq=2 (PK=1, value=0)
Seq=1000 (PK=1,
value=1001)
SST files
All intermediate rows were removed during flush/compaction
because RocksDB could know they were definitely not needed
This makes total lookup times much fewer than 1,000,000
But it’s still a good practice to keep transaction duration short enough
Flush Seq=2000 (PK=1,
value=2001)
Seq=3000 (PK=1,
value=3001)
Compaction
Seq=2 (PK=1, value=0)
Seq=3000 (PK=1, value=3001)
112. Online Binary Backup
▪ MyRocks provides online binary backup solutions and tools
▪ Online binary backup is useful for:
▪ Creating new slave instances much faster than logical backup
▪ Restoring from backup much faster than logical backup
▪ Currently only full binary backup is possible. Partial or incremental binary
backup has not been supported yet
113. How MyRocks online backup works
▪ Create hard links for sst files -- on RocksDB directory ($datadir/.rocksdb/*.sst)
▪ Then backup all Hard Link files and WAL files somewhere (local or remote storage)
▪ Hard Link files (.sst) are immutable. So source instances do not have to be stopped
▪ MyRocks has a special syntax “SET GLOBAL rocksdb_create_checkpoint = /path/to/backup”
to create hard links and to copy WAL files
▪ WAL files are mutable (append only) but become consistent after starting instance
000001.sst
000002.sst
000003.sst
000004.sst
WAL 1
WAL 2
Instance
Create
Hard Links
000001.sst
000002.sst
000003.sst
000004.sst
WAL 1
WAL 2
Checkpoint
Files (Hard
Links)
Backup
114. How to restore and recover from backups
▪ MyRocks writes binlog state and replication state into WAL at commit
▪ After crash recovery, the last replication state is reflected at mysql.slave_relay_log_info or mysql.slave_gtid_info
▪ You can restart replication from the last replication state
▪ The binlog state is the last snapshot time. If long time has passed after that, it will take very long time to sync up
with master. You can’t recover if master doesn’t have necessary binlogs.
Prod Master
Instance
Copy to
target
server
000001.sst
000002.sst
000003.sst
000004.sst
WAL 1
WAL 2
Backups
from hard
links
Start
Replication
000001.sst
000002.sst
000003.sst
000004.sst
WAL 1
WAL 2
After
Restore
Restart
mysqld
000001.sst
000002.sst
000003.sst
000004.sst
After
Crash
Recovery
000005.sst
Master_binlog_file=binlog.000003
Master_binlog_pos=5000
Master_binlog_GTID=uuid:30000
115. Tricks: Renewing Checkpoints
▪ After enough time has passed since last
checkpoint, destroy it and create another
checkpoint, then copy from the new
checkpoint
▪ Copy only sst files at intermediate
checkpoints
▪ Skip copying sst files if they were already
copied during previous checkpoints
▪ At the last checkpoint, copy the rest files
(WAL, manifest, etc)
▪ Binlog/Replication state is at the last
checkpoint time. Replication sync up time
can be much shorter and not dependent
on instance size
▪ Automating these steps will help
Copy
000001.sst
000002.sst
000003.sst
000004.sst
WAL 1
WAL 2
Checkpoint 1
000001.sst
000003.sst
000004.sst
WAL 3
WAL 4
Checkpoint 2
000005.sst
000006.sst
000001.sst
000002.sst
000003.sst
000004.sst
000005.sst
000006.sst
000001.sst
000113.sst
000114.sst
WAL 210
WAL 211
Checkpoint X
(Last)
000115.sst
000116.sst
000113.sst
000114.sst
WAL 210
WAL 211
000115.sst
000116.sst
116. myrocks_hotbackup: MyRocks online backup tool
▪ myrocks_hotbackup is a tool to take online binary backup
▪ Automates all of the algorithms described in previous slides
▪ Included in fb-mysql 5.6 (under scripts/), Open Source, written in Python
▪ Works as “Streaming Backup”
▪ Can send backup files to remote servers, without writing locally
▪ After streaming backup, “move-back” step (locating WAL, SST, frm, and metadata files properly) is
needed
▪ “tar” and “xbstream” are supported streaming options. xbstream is more recommended to prevent
burst writes on destination
▪ Supports “WDT”, faster network transfer method
▪ https://github.com/facebook/wdt
117. Online binary backup with myrocks_hotbackup
▪ Major restrictions
▪ Source MySQL instance must be running
▪ You can’t take backup from other than RocksDB storage engine
▪ Major benefits
▪ You don’t need much extra space on source machine
▪ Apply-log phase is extremely short
▪ Example usage:
▪ [src]$ myrocks_hotbackup --user=root --port=3306 --checkpoint_dir=/data/backup --stream=xbstream | ssh $dst
‘xbstream –x /data/backup’
▪ [dst]$ myrocks_hotbackup --move_back --datadir=/data/mysql --rocksdb_datadir=/data/mysql/.rocksdb –
rocksdb_waldir=/txlogs --backup_dir=/data/backup
118. How myrocks_hotbackup works (1)
▪ Create a checkpoint, start backup round
▪ SET GLOBAL rocksdb_create_checkpoint= ‘$checkpoint_dir/$i’
▪ RocksDB creates hard links from data directory to checkpoint directory
▪ Backup SST files one by one
▪ Taking advantage that SST file is immutable
▪ Streaming backup (tar or xbstream with ssh) is supported
▪ Delete a checkpoint if
▪ Some time has passed (every --interval seconds)
▪ Backed up all files
▪ If SST backup has not finished, re-create a checkpoint and repeat the same steps
▪ myrocks_hotbackup remembers previously sent SST files, and skips sending the same file
▪ As far as backup speed is faster than SST generation speed, backup will end eventually
119. How myrocks_hotbackup works (2)
▪ Repeating “Create checkpoint” and “Delete checkpoint” cycles
▪ After backup all SST files, backup the rest files
▪ WAL files, *.frm files, other metadata files
▪ These are way smaller than SST files
▪ Impact of multiple checkpoints
▪ Necessary WAL files do not depend on data size (much shorter apply-log time)
▪ May send some SST files that are no longer used (i.e. deleted by compaction)
▪ Move-back
▪ Put WAL files under rocksdb_wal_dir
▪ Put *.frm files under datadir
▪ Put SST and Manifest files under rocksdb_datadir
120. Starting instance
▪ Starting mysqld
▪ Crash recovery happens – applying WAL files
▪ By moving checkpoint frequently, the number of WAL files can be reduced
▪ Smaller number of WAL files can reduce crash recovery time
▪ Unnecessary SST files are automatically removed at mysqld start
▪ Binlog state is printed into mysql *.err log
121. Configuring replication from backup
▪ Backup source’s binlog state is written to *.err log on
destination
▪ Connect to source MySQL, then run
▪ SHOW GTID EXECUTED IN ‘$binlog_file’ FROM $binlog_pos;
▪ Returns GTID position where the destination instance should start replication
▪ Configure replication on destination
▪ SET GLOBAL GTID_PURGED=‘$gtid’;
▪ CHANGE MASTER TO MASTER_HOST=‘$master’,
MASTER_AUTO_POSITION=1;
122. Performance Tuning
▪ Reverse order column family
▪ Useful if the index is intensively used for descending range scan
▪ Space and Compression
▪ Bloom Filter
▪ Data Loading
▪ my.cnf configuration examples
123. Space and Compression
▪ Estimated steady-state total size is 1.11 * (size of the bottommost level)
▪ Compression algorithm can be configured per level
124. Leveled LSM and Compression basics
▪ First of all, compression and decompression is different
▪ Compression happens at Flush and Compaction only. Both are done by background
threads. Not user facing.
▪ Decompression happens whenever reading compressed data
▪ Data blocks are cached in RocksDB block cache, with uncompressed format
▪ Decompression must be fast and efficient, since it happens many more often than
compression
▪ Level 0 and 1 sst files are very frequently written but doesn’t occupy space
▪ It is possible to use stronger compression level than InnoDB
▪ InnoDB compression happens on user facing operations
▪ zlib compression level1 for InnoDB => Can use zlib compression level 6 for MyRocks
125. Space and Compression
▪ Use compression. Default is Snappy
▪ zlib is also recommended since it gives better space savings than Snappy/LZ4,
and decompression is fast enough
▪ zstd gives better compression rate and speed than zlib. It’s new and actively
developed
▪ Compressing Level 0 ~ Level 2 sst files don’t give much space benefits. Not using
compression for these levels makes more sense to reduce CPU usage
126. Size optimizations for leveled placement
▪ level_compaction_dynamic_level_bytes=true
▪ http://rocksdb.org/blog/2207/dynamic-level/
127. Bloom Filter
MemTable
1~
10
11~
20
21~
30
31~
40
41~
50
L0
L1
L2
L3
KeyMayExist(id=32) ?
false
false
Checking key may exist or not without reading data,
and skipping read i/o if it definitely does not exist
▪ Rows are sorted by key for each level
▪ SST file stores min key and max key at header so out of range sst files are skipped regardless of bloom filter
▪ Bloom filter is useful if keys are within min/max range in the sst, and if keys do not exist there
128. How to configure bloom filter
▪ By default, bloom filter is disabled
▪ Can be set in my.cnf
▪ rocksdb_default_cf_options=prefix_extractor=capped:12;
block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;
whole_key_filtering=0}
▪ Define bloom filter length -- Internal 4 byte index id + major condition length
▪ Condition length must be larger than or equal to bloom filter length
▪ Example: “WHERE id_bigint=1” => bloom filter length has to be shorter than or equal to 12
▪ MyRocks checks WHERE condition length then skips using bloom filter if can’t be used
▪ Can be set per column family
▪ Can’t change bloom filter settings without dump & restore
▪ Because bloom filter information is stored in sst files persistently
129. When bloom filter can be used (1)
▪ MyRocks automatically decides to use bloom filter if possible, based on WHERE
conditions and CF parameters
▪ Can be used for equal conditions in WHERE clause
▪ WHERE id=1 (index id)
▪ Can be used for AND conditions
▪ WHERE id1 = 1 and id2=1 (index id1,id2)
▪ Can be used for prefix index lookup
▪ WHERE id1=1 (index id1, id2)
▪ Can be used for prefix range scan with equal predicates
▪ WHERE id1 = 1 AND time < now()
▪ Bloom filter is used for filtering id1=1 only. Can’t be used for range conditions
130. When bloom filter can be used (2)
▪ WHERE key IN (‘01234567890123456789’, ‘1’), and BF length 16
▪ BF used for the first condition, not used for the second condition
▪ MyRocks bloom filter supports both point lookup and prefix lookup
▪ Prefix bloom filter is not used on descending range/full scan (ascending scan when
using reverse column family)
131. Bloom filter size overhead
▪ SST file size increases approximately 2~3%
▪ If your equal lookup rarely returns empty set, it makes sense to
disable bloom filter on the bottommost level
▪ All data reside on the bottommost level
▪ If your equal lookups always find data (i.e. using negative cache in front),
bloom filter on the bottommost level is useless
▪ CF option “optimize_filters_for_hits=true” can turn off bloom filter
on the bottommost level
132. Data Loading
▪ There are some session variables to make data loading faster
▪ rocksdb_skip_unique_check=1
▪ rocksdb_commit_in_the_middle=1
133. Deleting large number of rows
▪ SET session sql_log_bin=0; SET session rocksdb_commit_in_the_middle=1;
DELETE FROM x WHERE id < 10000000;
▪ Run these on both master and slaves
▪ The DELETE is not written to binlog so taking long time is ok
▪ “rocksdb_commit_in_the_middle=1” helps not to maintain lots of uncommitted
changes in memory (WriteBatch)
▪ Since MyRocks doesn’t hold gap lock, this won’t block inserts
▪ Repeating deletes with LIMIT is not recommended
▪ i.e. (for 1..100,000 DELETE FROM x WHERE id < 10000000 LIMIT 100)
▪ It has to gradually scan many tombstones
134. Other configurations (DB options)
▪ rocksdb_block_size
▪ I/O unit (not fully aligned). Default is 4KB. 16KB gives better space savings but needs extra CPU for
decompression. Measure trade-offs between 4K, 8K, 16K and 32K.
▪ rocksdb_block_cache_size
▪ RocksDB’s internal cache. Less important than innodb_buffer_pool_size since RocksDB relies on OS cache too
▪ rocksdb_max_total_wal_size
▪ Controls maximum WAL size. Setting as large as total InnoDB log size would be fine
▪ rocksdb_base_background_compactions
▪ rocksdb_max_background_compactions
▪ rocksdb_max_background_flushes
▪ rocksdb_lock_wait_timeout
▪ rocksdb_max_open_files=-1
▪ Increase file descriptor limit for mysqld process (Increase nofile in /etc/security/limits.conf)
▪ rocksdb_rpl_lookup_rows=0
138. Monitoring
▪ MyRocks files
▪ SHOW ENGINE ROCKSDB STATUS
▪ SHOW GLOBAL STATUS
▪ information_schema
▪ sst_dump
▪ Perf Context
▪ Checking data consistency
between InnoDB and MyRocks
139. MyRocks/RocksDB files
▪ Data Files (*.sst)
▪ WAL files (*.log)
▪ Manifest files
▪ Options files
▪ LOG files
▪ All files are created at $datadir/.rocksdb by default
▪ Can be changed by rocksdb_datadir and rocksdb_wal_dir
141. SHOW ENGINE ROCKSDB TRANSACTION STATUS
▪ Similar to SHOW ENGINE INNODB STATUS for transaction
secti0n. Useful to find out long running sessions
mysql> show engine rocksdb transaction statusG
*************************** 1. row ***************************
Type: SNAPSHOTS
Name: rocksdb
Status:
============================================================
2016-04-14 14:29:46 ROCKSDB TRANSACTION MONITOR OUTPUT
============================================================
---------
SNAPSHOTS
---------
LIST OF SNAPSHOTS FOR EACH SESSION:
---SNAPSHOT, ACTIVE 27 sec
MySQL thread id 9, OS thread handle 0x7fbbfcc0c000
-----------------------------------------
END OF ROCKSDB TRANSACTION MONITOR OUTPUT
=========================================
142. SHOW GLOBAL STATUS
mysql> show global status like 'rocksdb%';
+---------------------------------------+-------------+
| Variable_name | Value |
+---------------------------------------+-------------+
| rocksdb_rows_deleted | 216223 |
| rocksdb_rows_inserted | 1318158 |
| rocksdb_rows_read | 7102838 |
| rocksdb_rows_updated | 1997116 |
....
| rocksdb_bloom_filter_prefix_checked | 773124 |
| rocksdb_bloom_filter_prefix_useful | 308445 |
| rocksdb_bloom_filter_useful | 10108448 |
....
143. information_schema
mysql> select f.index_number, f.sst_name from information_schema.rocksdb_index_file_map
f, information_schema.rocksdb_ddl d where f.column_family = d.column_family and
f.index_number = d.index_number and d.table_schema=‘test' and
d.table_name=‘linktable' and d.cf='rev:cf_id1_type' order by 1, 2;
+--------------+------------+
| index_number | sst_name |
+--------------+------------+
| 2822 | 156068.sst |
| 2822 | 156119.sst |
| 2822 | 156164.sst |
| 2822 | 156191.sst |
| 2822 | 156240.sst |
| 2822 | 156294.sst |
| 2822 | 156333.sst |
| 2822 | 259093.sst |
| 2822 | 268721.sst |
| 2822 | 268764.sst |
| 2822 | 270503.sst |
| 2822 | 270722.sst |
| 2822 | 270971.sst |
| 2822 | 271147.sst |
+--------------+------------+
14 rows in set (0.41 sec)
This is an example returning all sst file names that specified db.table.cf_name has
144. sst_dump
▪ sst_dump is leveldb/rocksdb tool to parse a SST file. This is useful for
debugging purposes, if you want to investigate SST files
▪ Automatically built and installed on fb-mysql
# sst_dump --command=scan --output_hex --file=/data/mysql/.rocksdb/000020.sst
145. Perf Context
▪ RocksDB exposes many internal statistics
▪ It’s disabled in MyRocks by default, since it’s relatively expensive
▪ Can be enabled by “set global rocksdb_perf_context_level=1;”
▪ Global level context
▪ select * from information_schema.rocksdb_perf_context_global;
▪ Per table context
▪ select * from information_schema.rocksdb_perf_context where table_schema=‘test’ and
table_name=‘t1’;
▪ If “INTERNAL_DELETE_SKIPPED_COUNT” is very high, it’s a sign that there are many
tombstones
146. Checking data consistency
▪ Comparing data between InnoDB and MyRocks instances, without
stopping traffics
▪ Or for multiple InnoDB instances, or multiple MyRocks instances too
▪ Take consistent snapshots at the same binlog position, then run any
non-locking selects, compare results
▪ pt-table-checksum does not work for MyRocks, because of lack of Gap
Lock support
147. Writing SELECT correctness tools
▪ Suppose comparing two slaves
▪ slave 2) STOP SLAVE SQL_THREAD; SET GLOBAL slave_parallel_workers=0; sleep 1 second,
▪ Slave 1) START TRANSACTION WITH CONSISTENT INNODB|ROCKSDB SNAPSHOT; => getting binlog
GTID X
▪ Slave2) START SLAVE UNTIL SQL_AFTER_GTIDS = X;
SELECT @@global.gtid_executed => Y
Compare X and Y and confirm Y is subset of X (Y is not exceeding X)
SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS(X);
▪ slave2) START TRANSACTION WITH CONSISTENT INNODB|ROCKSDB SNAPSHOT;
▪ slave2) SET GLOBAL slave_parallel_workers=N; START SLAVE SQL_THREAD;
▪ Slave1, slave2) Run any SELECTs to compare results
▪ slave1, slave2) ROLLBACK; -- release snapshots. They’re expensive to hold for a long time
▪ START SLAVE does implicit commit in 5.7. Make sure to run via different thread
▪ Make snapshot retention period short. Beware of wait_timeout.
148. Memory Management
▪ Jemalloc is recommended memory allocator in MyRocks
▪ Memory (RSS) is less fragmented than glibc
▪ Major memory components
▪ Block Cache
▪ rocksdb_block_cache_size
▪ MemTable
▪ write_buffer_size * max_write_buffer_number * number of column families
▪ SHOW ENGINE ROCKSDB STAUS prints memory usage from MemTable
▪ WriteBatch
▪ O_DIRECT is not supported (yet)
149. Memory usage
▪ All modifications and row lock state by transactions are kept in
memory
▪ They’re released at transaction commit
▪ Maximum number of row locks or modifications per transaction can
be controlled by rocksdb_max_row_locks session variable
▪ If you are running huge inserts/updates/deletes, consider using
rocksdb_commit_in_the_middle session variable
▪ Commits with regular intervals
150. Mixing both InnoDB and MyRocks
▪ Add allow-multiple-engines in my.cnf
▪ Not recommended using in production
▪ May be useful on some test experiments
▪ We haven’t tested enough
▪ Transaction can not cross multiple storage engines
151. Contributing to MyRocks
▪ Bug Reports
▪ https://github.com/facebook/mysql-5.6/issues
▪ Documentation
▪ MyRocks: https://github.com/facebook/mysql-5.6/wiki
▪ RocksDB: https://github.com/facebook/rocksdb/wiki
▪ Development
▪ Test Cases
▪ New Features